I found some code on Github to do this, but can't claim to be an expert :)
Basically it converts data into very small shifts in the pitch of the different harmonics of the speaking voice in the audio. This can survive audio compression, and be recovered afterwards by the same program, but is completely non-obvious to listeners.
A government could try to scan all video for signs of steganography, but because there is so much video made in so many ways and so few are altered, they would be wasting their time unless they already knew what they were looking for.
And I'd encrypt the message first, with my conversation partner's GPG public key, so it looks random if discovered and nobody can read it.
That's the best I could do.
SimpleX probably nearly as good as long as its legal to be caught with and servers are accessible