If you are new to audio over websocket and come from the telephony world, you might assume that this works similarly to an RTP stream where the audio is framed and timed for real-time listening. This isn’t always the case.
Sending audio to a websocket from Asterisk via External Media channel is fairly straight forward. You listen for RTP packets on the external media port, pull the payload out and write it to the websocket as binary data. As long as we process and forward the data quickly enough to keep up as it comes in, we are implicitly getting an audio stream that is usable by most services. We don’t really have to worry about chunking, timing or framing because the incoming data provides us with a usable stream automatically. An inherent benefit of reading from an RTP stream.
If we want to send audio back to the External Media channel to make the service bi-directional, it’s not as simple as writing the raw stream to the address in the External Media channel’s UNICASTRTP_LOCAL_ADDRESS channel variable. You may receive an rtp or rtp-like stream from the service’s websocket but most likely it will be coming as blobs of unframed audio. This means that if you want to send it as RTP, it first needs to get chunked, framed and timed. You also need to keep track of ssrcid’s.
The blobs will also require buffering beyond your normal ‘jitter’ buffer. You may get 30+ seconds of audio in less than a second and you will need to keep that somewhere while you play it back as a timed stream. Keep that in mind when interacting with your service as well – don’t ask for it to read a novel if you can’t handle that much audio.
You also have to deal with the fact that the service is also not sending any form of silence in between the audio blobs. If you don’t lay a silence stream down ‘underneath’ the streamed audio you will get pops, clicks and awkward speech cadence.
Luckily most modern programming languages have multiple udp or rtp modules to help you write the little rtp engine you didn’t know you needed. Chunking the blobs into audio frames is pretty straight forward, just remember about putting silence in the empty spots when you get an incomplete frame – not zeroed or uninitialized bytes.
Hopefully this information will help you build your service with Asterisk!