Background
Welcome to an introduction to text-to-speech and speech-to-text for Asterisk! With how popular speech recognition is becoming, we decided to take on the initiative of integrating these services with Asterisk. In the past, this was usually handled by C modules or AGI. We wanted this experience to be more seamless and easier for developers to use. Let’s take a look at how it’s going to work.
How will it work?
C is not the most popular choice when it comes to programming languages, so library support for things such as speech-to-text doesn’t exist in the same capacity as it does for languages like Python and Javascript. Because of this, we decided to pass the heavy lifting off to an external application that can better leverage some of these APIs. The application will be responsible for taking the media and determining what to do with it based on the information Asterisk provides. Data will flow back and forth over a websocket connection in the form of JSON to keep things simple. The external application will forward the media along to a speech service such as Google or Amazon. From there, the result will be passed back to Asterisk.
This new functionality will be accessed via dialplan functions. A configuration will need to be set up and passed into the dialplan function, which will then go through the above process. Results will get passed back so that you know what happened and can act accordingly. It’s important to note that you will need to create an external application or use an existing one, as well as know which speech service you want to use. The goal is to take submissions for different applications that offer support for various speech services to grow the pool over time, so be thinking of which service you want to use!
Documentation
A more detailed description of the project can be found here. The wiki page will be updated as development continues, and we’re open to suggestions for improvements or potential lack of coverage, so leave a comment!