I should just be able to tell it what I want it to do.
As the cost comes down for live transcription and the quality and capabilities of language models expands, the possibility of leveraging the two as an effective service becomes irresistible. While traditional PBX speech tools centered around using speech input to direct a call, we now expect our speech to be able to do much more.
In order to figure out how to integrate speech-to-text more broadly in order to drive AI applications, lets take a look at speech-to-text vs live transcription.
Speech to Text as we have known it.
Traditionally when we have talked about speech-to-text in the PBX context it has been for voice IVRs and simple voice commands. This was occasionally helpful for migrating increasingly complicated phone systems. Things like choosing between Sales or Support. Maybe something more advanced like issuing a Dial Alice command. Various forms of this have existed for years. Asterisk’s res_speech exists to aid in this by helping turn speech input into dialplan variables.
One implementation of a speech engine, the res_speech_aeap module ties together res_speech and res_aeap to provide a framework for inter-working with aeap services such as deepgram or google’s speech api. The expected keywords (‘dial’, ‘extension’, ‘Alice’, ‘representative’) can be seeded to your aeap application, allowing it to translate words or short phrases into variables that make sense for your dial plan.
The general model for res_speech then is:
- Place a channel into a speech application.
- Allow the application to interpret the incoming speech into a meaningful result.
- Add the results to the channel.
How that application interprets the speech depends entirely on what transcription provider you are using and how complex a result you are looking for. In a simple voice IVR example the detected word ‘sales’ would return 100, ‘support’ and ‘help’ would return 200, etc. The channel spends a limited amount of time in the application, coming in and out of it as the call requires speech services.
Live Transcription enables more diverse services.
We have come however, to expect more from speech services. While it is possible to interface with such a service using Asterisk’s speech modules, it makes much more sense to go to a Unicast/ARI model where we treat the service we are integrating with as a channel rather than an application. From a model perspective, this aligns with how we think of these services – a ‘thing’ we are talking to.
The general model for Unicast is:
- Create unicast channel that sends rtp to your transcription application.
- Bridge/mix/snoop on the source channel to the unicast channel.
- Process results within your application.
The Unicast channel isn’t necessarily tied directly to the source call or caller. This means that the channel can be passive, doing things like live transcribing for a hearing impaired user or active, responding audibly to a callers intent. It can be added to a conference to transcribe or act as a virtual assistant, etc.
A bridge, not a destination.
In the first model Asterisk, or more accurately the PBX Asterisk is running in is acting as the final end point. It may be offloading the translation to the cloud, but ultimately this is to drive the call within the PBX to it’s next destination. The goal is to make the PBX experience more natural, not necessarily to add value.
In the second model, Asterisk is acting more as a bridge between the telephony world and a speech based service. The goal then is different, to leverage the added value specifically and act as a services gateway. In this model a service is just another channel.