The ins and outs of voice recognition technology, with Mike Phillips, CTO at voice app company, Vlingo.
All the evidence is that voice recognition technology hasn’t just come of age; it’s got the keys to the door, and is making a lot of friends fast. But why? What’s happened to turn it from something rather limited into a tool that’s been voted the most popular and must-have utility app amongst Nokia and BlackBerry users?
It’s certainly true that the idea of voice recognition has been around for a long time; forty years or more. Serious research work - originally funded in the US by DARPA - started in the 1970s. The first very simplistic commercial versions came out in the 1980s. It wasn’t really until the 1990s that it began to become usable, and even then mainly as interactive voice response for automated call centres and medical transcription systems.
It became obvious around 2000 that mobile phones were the future and that voice interfaces would be a great match for these small devices - but the data networks, devices and operating systems weren’t well enough developed. Some tried to rush things through without thinking enough about the overall user experience, and that led to some negative perceptions.
Also, until recently, the use of speech as an interface on mobile devices was limited by constrained grammars (a list of allowable words or phrases), even with so-called natural language interfaces. Scripted interactions were another barrier for users – having to go through several sequential steps in a particular order to perform some task. For developers this meant intensive effort and specialised speech recognition expertise to create specific custom speech interfaces. For users it meant having to learn commands and prescriptive navigation. Clearly this wouldn’t scale across all the applications people want to use on their mobile devices. It would be rather like getting the hang of a special mouse for every new app on your laptop.
Natural communication
But the prize was still huge, because the most natural way to communicate is through language. People have evolved to talk. Little key pads, tiny buttons and small displays don’t make things easy, whereas voice isn’t constrained by the size of the device. A “mix of modality” is also very important – one that gives users a choice to suit their personal preference and situation. We can speak faster than we can type, and read faster than we can listen. So the ideal (unless you’re in a car) could be to talk into the device and then receive feedback via the display.
Just think about the lifestyle, work and personal productivity benefits of being able to press a single button, talk to your phone using normal language, and get it to do everything from sending emails and text messages to web searches, voice dialling, writing yourself a note, updating your Facebook site or launching other apps. All this is possible now using the latest voice recognition technology. It’s helping people to access information faster, save time and get more done. It really has come a long way. But getting to this point wasn’t easy. A completely different approach was needed to overcome some hefty challenges.
Unconstrained voice user interface
As devices, operating systems and networks get better, they’ll no longer be constrained by memory or network speed, and the only limiting factor for apps will be the user interface. This was the critical insight in developing the new generation of voice recognition technology, with the need for an unconstrained and far more capable user interface which could work across any application. To do this, we had to get rid of application-specific grammar constraints and in turn remove the need for the scripted interactions. Put simply, users should be able to say or type anything they want into a voice-enabled text box. But making this a reality took a number of technological step changes.
The first of these was using hierarchical language models (HLMs), which are many millions of words based on well defined statistical models to predict what users are likely to say and how words are grouped together, scaled to enable web search, directory assistance, navigation and other tasks where a very large number of words are needed.
Then there’s automatic adaptation, where a system learns new words, pronunciations, and the speech patterns of individuals and groups. When Wagamama restaurants first opened in the US the name was brand new, but people were doing successful voice-enabled web searches on it within days. Equally, a first time user with a particular accent benefits from other users who’ve spoken into the system using the same accent.
Server-side processing is another key. A small amount of software on the mobile device handles audio capture and the user interface, and communicates over the mobile data network to a set of servers which run the bulk of the system processing. This enables the use of the large amounts of CPU and memory resources needed for unconstrained speech recognition, adaptation, the learning of new words and the updating of the language models for the benefit of all users.
Killer app?
Speech recognition remains a challenging area, and the main one is variability. If everyone spoke the same way using the same language, speech patterns and accent, things would be easy. The reality is that people speak all sorts of ways, choose different words, and use mobiles in quiet offices or on noisy roads - so there’s a huge variability in acoustic signals and sound patterns.
Statistical modelling is the answer; millions of parameters that can cope with this variability and a system that can learn as it goes along. It works for adding new languages too, which are being developed all the time. A ‘freed up’ user interface opens up more and more apps that people can use, just by talking.
It certainly seems like a lot of people are really starting to see voice as the killer app that can really move things forward. So talk to your phone – it works!
As the inventor of the mobile phone "voice user interface," Vlingo delivers a voice interface and technology that allows users to instantly access services and content on their device. www.vlingo.com