5 Technologies for Building the Next Generation of Voice Apps

There was a time not so long ago when the future of voice applications was uncertain. Who wants to talk when they can text? Why call a contact center when you can use a mobile app? Voice may have lost its cool for a time, but it’s made a strong comeback, driven by a new generation of voice-enabled applications and devices. In the world of voice assistants, smart speakers, and IoT, voice is the ultimate user interface. It’s natural, convenient, hands-free, and fast. And, we’re not just talking about asking Siri for the weather or Alexa to play your favorite music. Voice technology will transform how we search and make purchases, increase accessibility to digital information and services, enhance learning, improve elder care, and more.

Voice is now used in ways that weren’t previously possible and 5 technologies are fueling the next generation of voice application development:

#1 Automated Speech Recognition (ASR)
#2 Voice Biometrics
#3 Embedded Wake Words
#4 Text to Speech (TTS)
#5 Speaker Diarization

These voice technologies aren’t necessarily new. Some have been around for decades. However, each benefits from improved hardware, computing power, and neural networks that enable more sophisticated algorithms. We go into more detail on each of the technologies below.

#1 Automated Speech Recognition (ASR): ASR technology allows humans to interface with computers by converting speech to text in real-time. In contact centers, ASR enables customers to use simple voice commands to navigate menus and perform self-service tasks like checking a bank account balance. Many ASR systems use Natural Language Understanding, or NLU, to comprehend better what is being said, thus facilitating more natural human-like conversations. Most of us are familiar with “ask me anything” style systems like the ASR technology used for smart assistant voice commands.

Machine learning models make it simpler to deploy the technology at the edge. These capabilities will be found in a wide range of voice-enabled devices – from home security to entertainment to appliances and robotics. Of course, we can’t leave out the ability to add ASR to mobile gaming and VR – making experiences more immersive and intuitive.

#2 Voice Biometrics: Voice biometrics is the science of using a person’s voice to verify identity to 1) authenticate they are who they claim to be for security and/or 2) instantly personalize the interaction based on the speaker’s preferences, purchase history, interests and more. Like a fingerprint, voice biometrics is unique mainly based on a person’s vocal tract physiology. Voice biometrics is another technology that, up until now, has been focused on improving authentication in the contact center over the telephone channel and, more recently, for mobile application login.

However, advances in neural networks have enabled the development of voice biometric algorithms that are faster, more accurate, and can authenticate users with a smaller amount of speech. For example, ID R&D’s voice biometric technology has a small enough footprint to efficiently and effectively work on connected, low-power hardware, including AI chips. This has opened the door for using voice biometrics in a range of IoT devices that will benefit from strong security and personalization. In the future, you will be able to use voice to communicate with an appliance to order a replacement part, disarm a home alarm, make purchases on your television with your voice, and much more.

#3 Embedded Wake Words: A wake word is a spoken “trigger” word or phrase that activates a smart device. Just like we say a person’s name to get their attention, a wake word alerts a voice-enabled device to listen for and respond to commands. Wake words are most often associated with voice assistants like Alexa or Siri but we’ll continue to see the emergence of custom branded wake words for use in automobiles, smart home devices, hospitality, retail, and other applications. Building a custom embedded wake word for your voice assistant does not rely on an internet connection or third parties.

#4 Text to Speech (TTS): Text to Speech is a technology that can read written text. It is now standard on mobile phones, tablets, and laptops. TTS is an assistive technology that helps both adults and children who struggle with reading or are learning a new language, as well as the visually impaired. It can also help people suffering from diseases that impact their ability to speak, continue to communicate via voice.

Today, TTS is commonly used to read directions or text messages, thus offering safety while driving and convenience on the go. Likewise, TTS will increasingly be used in conjunction with ASR to read search engine results to users as more web browser sessions go screenless.

With advancements in realistic synthetic voices, future applications will make greater use of TTS to enable listening to an audiobook in the author’s voice, bringing historical figures back to life in the classroom, or allowing marketers to efficiently and consistently deliver content in a unique brand voice.

#5 Speaker Diarization

Remote learning In multi-speaker audio recordings, diarization is the ability to match a specific speaker with what they said and when. Diarization is necessary to split audio feeds into homogeneous segments by speaker for call center analytics, closed captioning, legal proceedings, and automated generation of meeting minutes. A use case in high demand with the rise in remote work is identifying speakers during remote web conferences and transcribing these events for easier reference.

Diarization uses voice biometrics for speaker identification and is language-independent.

Accelerating the Next Phase of Voice App Development

Integrating voice technologies to prototype, build, and test modern voice-enabled applications can be complex and expensive. Modern voice technologies, including products purpose-built and optimized for new IoT and mobile use cases and embedded systems.

As part of our efforts to facilitate innovation around voice biometrics, ID R&D partnered with Vivoka. The Vivoka voice application development kit comprises a complete range of embedded voice technologies from multiple providers, giving developers the ability to access the capabilities they need, all in one place.

The ability to take advantage of voice biometrics at the edge brings never-before realized value to smart speakers, smart cameras, connected cars, robotics, and more.

To accelerate development, the VDK offers:

A user-friendly, graphical interface that removes complexity and provides access to various speech technologies in one place
ASR and TTS support for multiple languages; voice biometrics language independence
Support for offline, embedded use cases
Flexible pricing models to suit your project

Learn how to get started with the Vivoka VDK.

Core Voice Biometrics

Packaged Voice Biometric Solutions

5 Technologies for Building the Next Generation of Voice Apps

Accelerating the Next Phase of Voice App Development