Speech Recognition Technology

Rachel Andersen
10 min readJun 13, 2021

Speech recognition software is a computer application that is programmed to take in audio speech data, interpret and analyze it, and thence transcribe it into text. Speech recognition technology is becoming more and more vital to the way we interact with technology today. By 2024, the global voice-based smart speaker market is predicted to be worth $30 billion, while by 2022, voice-based shopping is predicted to amass $40 billion in revenue. Suffice it to say, this is becoming an ever-increasing part of the global economy as well as a larger part of how we perform internet searches, text through voice-to-text, and even set timers or alarms or check the weather on our smart devices. The infographic below shows 2020 statistics on the increase in voice searches.

How Voice Search is Increasing

In this post, we will discuss the rich history of speech recognition technology, how speech recognition software works, types of speech recognition software, speech recognition technology impediments, as well as implementations of speech recognition technology.

The History of Voice Recognition Technology

Speech recognition technology dates back to the mid-20th century. In 1952, Bell Labs created The AUDREY system which could understand numerical digits 0 through 9 analyzed after input to its speaker box with 97% accuracy when AUDREY was trained to a specific speaker. This technology laid the foundation for voice dialing and was used by toll-line operators.

Following AUDREY, IBM unveiled the Shoebox system at the 1962 World’s Fair. This system could recognize digits 0 through 9 as well as 16 words including “plus” and “total.” Using a combination of these word and digit inputs, Shoebox could be instructed to perform simple mathematical operations via a linked adding machine. This was the world’s first voice-powered calculator.

Thanks to funding by the US Department of Defense and DARPA, speech recognition technology was able to make many strides. DARPA’s Speech Understanding Research Program(SUR), in place from 1971–1976, was a large initiative in the field of speech recognition technology, which aided in the creation of the “HARPY” voice recognition system at Carnegie Mellon. HARPY was able to process over 1,000 words (about the same as a 3-year-old’s vocabulary), and was also able to access the meaning of words from a database and determine sentence structure through its “beam search” technology.

By the 1980s, speech recognition technology advanced exponentially due to the advancements in computers as a whole, resulting in systems that could now recognize tens of thousands of words. One of the major advancements was through the development of the Hidden Markov Model (HMM), which enabled computers to decipher whether or not an input sound was a speech sound instead of having to match the sound to a rigid template. This allowed the recognition of conversational speech as well as the wide expansion of a system’s lexicon. During this time period, speech recognition expanded into commercial use. There was even a doll, World of Wonder’s “Julie”, released in 1987, that could understand simply phrases and reply accordingly.

In 1990, the first consumer-grade speech recognition product, Dragon Dictate, was developed. In 1997, Dragon Naturally Speaking, which could process natural language patterns up to 100 words per minute, was released. BellSouth developed VAL, the very first “voice portal” in 1997, which, through its ability to process speech and respond to questions via phone, laid the foundation for for voice-activated menus that we see today through phone banking and at pharmacies. From the mid-1990s through the late 2000s, speech recognition advancements plateaued, having hit a ceiling of about 80% recognition accuracy due to the limitations of HMM.

2007 marked the release of Apple’s first iPhone, causing the tech market to begin to orient focus on smartphone and mobile devices. The Google Voice Search App for iPhone was released in 2008, marking a major achievement in mobile speech recognition technology. What is important to this technology is that the processing power could be offloaded to Google’s cloud data centers, allowing the high-volume data analysis required for storing speech patterns and matching words against them. Apple then built upon this existing technology, resulting in the release of Siri, and AI-driven personal assistant, in 2011. Google’s Voice Assistant as well as Siri have since been perfected and and Google, Amazon, and Microsoft have even expanded their speech technology devices into the world’s homes via Google Home, Alexa, and Cortana respectively. Speech recognition technology will continue to improve as both its commercial and personal application and demands continue to increase.

How Does Speech Recognition Software Work?

Speech recognition software works by taking audio input from the user, breaking down the input into individual sounds (called phonemes in linguistics), implementing algorithms to result in the most likely word that fits the audio input, and thence transcribing these input sounds into text. This software implements natural language processing (NLP) as well as deep learning learning neural networks. NLP, according to Etienne Manderschield (VP AI of machine learning at Dialpad), is “ a technology built to help computers process and analyze our language, both spoken and written. Essentially, engineers build NLP models to teach computers how to understand us and even replicate the way we communicate.” In short, NLP allows the computer to understand human language to the extent that it can lead to the performance of simple tasks by machines.

Speech-to-Text Conversion

After the speech recognition software receives the audio input, the next process that occurs is the process of converting speech to text. First, the audio input must be sampled, meaning that the entirety of the continuous speech input needs to be broken down into discrete, smaller samples as small as a thousandth of a second.

Sampling of Speech Audio Input

Pre-Processing of Speech

The next step the program takes is to pre-process the speech samples in order to attain more accurate results. Pre-processing is key because is it determines the efficiency of the speech recognition modes. As samples can be as small as 1/16000th of a second, pre-processing increases efficiency by splitting the smaller samples into larger groups in terms of time duration (typically with an interval of 20–25 milliseconds). This aids in the process of converting sound waves into numbers or bits that are able to be processed by the software.

Recurrent Neural Network (RNN)

After pre-processing is complete, the resulting data enters into a series of algorithms called neural networks that are based on the functioning of the human brain and employ deep learning. These neural networks are are able to take in a large set of data and processing it by recognizing and drawing out patterns within the data in order to produce output.

Recurrent neural networks (RNNs) are the algorithms that are capable of predicting future outcomes and thus contributing to the efficiency of speech recognition. The RNN processes each speech sound and considers the likelihood of the next speech sound. For example, if a user were to produce the speech sounds commensurate with “GOODB,” the RNN would predict that that ensuing speech sounds would be commensurate with “YE” as opposed to “XLE.” The RNN also saves the previous predictions that is has made for future use.

Speech Recognition Model Using RNN
RNN Example

RNN Algorithm

RNN Algorithm

The above diagram delineates the steps of a RNN algorithm. This algorithms is divided into input states(X), a hidden state(St), and output states(O).

LTSM

An RNN cannot process very long sequences. This is due to the fact that RNNs need to be trained through back propagation through time (BPTT). For example, if you wanted to calculate the gradient at t=6, you would have to back propagate the previous 5 steps and sum them all. This is a major drawback as it is not efficient. Therefore LTSM (long term short term memory) is used. While a RNN only has one structure, LTSM has 4, which includes the cell state where information flows through and 3 gates (input, output, and forget). Essentially, LTSM is better than a RNN since cell states are used to control long terms dependencies.

LTSM Diagram
Left: RNN; Right: LTSM

Types of Speech Recognition Software

  • Speaker-dependent speech recognition software — This type is for use by one individual. After sufficient input and training, it can be very accurate for speech-to-text dictation for that specific individual.
  • Speaker-independent speech recognition software — This type is trained to recognize anyone’s voice. These are not as efficient as speaker-dependent but are more widely applicable and are used, for example, in telephone applications.
  • Command & control speech recognition software — This type is used to control and navigate devices via voice commands, and are used to start programs and navigate websites.
  • Discrete input speech recognition software — This type is highly accurate but requires a pause after each word is spoken and therefore limits speaker speed to 60–80 words per minute.
  • Continuous input speech recognition software
  • Natural speech input speech recognition software — This type is able to understand continuous human speech spoken fluently at up to 160 words per minute.

Speech Recognition Technology Impediments

Humans began using spoken language to communicate about 2 million years ago; we as a species have had many years worth of evolutionary neural development that has allowed us to be able to perceive and process natural human language with efficacy.

Despite this fact, humans still encounter impediments to speech production, perception, and analysis, including breakdowns at the following stages in reference to the diagram above:

  • Breakdown at A: speech formulation difficulties (e.g. Broca’s aphasia which affects speech production)
  • Breakdown at B: human vocal mechanism issues (e.g. impaired larynx)
  • Breakdown at C: interference with the vocal soundwaves due to the presence of background noise
  • Breakdown at D: issues with the individual’s auditory perception mechanism (e.g. hearing loss at certain frequencies, total hearing loss)
  • Breakdown at E: speech perception issues (e.g. inability to discern speech with a certain accent or dialect, Wernicke's aphasia, auditory processing disorders)

Speech recognition software programs also have issues somewhat akin to the human breakdowns discussed above:

Suppression of Noise

While humans are able to discern human speech patterns from background noise granted the speech sound waves are not completely overpowered, speech recognition software takes in the background noise as part of the audio input as a whole. Consider playing loud music and trying to use a voice command to a Google Home while using a normal speaking volume. The speech recognition software is unable to discern the speech sounds as separate from the background music and I know I personally oft have to yell at the Google Home to increase the magnitude of the soundwaves, thus making them perceptible to the Google Home.

Accents and Dialects

As discussed in breakdown at E above, humans often have difficulty perceiving speech that is accented or is in a dialect. At this stage of the game, speech recognition software is not excellent with accent and dialect perception.

Speed of Verbal Audio Input

I know a little Spanish. However, when I was in Costa Rica speaking to a native speaker who was speaking quickly, I was unable to process the entirety of the input as separate words, and I had to ask people to speak more slowly. Similarly, today’s speech recognition software begins to struggle with processing speech at or above 200 words per minute.

Speech Context

When humans communicate with one another, they use a variety of anecdotes, emotions, slang, and expressions. However, speech recognitions software has not quite reached the level of complexity to detect these nuances well. For example, I just asked my Google Home Mini, “What’s the sitch with the weather?”, and it’s output was, “I cannot help you with this request.”

Implementation of Speech Recognition Technology

As of today, in 2021, speech recognition technology has a wide range of applications, as it is used in most fields including defense, medicine/healthcare, law, education, telecommunications, and personal computing.

Conclusion

Speech recognition software is becoming increasingly important in our daily lives, whether at work or at home. As developers, it is important to understand the history of speech recognition technology, how speech recognition software works, types of speech recognition software, speech recognition technology impediments, as well as implementations of speech recognition technology.

References

https://www.researchgate.net/publication/329316345_Speech_Recognition_using_Recurrent_Neural_Networks

--

--