Speech Recognition Technology

Speech recognition software is a computer application that is programmed to take in audio speech data, interpret and analyze it, and thence transcribe it into text. Speech recognition technology is becoming more and more vital to the way we interact with technology today. By 2024, the global voice-based smart speaker market is predicted to be worth $30 billion, while by 2022, voice-based shopping is predicted to amass $40 billion in revenue. Suffice it to say, this is becoming an ever-increasing part of the global economy as well as a larger part of how we perform internet searches, text through voice-to-text, and even set timers or alarms or check the weather on our smart devices. The infographic below shows 2020 statistics on the increase in voice searches.

How Voice Search is Increasing

The History of Voice Recognition Technology

Speech recognition technology dates back to the mid-20th century. In 1952, Bell Labs created The AUDREY system which could understand numerical digits 0 through 9 analyzed after input to its speaker box with 97% accuracy when AUDREY was trained to a specific speaker. This technology laid the foundation for voice dialing and was used by toll-line operators.

How Does Speech Recognition Software Work?

Speech recognition software works by taking audio input from the user, breaking down the input into individual sounds (called phonemes in linguistics), implementing algorithms to result in the most likely word that fits the audio input, and thence transcribing these input sounds into text. This software implements natural language processing (NLP) as well as deep learning learning neural networks. NLP, according to Etienne Manderschield (VP AI of machine learning at Dialpad), is “ a technology built to help computers process and analyze our language, both spoken and written. Essentially, engineers build NLP models to teach computers how to understand us and even replicate the way we communicate.” In short, NLP allows the computer to understand human language to the extent that it can lead to the performance of simple tasks by machines.

Sampling of Speech Audio Input
Speech Recognition Model Using RNN
RNN Example
RNN Algorithm
LTSM Diagram
Left: RNN; Right: LTSM

Types of Speech Recognition Software

  • Speaker-dependent speech recognition software — This type is for use by one individual. After sufficient input and training, it can be very accurate for speech-to-text dictation for that specific individual.
  • Speaker-independent speech recognition software — This type is trained to recognize anyone’s voice. These are not as efficient as speaker-dependent but are more widely applicable and are used, for example, in telephone applications.
  • Command & control speech recognition software — This type is used to control and navigate devices via voice commands, and are used to start programs and navigate websites.
  • Discrete input speech recognition software — This type is highly accurate but requires a pause after each word is spoken and therefore limits speaker speed to 60–80 words per minute.
  • Continuous input speech recognition software
  • Natural speech input speech recognition software — This type is able to understand continuous human speech spoken fluently at up to 160 words per minute.

Speech Recognition Technology Impediments

Humans began using spoken language to communicate about 2 million years ago; we as a species have had many years worth of evolutionary neural development that has allowed us to be able to perceive and process natural human language with efficacy.

  • Breakdown at B: human vocal mechanism issues (e.g. impaired larynx)
  • Breakdown at C: interference with the vocal soundwaves due to the presence of background noise
  • Breakdown at D: issues with the individual’s auditory perception mechanism (e.g. hearing loss at certain frequencies, total hearing loss)
  • Breakdown at E: speech perception issues (e.g. inability to discern speech with a certain accent or dialect, Wernicke's aphasia, auditory processing disorders)

Implementation of Speech Recognition Technology

As of today, in 2021, speech recognition technology has a wide range of applications, as it is used in most fields including defense, medicine/healthcare, law, education, telecommunications, and personal computing.

Conclusion

Speech recognition software is becoming increasingly important in our daily lives, whether at work or at home. As developers, it is important to understand the history of speech recognition technology, how speech recognition software works, types of speech recognition software, speech recognition technology impediments, as well as implementations of speech recognition technology.

References