The evolution of speech recognition technology.

The evolution of speech recognition technology.
Remember when the idea of ​​KITT, the talking car from Knight Rider, blew your mind again? Or when Blade Runner Eric Decker verbally ordered his computer to enhance crime scene photos? The idea of ​​being understood by a computer seemed pretty futuristic, let alone one that could answer his questions and understand his commands. About the Author Graeme John Cole is a contributor to Rev, creator of the world's most accurate automatic speech recognition engine, Rev.ai. Today we all carry KITT in our pockets. We sigh when KITT answers the phone at the bank. The personality isn't there yet, but computers can recognize the words we speak almost perfectly. Michael Knight, the Knight Rider hero who teamed up with his smart car to fight crime, was skeptical that KITT could understand his questions in 1982. But development of voice recognition technology had been underway since the 1950s. Below is how this technology has evolved over the years. And how our ways of using speech recognition and text-to-speech capabilities have evolved with technology.

IBM shoe box

(Image credit: IBM)

The first computers that listen, 1950-1980

The power of Automatic Speech Recognition (ASR) means its development has always been associated with big names. Bell Laboratories led the way with AUDREY in 1952. The AUDREY system recognized spoken numbers with 97 to 99% accuracy, under carefully controlled conditions. However, according to James Flanagan, a scientist and former electrical engineer at Bell Labs, AUDREY was sitting on "a six-foot-tall relay holder, drawing considerable power and exhibiting the myriad maintenance problems associated with the complex circuitry of the relays." empty tubes". AUDREY was too expensive and impractical, even for specialized use cases. IBM followed in 1962 with the Shoebox, which recognized numbers and simple math terms. During this time, Japanese laboratories developed vowel and phoneme recognition and the first segment of speech. It's one thing for a computer to understand a small range of numbers (ie 0-9), but Kyoto University's breakthrough was to "segment" a speech line so the technology could work on a variety of spoken sounds. . In the 1970s, the Department of Defense (DARPA) funded the Speech Understanding Research (SUR) program. The fruits of this research included Carnegie Mellon's HARPY speech recognition system. HARPY recognized sentences from a vocabulary of 1.011 words, giving the system the power of an average three-year-old. When I was three years old, speech recognition was now lovely and had potential, but you wouldn't want it in the office. HARPY was one of the first to use Hidden Markov Models (HMM). This probabilistic method led to the development of ASR in the 1980s. In fact, in the 1980s, the first viable use cases for text-to-speech tools appeared with IBM's experimental transcription system, Tangora. With proper training, Tangora could recognize and write 20.000 English words. However, the system was still too heavy for commercial use.

ASR at the consumer level, from the 1990s to the 2010s

“We thought it was wrong to ask a machine to imitate people,” recalls Fred Jelinek, an IBM speech recognition innovator. “After all, if a machine has to move, it does so on wheels, not by walking. Instead of exhaustively studying how people hear and understand speech, we wanted to find the natural way for the machine to do it. Statistical analysis was now the driving force behind the evolution of ASR technology. In 1990, Dragon Dictate was released as the first commercial speech recognition software. It costs €9,000, around €18,890 in 2021, including inflation. Until the release of Dragon Naturally Speaking in 1997, users still had to pause between each word. In 1992, AT&T introduced Bell Labs' Voice Recognition Call Processing (VRCP) service. VRCP now processes approximately 1.200 billion voice transactions each year. But most of the voice recognition work in the 1990s took place under the hood. Personal computing and the ubiquitous web have created new angles of innovation. That was the opportunity discovered by Mike Cohen, who joined Google to launch the company's voice technology efforts in 2004. Google Voice Search (2007) brought voice recognition technology to the masses. But it also recycled voice data from millions of network users as training material for machine learning. And it had the processing weight of Google to improve quality. Apple (Siri) and Microsoft (Cortana) followed just to stay in the game. In the early 2010s, the emergence of deep learning, recurrent neural networks (RNNs), and long-term memory (LSTM) led to a hyperspatial leap in the capabilities of ASR technology. This forward momentum has also been fueled in large measure by the rise and increased availability of low-cost computing and massive algorithmic advances.

Screenshot of WWDC 2021

(Image credit: Apple)

The current state of ASR

Building on decades of evolution, and in response to increasing user expectations, speech recognition technology has made further advances over the past half decade. Solutions to optimize variable audio fidelity and demanding hardware requirements make speech recognition easier for everyday use across voice search and the Internet of Things. For example, smart speakers use keyword detection to deliver immediate results using embedded software. During this time, the rest of the sentence is sent to the cloud for processing. Google's VoiceFilter-Lite optimizes a person's speech at the end of the transaction on the device. This allows consumers to "train" their device with their voice. Training reduces the source-distortion ratio (SDR), improving the usability of voice-activated assistive applications. The word error rate (WER, the percentage of wrong words that appear during a speech-to-text conversion process) is dramatically improved. Academics suggest that by the end of the 2020s, 99% of transcription work will be automated. Humans will only step in for quality control and fixes.

ASR use cases in the 2020s

ASR capability is improving in symbiosis with developments in the network age. Here are three compelling use cases for automated speech recognition. The podcasting industry will cross the billion dollar mark in 2021. Audiences are skyrocketing and the words keep coming. Podcast platforms look for ASR providers with high precision and word stamps to help people create podcasts more easily and maximize the value of their content. Providers like Descript convert podcasts into text that can be quickly edited. Additionally, word-based timestamps save time, allowing the editor to shape the finished podcast like clay. These transcripts also make content more accessible to all audiences and help creators improve search and discoverability for their shows through SEO. Today, more and more meetings are taking place online. And even those who aren't often sign up. Taking a few minutes is expensive and time consuming. But meeting notes are a valuable tool for attendees to get a summary or review a detail. ASR transmission offers real-time speech synthesis. This means easy captioning or live transcription for meetings and seminars. Processes such as legal depositions, hiring, etc. go virtual. ASR can help make this video content more accessible and engaging. But more importantly, end-to-end (E2E) machine learning (ML) models further improve speaker registration: the record of who is present and who said what. In high-risk situations, confidence in the tools is essential. A reliable ultra-low WER speech-to-text engine removes the element of doubt and reduces the time required to produce final documents and make decisions.

On file

Do you think Knight Industries has ever evaluated the transcript of KITT and Michael's conversations to improve efficiency? Maybe not. But, fueled by the recent shift to working from home, more and more of our discussions are taking place online or over the phone. Real-time, high-precision Natural Language Processing (NLP) gives us power over our words. Add value to every interaction. The tools are no longer exclusive to big names like IBM and DARPA. They are available for consumers, businesses and developers to use however their imaginations see fit, as speech recognition technology strives to exceed the promises of science fiction. Interested in speech recognition? Discover our roundup of the best text-to-speech software