For the first time, artificial intelligence (AI) was able to achieve higher accuracy than humans in recognizing everyday conversations. In the future, the technology could serve as a basis for automatic translations.
Digital assistants such as Alexa, Cortana, or Siri enable the automated transcription of spoken texts and translations. For this purpose, speech recognition systems use artificial neural networks that assign acoustic signals to individual syllables and words using libraries. The results are now very good when the assistants are addressed directly or when a text is read aloud. However, in everyday life, problems still often occur which, as a study recently conducted by the Ruhr-Universität-Bochum (RUB) has shown, can also lead to speech assistants being unintentionally activated by misunderstood signal words.
Conversations between several people are also still frequently causing problems at present. According to Alex Waibel of the Karlsruhe Institute of Technology (KIT) “there are interruptions, stutterers, filling sounds like ‘ah’ or ‘hm’ and also laughter or coughing when people speak to each other. In addition, as Waibel explains, “Words are often pronounced in an unclear manner. As a result, even humans have problems in creating an exact transcription of such an informal dialogue. However, even greater difficulties are posed by artificial intelligence (AI).
Everyday conversations problematic for AI
According to a preprint published by arXiv, scientists around Waibel have now succeeded in developing an AI that transcribes everyday conversations faster and better than humans. The new system is based on a technology that translates university lectures from German and English in real-time. So-called encoder-decoder networks are used to analyze acoustic signals and assign words to them. According to Waibel, “the recognition of spontaneous speech is the most important component in this system because errors and delays quickly make the translation unintelligible.
Increased accuracy and reduced latency
Now KIT scientists have significantly enhanced the system and, in particular, significantly reduced the latency. Waibel and his team used an approach based on the probability of certain word combinations and linked it with two other recognition modules.
In a standardized test, the new speech recognition system listened to excerpts from a collection of about 2,000 hours of telephone conversations, which the system was to transcribe automatically. According to Waibel, “the human error rate here is around 5.5 percent. The AI, on the other hand, only achieved an error rate of 5.0 percent, surpassing humans for the first time in recognizing everyday conversations. The latency time, i.e. the delay between the arrival of the signal and the result, is also very fast at an average of 1.63 seconds but does not yet quite come close to the average latency of 1 second of a human being.
In the future, the new system could be used, for example, as a basis for automatic translations or for other scenarios in which computers are to process natural language.