Speech Intelligibility for Nuendo 11

Are speech intelligibility and ease of listening the same thing?

In a strict sense, speech intelligibility is measured as the proportion of speech items (e.g. words) that can be recognized correctly in a given situation. More broadly, the term "intelligibility" is often used to describe the perceived effort one has to spend to understand speech. This is also relevant for broadcast applications, because even if I am technically able to understand every word of a dialog, I may still have to invest a lot of cognitive resources, e.g., when the background sounds are too loud. This broader sense of speech intelligibility is what we measure with Nuendo’s new tool.

What "characteristics" of the speech are being considered to decide if it is intelligible or not?

Speech consists of small building blocks, so-called phonemes. Several phonemes combine to syllables or words. Phonemes are what automatic speech recognition engines detect and convert to meaningful speech. In very clear speech, there is only a single phoneme at a given instant of time. In technical terms, a machine trained to recognize speech detects a high probability for the presence of a specific phoneme and a low probability for all other phonemes. The more disturbed the speech, the less distinct this probability is: The machine is less certain which phoneme is present. This is what we use to quantify intelligibility.

How do you train the AI algorithm?

The algorithm has to perform different tasks. First, it must detect if speech is present or not. This sounds trivial but is a challenging issue when considering how diverse and “speech-like” broadcast background sounds can be. Then we use automatic speech recognition technology and compute how certain the recognizer is to detect individual phonemes. Finally, we map this certainty to a scale that corresponds to human perception as measured in hundreds of hours of listening experiments. For all this to work robustly, we exploited deep learning with many thousand hours of training material with real speech and highly challenging backgrounds.

Contact person at Fraunhofer IDMT in Oldenburg:
Dr. Jan Rennies-Hochmuth
Head of Group ‘Personalized Hearing Systems’
Fraunhofer-Institute for Digital Media Technology IDMT
Hearing, Speech and Audio Technology
Marie-Curie-Str. 2
D-26129 Oldenburg, Germany