Biz & IT —

Siri can’t talk to me: The challenge of teaching language to voice assistants

Getting voice assistants to speak Slovakian first means getting better AI learning.

Depending on your language preferences, the answer to this prompt remains "no."
Enlarge / Depending on your language preferences, the answer to this prompt remains "no."

Apple’s most recent fall event centered on excitement about the iPhone X, face recognition replacing Touch ID, OLED displays, and a cellular-enabled Apple Watch. But instead of “one more thing,” people living in Poland, Lithuania, Slovakia, Czech Republic, and many other places all over the world certainly noticed one missing thing.

Siri learned no new languages, and it’s kind of a big deal.

Touch screen works splendidly as an interface for a smartphone, but with the tiny display of a smartwatch it becomes a nuisance. And smart speakers that Apple wants to ship by the end of the year will have no screens at all. Siri—and other virtual assistants like Google Assistant, Cortana, or Bixby—are increasingly becoming a primary way we interact with our gadgets. And talking to an object in a foreign language at your own home in your own country just to make it play a song makes you feel odd.

Believe me I tried. Today, Siri only has 21 supported languages.

A quick glance at the Ethnologue reveals there are more than seven thousand languages spoken in the world today. Those 21 that Siri has managed to master account for roughly half of the Earth’s population. Adding new languages is subject to hopelessly diminishing returns, as companies need to go through expensive and elaborate development processes catering to smaller and smaller groups of people. Poland’s population stands at 38 million. Czech Republic has 10.5 million, and Slovakia has just 5.4 million souls. Adding Slovakian to Siri or any other virtual assistant takes just as much effort and money as it takes to teach it Spanish, only instead of 437 million native Spanish speakers, you just get 5.4 million Slovakians.

While details vary from Siri to Cortana to Google et al, the process of teaching these assistants new languages looks more or less the same across the board. That’s because it’s determined by how a virtual assistant works, specifically how it processes language.

So if Siri doesn't talk to you in your mother tongue right now, you’re probably going to have to wait for the technology driving her to make a leap. Luckily, the first signs of such an evolution have arrived.

It took awhile before virtual assistants even got to this point, mind you.
It took awhile before virtual assistants even got to this point, mind you.

Step one: Make them listen

“In recognizing speech you have to deal with a huge number of variations: accents, background noise, volume. So, recognizing speech is actually much harder than generating it,” says Andrew Gibiansky, a computational linguistics researcher at Baidu. Despite that difficulty, Gibiansky points out that research in speech recognition is more advanced today than speech generation.

The fundamental challenge of speech recognition has always been translating sound into characters. Voice, when you talk to your device, is registered as a waveform that represents how frequency changes in time. One of the first methods to solve this was to align parts of waveforms with corresponding characters. It worked awfully because we all speak differently with different voices. And even building systems dedicated to understanding just one person didn’t cut it, because people could say every word differently, changing tempo, for instance. If a single term was spoken slowly then quickly, this meant the input signal could be long or quite short, but in both cases it must translate into the same set of characters.

When computer scientists determined that mapping sound onto characters directly wasn’t the best idea, they moved on to mapping parts of waveforms onto phonemes, signs representing sounds in linguistics. This amounted to building an acoustic model, and such phonemes then went into a language model that translated those sounds into written words. What emerged was a scheme of an Automatic Speech Recognition (ASR) system with a signal processing unit, where you could smooth input sound a little bit, transform waveforms to spectrograms, and chop them down into roughly 20 millisecond-long pieces. This ASR also had an acoustic model to translate those pieces into phonemes and a language model whose job was to turn those phonemes into text.

“In the old days, translation systems and speech to text systems were designed around the same tools—Hidden Markov Models,” says Joe Dumoulin, chief technology innovation officer at Next IT, a company that designed virtual assistants for the US Army, Amtrak, and Intel, among others.

What HMMs do is calculate probabilities, which are statistical representations of how multiple elements interact with each other in complex systems like languages. Take a vast corpus of human-translated text—like proceedings of the European Parliament available in all EU member states’ languages—unleash a HMM on it to establish how probable it is for various combinations of words to occur given a particular input phrase, and you’ll end up with a more or less workable translation system. The idea was to pull off the same trick with transcribing speech.

It becomes clear when you look at it from the right perspective. Think about pieces of sound as of one language and about phonemes as of another. Then do the same with phonemes and written words. Because HMMs worked fairly well in machine translation, they were a natural choice for moving between the steps of speech recognition.

With huge models and vast vocabularies, speech recognition tools fielded by IT giants like Google or Nuance brought the word error rate down to more or less 20 percent over time. But they had one important flaw: they were the result of years of meticulous human fine-tuning. Getting at this level of accuracy in a new language meant starting almost from scratch with teams of engineers, computer scientists, and linguists. It was devilishly expensive, hence only the most popular languages were supported. A breakthrough came in 2015.

Channel Ars Technica