One area where Apple is the undisputed king of the personal assistant space is localization; Siri supports twenty four languages across 36 country dialects. In contrast, Google’s Assistant can only understand five languages and Alexa (popularized by the Amazon Echo) just two, English and German.
iOS 10.3 is introducing another language, Shanghainese, extending its international advantage even more. In an interview with Reuters, Apple’s head of speech explains how Siri is taught to learn a whole new language …
Alex Acero currently leads the speech team at Apple, who joined the company in 2013. Siri voice recognition was once powered by Nuance, Apple replaced it a couple of a years ago with a custom-built in-house voice platform that relies heavily on machine learning to improve its understanding of words.
In terms of picking up a new language, Acero explains that the process starts by bringing in real people who can speak the new language to read various paragraphs and word lists, spanning different dialects and accents.
The human speech is recorded and transcribed by other humans. This forms a canonical representation of words and how they sound aloud, dictated by real people to ensure accuracy. This raw training data is then fed into an algorithmic machine training model.
The computer language model attempts to predict the transcription of arbitrary strings of words. The algorithm can improve automatically over time as it is trained with more data. Apple will tune the data a little internally and then move onto the next step.
Instead of jumping straight to Siri, Apple releases the new language as a feature of iOS and macOS dictation, available on the iPhone keyboard by pressing the microphone key next to the spacebar. This allows Apple to gain more speech samples (sent anonymously) from a much wider base of people.
These real-world audio clips naturally incorporate background noise and non-perfect speech like coughing, pauses and slurring. Apple takes the samples and transcribes them by humans, then using this newly verified pairing of audio and text as more input data for the language model. The report says this secondary process cuts the dictation error rate in half.
Apple repeats this procedure until it feels it has made the system accurate enough that is ready to roll out as a headline Siri feature. Separately, voice actors record speech sequences so that Siri can synthesize audio and perform text-to-speech with replies.
The language is then released with a software update, just like how Shanghainese will be a part of iOS 10.3 and macOS 10.12.4. Siri is seeded with preset answers to the ‘most common queries’; this enables Siri to answer questions like ‘tell me a joke’. Questions like ‘find nearby restaurants’ are handled dynamically, of course.
Eventually, artificial intelligence will be able to answer general conversational questions without the need for scripted database of human-written replies. That is not really possible today; Siri and all of its competitors currently rely on humans to write jokes and short answers.
Acero says that Apple looks at what real-world users ask once Siri has been deployed in a new language and updates the database of human answers every two weeks.