After first launching its new Machine Learning Journal for Apple engineers to share with the community, today the Siri team has shared three new blog posts based on research being presented at Interspeech 2017 in Stockholm this week.
One blog post titled “Deep Learning for Siri’s Voice: On-device Deep Mixture Density Networks for Hybrid Unit Selection Synthesis” details the evolution of Siri’s voice right up to iOS 11 and the process Apple uses for speech synthesis. Included are recordings that compare iOS 9 and iOS 10 to iOS 11 to demonstrate the improvements Apple has made with the newest release coming alongside next-generation iPhones next month:
For iOS 11, we chose a new female voice talent with the goal of improving the naturalness, personality, and expressivity of Siri’s voice. We evaluated hundreds of candidates before choosing the best one. Then, we recorded over 20 hours of speech and built a new TTS voice using the new deep learning based TTS technology. As a result, the new US English Siri voice sounds better than ever. Table 1 contains a few examples of the Siri deep learning -based voices in iOS 11 and 10 compared to a traditional unit selection voice in iOS 9.
The other two blog posts today titled “Improving Neural Network Acoustic Models by Cross-bandwidth and Cross-lingual Initialization” and “Inverse Text Normalization as a Labeling Problem” were also published by Apple’s Siri team. One post details how Siri uses machine learning to display things like dates, times, addresses and currency amounts in a nicely formatted way, and the other techniques Apple uses to make introducing a new language as smooth as possible.
Head over to Apple’s Machine Learning journal to read the full blog posts.