For developing Conversational AI, what kind of technologies do we need? It depends on the environment and tasks of the system, but they are generally divided into 4 categories:
- NLU - Natural Language Understanding
- The machine should be able to understand what people are saying and their intentions. Artificial intelligence has evolved a lot, but it is still a very difficult task for a computer to understand exactly what people say.
- DM - Dialog Management
- When we think about when we talk to someone, it's easy to see that we use a variety of contexts, such as the topic of the conversation, the current situation, the place, and the mood. And most of all, we need to know the flow of dialogue.
- KB - Knowledge Base
- In addition, we utilize various knowledge information such as what we have seen, heard and learned in the past, what we read in the news last night, what we read in the book, and stories we saw in movies, etc. It also includes the information generated by reasoning in addition to what you have seen and heard directly.
- NLG - Natural Language Generation
- Now is the time to express what the machine wants to say in human language. We are born and spend our childhood with people and naturally begin to talk, but the computer can not.
It would be natural for robots to communicate over voice as we do. The following two technologies are also essential for voice communication:
- STT - Speech to Text (or Speech Recognition)
- Converts voice to text. In other words, it converts the sound information delivered in waveform into a text form of a specific human natural language so that the computer can start processing.
- TTS - Text to Speech (or Speech Synthesis)
- This time, robots need to synthesize the text into a voice. It is the process by which we make a sound through vocal and breathing, and the movement of our lips and tongues. It is important to create a natural voice in terms of accents, tones, emotions, and so on.
Researchers have made each technology by combining many detailed modules, but now with end-to-end deep neural network we are able to make STT or TTS engine with a single neural network. Someday, each technology could be integrated through a very large and complex neural network or a new paradigm technology.
Tacotron for TTS is a good example of end-to-end deep neural network:
- Paper - https://arxiv.org/abs/1703.10135
- Succeeding Publications
- Tacotron: An end-to-end speech synthesis system by Google
I personally was interested in TTS. The reason is that if I had a conversation with the robots later, I wanted them to speak to me with their unique voices and diverse personality. I hope it is not a mechanical voice without emotion.
So, what data do we need at this point? What training data would be required to simulate a person's accent and tone? What corpus is good for TTS? We were curious and decided to try some experiments. At first, we needed to select the person to be simulated, and after much consideration, we chose Park Won-soon, Mayor of Seoul, who has the voice that many people can easily recognize.
We have searched the internet and collected 17 video clips , which can be classified into three types:
- 6 video clips
- One moderator and multiple participants (including Park Won-soon)
- Extracted 13 mins from 8 hours 58 mins
- 7 video clips
- One interviewer and one interviewee (= Park Won-soon)
- Extracted 29 mins from 2 hours 49 mins
- 4 video clips
- One speaker (= Park Won-soon)
- Extracted 24 mins from 31 mins
Speech data is easy to process because only one speaker speaks, while the formal intonation is not satisfactory to train the model for everyday conversation. On the other hand, his voice in the interview and forum video clips is more natural as he talk to someone but there was a lot of noise, such as overlapping voices with other people or mixing different sounds.
Following audio clips are samples of the Park Won-soon TTS corpus we built:
And we trained the end-to-end deep neural network, Tacotron, to build a TTS engine that simulates the voice of the mayor. We have used 41 mins from the interview and forum video clips but excluded speech videos because the audio quality was not good enough.
In this empirical verification process, we learned which voice corpus is good for TTS. We have once again found that it is important to know that the small errors accumulated in the corpus construction process have a negative impact on the performance of the final model and that it is important to understand the characteristics of the linguistic data in order to make good corpus.
After we run this experiment, we have released the TTS demo video about 4 month ago on Youtube:
And interestingly, we got a chance to show this TTS demo directly to the mayor on the first working day of 2019.
He watched our demo video and enjoyed it, and he promised that he and the Seoul Metropolitan Government will become a testbed for AI startups. Wow! One of the pleasures of working at startup is that you can have diverse and rich experiences, and it is one of my memorable events.