A single speech system that can control emotions considering the process of an attention in neural networks and supports two or more speakers in one
Conventional speech synthesis techniques use audio post-processing methods or statistical machine learning with a large amount of data for simulating emotions. Each of these approaches have their own problems.
Audio post-processing methods provide flexibility by allowing arbitrary control of modifications, but suffer from quality degradation as the modification amount increases. It is also difficult to adopt detailed features such as intonation naturally.
With statistical machine learning, only certain emotions(such as angry, sad) are expressed well, but flexibility is limited as these systems do not allow direct control of prosody.
There are many models that can accept a rich condition based on Tacotron*, but no model can express many of the emotions well. It has become possible to express myriad kinds of emotions including EbE(emotions between emotions) by learning a latent “emotion space” rather than using a small number of discrete emotional labels.
Another drawback of Tacotron-based models is the difficulty of arbitrary duration and pitch control, which is very important for generating the exact desired speech in many use cases.
Our model takes a different approach which combines all the desired characteristics without any of the drawbacks. Speech can be generated either based on abstract emotion labels, or on explicit positions in the interpolation space between emotions. Additionally, arbitrary control of pitch and duration is provided, so that a specific nuance or affect can be created, while still generating natural sounding speech.
Yuxuan, et al. "Tacotron: Towards end-to-end speech
preprint arXiv:1703.10135 (2017).
1. Arbitrary control of pitch:
Red line: Input pitch to the system
Log-spectrogram of synthesized speech
2. Emotional synthesis examples: