During the past and current year, AlfaNum's team has been working intensively on innovations that relate to the possibility of synthesizing speech with different characteristics, if available:
- a quality acoustic model, that is, a synthesis of the speech of the initial characteristics;
- a small sample of speech (lasting from a few seconds to a few minutes) of different characteristics.
You can listen to the results:
A sample of Donald Trump's original speech:
Synthesized Trump’s voice reads the same text:
Synthesized Obama’s voice reads Trump's text:
Change of the speech characteristics relates to:
- Changing the identity of the speaker (the initial acoustic model corresponds to the voice of one speaker, and after conversion, the voice of another speaker is received).
- Changing the style of speech (the initial acoustic model refers to a common, neutral style of speech, and after the conversion, for instance, we can get an expressive style that expresses some emotion - joy, anger, etc.)
Examples of speech style change:
The possible application of these innovations are enormous. First of all, they enable the generation of new TTS voices. Namely, the cost of developing a single TTS voice is very high, as it is evident from the fact that largest companies in this field do not have more than a few voices per language, and for "smaller" languages usually only one voice. On the other hand, the need for different TTS voices definitely exists - in interactive voice systems, video games, book reader applications, audio-textbooks... Furthermore, there is also a demand for adjusting the synthesis to the voice of the user (for reading messages from social networks, IM and e-mail messages, as well as the use of speech translation applications) or the voice of another person (in synchronizing movies using the voices of the original actors)