Audio samples from "Fine-grained Prosody Modeling in Neural Speech Synthesis using ToBI Representation"

Authors: Yuxiang Zou, Shichao Liu, Xiang Yin, Haopeng Lin, Chunfeng Wang, Haoyu Zhang, Zejun Ma.

Abstract: Benefiting from the great development of deep learning, modern neural text-to-speech (TTS) models can generate speech indistinguishable from natural speech. However, The generated utterances often keep an average prosodic style of the database instead of having rich prosodic variation. For pitchstressed languages, such as English, accurate intonation and stress are important for conveying semantic information. In this work, we propose a fine-grained prosody modeling method in neural speech synthesis with ToBI (Tones and Break Indices) representation. The proposed system consists of a text frontend for ToBI prediction and a Tacotron-based TTS module for prosody modeling. By introducing the ToBI representation, we can control the system to synthesize speech with accurate intonation and stress at syllable level. Compared with the two baselines (Tacotron and unsupervised method), experiments show that our model can generate more natural speech with more accurate prosody, as well as effectively control the stress, intonation, and pause of the speech.

Controllability evaluation of speech synthesis backend

Sample from Table 2 in the paper.

Controllability evaluation: under the given ground truth ToBI labels, test whether each ToBI feature is effectively reflected in generated speech.

Pitch accents mark the stressed syllable of specific words that carry the most information in a sentence. The default is H* (high accent), L* (low accent), L*+H (a syllable which starts with a low accent and then rises), L+H* (again low-high on one syllable, but with the second part accented), and !H*(This H is pitched somewhat lower than the earlier one).

Boundary tones describe the pitch trend at each full intonation phrase boundary. The default is H% (high tone) and L% (low tone).

Phrase accents describe the pitch movement between the ultimate pitch accent and the boundary tone. The default is H- and L-.

1a: Does Bob play basketball(H*) every(L*) day(L-H%)?
1b: Does Bob play basketball every(L*) day(L+H* H-H%)?
2a: Where is the washroom(H* L-L%), Do you know(H* L-L%)?
2b: Where(H*) is the washroom(H* H-L%), Do you know(L+H* H-H%)?
3a: Hey, I'm Jim(H* L-L%). And no(H* L-), I'm not from China(!H* L-L%). My grandparents are(L+H* H-), but I was born(H*) here(!H* L-L%). What about yourself(H* L-L%)?
3b: Hey(H* H-L%), I'm Jim(L*+H H-L%). And no(H* H-), I'm not(H*) from China(L-L%). My grandparents(H*) are(!H* L-), but I(H*) was born here(H* L-L%). What about yourself(H* L-L%)?

Comparison among systems

Sample from Table 4 in the paper.

TACO: the model architecture is the same with the Tacotron and the input features are simply composed of the labels of phoneme, lexical stress, and word boundary.

TP-DPE: the unsupervised method. Phone-level prosodic features predicted from text, including phone duration, pitch and energy (DPE features), are used to realize prosody modeling.

TP-ToBI: the proposed method. the ToBI-related labels are predicted from text using the proposed ToBI prediction frontend.

GT-ToBI: the acoustic model and the input feature are the same as TP-ToBI, but the ToBI-related labels are manually revised from the results of ToBI prediction frontend.

TACOTP-DPETP-ToBIGT-ToBI
1: What do you usually do in the afternoon?
2: What are the man and the woman talking about?
3: How high was the plane when the engine failed?
4: Would you like some rice and beef?
5: What a nice vase it is!
6: How careful she is!
7: There are many fun places to see and things to do.
8: They can grow so big, blocking all daylight, making it very dark and ominous standing under them.