You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Expressive speech synthesis is crucial for many human-computer interaction scenarios, such as audiobooks, podcasts, and voice assistants. Previous works focus on predicting the style embedding at one single scale from the information within the current sentence. The context information in neighboring utterances and the multi-scale nature of style in human speech are neglected, making it challenging to convert multi-sentence text into natural and expressive speech. In this paper, we propose MSStyleTTS, a style modeling method for expressive speech synthesis, to capture and predict style at different levels from a wider range of context rather than a sentence. Two sub-modules, including multi-scale style extractor and multi-scale style predictor, are trained together with a FastSpeech 2 based acoustic model. The predictor is designed to explore the hierarchical context information by considering structural relationship in context and predict style embeddings at global-level, sentence-level and subword-level. The extractor extracts multi-scale style embedding from the ground-truth speech and explicitly guides the style prediction. Subjective and objective evaluations on a Mandarin audiobook dataset both demonstrate that the proposed method is significantly outperformed the three baselines. In addition, we conduct the analysis of the context information and multi-scale style representation that have never been discussed before.
Fig.1: The architecture of our proposed model.
Subjective Evaluation
To demonstrate that our proposed model can significantly improve the naturalness and expressiveness of the synthesized speech, some samples are provided for comparison. GT means ground truth. FastSpeech 2 means an open-source implementation of FastSpeech 2. WSV* means word-level style variations (WSV) model with several changes which are described in detail in the paper. And HCE means hierarchical context encoder (HCE) model, which predicts the style on global-level from the context. In addition, a well-trained HIFI-GAN is used as the vocoder to generate waveform.
Target Chinese Text
GT
FastSpeech 2
WSV*
HCE
MSStyleTTS
小公母儿俩一进屋儿,屋儿里又多了两个人。
Your browser does not support the audio element.
Your browser does not support the audio element.
Your browser does not support the audio element.
Your browser does not support the audio element.
Your browser does not support the audio element.
晚上,赏三两小醑酒,又把客人吃剩的汤菜做成杂烩,送到砦四海的窝棚。
Your browser does not support the audio element.
Your browser does not support the audio element.
Your browser does not support the audio element.
Your browser does not support the audio element.
Your browser does not support the audio element.
小兔崽子!抓帽胡同儿的这几个哈哈珠砸又团聚啦。
Your browser does not support the audio element.
Your browser does not support the audio element.
Your browser does not support the audio element.
Your browser does not support the audio element.
Your browser does not support the audio element.
寥花儿打累了,也把胆怯的心给打没了,拥被坐着,喘着粗气。
Your browser does not support the audio element.
Your browser does not support the audio element.
Your browser does not support the audio element.
Your browser does not support the audio element.
Your browser does not support the audio element.
马嵬坡下草青青,今日犹存妃子陵。
Your browser does not support the audio element.
Your browser does not support the audio element.
Your browser does not support the audio element.
Your browser does not support the audio element.
Your browser does not support the audio element.
郭二坏一眼瞥见余为农,行色匆匆地顺着二道街往前奔。
Your browser does not support the audio element.
Your browser does not support the audio element.
Your browser does not support the audio element.
Your browser does not support the audio element.
Your browser does not support the audio element.
Ablation Study
The effect of using residuals to represent style variations
Target Chinese Text
MSStyleTTS
without residual style embedding
GT
您老放心,漫说开荒累不死人,就是赴汤蹈火,您侄子第一个跳进去。
Your browser does not support the audio element.
Your browser does not support the audio element.
Your browser does not support the audio element.
勾秀云嘴上缺个把门儿的,她调笑四海。
Your browser does not support the audio element.
Your browser does not support the audio element.
Your browser does not support the audio element.
小兔崽子!抓帽胡同儿的这几个哈哈珠砸又团聚啦。
Your browser does not support the audio element.
Your browser does not support the audio element.
Your browser does not support the audio element.
瓜尔佳氏哼了一声,呵斥道。
Your browser does not support the audio element.
Your browser does not support the audio element.
Your browser does not support the audio element.
打开食盒,里面儿是血肠儿白肉、大馅儿包子,还有一葫芦酒。
Your browser does not support the audio element.
Your browser does not support the audio element.
Your browser does not support the audio element.
Comparisons of utilizing different ranges of context information in predictor
Target Chinese Text
L=0
L=1
L=2
L=3
L=4
每年腊月门子忙活一阵,賺到的银两都在正月里的赌场上还了人家。
Your browser does not support the audio element.
Your browser does not support the audio element.
Your browser does not support the audio element.
Your browser does not support the audio element.
Your browser does not support the audio element.
本以为六格会搂席,未承想却斯文起来,端端正正儿地坐在那儿,莞尔一笑,想了半天他说.
Your browser does not support the audio element.
Your browser does not support the audio element.
Your browser does not support the audio element.
Your browser does not support the audio element.
Your browser does not support the audio element.
贴上余为农,既养了家也解了自己的饥渴。
Your browser does not support the audio element.
Your browser does not support the audio element.
Your browser does not support the audio element.
Your browser does not support the audio element.
Your browser does not support the audio element.
瓜尔佳氏哼了一声,呵斥道。
Your browser does not support the audio element.
Your browser does not support the audio element.
Your browser does not support the audio element.
Your browser does not support the audio element.
Your browser does not support the audio element.
再说了,汪半城也脚着,就算这西施有些说道儿。
Your browser does not support the audio element.
Your browser does not support the audio element.
Your browser does not support the audio element.
Your browser does not support the audio element.
Your browser does not support the audio element.
The Effect of Multi-Scale Style Predictor
Target Chinese Text
MSStyleTTS
-residual connections
无论啥人想给猩猩怪翻案,都不是那么好相与的。
Your browser does not support the audio element.
Your browser does not support the audio element.
要想去掉链子,再花三十吊.
Your browser does not support the audio element.
Your browser does not support the audio element.
刘二华堂会来事儿,任木匠二进古城子,还住在他家的上房。
Your browser does not support the audio element.
Your browser does not support the audio element.
他连忙儿打开了盒子,假地契原封不动儿地还躺在里面儿。余为商抹了把汗,胆儿突突地问.
Your browser does not support the audio element.
Your browser does not support the audio element.
怀瑾听了若有所悟,双手合十唱了一声佛号,躬身退了出去。
Your browser does not support the audio element.
Your browser does not support the audio element.
Comparisons between global-level, sentence-level and subword-level style representation
Investigation on global-level style
Target Chinese Text
Proposed
without global-level style
GT
您老放心,漫说开荒累不死人,就是赴汤蹈火,您侄子第一个跳进去。
Your browser does not support the audio element.
Your browser does not support the audio element.
Your browser does not support the audio element.
小公母儿俩一进屋儿,屋儿里又多了两个人。
Your browser does not support the audio element.
Your browser does not support the audio element.
Your browser does not support the audio element.
西施留在了汪家,桃儿才体会到了什么叫汪大奶奶。
Your browser does not support the audio element.
Your browser does not support the audio element.
Your browser does not support the audio element.
小施主,关老爷一生最重一个义字。
Your browser does not support the audio element.
Your browser does not support the audio element.
Your browser does not support the audio element.
乌雅氏和勾秀云早已经捷足先登了。
Your browser does not support the audio element.
Your browser does not support the audio element.
Your browser does not support the audio element.
Investigation on global-level and sentence-level style
Target Chinese Text
Proposed
without global-level and sentence-level style
GT
终于有人跳下了炕,明保脑瓜皮酥了一下。
Your browser does not support the audio element.
Your browser does not support the audio element.
Your browser does not support the audio element.
他使劲儿拍了拍穆隆阿,又使劲儿拍了拍六格,骂了一句。
Your browser does not support the audio element.
Your browser does not support the audio element.
Your browser does not support the audio element.
小兔崽子!抓帽胡同儿的这几个哈哈珠砸又团聚啦。
Your browser does not support the audio element.
Your browser does not support the audio element.
Your browser does not support the audio element.
瓜尔佳氏哼了一声,呵斥道。
Your browser does not support the audio element.
Your browser does not support the audio element.
Your browser does not support the audio element.
必须用你的目光逼退鹰眼射出的寒光。
Your browser does not support the audio element.
Your browser does not support the audio element.
Your browser does not support the audio element.
Case Study
To further explore the impact of the multi-scale style modeling framework on the expressiveness and prosody of synthesized speech, two case studies are conducted to compare our MSStyleTTS with two mono-scale baselines, respectively. The ground truth speeches are also provided as references.