家庭伦理小说 TTS | 一文总览语音合成系列基础学问及简要先容
Text-to-Speech(平淡缩写为TTS)是指一种将文本转为音频的时刻家庭伦理小说。
本文主要包含了以下内容:
- 语音合成的历史摘录- 语音合成汉文分内析- 声学模子的类型- 语音合成中的声码器- 端到端的语音合成 1.历史第一台“会语言的机器”可能是在 18 世纪后期制造的(传奇是一位匈牙利科学家发明的)。计较机扶植创作发源于20世纪中期,各式时刻还是使用了大要50年。如若咱们对旧时刻进行分类.领先,
1)Articulatory Synthesis: 这是一种模拟东说念主的嘴唇、舌头和发声器官的时刻。
2)共振峰合成:东说念主声可以看作是在语音在器官中过滤某些声息而产生的声息。这便是所谓的源滤波器模子,它是一种在基本声息(举例单个音高)上添加各式滤波器以使其听起来像东说念主声的方法(称为加法合成)。
3) Concatenative Synthesis:当今使用数据的模子。举个浅易的例子,你可以录制 0 到 9 的声息,并通过运动这些声息来拨打电话号码。然则,声息并不是很天然畅达。
4)统计参数语音合成(SPSS):通过创建声学模子、算计模子参数并使用它来生成音频的模子。它可以大致分为三个部分。
领先,“文分内析” ,将输入文本调动为语言特征,“声学模子” ,将语言特征调动为声学特征,终末是声学特征。这是声码器。该领域使用最凡俗的声学模子是隐马尔可夫模子(HMM)。使用 HMM,大约创建比以前更好的声学特征。但是,大部分生成的音频比拟机械,举例机器东说念主声息等。
图片
5)神经 TTS:跟着咱们在 2010 年代参加 深度学习期间,还是开采了基于几种新神经网罗的模子。这些逐渐取代了HMM,并被用于“声学模子”部分,逐渐提高了语音生成的质料。从某种意旨上说,它可以看作是SPSS的一次进化,但跟着模子性能的逐渐提高,它朝着逐渐简化上述三个构成部分的标的发展。比如下图中,可以看出它是在从上(0)到下(4)的标的发展的。
图片
当今推出的大致分为三种模子:
-声学模子:以字符(文本)或音素(音素;发音单元)为输入并创建任何声学特征的模子。如今,大多数声学特征都是指梅尔频谱图。
-声码器:一种将梅尔频谱图(和类似的频谱图)作为输入并生成果真音频的模子。
-王人备端到端的 TTS 模子:接收字符或音素作为输入独立即生成音频的模子。
2.文分内析文分内析是将字符文本调动为语言特征。要谈判以下问题:
1) 文本步调化:将缩写或数字更动为发音。举例把1989改成'一九八九’
2)分词:这在汉文等基于字符的语言中是必须的部分。举例,它凭证高下文判断是把“包”动作单个词照旧把'书包'和'包子'分开看.
3)词性标注:把动词、名词、介词瓜分析出来。
4) Prosody prediction:抒发对句子的哪些部分重读、每个部分的长度怎么变化、语气怎么变化等的高深嗅觉的词。如若莫得这个,它会产生一种简直嗅觉像“机器东说念主语言”的声息。尤其是英语(stress-based)等语言在这方面互异很大,仅仅进度不同云尔,但每种语言都有我方的韵律。如若咱们可以通过检验文本来权衡这些韵律,那详情会有所匡助。举例,文本末尾的“?”。如若有,天然会产生飞腾的调子。
5) Grapheme-to-phoneme (G2P):即使拼写换取,也有好多部分发音不同。举例,“resume”这个词随契机读作“rizju:m”,随机读作“rezjumei”,因此必须检验通盘文本的高下文。是以,如若优先谈判字素转音素的部分,也便是将'语音’调动成'spiy ch’等音标的部分。
在往常的 SPSS 期间,添加和开采了这些不同的部分以提高生成音频的质料。在 neural TTS 中,这些部分还是简化了好多,但仍然有一些部分是详情需要的。比如1)文本步调化text normalization 或者5)G2P基本上都是先处理后输入。如若有的论文说可以接收字符和音素作为输入,那么很厚情况下都会写“践诺上,当输入音素时驱逐更好”。尽管如斯,它照旧比以前浅易了好多,是以在大多数神经 TTS 中,文分内析部分并莫得单沉寂理,它被以为是一个浅易的预处理。止境是在 G2P 的情况下,还是进行了几项商榷,举例英语 [Chae18]、汉文 [Park20]、韩语 [Kim21d]。
3.声学模子声学模子是指 通过接收字符或音素作为输入或通过接收在文分内析部分创建的语言特征来生成声学特征的部分。前边提到,在SPSS期间,HMM(Hidden Markov Model)在Acoustic Model中的比重很大,其后神经网罗时刻逐渐面貌一新。举例,[Zen13][Qian14] 标明用 DNN 替换 HMM 效果更好。不外RNN系列可能更得当语音等时辰序列。因此,在[Fan14][Zen15]中,使用LSTM等模子来提高性能。然则,尽管使用了神经网罗模子,这些模子仍然接收语言特征作为输入和输出,如 MCC(梅尔倒谱统统)、BAP(带非周期性)、LSP(线谱对)、LinS(线性谱图)和 F0 .(基频)等 。因此,这些模子可以被以为是改良的 SPSS 模子。
DeepVoice [Arık17a],吴恩达在百度商榷院时文牍的,其实更接近SPSS模子。它由几个部分构成,举例一个G2P模块家庭伦理小说,一个寻找音素范畴的模块,一个权衡音素长度的模块,一个寻找F0的模块,每个模块中使用了各式神经网罗模子。之后发布的DeepVoice 2 [Arık17b],也可以看作是初版的性能教训和多扬声器版块,但举座结构类似。
3.1.基于Seq2seq的声学模子在2014-5年的机器翻译领域,使用attention的seq2seq模子成为一种趋势。然则,由于字母和声息之间有好多相似之处,是以可以应用于语音。基于这个想法,Google 开采了 Tacotron[Wang17](因为作家可爱 tacos 而得名)。通过将 CBHG 模块添加到作为 seq2seq 基础的 RNN 中,终于运行出现可以接收字符作为输入独立即索求声学特征的相宜神经 TTS,从而开脱了以前的 SPSS。这个seq2seq模子从那以后很长一段时辰都是TTS模子的基础。
在百度,DeepVoice 3 [Ping18] 废弃了之前的旧模子,加入了使用肃穆力的 seq2seq 。然则,DeepVoice 握续基于 CNN 的传统仍然存在。DeepVoice 在版块 3 末尾住手使用这个称呼,之后的 ClariNet [Ping19] 和 ParaNet [Peng20] 也沿用了该称呼。止境是,ParaNet 引入了几种时刻来提高 seq2seq 模子的速率。
谷歌的 Tacotron 在保握称为 seq2seq 的基本阵势的同期,也向各个标的发展。第一个版块有点落后,但从 Tacotron 2 [Shen18] 运行,mel-spectrogram 被用作默许的中间表型。在 [Wang18] 中,学习了界说某种语音立场的立场标志,并将其添加到 Tacotron 中,以创建一个限定立场的 TTS 系统。同期发表的另一篇谷歌论文 [Skerry-Ryan18] 也提倡了一种模子,可以通过添加一个部分来学习韵律镶嵌到 Tacotron 中来改变生成音频的韵律。在 DCTTS [Tachibana18] 中,将 Tacotron 的 RNN 部分替换为 Deep CNN 标明在速率方面有很大的增益。从其时起,该模子已改良为快速模子 Fast DCTTS,尺寸昭着减小 [Kang21]。
在 DurIAN [Yu20] 中,Tacotron 2 的肃穆力部分更动为对王人模子,从而减少了诞妄。Non-Attentive Tacotron [Shen20] 也作念了类似的事情,但在这里,Tacotron 2 的肃穆力部分被更动为握续时辰权衡器,以创建更稳健的模子。在FCL-TACO2 [Wang21]中,提倡了一种半自雅致(SAR)方法,每个音素用AR方法制作,举座用NAR方法制作,以提高速率,同期保握质料。此外,蒸馏用于减小模子的大小。建议使用基于 Tacotron 2 的模子,但速率要快 17-18 倍。
3.2.基于变压器的声学模子跟着2017年Transformers的出现,肃穆力模子演变成NLP领域的Transformers,使用Transformers的模子也运行出当今TTS领域。TransformerTTS [Li19a]可以看作是一个早先,这个模子原样沿用了Tacotron 2的大部分,仅仅将RNN部分改成了Transformer。这允许并行处理并允许谈判更长的依赖性。
FastSpeech [Ren19a] 系列可以被援用为使用 Transformer 模子的 TTS 的代表。在这种情况下,可以通过使用前馈 Transformer 甚止境高的速率创建梅尔频谱图。作为参考,mel-spectrogram是一种谈判东说念主的听觉特点,对FFT的驱逐进行变换的方法,天然是比拟旧的方法,但仍然被使用。优点之一是可以用少许维度(平淡为 80)暗示。
在 TTS 中,将输入文本与梅尔频谱图的帧相匹配止境紧迫。需要准确计较出一个字符或音素变化了些许帧,其实attention方法过于天真,对NLP可能有平允,但在speech上反而不利(单词类似或跳过)。因此,FastSpeech 销毁了肃穆力方法,并哄骗了一个准确权衡长度的模块(长度疗养器)。其后,FastSpeech 2 [Ren21a] 进一步简化了网罗结构,并额外使用了音高、长度和能量等更各种化的信息作为输入。FastPitch[ Łancucki21] 提倡了一个模子,通过向 FastSpeech 添加瞩主张音高信息进一步改良了驱逐。LightSpeech [Luo21] 提倡了一种结构,通过使用 NAS(Neural Architecture Search)优化底本速率很快的 FastSpeech 的结构,将速率提高了 6.5 倍。
MultiSpeech [Chen20] 还先容了各式时刻来处置 Transformer 的弱点。在此基础上,对 FastSpeech 进行测验以创建一个愈加改良的 FastSpeech 模子。TransformerTTS 作家随后还提倡了进一步改良的 Transformer TTS 模子,在 RobuTrans [Li20] 模子中使用基于长度的硬肃穆力。AlignTTS [Zeng20] 还先容了一种使用单独的网罗而不是肃穆力来计较对王人样式的方法。来自 Kakao 的 JDI-T [Lim20] 引入了一种更浅易的基于 transformer 的架构,还使用了改良的肃穆力机制。NCSOFT 提倡了一种在文本编码器和音频编码器均分层使用调动器的方法,方法是将它们堆叠在多个层中 [Bae21]。截至肃穆力范围和使用多档次音高镶嵌也有助于提高性能。
3.3.基于流的声学模子2014年驾驭运行应用于图像领域的新一代方法Flow,也被应用到声学模子中。Flowtron [Valle20a] 可以看作是 Tacotron 的改良模子,它是一个通过应用 IAF(逆自雅致流)生成梅尔谱图的模子。在 Flow-TTS [Miao20] 中,使用非自雅致流制作了一个更快的模子。在后续模子 EfficientTTS [Miao21] 中,在模子进一步泛化的同期,对对王人部分进行了进一步改良。
来自 Kakao 的 Glow-TTS [Kim20] 也使用流来创建梅尔频谱图。Glow-TTS 使用经典的动态谈论来寻找文本和梅尔帧之间的匹配,但 TTS 标明这种方法也可以产生高效准确的匹配。其后,这种方法(Monotonic Alignment Search)被用于其他商榷。
3.4.基于VAE的声学模子另一个降生于 2013 年的生成模子框架 Variational autoencoder (VAE) 也被用在了 TTS 中。顾名念念义,谷歌文牍的 GMVAE-Tacotron [Hsu19]使用 VAE 对语音中的各式潜在属性进行建模和限定。同期问世的VAE-TTS[Zhang19a]也可以通过在Tacotron 2模子中添加用VAE建模的样式部件来作念类似的事情。BVAE-TTS [Lee21a] 先容了一种使用双向 VAE 快速生成具有少许参数的 mel 的模子。Parallel Tacotron [Elias21a] 是 Tacotron 系列的扩展,还引入了 VAE 以加速测验和创建速率。
3.5.基于GAN的声学模子在 2014 年提倡的 Generative Adversarial Nets (GAN) 在 [Guo19] 中,Tacotron 2 被用作生成器,GAN 被用作生成更好的 mels 的方法。在 [Ma19] 中,使用 Adversarial training 方法让 Tacotron Generator 一说念学习语音立场。Multi-SpectroGAN [Lee21b] 还以回击样式学习了几种样式的潜在暗示,这里使用 FastSpeech2 作为生成器。GANSpeech [Yang21b] 还使用带有生成器的 GAN 方法测验 FastSpeech1/2,自稳健调整特征匹配耗费的领域有助于提高性能。
3.6.基于扩散的声学模子最近备受眷注的使用扩散模子的TTS也接踵被提倡。Diff-TTS [Jeong21] 通过对梅尔生成部分使用扩散模子进一步提高了驱逐的质料。Grad-TTS [Popov21] 也通过将解码器更动为扩散模子来作念类似的事情,但在这里,Glow-TTS 用于除解码器除外的其余结构。在 PriorGrad [Lee22a] 中,使用数据统计创建先验散布,从而达成更高效的建模。在这里,咱们先容一个使用每个音素的统计信息应用声学模子的示例。腾讯的 DiffGAN-TTS [Liu22a] 也使用扩散解码器,它使用回击测验方法。这大大减少了推理经由中的法子数并镌汰了生成速率。
3.7.其他声学模子其实上头先容的这些时刻不一定要单独使用,而是可以互采齐集使用的。 FastSpeech 的作家我方分析发现,VAE 即使在小尺寸下也能很好地捕捉韵律等长信息,但质料略差,而 Flow 保留细节很好,而模子需要很大为了提高质料, PortaSpeech提倡了一种模子,包含Transformer VAE Flow的每一个元素。
VoiceLoop [Taigman18] 提倡了一种模子,该模子使用类似于东说念主类责任缅想模子的模子来存储和处理语音问息,称为语音轮回。它是谈判多扬声器的早期模子,之后,它被用作Facebook[Akuzawa18] [Nachmani18] 和 [deKorte20] 的其他商榷的主干网罗。
DeviceTTS [Huang21] 是一个使用深度前馈执法缅想网罗(DFSMN)作为基本单元的模子。该网罗是一种带有缅想块的前馈网罗,是一种微型但高效的网罗,可以在不使用递归决议的情况下保握永久依赖关系。由此,提倡了一种可以在一般迁徙设置中充分使用的 TTS 模子。
4.声码器声码器是使用声学模子生成的声学特征并将其调动为波形的部件。即使在 SPSS 期间,天然也需要声码器,此时使用的声码器包括 STRAIGHT [Kawahara06] 和 WORLD [Morise16]。
4.1.自雅致声码器Neural Vocoder 从 WaveNet [Oord16] 引入膨胀卷积层来创建长音频样本很紧迫,何况可以使用自雅致方法生成高档音频,该方法使用先前创建的样本生成下一个音频样本(一个接一个)。践诺上,WaveNet自己可以作为一个Acoustic Model Vocoder,将语言特征作为输入,生成音频。然则,从其时起,通过更复杂的声学模子创建梅尔频谱图,并基于 WaveNet 生成音频就变得很宽绰。
ai换脸 在线在 Tacotron [Wang17] 中,创建了一个线性频谱图,并使用 Griffin-Lim 算法 [Griffin84] 将其调动为波形。由于该算法是40年前使用的,尽管网罗的举座结构止境好,但得到的音频并不是很令东说念主得志。在 DeepVoice [Arık17a] 中,从一运行就使用了 WaveNet 声码器,止境是在论文 DeepVoice2 [Arık17b] 中,除了他们我方的模子外,还通过将 WaveNet 声码器添加到另一家公司的模子 Tacotron 来提高性能(这样说来,在单个speaker上比DeepVoice2好)给出了更好的性能。自版块 2 [Shen18] 以来,Tacotron 使用 WaveNet 作为默许声码器。
SampleRNN [Mehri17] 是另一种自雅致模子,在 RNN 方法中一个一个地创建样本。这些自雅致模子生成音频的速率止境慢,因为它们通过上一个样本一个一个地构建下一个样本。因此,许多其后的商榷建议领受更快分娩率的模子。
FFTNet [Jin18] 着眼于WaveNet的dilated convolution的阵势与FFT的阵势相似,提倡了一种可以加速生成速率的时刻。在 WaveRNN [Kalchbrenner18] 中,使用了各式时刻(GPU 内核编码、剪枝、缩放等)来加速 WaveNet 。WaveRNN 从此演变成通用神经声码器和各式阵势。在 [Lorenzo-Trueba19] 中,使用 74 位语言东说念主和 17 种语言的数据对 WaveRNN 进行了测验,以创建 RNN_MS(多语言东说念主)模子,确认注解它是一种即使在语言东说念主和环境中也能产生细密质料的声码器。数据。[Paul20a] 提倡了 SC(Speaker Conditional)_WaveRNN 模子,即通过额外使用 speaker embedding 来学习的模子。该模子还标明它适用于不在数据中的语言东说念主和环境。
苹果的TTS[Achanta21]也使用了WaveRNN作为声码器,何况在server端和mobile端作念了各式优化编码和参数诞生,使其可以在迁徙设置上使用。
通过将音频信号分红几个子带来处理音频信号的方法,即较短的下采样版块,已应用于多个模子,因为它具有可以快速并行计较的优点,何况可以对每个子带推广不同的处理。。举例,在 WaveNet 的情况下,[Okamoto18a] 提倡了一种子带 WaveNet,它通过使用滤波器组将信号分红子带来处理信号,[Rabiee18] 提倡了一种使用小波的方法。[Okamoto18b] 提倡了 FFTNet 的子带版块。DurIAN [Yu19] 是一篇主要处理声学模子的论文,但也提倡了 WaveRNN 的子带版块。
当今,好多其后推出的声码器都使用非自雅致方法来改善自雅致方法生成速率慢的问题。换句话说,一种无需检验先前样本(平淡暗示为平行)即可生成后续样本的方法。还是提倡了各式千般的非自雅致方法,但最近一篇标明自雅致方法莫得死的论文是 Chunked Autoregressive GAN (CARGAN) [Morrison22]。它标明许多非自雅致声码器存在音高诞妄,这个问题可以通过使用自雅致方法来处置。天然,速率是个问题,但是通过教唆可以分红chunked单元计较,绍一种可以昭着镌汰速率和内存的方法。
4.2.基于流的声码器归一化基于流的时刻可以分为两大类。领先是自雅致变换,在有代表性的IAF(inverse autoregressive flow)的情况下,生成速率止境快,而不是需要很长的测验时辰。因此,它可以用来快速生成音频。然则,测验速率慢是一个问题,在Parallel WaveNet [Oord18]中,领先创建一个自雅致WaveNet模子,然后测验一个类似的非自雅致IAF模子。这称为老师-学生模子,或蒸馏。之后,ClariNet [Ping19] 使用类似的方法提倡了一种更浅易、更踏实的测验方法。在见效测验 IAF 模子后,当今可以快速生成音频。但测验方法复杂,计较量大。
另一种流时刻称为二分变换,一种使用称为仿射耦合层的层来加速测验和生成的方法。大要在合并时辰,提倡了两个使用这种方法的声码器,WaveGlow [Prenger19] 和 FloWaveNet [Kim19]。这两篇论文来自险些相似的想法,惟一幽微的结构互异,包括夹杂通说念的方法。Bipartite transform的优点是浅易,但也有弱点,要创建一个等价于IAF的模子,需要堆叠好几层,是以参数目比拟大。
从其时起,WaveFlow [Ping20] 提供了几种音频生成方法的详尽视图。不仅解释了 WaveGlow 和 FloWaveNet 等流方法,还解释了WaveNet 作为广义模子的生成方法,咱们提倡了一个计较速率比这些更快的模子。此外,SqueezeWave [Zhai20] 提倡了一个模子,该模子通过排斥 WaveGlow 模子的低效力并使用深度可分离卷积,速率提高了几个数目级(性能略有下落)。WG-WaveNet [Hsu20] 还提倡了一种方法,通过在 WaveGlow 中使用权重分享昭着减小模子大小并添加一个小的 WaveNet 滤波器来提高音频质料来创建模子,从而使 44.1kHz 音频在 CPU 上比及时音频更快音频...
4.3.基于 GAN 的声码器凡俗应用于图像领域的生成回击网罗(GANs)经过很长一段时辰(4-5年)后见效应用于音频生成领域。WaveGAN [Donahue19] 可以作为第一个主要商榷效果被援用。在图像领域发展起来的结构在音频领域被沿用,是以天然创造了一定质料的音频,但似乎仍然有所欠缺。
从GAN-TTS [Binkowski20]运行,为了让模子更得当音频,也便是我运行念念考怎么作念一个大约很好捕捉波形特征的判别器。在 GAN-TTS 中,使用多个就地窗口(Random window discriminators)来谈判更各种化的特征,而在 MelGAN [Kumar19] 中,使用了一种在多个圭臬(Multi-scale discriminator)中检验音频的方法。来自Kakao的HiFi-GAN [Kong20]提倡了一种谈判更多音频特征的方法,即一个周期(Multi-period discriminator)。在 VocGAN [Yang20a] 的情况下,还使用了具有多种分辨率的辩认器。在 [Gritsenko20] 中,生成的散布与践诺散布之间的互异以广义能量距离 (GED) 的阵势界说,并在最小化它的方进取学习。复杂的辩认器以各式样式极地面提高了生成音频的性能。[You21] 进一步分析了这一丝,并提到了多分辨率辩认器的紧迫性。在 Fre-GAN [Kim21b] 中,生成器和辩认器都使用多分辨率方法结合。使用龙套波形变换 (DWT) 也有匡助。
在generator的情况下,好多模子使用了MelGAN提倡的dilated transposed convolution组合。如若稍有不同,Parallel WaveGAN [Yamamoto20] 也接收高斯噪声作为输入,而 VocGAN 生成各式圭臬的波形。在 HiFi-GAN 中,使用了具有多个感受野的生成器。[Yamamoto19] 还建议在 GAN 方法中测验 IAF 生成器。
前边提到的 Parallel WaveGAN [Yamamoto20] 是 Naver/Line 提倡的一种模子,它可以通过提倡非自雅致 WaveNet 生成器来甚止境高的速率生成音频。[Wu20] 通过在此处添加依赖于音高的膨胀卷积提倡了一个对音高更稳健的版块。之后,[Song21]提倡了一种进一步改良的 Parallel WaveGAN,通过应用感知隐匿滤波器来减少听觉敏锐诞妄。此外,[Wang21] 提倡了一种通过将 Pointwise Relativistic LSGAN(一种改良的最小二乘 GAN)应用于音频来创建具有较少局部伪影的 Parallel WaveGAN(和 MelGAN)的方法。在 LVCNet [Zeng21] 中,使用凭证条目变化的卷积层的生成器,称为位置可变卷积,被放入 Parallel WaveGAN 并测验以创建更快(4x)的生成模子,质料互异很小。
而后,MelGAN 也得到了多种阵势的改良。在Multi-Band MelGAN [Yang21a]中,加多了原有MelGAN的感受野,加多了多分辨率STFT loss(Parallel WaveGAN建议),计较了多波段诀别(DurIAN建议),使得速率更快,更踏实的模子。还提倡了 Universal MelGAN [Jang20] 的多扬声器版块,它也使用多分辨率辩认器来生成具有更多细节的音频。这个想法在后续的商榷 UnivNet [Jang21] 中得到继续,并进一步改良,比如一说念使用多周期判别器。在这些商榷中,音频质料也通过使用更宽的频带 (80->100) mel 得到改善。
首尔国立大学/NVIDIA 推出了一种名为 BigVGAN [Lee22b] 的新式声码器。作为谈判各式灌音环境和未见语言等的通用Vocoder,作为时刻改良,使用snake函数为HiFi-GAN生成器提供周期性的归纳偏置,并加入低通滤波器以减少边由此变成的影响。另外,模子的大小也大大加多了(~112M),测验也见效了。
4.4.基于扩散的声码器扩散模子可以称为最新一代模子,较早地应用于声码器。ICLR21同期先容了念念路相似的DiffWave[Kong21]和WaveGrad[Chen21a]。Diffusion Model用于音频生成部分是同样的,但DiffWave类似于WaveNet,WaveGrad基于GAN-TTS。处理迭代的样式也有所不同,因此在比拟两篇论文时阅读起来很道理。之前声学模子部分先容的PriorGrad [Lee22a]也以创建声码器为例进行了先容。在这里,先验是使用梅尔谱图的能量计较的。
扩散法的优点是可以学习复杂的数据散布并产生高质料的驱逐,但最大的弱点是生成时辰相对较长。另外,由于这种方法自己是以去除噪声的样式进行的,因此如若进行时辰过长,存在原始音频中存在的许多噪声(清音等)也会隐藏的弱点。FastDiff [Huang22] 通过将 LVCNet [Zeng21] 的念念想应用到扩散模子中,提倡了时辰感知的位置-变化卷积。通过这种样式,可以更稳健地应用扩散,何况可以通过使用噪声调度权衡器进一步减少生成时辰。
来自腾讯的 BDDM [Lam22] 也提倡了一种大大减少创建时辰的方法。换句话说,扩散经由的正向和反向经由使用不同的网罗(正向:调度网罗,反向:分数网罗),并为此提倡了一个新的表面商量。在这里,咱们展示了至少可以通过三个法子生成音频。在这个速率下,扩散法也可以用于践诺主张。天然以前的大多数商榷使用 DDPM 型建模,但扩散模子也可以用就地微分方程 (SDE) 的阵势暗示。ItoWave [Wu22b] 展示了使用 SDE 类型建模生成音频的示例。
4.5.基于源滤波器的声码器在这篇著述的开头,在处理 TTS 的历史时,咱们浅易地了解了 Formant Synthesis。东说念主声是一种建模方法,以为基本声源(正弦音等)经过口部结构过滤,逶迤为咱们听到的声息。这种方法最紧迫的部分是怎么制作过滤器。在 DL 期间,我想如若这个过滤工具神经网罗建模,性能会不会更好。在神经源滤波器方法 [Wang19a] 中,使用 f0(音高)信息创建基本正弦声息,并测验使用膨胀卷积的滤波器以产生优质声息。不是自雅致的方法,是以速率很快。之后,在[Wang19b]中,将其扩展重构为谐波 噪声模子以提高性能。DDSP [Engel20] 提倡了一种使用神经网罗和多个 DSP 组件创建各式声息的方法,其中谐波使用加法合成方法,噪声使用线性时变滤波器。
另一种方法是将与语音音高联系的部分(共振峰)和其他部分(称为残差、激发等)进行诀别和处理的方法。这亦然一种历史悠久的方法。共振峰主要使用了LP(线性权衡),激发使用了各式模子。GlotNet [Juvela18],在神经网罗期间提倡,将(声门)激发建模为 WaveNet。之后,GELP [Juvela19] 使用 GAN 测验方法将其扩展为并行样式。
Naver/Yonsei University 的 ExcitNet [Song19] 也可以看作是具有类似念念想的模子,然后,在扩展模子 LP-WaveNet [Hwang20a] 中,source 和 filter 一说念测验,并使用更复杂的模子。在 [Song20] 中,引入了逐代建模 (MbG) 办法,从声学模子生成的信息可用于声码器以提高性能。在神经同态声码器 [Liu20b] 中,谐波使用线性时变 (LTV) 脉冲序列,噪声使用 LTV 噪声。[Yoneyama21] 提倡了一种模子,它使用 Parallel WaveGAN 作为声码器,并集成了上述几种源滤波器模子。Parallel WaveGAN自己也被原作家组(Naver等)握住推广,领先在[Hwang21b]中,Generator被推广为Harmonic Noise模子,同期也加入了subband版块。此外,[Yamamoto21] 提倡了几种提高辩认器性能的时刻,其中,模子浊音(谐波)和清音(噪声)的辩认器分为谈判成分。
LPCNet [Valin19] 可以被以为是继这种源过滤器方法之后使用最凡俗的模子。作为在 WaveRNN 中加入线性权衡的模子, LPCNet 而后也进行了多方面的改良。在 Bunched LPCNet [Vipperla20] 中,通过哄骗原始 WaveRNN 中引入的时刻,LPCNet 变得愈加高效。Gaussian LPCNet [Popov20a] 还通过允许同期权衡多个样本来提高效力。[Kanagawa20] 通过使用张量领悟进一步减小 WaveRNN 里面组件的大小来提高另一个标的的效力。iLPCNet [ Hwang20b] 提倡了一种模子,该模子通过哄骗一语气阵势的夹杂密度网罗知道出比现存 LPCNet 更高的性能。[Popov20b] 提倡了一种模子,在LPCNet中的语音中找到可以割断的部分(举例,停顿或清音),将它们诀别,并行处理,并通过交叉淡入淡出来加速生成速率. LPCNet 也扩展到了子带版块,领先在 FeatherWave [Tian20] 中引入子带 LPCNet。 在 [Cui20] 中,提倡了谈判子带之间联系性的子带 LPCNet 的改良版块。最近LPCNet的作家也推出了改良版(好像是从Mozilla/Google转到Amazon)[Valin22],使用树结构来减少采样时的计较量,使用8位量化权重。建议。这些都是有用使用缓存并哄骗最新 GPU 改良的并行计较智商的整个方法。
声码器的发展正朝着从高质料、慢速的AR(Autoregressive)方法向快速的NAR(Non-autoregressive)方法更动的标的发展。由于几种先进的生成时刻,NAR 也逐渐达到 AR 的水平。举例在TTS-BY-TTS [Hwang21a]中,使用AR方法创建了宽绰数据并用于NAR模子的测验,效果可以。但是,使用所特别据可能会很晦气。因此,TTS-BY-TTS2 [Song22] 提倡了一种仅使用此数据进行测验的方法,方法是使用 RankSVM 得到与原始音频更相似的合成音频。
DelightfulTTS [Liu21],微软使用的 TTS 系统,有一些我方的结构修改,举例使用 conformers,何况止境以生成 48 kHz 的最终音频为特征(大多数 TTS 系统平淡生成 16 kHz 音频)。为此,梅尔频谱图以 16kHz 的频率生成,但最终音频是使用里面制作的 HiFiNet 以 48kHz 的频率生成的。
5.王人备端到端的TTS通过一说念学习声学模子和声码器,先容在输入文本或音素时立即创建波形音频的模子。践诺上,最佳一次完成整个操作,无需诀别测验法子,更少的法子减少诞妄。无需使用 Mel Spectrum 等声学功能。其实Mel是好的,但是被东说念主纵情设定了(次优),相位信息也丢失了。然则,这些模子之是以羁系易从一运行就开采出来,是因为很难一次全部完成。
举例,作为输入的文本在 5 秒内大要为 20,对于音素大要为 100。但波形是 80,000 个样本(采样率为 16 kHz)。因此,一朝成为问题,不好王人备与其匹配(文本->音频样本),不如使用中瓜分辨率的抒发样式(如Mel)分两步进行比拟浅易。但是,跟着时刻的逐渐发展,可以找到一些用这种 Fully End-to-End 方法测验的模子。作为参考,在许多处理声学模子的论文中,他们频繁使用术语端到端模子,这意味着文分内析部分已被一说念继承到他们的模子中,或者他们可以通过将声码器附加到他们的模子来生成音频. 它平淡用于暗示大约。
也许这个领域的第一个是 Char2Wav [Sotelo17]。这是蒙特利尔大学名东说念主Yoshua Bengio教化团队的论文,通过将其团队制作的SampleRNN [Mehri17] vocoder添加到Acoustic Model using seq2seq中一次性测验而成。ClariNet[Mehri17]的主要内容其实便是让WaveNet->IAF方法的Vocoder愈加高效,但是有他们团队(百度)创建的Acoustic Model(DeepVoice 3),是以在里面添加一个新创建的vocoder何况迅速学起来吧,还先容了怎么创建-to-End模子。
FastSpeech 2 [Ren21a] 亦然对于一个好的 Acoustic Model,这篇论文也先容了一个 Fully End-to-End 模子,叫作念 FastSpeech 2s。FastSpeech 2模子附加了一个WaveNet声码器,为了克服测验的艰苦,选拔了使用事前制作的mel编码器的方法。名为EATS [Donahue21]的模子使用他们团队(谷歌)创建的GAN-TTS [Binkowski20]作为声码器,创建一个新的Acoustic Model,并一说念测验。但是,一次测验很艰苦,因此创建并使用了中瓜分辨率的暗示。Wave-Tacotron [Weiss21],是一种通过将声码器结合到 Tacotron 来立即测验的模子。这里使用了流式声码器,作家使用 Kingma,因此可以在不昭着镌汰性能的情况下创建更快的模子。
之前Acoustic Model部分先容的EfficientTTS [Miao21]也先容了一种模子(EFTS-Wav),通过将decoder换成MelGAN,以端到端的样式进行测验。该模子还标明,它可以昭着加速音频生成速率,同期仍然发扬细密。Kakao 团队开采了一种名为 Glow-TTS [Kim20] 的声学模子和一种名为 HiFi-GAN [Kong20] 的声码器。然后可以将两者放在一说念以创建端到端模子。这样创建的模子是 VITS [Kim21a],它使用 VAE 结合两个部分,并使用回击性方法进行通盘测验,提倡了具有细密速率和质料的模子。
延世大学/Naver 还在 2021 年推出了 LiteTTS [Nguyen21],这是一种高效的王人备端到端 TTS。使用了前馈变换器和 HiFi-GAN 结构的轻量级版块。止境是,域传输编码工具于学习与韵律镶嵌联系的文本信息。腾讯和浙江大学提倡了一种名为 FastDiff [Huang22] 的声码器,还引入了 FastDiff-TTS,这是一种聚会 FastSpeech 2的王人备端到端模子。Kakao 还引入了 JETS,它可以一说念测验 FastSpeech2 和 HiFi-GAN [Lim22]。微软在将现存的 DelightfulTTS 升级到版块 2 的同期,也引入了 Fully End-to-End 方法 [Liu22b]。这里,VQ音频编码器被用作中间抒发方法。
参考文件【1】[논문들소개] Neural Text-to-Speech(TTS)
【2】1906.10859.pdf (arxiv.org)家庭伦理小说
Reference [Griffin84] D.Griffin, J.Lim. Signal estimation from modified short-time fourier transform. IEEE Transactions on Acoustics, Speech, and Signal Processing, 32(2):236–243, 1984.[Kawahara06] H.Kawahara. Straight, exploitation of the other aspect of vocoder: Perceptually isomor- phic decomposition of speech sounds. Acoustical science and technology, 27(6):349–353, 2006.[Zen13] H.Zen, A.Senior, M.Schuster. Statistical parametric speech synthesis using deep neural networks. ICASSP 2013.[Fan14] Yuchen Fan, Yao Qian, Feng-Long Xie, and Frank K Soong. TTS synthesis with bidirectional lstm based recurrent neural networks. Fifteenth annual conference of the international speech communication association, 2014.[Qian14] Y. Qian, Y.-C. Fan, W.-P. Hum, F. K. Soong, On the training aspects of deep neural network (DNN) for parametric TTS synthesis. ICASSP 2014.[Zen15] H.Zen, Hasim Sak. Unidirectional long short-term memory recurrent neural network with recurrent output layer for low-latency speech synthesis. ICASSP 2015.[Morise16] M.Morise, F.Yokomori, K.Ozawa. World: a vocoder-based high-quality speech synthesis system for real-time applications. IEICE Transactions on Information and Systems, 99(7):1877–1884, 2016.[Oord16] A.van den Oord, S.Dieleman, H.Zen, K.Simonyan, O.Vinyals, A.Graves, N.Kalchbrenner, A.Senior, K.Kavukcuoglu. WaveNet: A generative model for raw audio. arXiv preprint arXiv:1609.03499, 2016. [Arık17a] S.Ö.Arık, M.Chrzanowski, A.Coates, G.Diamos, A.Gibiansky, Y.Kang, X.Li, J.Miller, J.Raiman, S.Sengupta, M.Shoeybi. Deep Voice: Real-time neural text-to-speech. ICML 2017.[Arık17b] S.Ö.Arık, G.Diamos, A.Gibiansky, J.Miller, K.Peng, W.Ping, J.Raiman, Y.Zhou. Deep Voice 2: Multi-speaker neural text-to-speech. NeurIPS 2017.[Lee17] Y.Lee, A.Rabiee, S.-Y.Lee. Emotional end-to-end neural speech synthesizer. arXiv preprint arXiv:1711.05447, 2017.[Mehri17] S.Mehri, K.Kumar, I.Gulrajani, R.Kumar, S.Jain, J.Sotelo, A.Courville, Y.Bengio. SampleRNN: An unconditional end-to-end neural audio generation model. ICLR 2017. [Ming17] H.Ming, Y.Lu, Z.Zhang, M.Dong. Alight-weight method of building an LSTM-RNN-based bilingual TTS system. International Conference on Asian Language Processing 2017.[Sotelo17] J.Sotelo, S.Mehri, K.Kumar, J.F.Santos, K.Kastner, A.Courville, Y.Bengio. Char2wav: End-to-end speech synthesis. ICLR workshop 2017. [Tjandra17] A.Tjandra, S.Sakti, S.Nakamura. Listening while speaking: Speech chain by deep learning. IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) 2017.[Wang17] Y.Wang, RJ Skerry-Ryan, D.Stanton, Y.Wu, R.Weiss, N.Jaitly, Z.Yang, Y.Xiao, Z.Chen, S.Bengio, Q.Le, Y.Agiomyrgiannakis, R.Clark, R.A.Saurous. Tacotron: Towards end-to-end speech synthesis. Interspeech 2017. [Adigwe18] A.Adigwe, N.Tits, K.El Haddad, S.Ostadabbas, T.Dutoit. The emotional voices database: Towards controlling the emotion dimension in voice generation systems. arXiv preprint arXiv:1806.09514, 2018.[Akuzawa18] K.Akuzawa, Y.Iwasawa, Y.Matsuo. Expressive speech synthesis via modeling expressions with variational autoencoder. Interspeech 2018.[Arık18] S.Ö.Arık, J.Chen, K.Peng, W.Ping, Y.Zhou. Neural voice cloning with a few samples. NeurIPS 2018.[Chae18] M.-J.Chae, K.Park, J.Bang, S.Suh, J.Park, N.Kim, L.Park. Convolutional sequence to sequence model with non-sequential greedy decoding for grapheme to phoneme conversion. ICASSP 2018.[Guo18] W.Guo, H.Yang, Z.Gan. A dnn-based mandarin-tibetan cross-lingual speech synthesis. Asia-Pacific Signal and Information Processing Association Annual Summit and Conference 2018.[Kalchbrenner18] N.Kalchbrenner, E.Elsen, K.Simonyan, S.Noury, N.Casagrande, E.Lockhart, F.Stimberg, A.van den Oord, S.Dieleman, K.Kavukcuoglu. Efficient neural audio synthesis. ICML 2018. [Jia18] Y.Jia, Y.Zhang, R.J.Weiss, Q.Wang, J.Shen, F.Ren, Z.Chen, P.Nguyen, R.Pang, I.L.Moreno, Y.Wu. Transfer learning from speaker verification to multispeaker text-to-speech synthesis. NeurIPS 2018.[Jin18] Z.Jin, A.Finkelstein, G.J.Mysore, J.Lu. FFTNet: A real-time speaker-dependent neural vocoder. ICASSP 2018.[Juvela18] L.Juvela, V.Tsiaras, B.Bollepalli, M.Airaksinen, J.Yamagishi, P. Alku. Speaker-independent raw waveform model for glottal excitation. Interspeech 2018.[Nachmani18] E.Nachmani, A.Polyak, Y.Taigman, L.Wolf. Fitting new speakers based on a short untranscribed sample. ICML 2018.[Okamoto18a] T. Okamoto, K. Tachibana, T. Toda, Y. Shiga, and H. Kawai. An investigation of subband wavenet vocoder covering entire audible frequency range with limited acoustic features. ICASSP 2018.[Okamoto18b] T. Okamoto, T. Toda, Y. Shiga, and H. Kawai. Improving FFT-Net vocoder with noise shaping and subband approaches. IEEE Spoken Language Technology Workshop (SLT) 2018.[Oord18] A.van den Oord, Y.Li, I.Babuschkin, K.Simonyan, O.Vinyals, K.Kavukcuoglu, G.van den Driessche, E.Lockhart, L.C.Cobo, F.Stimberg et al., Parallel WaveNet: Fast high-fidelity speech synthesis. ICML 2018. [Ping18] W.Ping, K.Peng, A.Gibiansky, S.O.Arık, A.Kannan, S.Narang, J.Raiman, J.Miller. Deep Voice 3: Scaling text-to-speech with convolutional sequence learning. ICLR 2018. [Shen18] J.Shen, R.Pang, R.J.Weiss, M.Schuster, N.Jaitly, Z.Yang, Z.Chen, Y.Zhang, Y.Wang, RJ S.Ryan, R.A.Saurous, Y.Agiomyrgiannakis, Y.Wu. Natural TTS synthesis by conditioning WaveNet on mel spectrogram predictions. ICASSP 2018. [Skerry-Ryan18] R.J.Skerry-Ryan, E.Battenberg, Y.Xiao, Y.Wang, D.Stanton, J.Shor, R.Weiss, R.Clark, R.A.Saurous. Towards end-to-end prosody transfer for expressive speech synthesis with tacotron. ICML 2018.[Tachibana18] H.Tachibana, K.Uenoyama, S.Aihara. Efficiently trainable text-to-speech system based on deep convolutional networks with guided attention. ICASSP 2018.[Taigman18] Y.Taigman, L.Wolf, A.Polyak, E.Nachmani. VoiceLoop: Voice fitting and synthesis via a phonological loop. ICLR 2018.[Tjandra18] A.Tjandra, S.Sakti, S.Nakamura. Machine speech chain with one-shot speaker adaptation. Interspeech 2018.[Wang18] Y.Wang, D.Stanton, Y.Zhang, R.J.Skerry-Ryan, E.Battenberg, J.Shor, Y.Xiao, Y.Jia, F.Ren, R.A.Saurous. Style tokens: Unsupervised style modeling, control and transfer in end-to-end speech synthesis. ICML 2018.[Bollepalli19] B.Bollepalli, L.Juvela, P.Alkuetal. Lombard speech synthesis using transfer learning in a Tacotron text-to-speech system. Interspeech 2019.[Chen19a] Y.-J.Chen, T.Tu, C.-c.Yeh, H.-Y.Lee. End-to-end text-to-speech for low-resource languages by cross-lingual transfer learning. Interspeech 2019.[Chen19b] Y.Chen, Y.Assael, B.Shillingford, D.Budden, S.Reed, H.Zen, Q.Wang, L.C.Cobo, A.Trask, B.Laurie, C.Gulcehre, A.van den Oord, O.Vinyals, N.de Freitas. Sample efficient adaptive text-to-speech. ICLR 2019.[Chen19c] M.Chen, M.Chen, S.Liang, J.Ma, L.Chen, S.Wang, J.Xiao. Cross-lingual, multi-speaker text-to-speech synthesis using neural speaker embedding. Interspeech 2019.[Chung19] Y.-A.Chung, Y.Wang, W.-N.Hsu,Y.Zhang, R.J.Skerry-Ryan.Semi-supervised training for improving data efficiency in end-to-end speech synthesis. ICASSP 2019.[Donahue19] C.Donahue, J.McAuley, M.Puckette. Adversarial audio synthesis. ICLR 2019. [논문리뷰][Fang19] W.Fang, Y.-A.Chung, J.Glass. Towards transfer learning for end-to-end speech synthesis from deep pre-trained language models. arXiv preprint arXiv:1906.07307, 2019.[Guo19] H.Guo, F.K.Soong, L.He, L.Xie. A new GAN-based end-to-end tts training algorithm. Interspeech 2019.[Gururani19] S.Gururani, K.Gupta, D.Shah, Z.Shakeri, J.Pinto. Prosody transfer in neural text to speech using global pitch and loudness features. arXiv preprint arXiv:1911.09645, 2019.[Habib19] R.Habib, S.Mariooryad, M.Shannon, E.Battenberg, R.J.Skerry-Ryan, D.Stanton, D.Kao, T.Bagby. Semi-supervised generative modeling for controllable speech synthesis. ICLR 2019.[Hayashi19] T. Hayashi, S. Watanabe, T. Toda, K. Takeda, S. Toshniwal, and K. Livescu. Pre-trained text embeddings for enhanced text-to-speech synthesis. Interspeech 2019.[Hsu19] W.-N.Hsu, Y.Zhang, R.J.Weiss, H.Zen, Y.Wu, Y.Wang, Y.Cao, Y.Jia, Z.Chen, J.Shen, P.Nguyen, R.Pang. Hierarchical generative modeling for controllable speech synthesis. ICLR 2019.[Jia19] Y.Jia, R.J.Weiss, F.Biadsy, W.Macherey, M.Johnson, Z.Chen, Y.Wu. Direct speech-to-speech translation with a sequence-to-sequence model. Interspeech 2019.[Juvela19] L.Juvela, B.Bollepalli, J.Yamagishi, P.Alku. Gelp: Gan-excited linear prediction for speech synthesis from mel-spectrogram. Interspeech 2019.[Kim19] S.Kim, S.Lee, J.Song, J.Kim, S.Yoon. FloWaveNet: A Generative flow for raw audio. ICML 2019. [Kenter19] T.Kenter, V.Wan, C.-A.Chan, R.Clark, J.Vit. Chive: Varying prosody in speech synthesis with a linguistically driven dynamic hierarchical conditional variational network. ICML 2019.[Klimkov19] V.Klimkov, S.Ronanki, J.Rohnke, T.Drugman. Fine-grained robust prosody transfer for single-speaker neural text-to-speech. Interspeech 2019.[Kons19] Z.Kons, S.Shechtman, A.Sorin, C.Rabinovitz, R.Hoory. High quality, lightweight and adaptable TTS using LPCNet. Interspeech 2019.[Kwon19] O.Kwon, E.Song, J.-M.Kim, H.-G.Kang. Effective parameter estimation methods for an excitnet model in generative text-to-speech systems. arXiv preprint arXiv:1905.08486, 2019.[Kumar19] K.Kumar, R.Kumar, T.de Boissiere, L.Gestin, W.Z.Teoh, J.Sotelo, A.de Brebisson, Y.Bengio, A. Courville. MelGAN: Generative adversarial networks for conditional waveform synthesis. NeurIPS 2019. [Lee19] Y.Lee, T.Kim. Robust and fine-grained prosody control of end-to-end speech synthesis. ICASSP 2019.[Li19a] N.Li, S.Liu, Y.Liu, S.Zhao, M.Liu, M.Zhou. Neural speech synthesis with transformer network. AAAI 2019. [Li19b] B. Li, Y. Zhang, T. Sainath, Y. Wu, W. Chan. Bytes are all you need: End-to-end multilingual speech recognition and synthesis with bytes. ICASSP, 2019.[Lorenzo-Trueba19] J.Lorenzo-Trueba, T.Drugman, J.Latorre, T.Merritt, B.Putrycz, R.Barra-Chicote, A.Moinet, V.Aggarwal. Towards achieving robust universal neural vocoding. Interspeech 2019.[Ma19] S.Ma, D.Mcduff, Y.Song. Neural TTS stylization with adversarial and collaborative games. ICLR 2019.[Ming19] H. Ming, L. He, H. Guo, and F. Soong. Feature reinforcement with word embedding and parsing information in neural TTS. arXiv preprint arXiv:1901.00707, 2019.[Nachmani19] E.Nachmani, L.Wolf. Unsupervised polyglot text to speech. ICASSP 2019.[Ping19] W.Ping, K.Peng, J.Chen. ClariNet: Parallel wave generation in end-to-end text-to-speech. ICLR 2019.[Prenger19] R.Prenger, R.Valle, B.Catanzaro. WaveGlow: A flow-based generative network for speech synthesis. ICASSP 2019. [Ren19a] Y.Ren, Y.Ruan, X.Tan, T.Qin, S.Zhao, Z.Zhao, T.Y.Liu. FastSpeech: Fast, robust and controllable text to speech. NeurIPS 2019.[Ren19b] Y.Ren, X.Tan, T.Qin, S.Zhao, Z.Zhao, T.-Y.Liu. Almost unsupervised text to speech and automatic speech recognition. ICML 2019.[Song19] E.Song, K.Byun, H.-G.Kang. ExcitNet vocoder: A neural excitation model for parametric speech synthesis systems. EUSIPCO, 2019.[Tits19a] N.Tits, K.E.Haddad, T.Dutoit. Exploring transfer learning for low resource emotional TTS. SAI Intelligent Systems Conference. Springer 2019.[Tits19b] N.Tits, F.Wang, K.E.Haddad, V.Pagel, T.Dutoit. Visualization and interpretation of latent spaces for controlling expressive speech synthesis through audio analysis,. arXiv preprint arXiv:1903.11570, 2019.[Tjandra19] A.Tjandra, B.Sisman, M.Zhang, S.Sakti, H.Li, S.Nakamura. VQVAE unsupervised unit discovery and multi-scale code2spec inverter for zerospeech challenge 2019. Interspeech 2019.[Valin19] J.-M.Valin, J.Skoglund. LPCNet: Improving neural speech synthesis through linear prediction. ICASSP 2019.[Wang19a] X.Wang, S.Takaki, J.Yamagishi. Neural source-filter-based waveform model for statistical parametric speech synthesis. ICASSP 2019.[Wang19b] X.Wang, S.Takaki, J.Yamagishi. Neural harmonic-plus-noise waveform model with trainable maximum voice frequency for text-to-speech synthesis. ISCA Speech Synthesis Workshop 2019.[Yamamoto19] R.Yamamoto, E.Song, J.-M.Kim. Probability density distillation with generative adversarial networks for high-quality parallel waveform generation. Interspeech 2019.[Yang19] B.Yang, J.Zhong, S.Liu. Pre-trained text representations for improving front-end text processing in Mandarin text-to-speech synthesis. Interspeech 2019.[Zhang19a] Y.-J.Zhang, S.Pan, L.He, Z.-H.Ling. Learning latent representations for style control and transfer in end-to-end speech synthesis. ICASSP 2019.[Zhang19b] M.Zhang, X.Wang, F.Fang, H.Li, J.Yamagishi. Joint training framework for text-to-speech and voice conversion using multi-source tacotron and wavenet. Interspeech 2019.[Zhang19c] W.Zhang, H.Yang, X.Bu, L.Wang. Deep learning for mandarin-tibetan cross-lingual speech synthesis. IEEE Access 2019.[Zhang19d] Y.Zhang, R.J.Weiss, H.Zen, Y.Wu, Z.Chen, R.J.Skerry-Ryan, Y.Jia, A.Rosenberg, B.Ramabhadran. Learning to speak fluently in a foreign language: Multilingual speech synthesis and cross-language voice cloning. Interspeech 2019.[Azizah20] K.Azizah, M.Adriani, W.Jatmiko. Hierarchical transfer learning for multilingual, multi-speaker, and style transfer DNN-based TTS on low-resource languages. IEEE Access 2020.[Bae20] J.-S.Bae, H.Bae, Y.-S.Joo, J.Lee, G.-H.Lee, H.-Y.Cho. Speaking speed control of end-to-end speech synthesis using sentence-level conditioning. Interspeech 2020.[Binkowski20] M.Binkowski, J.Donahue, S.Dieleman, A.Clark, E.Elsen, N.Casagrande, L.C.Cobo, K.Simonyan. High fidelity speech synthesis with adversarial networks. ICLR 2020. [논문리뷰][Chen20] M.Chen, X.Tan, Y.Ren, J.Xu, H.Sun, S.Zhao, T.Qin. MultiSpeech: Multi-speaker text to speech with transformer. Interspeech 2020.[Choi20] S.Choi, S.Han, D.Kim, S.Ha. Attentron: Few-shot text-to-speech utilizing attention-based variable-length embedding. Interspeech 2020.[Cooper20a] E.Cooper, C.-I.Lai, Y.Yasuda, F.Fang, X.Wang, N.Chen, J.Yamagishi. Zero-shot multi-speaker text-to-speech with state-of-the-art neural speaker embeddings. ICASSP 2020.[Cooper20b] E.Cooper, C.-I.Lai, Y.Yasuda, J.Yamagishi. Can speaker augmentation improve multi-speaker end-to-end TTS? Interspeech 2020.[Cui20] Y.Cui, X.Wang, L.He, F.K.Soong. An efficient subband linear prediction for lpcnet-based neural synthesis. Interspeech 2020.[deKorte20] M.de Korte, J.Kim, E.Klabbers. Efficient neural speech synthesis for low-resource languages through multilingual modeling. Interspeech 2020.[Engel20] J.Engel, L.Hantrakul, C.Gu, A.Roberts, DDSP: Differentiable digital signal processing. ICLR 2020.[Gritsenko20] A.Gritsenko, T.Salimans, R.van den Berg, J.Snoek, N.Kalchbrenner. A spectral energy distance for parallel speech synthesis. NeurIPS 2020.[Hemati20] H.Hemati, D.Borth. Using IPA-based tacotron for data efficient cross-lingual speaker adaptation and pronunciation enhancement. arXiv preprint arXiv:2011.06392, 2020.[Himawan20] I.Himawan, S.Aryal, I.Ouyang, S.Kang, P.Lanchantin, S.King. Speaker adaptation of a multilingual acoustic model for cross-language synthesis. ICASSP 2020.[Hsu20] P.-C.Hsu and H.-Y.Lee. WG-WaveNet: Real-time high-fidelity speech synthesis without GPU. Interspeech 2020.[Hwang20a] M.-J.Hwang, F.Soong, E.Song, X.Wang, H. ang, H.-G.Kang. LP-WaveNet: Linear prediction-based WaveNet speech synthesis. Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC) 2020.[Hwang20b] M.-J.Hwang, E.Song, R.Yamamoto, F.Soong, H.-G.Kang. Improving LPCNet-based text-to-speech with linear prediction-structured mixture density network. ICASSP 2020.[Jang20] W.Jang, D.Lim, J.Yoon. Universal MelGAN: A robust neural vocoder for high-fidelity waveform generation in multiple domains. arXiv preprint arXiv:2011.09631, 2020.[Kanagawa20] H.Kanagawa, Y.Ijima. Lightweight LPCNet-based neural vocoder with tensor decomposition. Interspeech 2020.[Kenter20] T. Kenter, M. K. Sharma, and R. Clark. Improving prosody of RNN-based english text-to-speech synthesis by incorporating a BERT model. Interspeech 2020.[Kim20] J.Kim, S.Kim, J.Kong, S.Yoon. Glow-TTS: A generative flow for text-to-speech via monotonic alignment search. NeurIPS 2020[Kong20] J.Kong, J.Kim, J.Bae. HiFi-GAN: Generative adversarial networks for efficient and high fidelity speech synthesis. NeurIPS 2020.[Li20] N.Li, Y.Liu, Y.Wu, S.Liu, S.Zhao, M.Liu. RobuTrans: A robust transformer-based text-to-speech model. AAAI 2020.[Lim20] D.Lim, W.Jang, G.O, H.Park, B.Kim, J.Yoon. JDI-T: Jointly trained duration informed transformer for text-to-speech without explicit alignment. Interspeech 2020.[Liu20a] A.H.Liu, T.Tu, H.-y.Lee, L.-s.Lee. Towards unsupervised speech recognition and synthesis with quantized speech representation learning. ICASSP 2020.[Liu20b] Z.Liu, K.Chen, K.Yu. Neural homomorphic vocoder. Interspeech 2020.[Luong20] H.-T.Luong, J.Yamagishi. NAUTILUS: a versatile voice cloning system. IEEE/ACM Transactions on Audio, Speech, and Language Processing 2020.[Maiti20] S.Maiti, E.Marchi, A.Conkie. Generating multilingual voices using speaker space translation based on bilingual speaker data. ICASSP 2020.[Miao20] C.Miao, S.Liang, M.Chen, J.Ma, S.Wang, J.Xiao. Flow-TTS: A non-autoregressive network for text to speech based on flow. ICASSP 2020.[Morrison20] M.Morrison, Z.Jin, J.Salamon, N.J.Bryan, G.J.Mysore. Controllable neural prosody synthesis. Interspeech 2020.[Moss20] H.B.Moss, V.Aggarwal, N.Prateek, J.González, R.Barra-Chicote. BOFFIN TTS: Few-shot speaker adaptation by bayesian optimization. ICASSP 2020.[Nekvinda20] T.Nekvinda, O.Dušek. One model, many languages: Meta-learning for multilingual text-to-speech. Interspeech 2020.[Park20] K.Park, S.Lee. G2PM: A neural grapheme-to-phoneme conversion package for mandarin chinese based on a new open benchmark dataset. Interspeech 2020.[Paul20a] D.Paul, Y.Pantazis, Y.Stylianou. Speaker Conditional WaveRNN: Towards universal neural vocoder for unseen speaker and recording conditions. Interspeech 2020.[Paul20b] D.Paul, M.P.V.Shifas, Y.Pantazis, Y.Stylianou. Enhancing speech intelligibility in text-to-speech synthesis using speaking style conversion. Interspeech 2020.[Peng20] K.Peng, W.Ping, Z.Song, K.Zhao. Non-autoregressive neural text-to-speech. ICML 2020. [논문리뷰][Ping20] W.Ping, Ka.Peng, K.Zhao, Z.Song. WaveFlow: A compact flow-based model for raw audio. ICML 2020. [논문리뷰][Popov20a] V.Popov, M.Kudinov, T.Sadekova. Gaussian LPCNet for multisample speech synthesis. ICASSP 2020.[Popov20b] V.Popov, S.Kamenev, M.Kudinov, S.Repyevsky, T.Sadekova, V.Bushaev, V.Kryzhanovskiy, D.Parkhomenko. Fast and lightweight on-device tts with Tacotron2 and LPCNet. Interspeech 2020.[Shen20] J.Shen, Y.Jia, M.Chrzanowski, Y.Zhang, I.Elias, H.Zen, Y.Wu. Non-Attentive Tacotron: Robust and controllable neural TTS synthesis including unsupervised duration modeling. arXiv preprint arXiv:2010.04301, 2020.[Song20] E.Song, M.-J.Hwang, R.Yamamoto, J.-S.Kim, O.Kwon, J.- M.Kim. Neural text-to-speech with a modeling-by-generation excitation vocoder. Interspeech 2020.[Staib20] M.Staib, T.H.Teh, A.Torresquintero, D.S.R.Mohan, L.Foglianti, R.Lenain, J.Gao. Phonological features for 0-shot multilingual speech synthesis. Interspeech 2020.[Sun20a] G.Sun, Y.Zhang, R.J.Weiss, Y.Cao, H.Zen, A.Rosenberg, B.Ramabhadran, Y.Wu. Generating diverse and natural text-to-speech samples using a quantized fine-grained VAE and autoregressive prosody prior. ICASSP 2020.[Sun20b] G.Sun, Y.Zhang, R.J.Weiss, Y.Cao, H.Zen, Y.Wu. Fully-hierarchical fine-grained prosody modeling for interpretable speech synthesis. ICASSP 2020.[Tian20] Q.Tian, Z.Zhang, L.Heng, L.Chen, S.Liu. FeatherWave: An efficient high-fidelity neural vocoder with multiband linear prediction. Interspeech 2020.[Tu20] T.Tu, Y.-J.Chen, A.H.Liu, H.-y.Lee. Semi-supervised learning for multi-speaker text-to-speech synthesis using discrete speech representation. Interspeech 2020.[Um20] S.-Y.Um, S.Oh, K.Byun, I.Jang, C.H.Ahn, H.-G.Kang. Emotional speech synthesis with rich and granularized control. ICASSP 2020.[Valle20a] R.Valle, K.Shih, R.Prenger, B.Catanzaro. Flowtron: an autoregressive flow-based generative network for text-to-speech synthesis. arXiv preprint arXiv:2005.05957, 2020.[Valle20b] R.Valle, J.Li, R.Prenger, B.Catanzaro. Mellotron: Multispeaker expressive voice synthesis by conditioning on rhythm, pitch and global style tokens. ICASSP 2020.[Vipperla20] R.Vipperla, S.Park, K.Choo, S.Ishtiaq, K.Min, S.Bhattacharya, A.Mehrotra, A.G.C.P.Ramos, N.D.Lane. Bunched LPCNet: Vocoder for low-cost neural text-to-speech systems. Interspeech 2020.[Wu20] Y.-C.Wu, T.Hayashi, T.Okamoto, H.Kawai, T.Toda. Quasi-periodic Parallel WaveGAN vocoder: A non-autoregressive pitch-dependent dilated convolution model for parametric speech generation. Interspeech 2020.[Xiao20] Y.Xiao, L.He, H.Ming, F.K.Soong. Improving prosody with linguistic and BERT derived features in multi-speaker based Mandarin Chinese neural TTS. ICASSP 2020.[Xu20] J.Xu, X.Tan, Y.Ren, T.Qin, J.Li, S.Zhao, T.-Y.Liu. LRSpeech: Extremely low-resource speech synthesis and recognition. ACM SIGKDD International Conference on Knowledge Discovery & Data Mining 2020.[Yamamoto20] R.Yamamoto, E.Song, and J.M.Kim. Parallel WaveGAN: A fast waveform generation model based on generative adversarial networks with multi-resolution spectrogram. ICASSP 2020.[Yang20a] J.Yang, J.Lee, Y.Kim, H.-Y.Cho, I.Kim. VocGAN: A high-fidelity real-time vocoder with a hierarchically-nested adversarial network. Interspeech 2020.[Yang20b] J.Yang, L.He. Towards universal text-to-speech. Interspeech 2020.[Yu20] C.Yu, H.Lu, N.Hu, M.Yu, C.Weng, K.Xu, P.Liu, D.Tuo, S.Kang, G.Lei, D.Su, D.Yu. DurIAN: Duration informed attention network for speech synthesis. Interspeech 2020.[Zhang20a] H.Zhang, Y.Lin. Unsupervised learning for sequence-to-sequence text-to-speech for low-resource languages. Interspeech 2020.[Zhang20b] Z.Zhang, Q.Tian, H.Lu, L.-H.Chen, S.Liu. AdaDurIAN: Few-shot adaptation for neural text-to-speech with durian. arXiv preprint arXiv:2005.05642, 2020.[Zhai20] B.Zhai, T.Gao, F.Xue, D.Rothchild, B.Wu, J.E.Gonzalez, K.Keutzer. SqueezeWave: Extremely lightweight vocoders for on-device speech synthesis. arXiv preprint arXiv:2001.05685, 2020.[Zhao20] S.Zhao, T.H.Nguyen, H.Wang, B.Ma. Towards natural bilingual and code-switched speech synthesis based on mix of monolingual recordings and cross-lingual voice conversion. Interspeech 2020.[Zeng20] Zhen Zeng, Jianzong Wang, Ning Cheng, Tian Xia, and Jing Xiao. AlignTTS: Efficient feed-forward text-to-speech system without explicit alignment. ICASSP 2020.[Zhou20] X.Zhou, X.Tian, G.Lee, R.K.Das, H.Li. End-to-end code-switching TTS with cross-lingual language model. ICASSP 2020.[Achanta21] S.Achanta, A.Antony, L.Golipour, J.Li, T.Raitio, R.Rasipuram, F.Rossi, J.Shi, J.Upadhyay, D.Winarsky, H.Zhang. On-device neural speech synthesis. IEEE Workshop on Automatic Speech Recongnition and Understanding 2021.[Bak21] T.Bak, J.-S.Bae, H.Bae, Y.-I.Kim, H.-Y.Cho. FastPitchFormant: Source-filter based decomposed modeling for speech syntehsis. Interspeech 2021.[Bae21] J.-S.Bae, T.-J.Bak, Y.-S.Joo, H.-Y.Cho. Hierarchical context-aware transformers for non-autoregressive text to speech. Interspeech 2021.[Casanova21] E.Casanova, C.Shulby, E.Gölge, N.M.Müller,F.S.de Oliveira, A.C.Junior, A.d.Soares, S.M.Aluisio, M.A.Ponti. SC-GlowTTS: an efficient zero-shot multi-speaker text-to-speech model. Interspeech 2021.[Chen21a] N.Chen, Y.Zhang, H.Zen, R.J.Weiss, M.Norouzi, W.Chan. WaveGrad: Estimating gradients for waveform generation. ICLR 2021.[Chen21b] M.Chen, X.Tan, B.Li, Y.Liu, T.Qin, S.Zhao, T.-Y.Liu. AdaSpeech: Adaptive text to speech for custom voice. ICLR 2021.[Chien21] C.-M.Chien, J.-H.Lin, C.-y.Huang, P.-c.Hsu, H.-y.Lee. Investigating on incorporating pretrained and learnable speaker representations for multi-speaker multi-style text-to-speech. ICASSP 2021.[Christidou21] M.Christidou, A.Vioni, N.Ellinas, G.Vamvoukakis, K.Markopoulos, P.Kakoulidis, J.S.Sung, H.Park, A.Chalamandaris, P.Tsiakoulis. Improved Prosodic Clustering for Multispeaker and Speaker-Independent Phoneme-Level Prosody Control. SPECOM 2021.[Donahue21] J.Donahue, S.Dieleman, M.Binkowski, E.Elsen, K.Simonyan. End-to-end adversarial text-to-speech. ICLR 2021. [Du21] Chenpeng Du and Kai Yu. Rich prosody diversity modelling with phone-level mixture density network. Interspeech 2021.[Elias21a] I.Elias, H.Zen, J.Shen, Y.Zhang, Y.Jia, R.Weiss, Y.Wu. Parallel Tacotron: Non-autoregressive and controllable TTS. ICASSP 2021.[Elias21b] I.Elias, H.Zen, J.Shen, Y.Zhang, Y.Jia, R.J.Skerry-Ryan, Y.Wu. Parallel Tacotron 2: A non-autoregressive neural tts model with differentiable duration modeling. Interspeech 2021.[Hu21] Q.Hu, T.Bleisch, P.Petkov, T.Raitio, E.Marchi, V.Lakshminarasimhan. Whispered and lombard neural speech synthesis. IEEE Spoken Language Technology Workshop (SLT) 2021.[Huang21] Z.Huang, H.Li, M.Lei. DeviceTTS: A small-footprint, fast, stable network for on-device text-to-speech. ICASSP 2021.[Huybrechts21] G.Huybrechts, T.Merritt, G.Comini, B.Perz, R.Shah, J.Lorenzo-Trueba. Low-resource expressive text-to-speech using data augmentation. ICASSP 2021.[Hwang21a] M.-J.Hwang, R.Yamamoto, E.Song, J.-M.Kim. TTS-by-TTS: Tts-driven data augmentation for fast and high-quality speech synthesis. ICASSP 2021.[Hwang21b] M.-J.Hwang, R.Yamamoto, E.Song, J.-M.Kim. High-fidelity Parallel WaveGAN with multi-band harmonic-plus-noise model. Interspeech 2021.[Jang21] W.Jang, D.Lim, J.Yoon, B.Kim, J.Kim. UnivNet: A neural vocoder with multi-resolution spectrogram discriminators for high-fidelity waveform generation. Interspeech 2021. [Jeong21] M.Jeong, H.Kim, S.J.Cheon, B.J.Choi, N.S.Kim. Diff-TTS: A Denoising diffusion model for text-to-speech. Interspeech 2021. [Jia21] Y.Jia, H.Zen, J.Shen, Y.Zhang, Y.Wu. PnG BERT: Augmented bert on phonemes and graphemes for neural TTS. arXiv preprint arXiv:2103.15060, 2021.[Kang21] M.Kang, J.Lee, S.Kim, I.Kim. Fast DCTTS: Efficient deep convolutional text-to-speech. ICASSP 2021.[Kim21a] J.Kim, J.Kong, J.Son. Conditional variational autoencoder with adversarial learning for end-to-end text-to-speech. ICML 2021.[Kim21b] J.-H.Kim, S.-H.Lee, J.-H.Lee, S.-W.Lee. Fre-GAN: Adversarial frequency-consistent audio synthesis. Interspeech 2021.[Kim21c] M.Kim, S.J.Cheon, B.J.Choi, J.J.Kim, N.S.Kim. Expressive text-to-speech using style tag. Interspeech 2021.[Kim21d] H.-Y.Kim, J.-H.Kim, J.-M.Kim. NN-KOG2P: A novel grapheme-to-phoneme model for Korean language. ICASSP 2021.[Kong21] Z.Kong, W.Ping, J.Huang, K.Zhao, B.Catanzaro. DiffWave: A versatile diffusion model for audio synthesis. ICLR 2021.[Łancucki21] A.Łancucki. FastPitch: Parallel text-to-speech with pitch prediction. ICASSP 2021.[Lee21a] Y.Lee, J.Shin, K.Jung. Bidirectional variational inference for non-autoregressive text-to-speech. ICLR 2021.[Lee21b] S.-H.Lee, H.-W.Yoon, H.-R.Noh, J.-H. Kim, S.-W.Lee. Multi-SpectroGAN: High-diversity and high-fidelity spectrogram generation with adversarial style combination for speech synthesis. AAAI 2021.[Lee21c] K.Lee, K.Park, D.Kim. Styler: Style modeling with rapidity and robustness via speech decomposition for expressive and controllable neural text to speech. Interspeech 2021.[Li21a] T.Li, S.Yang, L.Xue, L.Xie. Controllable emotion transfer for end-to-end speech synthesis. International Symposium on Chinese Spoken Language Processing (ISCSLP) 2021.[Li21b] X.Li, C.Song, J.Li, Z.Wu, J.Jia, H.Meng. Towards multiscale style control for expressive speech synthesis. Interspeech, 2021.[Liu21] Y.Liu, Z.Xu, G.Wang, K.Chen, B.Li, X.Tan, J.Li, L.He, S.Zhao. DelightfulTTS: The Microsoft speech synthesis system for Blizzard challenge 2021. arXiv preprint arXiv:2110.12612, 2021.[Luo21] R.Luo, X.Tan, R.Wang, T.Qin, J.Li, S.Zhao, E.Chen, T.-Y.Liu. LightSpeech: Lightweight and fast text to speech with neural architecture search. ICASSP 2021.[Miao21] C.Miao, S.Liang, Z.Liu, M.Chen, J.Ma, S.Wang, J.Xiao. EfficientTTS: An efficient and high-quality text-to-speech architecture. ICML 2021.[Min21] D.Min, D.B.Lee, E.Yang, S.J.Hwang. Meta-StyleSpeech: Multi-speaker adaptive text-to-speech generation. ICML 2021.[Morisson21] M.Morrison, Z.Jin, N.J.Bryan, J.-P.Caceres, B.Pardo. Neural pitch-shifting and time-stretching with controllable LPCNet. arXiv preprint arXiv:2110.02360, 2021.[Nguyen21] H.-K.Nguyen, K.Jeong, S.Um, M.-J.Hwang, E.Song, H.-G.Kang. LiteTTS: A lightweight mel-spectrogram-free text-to-wave synthesizer based on generative adversarial networks. Interspeech 2021.[Pan21] S.Pan, L.He. Cross-speaker style transfer with prosody bottleneck in neural speech synthesis. Interspeech 2021.[Popov21] C.Popov, I.Vovk, V.Gogoryan, T.Sadekova, M.Kudinov. Grad-TTS: A diffusion probabilistic model for text-to-speech. ICML 2021.[Ren21a] Y.Ren, C,Hu, X.Tan, T.Qin, S.Zhao, Z.Zhao, T.-Y.Liu. FastSpeech 2: Fast and high-quality end-to-end text to speech. ICLR 2021.[Ren21b] Y.Ren, J.Liu, Z.Zhao. PortaSpeech: Portable and high-quality generative text-to-speech. NeurIPS 2021.[Sivaprasad21] S.Sivaprasad, S.Kosgi, V.Gandhi. Emotional prosody control for speech generation. Interspeech 2021.[Song21] E.Song, R.Yamamoto, M.-J.Hwang, J.-S.Kim, O.Kwon, J.- M.Kim. Improved Parallel WaveGAN vocoder with perceptually weighted spectrogram loss. IEEE Spoken Language Technology Workshop (SLT) 2021.[Tan21] X.Tan, T.Qin, F.Soong, T.-Y. Liu. A survey on neural speech synthesis. arXiv: 2106.15561v3.[Wang21] D.Wang, L.Deng, Y.Zhang, N.Zheng, Y.T.Yeung, X.Chen, X.Liu, H.Meng. FCL-Taco2: Towards fast, controllable and lightweight text-to-speech synthesis. ICASSP 2021.[Weiss21] R.J.Weiss, R.J.Skerry-Ryan, E.Battenberg, S.Mariooryad, D.P.Kingma. Wave-Tacotron: Spectrogram-free end-to-end text-to-speech synthesis. ICASSP 2021.[Xu21] G.Xu, W.Song, Z.Zhang, C.Zhang, X.He, B.Zhou. Improving prosody modelling with cross-utterance BERT embeddings for end-to-end speech synthesis. ICASSP 2021.[Yamamoto21] R.Yamamoto, E.Song, M.-J.Hwang, J.-M.Kim. Parallel waveform synthesis based on generative adversarial networks with voicing-aware conditional discriminators. ICASSP 2021.[Yan21a] Y.Yan, X.Tan, B.Li, T.Qin, S.Zhao, Y.Shen, T.-Y.Liu. AdaSpeech 2: Adaptive text to speech with untranscribed data. ICASSP 2021.[Yan21b] Y.Yan, X.Tan, B.Li, G.Zhang, T.Qin, S.Zhao, Y.Shen, W.-Q.Zhang, T.-Y.Liu. AdaSpeech 3: Adaptive text to speech for spontaneous style. Interspeech 2021.[Yang21a] G.Yang, S.Yang, K.Liu, P.Fang, W.Chen, L.Xie. Multi-Band MelGAN: Faster waveform generation for high-quality text-to-speech. IEEE Spoken Language Technology Workshop (SLT) 2021.[Yang21b] J.Yang, J.-S.Bae, T.Bak, Y.Kim, H.-Y.Cho. GANSpeech: Adversarial training for high-fidelity multi-speaker speech synthesis. Interspeech 2021.[Yoneyama21] R.Yoneyama, Y.-C.Wu, T.Toda. Unified source-filter GAN: Unified source-filter network based on factorization of quasi-periodic Parallel WaveGAN. Interspeech 2021.[You21] J.You, D.Kim, G.Nam, G.Hwang, G.Chae. GAN Vocoder: Multi-resolution discriminator is all you need. Interspeech 2021.[Yue21] F.Yue, Y.Deng, L.He, T.Ko. Exploring machine speech chain for domain adaptation and few-shot speaker adaptation. arXiv preprint arXiv:2104.03815, 2021.[Zaidi21] J.Zaidi, H.Seute, B.van Niekerk, M.-A.Carbonneau. Daft-Exprt: Cross-speaker prosody transfer on any text for expressive speech synthesis. arXiv preprint arXiv:2108.02271, 2021.[Zhang21a] C.Zhang, X.Tan, Y.Ren, T.Qin, K.Zhang, T.-Y.Liu. UWSpeech: Speech to speech translation for unwritten languages. AAAI 2021.[Zhang21b] G.Zhang, Y.Qin, D.Tan, T.Lee. Applying the information bottleneck principle to prosodic representation learning. arXiv preprint arXiv:2108.02821, 2021.[Zeng21] Z.Zeng, J.Wang, N.Cheng, J.Xiao. LVCNet: Efficient condition-dependent modeling network for waveform generation. ICASSP 2021.[Bae22] J.-S.Bae, J.Yang, T.-J.Bak, Y.-S.Joo. Hierarchical and multi-scale variational autoencoder for diverse and natural non-autoregressive text-to-speech. Interspeech 2022.[Cho22] H.Cho, W.Jung, J.Lee, S.H.Woo. SANE-TTS: Stable and natural end-to-end multilingual text-to-speech. Interspeech 2022.[Comini22] G.Comini, G.Huybrechts, M.S.Ribeiro, A.Gabrys, J.Lorenzo-Trueba. Low-data? No problem: low-resource, language-agnostic conversational text-to-speech via F0-conditioned data augmentation. Interspeech 2022.[Dai22] Z.Dai, J.Yu, Y.Wang, N.Chen, Y.Bian, G.Li, D.Cai, D.Yu. Automatic prosody annotation with pre-trained text-speech model. Interspeech 2022.[Hsu22] P.-C.Hsu, D.-R.Liu, A.T.Liu, H.-y.Lee. Parallel synthesis for autoregressive speech generation. arXiv preprint arXiv:2204.11806, 2022.[Huang22a] R.Huang, M.W.Y.Lam, J.Wang, D.Su, D.Yu, Y.Ren, Z.Zhao. FastDiff: A fast conditional diffusion model for high-quality speech synthesis. International Joint Conference on Artificial Intelligence 2022.[Huang22b] R.Huang, Y.Ren, J.Liu, C.Cui, Z.Zhao. GenerSpeech: Towards style transfer for generalizable out-of-domain TTS synthesis. arXiv preprint arXiv:2205.07211, 2022.[Kharitonov22] E.Kharitonov, A.Lee, A.Polyak, Y.Adi, J.Copet, K.Lakhotia, T.-A.Nguyen, M.Riviere, A.Mohamed, E.Dupoux, W.-N.Hsu. Text-free prosody-aware generative spoken language modeling. Annual Meeting of the Association for Computational Linguistics (ACL) 2022.[Kim22a] H.Kim, S.Kim, S.Yoon. Guided-TTS: A diffusion model for text-to-speech via classifier guidance. ICML 2022.[Kim22b] S.Kim, H.Kim, S.Yoon. Guided-TTS 2: A diffusion model for high-quality adaptive text-to-speech with untranscribed data. arXiv preprint arXiv:2205.15370, 2022.[Koch22] J.Koch, F.Lux, N.Schauffler, T.Bernhart, F.Dieterle, J.Kuhn, S.Richter, G.Viehhauser, N.T.Vu. PoeticTTS: Controllable poetry reading for literary studies. Interspeech 2022.[Lam22] M.W.Y.Lam, J.Wang, D.Su, D.Yu. BDDM: Bilateral denoising diffusion models for fast and high-quality speech synthesis. ICLR 2022.[Lee22a] S.-G.Lee, H.Kim, C.Shin, X.Tan, C.Liu, Q.Meng, T.Qin, W.Chen, S.Yoon, T.-Y.Liu. PriorGrad: Improving conditional denoising diffusion models with data-driven adaptive prior. ICLR 2022.[Lee22b] S.-G.Lee, W.Ping, B.Ginsburg, B.Catanzaro, S.Yoon. BigVGAN: A universal neural vocoder with large-scale training. arXiv preprint arXiv:2206.04658, 2022.[Lei22] Y.Lei, S.Yang, X.Wang, MsEmoTTS: Multi-scale emotion transfer, prediction, and control for emotional speech synthesis. IEEE/ACM Transactions on Audio, Speech and Language Process Vol.30, 2022.[Li22a] Y.A.Li, C.Han, N.Mesgarani. StyleTTS: A style-based generative model for natural and diverse text-to-speech synthesis. arXiv preprint arXiv:2205.15439, 2022.[Li22b] T.Li, X.Wang, Q.Xie, Z.Wang, M.Jiang, L.Xie. Cross-speaker emotion transfer based on prosody compensation for end-to-end speech synthesis. arXiv preprint arXiv:2207.01198, 2022.[Li22c] X.Li, C.Song, X.Wei, Z.Wu, J.Jia, H.Meng. Towards cross-speaker reading style transfer on audiobook dataset. Interspeech 2022.[Lian22] J.Lian, C.Zhang ,G.K.Anumanchipalli, D.Yu. UTTS: Unsupervised TTS with conditional disentangled sequential variational auto-encoder. arXiv preprint arXiv:2206.02512, 2022.[Lim22] D.Lim, S.Jung, E.Kim. JETS: Jointly training FastSpeech2 and HiFi-GAN for end-to-end text-to-speech. Interspeech 2022.[Liu22a] S.Liu, D.Su, D.Yu. DiffGAN-TTS: High-fidelity and efficient text-to-speech with denoising diffusion GANs. arXiv preprint arXiv:2201.11972, 2022.[Liu22b] Y.Liu, R.Xue, L.He, X.Tan, S.Zhao. DelightfulTTS 2: End-to-end speech synthesis with adversarial vector-quantized auto-encoders. Interspeech 2022.[Lu22] Z.Lu, M.He, R.Zhang, C.Gong. A post auto-regressive GAN vocoder focused on spectrum fracture. arXiv preprint arXiv:2204.06086, 2022.[Lux22] F.Lux, J.Koch, N.T.Vu. Prosody cloning in zero-shot multispeaker text-to-speech. arXiv preprint arXiv:2206.12229, 2022.[Mehta22] S.Mehta, E.Szekely, J.Beskow, G.E.Henter. Neural HMMs are all you need (for high-quality attention-free TTS). ICASSP 2022.[Mitsui22] K.Mitsui, T.Zhao, K.Sawada, Y.Hono, Y.Nankaku, K.Tokuda. End-to-end text-to-speech based on latent representation of speaking styles using spontaneous dialogue. Interspeech 2022.[Morrison22] M.Morrison, R.Kumar, K.Kumar, P.Seetharaman, A.Courville, Y.Bengio. Chunked autoregressive GAN for conditional waveform synthesis. ICLR 2022.[Nishimura22] Y.Nishimura, Y.Saito, S.Takamichi, K.Tachibana, H.Saruwatari. Acoustic modeling for end-to-end empathetic dialogue speech synthesis using linguistic and prosodic contexts of dialogue history. Interspeech 2022.[Raitio22] T.Raitio, J.Li, S.Seshadri. Hierarchical prosody modeling and control in non-autoregressive parallel neural TTS. ICASSP 2022.[Ren22] Y.Ren, M.Lei, Z.Huang, S.Zhang, Q.Chen, Z.Yan, Z.Zhao. ProsoSpeech: Enhancing prosody with quantized vector pre-training in TTS. ICASSP 2022.[Ribeiro22] M.S.Ribeiro, J.Roth, G.Comini, G.Huybrechts, A.Gabrys, J.Lorenzo-Trueba. Cross-speaker style transfer for text-to-speech using data augmentation. ICASSP 2022.[Saeki22] T.Saeki, K.Tachibana, R.Yamamoto. DRSpeech: Degradation-robust text-to-speech synthesis with frame-level and utterance-level acoustic representation learning. Interspeech 2022.[Shin22] Y.Shin, Y.Lee, S.Jo, Y.Hwang, T.Kim. Text-driven emotional style control and cross-speaker style transfer in neural TTS. Interspeech 2022.[Song22] E.Song, R.Yamamoto, O.Kwon, C.-H.Song, M.-J.Hwang, S.Oh, H.-W.Yoon, J.-S.Kim, J.-M.Kim. TTS-by-TTS 2: Data-selective augmentation for neural speech synthesis using ranking Support Vector Machine with variational autoencoder. Interspeech 2022.[Tan22] X.Tan, J.Chen, H.Liu, J.Cong, C.Zhang, Y.Liu, X.Wang, Y.Leng, Y.Yi, L.He, F.Soong, T.Qin, S.Zhao, T.-Y.Liu. NaturalSpeech: End-to-end text to speech synthesis with human-level quality. arXiv preprint arXiv:2205.04421, 2022.[Terashima22] R.Terashima, R.Yamamoto, E.Song, Y.Shirahata, H.-W.Yoon, J.-M.Kim, K.Tachibana. Cross-speaker emotion transfer for low-resource text-to-speech using non-parallel voice conversion with pitch-shift data augmentation. Interspeech 2022.[Valin22] J.-M.Valin, U.Isik, P.Smaragdis, A.Krishnaswamy. Neural speech synthesis on a shoestring: Improving the efficiency of LPCNET. ICASSP 2022.[Wang22] Y.Wang, Y.Xie, K.Zhao, H.Wang, Q.Zhang. Unsupervised quantized prosody representation for controllable speech synthesis. IEEE International Conference on Multimedia and Expo (ICME) 2022.[Wu22a] Y.Wu, X.Tan, B.Li, L.He, S.Zhao, R.Song, T.Qin, T.-Y.Liu. AdaSpeech 4: Adaptive text to speech in zero-shot scenarios. arXiv preprint arXiv:2204.00436, 2022.[Wu22b] S.Wu, Z.Shi. ItoWave: Ito stochastic differential equation is all you need for wave generation. ICASSP 2022.[Xie22] Q.Xie, T.Li, X.Wang, Z.Wang, L.Xie, G.Yu, G.Wan. Multi-speaker multi-style text-to-speech synthesis with single-speaker single-style training data scenarios. ICASSP 2022.[Yang22] J.Yang, L.He. Cross-lingual TTS using multi-task learning and speaker classifier joint training. arXiv preprint arXiv:2201.08124, 2022.[Ye22] Z.Ye, Z.Zhao, Y.Ren, F.Wu. SyntaSpeech: Syntax-aware generative adversarial text-to-speech. International Joint Conference on Artificial Intelligence 2022.[Yoon22] H.-W.Yoon, O.Kwon, H.Lee, R.Yamamoto, E.Song, J.-M.Kim, M.-J.Hwang. Language model-based emotion prediction methods for emotional speech synthesis systems. Interspeech 2022.[Zhang22] G.Zhang, Y.Qin, W.Zhang, J.Wu, M.Li, Y.Gai, F.Jiang, T.Lee. iEmoTTS: Toward robust cross-speaker emotion transfer and control for speech synthesis based on disentanglement between prosody and timbre. arXiv preprint arXiv:2206.14866, 2022. 本站仅提供存储就业,整个内容均由用户发布,如发现存害或侵权内容,请点击举报。