SlideShare une entreprise Scribd logo
1  sur  33
Télécharger pour lire hors ligne
2018. 12. 11
DSP & AI Lab
Min-Jae Hwang
TOWARD WAVENET SPEECH SYNTHESIS
PROFILE
2
Min-Jae Hwang
• Affiliation : 7th semester of MS/Ph.D. student
DSP & AI Lab., Yonsei Univ., Seoul, Korea
• Adviser : Prof. Hong-Goo Kang
• E-mail : hmj234@dsp.yonsei.ac.kr
Work experience
• 2017.12 ~ 2017.12 – Intern at NAVER corporation
• 2018.01 ~ 2018.11 – Intern at Microsoft Research Asia
Research area
• 2015.8 ~ 2016.12 - Audio watermarking
• M. Hwang, J. Lee, M. Lee, and H. Kang,“사전 분석법을 통한 스프레드 스펙트럼 기반 오디오 워터마킹 알고리즘의 성능 향상,” 한국음향학회 제 33회 음성통신 및
신호처리 학술대회, 2016
• M. Hwang, J. Lee, M. Lee, and H. Kang, “SVD-based adaptive QIM watermarking on stereo audio signals,” in IEEE Transactions on Multimedia, 2017
• 2017.1 ~ Present - Deep learning-based statistical parametric speech synthesis
• M. Hwang, E. Song, K. Byun, and H. Kang, “Modeling-by-generation structure-based noise compensation algorithm for glottal vocoding speech synthesis syste
m,” in ICASSP, 2018
• M. Hwang, E. Song, J. Kim, and H. Kang, “A unified framework for the generation of glottal signals in deep learning-based parametric speech synthesis
systems,” in Interspeech, 2018
• M. Hwang, F. Soong, F. Xie, X. Wang, and H. Kang, “LP-WaveNet: linear prediction-based WaveNet speech synthesis,” in ICASSP, 2019 [Submitted]
CONTENTS
Introduction
• Story of WaveNet vocoder
Various types of WaveNet vocoders
• SoftMax-WaveNet
• MDN-WaveNet
• ExcitNet
• Proposed LP-WaveNet
Tuning of WaveNet
• Noise injection
• Waveform generation methods
Experiments
Conclusion & Summary
3
INTRODUCTION
Story of WaveNet vocoder
Traditional
Text-to-speech
WaveNet MDN-WaveNet ExcitNet LP-WaveNet
~2015 2016 2017 2018 2019
WaveNet
Vocoder
2017
CLASSICAL TTS SYSTEMS
• Parameterize speech signal using vocoder
technique
• Pros) High controllability, small database
• Cons) Low quality
5
• Concatenate tiny segment of real speech
signal
• Pros) High quality
• Cons) Low controllability, large database
Unit-selection speech synthesis [1] Statistical parametric speech synthesis (SPSS) [2]
Text-to-speech
GENERATIVE MODEL-BASED SPEECH SYNTHESIS
WaveNet [3]
• First generative model for raw audio waveform
• Predict the probability distribution of waveform sample auto-regressively
• Generate high quality of audio / speech signal
• Impact on the task of TTS, voice conversion, music synthesis, etc.
WaveNet in TTS task
6
1
( ) ( | )
N
n n
n
p p x <
=
= Õx x
[Mean opinion scores]
• Utilize linguistic features and F0
contour as a conditional information
[WaveNet speech synthesis]
• Present higher quality than the
conventional TTS systems
WAVENET VOCODER-BASED SPEECH SYNTHESIS
Utilize WaveNet as parametric vocoder [4]
• Use acoustic features as conditional information
Advantages
• Higher quality synthesized speech than the conventional vocoders
• Don’t require hand-engineered processing pipeline
• Higher controllability than the case of linguistic features
• Controlling acoustic features
• Higher training efficiency than the case of linguistic features
• Linguistic feature: 25~35 hour database
• Acoustic feature: 1 hour database
7
[WaveNet vocoder based parametric speech synthesis]
Conventional
vocoder
WaveNet
vocoder
Recorded
1. SoftMax-WaveNet
2. MDN-WaveNet
3. ExcitNet
4. Proposed LP-WaveNet
Traditional
Text-to-Speech
WaveNet MDN-WaveNet ExcitNet LP-WaveNet
~2015 2016 2017 2018 2019
WaveNet
Vocoder
2017
VARIOUS TYPES OF WAVENET VOCODERS
BASIC OF WAVENET
Auto-regressive generative model
• Predict probability distribution of speech samples
• Use past waveform samples as a condition information of WaveNet
Problem: Long-term dependency nature of speech signal
• Highly correlated speech signal in high sampling rate, e.g. 16000 Hz
• E.g. 1) At least 160 (=16000/100) speech samples to represent 100Hz of voice correctly
• E.g. 2) Averagely 6,000 speech samples to represent single word1
Solution
• Put the speech samples as an input of dilated causal convolution layer
• Stack the dilated causal convolution layer
• Result in exponentially growing receptive field
9
: 1
1
( ) ( | )
N
n n R n
n
p p x - -
=
= Õx x
[Dilated causal convolution]
1 Statistics based on the average speaking rate of a set of TED talk speakers
http://sixminutes.dlugan.com/speaking-rate/
Effective embedding method of long receptive field is required
BASIC OF WAVENET
WaveNet without dilation
10
[WaveNet without dilated convolution]
Receptive field: # layer - 1
BASIC OF WAVENET
WaveNet with dilation
11
[WaveNet with dilated convolution]
Receptive field: 𝟐# layer − 𝟏
BASIC OF WAVENET
12
Multiple stack of residual blocks
1. Dilated causal convolution
• Exponentially increase the receptive field
2. Gated activation
• Impose non-linearity on the model
• Enable conditional WaveNet
3. Residual / skip connections
• Speed up the convergence
• Enable deep layered model training
( ) ( )tanh * * sigmoid * *f f g gW V W V= + +z x h x h!
1
0
[ , ] [ ] [ ]
M
k
y d n h k x n d k
-
=
= - ×å
[Basic WaveNet architecture]
: input of activation
: conditional vector
: output of activation
x
h
z
SOFTMAX-WAVENET
Use WaveNet as multinomial logistic regression model [4]
• 𝜇-law companding for evenly distributed speech sample
• 8-bit one-hot encoding
• 256 symbols
Estimate each symbol using WaveNet
• Predict the sample by SoftMax distribution
Optimize network by cross-entropy (CE) loss
• Minimize the probabilistic distance between 𝑝& and 𝑞&
13
( )ln 1
( ) , =255
ln(1 )
x
y sign x
µ
µ
µ
+
= ×
+
8 ( )bitp OneHot y=
exp( )
exp( )
q
n
n q
i
i
z
q
z
=
å
[ ]logn n
n
L p q= -å
( | )q
nWaveNet <=z q h
[Distribution of speech samples]
[SoftMax-WaveNet]
SOFTMAX-WAVENET
Limitation by the usage of 8-bit quantization
• Noisy synthetic speech due to insufficient number of quantization bits
Intuitive solution
• Expand the SoftMax dimension to 65,536 corresponds to 16-bit quantization
à High computational cost & difficult to train
Mixture density network (MDN)-based solution [5]
• Train the WaveNet to predict the parameter of pre-defined speech distribution
14
MDN-WAVENET
Define the distribution of waveform sample as parameterized form [5]
• Discretized mixture of logistic (MoL) distribution
• Discretized logistic mixture with 16bit quantization (Δ = 1/2-.
)
Estimate mixture parameters by WaveNet
Optimize network by negative log-likelihood (NLL) loss
15
: logistic functions
[ ]log ( | , )n n
n
L p x x<= -å h
1
/ 2 / 2
( )
N
n n
n
n n n
x x
p x
s s
µ µ
p s s
=
é ùæ ö æ ö+ D - - D -
= × -ê úç ÷ ç ÷
è ø è øë û
å
[ , , ] ( , )s
nWaveNetp µ
<=z z z x h
softmax( )
exp( )s
p
µ
=
=
=
π z
µ z
s z
, for unity-summed mixture gain
, for positive value of mixture scale
Mixture density network
[MDN-WaveNet]
MDN-WAVENET
Higher quality than SoftMax-WaveNet
• Enable to model the speech signal by 16-bit
• Overcome difficulty of spectrum modeling
Difficulty of WaveNet training
• Due to increased quantization bit from 8-bit to 16-bit
Solution based on the human speech production model [6]
• Model the vocal source signal, whose physical behavior is much simpler than the speech signal
16
Mixture density network
[Spectrogram comparison] [Loss comparison]
SoftMax
WaveNet
MDN
WaveNet
Recorded
SPEECH PRODUCTION MODEL
Concept
• Modeling the speech as the filtered output of vocal source signal to vocal tract filter
Methodology: Linear prediction (LP) approach
• Define speech signal as linear combination of past speech samples
17
[ ]( ) ( ) ( ) ( ) ( )S z G z V z R z E z= × × ×
[Speech production model] [Spectral deconvolution by LP analysis]
1
P
n i n i n
i
s s ea -
=
= +å
1
1
( ) ( ) ( ), where ( )
1
p i
ii
S z H z E z H z
za -
=
= × =
- å
Spectrum part = LP coefficients
Excitation part = Residual signal of LP analysis
Speech = [vocal fold × vocal tract ×	lip radiation] × excitation
EXCITNET
Model the excitation signal by WaveNet, instead of speech signal [6]
Training stage
• Extract excitation signal by linear prediction (LP) analysis
• Periodically updated filter coefficients matched to frame rate of acoustic feature
• Train WaveNet to model excitation signal
Synthesis stage
• Generated excitation signal by WaveNet
• Synthesize speech signal by LP synthesis filtering
18
1
P
n n i n i
i
e s sa -
=
= - å
1
p
n i n i n
i
s s ea -
=
= +å
[ExcitNet]
Recorded Excitation LP synthesis
EXCITNET
High quality of synthesized speech even though 8-bit quantization is used
• Overcome the difficulty of spectrum modeling by using external spectral shaping filter
Limitation
• Independent modeling of excitation and spectrum parts results in unpredictable
prediction error
19
1
1
( ) ( )
1
p
i
i
i
S z E z
za -
=
= ×
- å
Contains error
Contains error
Objective
Represent LP synthesis process during WaveNet training / generation
Recorded
SoftMax
WaveNet ExcitNet
LP-WAVENET
Motivation from the assumption of WaveNet vocoder
1. Previous speech samples, 𝐱3&, are given
2. LP coefficients, {𝛼6}, are given
Probabilistic analysis
• Difference between the random variables 𝑋& and 𝐸& is only a constant value of 𝑥;&
Assume the discretized MoL distributed speech
• Shifting property of 2nd order random variable
20
1
ˆ
p
n i n i
i
x xa -
=
= åTheir linear combination, , are also given
( | , )n np e <x h
ˆnx
( | , )n np x <x h
[Difference of distributions between speech and excitation]
ˆ
x e
i i
x e
i i n
x e
i i
x
s s
p p
µ µ
=
= +
=
Only mean parameters
are different
Linear prediction
ˆ| ( , ) ( ) | ( , )
ˆ| ( , )
n n n n n
n n n
X E x
E x
< <
<
= +
= +
x h x h
x h
ˆn n nx e x= +
LP-WAVENET
Utilize the causality of WaveNet and the linearity of LP synthesis processes [7]
21
( ), , ,s
nWaveNetp µ
<
é ù =ë ûz z z x h
1
( | , )
/ 2 / 2
n n
N
i i
i
i i i
p x
x x
s s
µ µ
p s s
<
=
=
é ùæ ö æ ö+ D - - D -
× -ê úç ÷ ç ÷
è ø è øë û
å
x h
1. Mixture parameter prediction
2. Mixture parameter refinement
3. Discretized MoL loss calculation
[ ]log ( | , )n n
n
E p x <= -å x h
softmax( )
ˆ
exp( )
n
s
x
p
µ
=
= +
=
π z
µ z
s z
[LP-WaveNet]
Linear prediction Recorded ExcitNet LP-WaveNet
TUNING OF WAVENET
1. Solution to waveform divergence problem
2. Waveform generation methods
WAVEFORM DIVERGENCE PROBLEM
Waveform divergence problem
• Waveform divergence by a stacked error during WaveNet’s autoregressive generation
Cause – Overfitting on the silence region
• Unique solution in silence region
• Easier to be happen when the portion of silence region in training set is larger
• Be sensitive to tiny error in silence region during waveform generation
Solution – Noise injection
• Inject negligible amount of noise
• Allowing only 1 bit error
• Increase robustness to the prediction error
in silence region
23
( ),curr prevx WaveNet= x h ( )0 , silWaveNet= 0 h
16
ˆ , where = 2 2e e= + ×x x n 0 0 0
1 1 1
2 2 2
~ ( | )
~ ( | )
~ ( | )
~ ( | )n n n
x p x
x p x
x p x
x p x
<
<
<
<
x
x
x
x
!
0err
0 1err err+
i
i
errå
!
[Error stacking during waveform generation]
WAVEFORM GENERATION METHODS
Random sampling
Argmax sampling
Greedy sampling [8]
Mode sampling [7]
• Sharper distribution on voiced region
24
ˆ ( ) ~ ( ( ) | ( ), )randx n p x n x k n< h
max
ˆ ˆ ˆ( ) ( ) (1 ) ( )greedy randx n vuv x n vuv x n= × + - ×
max
( )
ˆ ( ) arg max ( ( ) | ( ), )
x n
x n p x n x k n= < h
,
ˆ ˆ ˆ( ) ( ) (1 ) ( )mode rand nar randx n vuv x n vuv x n= × + - ×
1
ˆ ( ) ~ ( , )
I
rand i i i
i
x n DistLogic sp µ
=
å
,
1
ˆ ( ) ~ ,
I
i
rand narr i i
i
s
x n DistLogic
c
p µ
=
æ ö
ç ÷
è ø
å
[Random & argmax sampling]
[Scale parameter control in mode sampling]
randx
maxx
Random
sampling
Recorded
Argmax
sampling
Greedy
sampling
Mode
sampling
EXPERIMENTS
EXPERIMENTS
Network architecture
Systems
• WN<: MDN-WaveNet that models the speech signal
• WN=: MDN-WaveNet that models the excitation signal
• WN>?: Proposed LP-WaveNet
26
Database Korean male: MBC YBS database
Training / validation / Test 2,500 (~3.2h) / 200 / 200
Minibatch size 4 GPU x 20000 samples
Dilation 3 * [1, 2, 4, 8, 16, 32, 64, 128, 256, 512]
Layer 30
Receptive field 3070 samples
Residual chn. 128
Skip chn. 128
Quantization 65536 uniform
Number of mixture 10 mixture components --> 30 output channels
Conditioning features Voicing flag, LSF, F0 (log), Eng (log), BAP
Normalization Gaussian normalization
Learning rate 1.00E-04
Initialization / optimizer Xavier / Adam
Sample generation method Mode sampling
LEARNING CURVE
27
Training speed
𝐖𝐍 𝑳𝑷 = WN= > WN<
OBJECTIVE EVALUATION
Performance measurements
• VUV: Voicing error rate (%)
• F0 RMSE: F0 root mean square error (Hz) in voiced region
• LSD: Log-spectral distance of AR spectrum (dB)
• F-LSD: LSD of synthesized speech in frequency domain (dB)
Results
28
A/S: analysis / synthesis
SPSS: synthesis by predicted acoustic features
SUBJECTIVE EVALUATION
Mean opinion score (MOS) test
• 20 random synthesized utterances from test set
• 12 native Korean speakers
• Include STRAIGHT (STR)-based speech synthesis system as baseline [9]
Results
29
A/S: analysis / synthesis
SPSS: synthesis by predicted acoustic features
SAMPLES
Recorded
STRAIGHT – A/S
WN< – A/S
WN= – A/S
WN>? – A/S
30
STRAIGHT – SPSS
WN< – SPSS
WN= – SPSS
WN>? – SPSS
A/S: analysis / synthesis
SPSS: synthesis by predicted acoustic features
SUMMARY & CONCLUSION
Investigated various types of WaveNet vocoder
• SoftMax-WaveNet
• MDN-WaveNet
• ExcitNet
Proposed an LP-WaveNet
• Explicitly represent linear prediction structure of speech waveform into WaveNet framework
Introduced tuning methods of WaveNet training / generation
• Noise injection
• Sample generation methods
Performed experiment in objective and subjective manner
• Achieve faster convergence speed than conventional WaveNet vocoders, whereas keep
similar speech quality
31
REFERENCES
[1] A. J. Hunt and A. W. Black, “Unit selection in a concatenative speech synthesis system using a large speech database,” in
Proc. ICASSP, 1996
[2] H. Zen, et. al. “Statistical parametric speech synthesis using deep neural networks,” in Proc. ICASSP, 2013.
[3] A. van den Oord, et. al.,“WaveNet: A generative model for raw audio generation,” in CoRR, 2016.
[4] A.Tamamori, et. al., “Speaker-dependent wavenet vocoder,” in INTERSPEECH, 2017.
[5] A. van den Oord, et. al., "Parallel WaveNet: Fast High-fidelity speech synthesis," in CoRR, 2017
[6] E. Song, K. Byun, and H.-G. Kang, “A neural excitation vocoder for statistical parametric speech synthesis systems,” in
Arxiv, 2018.
[7] M. Hwang, F. Soong, F. Xie, X. Wang, and H. Kang, “LP-WaveNet: Linear Prediction-based WaveNet Speech Synthesis,”
Submitted to Proc. ICASSP, 2019.
[8] W. Xin, et. al., “A comparison of recent waveform generation and acoustic modeling methods for neural-network-based
speech synthesis,” in Proc. ICASSP, 2018.
[9] H. Kawahara, “STRAIGHT, exploration of the other aspect of vocoder: Perceptually isomorphic decomposition of speech
sounds,” Acoustical Science and Technology, 2006.
32
Thank you!
33

Contenu connexe

Tendances

Speaker recognition using MFCC
Speaker recognition using MFCCSpeaker recognition using MFCC
Speaker recognition using MFCC
Hira Shaukat
 
Kalman filter - Applications in Image processing
Kalman filter - Applications in Image processingKalman filter - Applications in Image processing
Kalman filter - Applications in Image processing
Ravi Teja
 

Tendances (20)

Speaker recognition using MFCC
Speaker recognition using MFCCSpeaker recognition using MFCC
Speaker recognition using MFCC
 
Attention mechanism 소개 자료
Attention mechanism 소개 자료Attention mechanism 소개 자료
Attention mechanism 소개 자료
 
PR-395: Variational Image Compression with a Scale Hyperprior
PR-395: Variational Image Compression with a Scale HyperpriorPR-395: Variational Image Compression with a Scale Hyperprior
PR-395: Variational Image Compression with a Scale Hyperprior
 
Variational Autoencoder Tutorial
Variational Autoencoder Tutorial Variational Autoencoder Tutorial
Variational Autoencoder Tutorial
 
Journal Club: VQ-VAE2
Journal Club: VQ-VAE2Journal Club: VQ-VAE2
Journal Club: VQ-VAE2
 
Dsp lecture vol 6 design of fir
Dsp lecture vol 6 design of firDsp lecture vol 6 design of fir
Dsp lecture vol 6 design of fir
 
Transformation of Random variables & noise concepts
Transformation of Random variables & noise concepts Transformation of Random variables & noise concepts
Transformation of Random variables & noise concepts
 
Kalman filter - Applications in Image processing
Kalman filter - Applications in Image processingKalman filter - Applications in Image processing
Kalman filter - Applications in Image processing
 
DSP_FOEHU - MATLAB 04 - The Discrete Fourier Transform (DFT)
DSP_FOEHU - MATLAB 04 - The Discrete Fourier Transform (DFT)DSP_FOEHU - MATLAB 04 - The Discrete Fourier Transform (DFT)
DSP_FOEHU - MATLAB 04 - The Discrete Fourier Transform (DFT)
 
Dft
DftDft
Dft
 
Linear Predictive Coding
Linear Predictive CodingLinear Predictive Coding
Linear Predictive Coding
 
3.3 modulation formats msk and gmsk
3.3 modulation formats   msk and gmsk3.3 modulation formats   msk and gmsk
3.3 modulation formats msk and gmsk
 
The Vocoder, Auto-Tune, Pitch Standardization, and Vocal Virtuosity
The Vocoder, Auto-Tune, Pitch Standardization, and Vocal VirtuosityThe Vocoder, Auto-Tune, Pitch Standardization, and Vocal Virtuosity
The Vocoder, Auto-Tune, Pitch Standardization, and Vocal Virtuosity
 
Coherent systems
Coherent systemsCoherent systems
Coherent systems
 
Linear prediction
Linear predictionLinear prediction
Linear prediction
 
DSP_2018_FOEHU - Lec 02 - Sampling of Continuous Time Signals
DSP_2018_FOEHU - Lec 02 - Sampling of Continuous Time SignalsDSP_2018_FOEHU - Lec 02 - Sampling of Continuous Time Signals
DSP_2018_FOEHU - Lec 02 - Sampling of Continuous Time Signals
 
Multirate DSP
Multirate DSPMultirate DSP
Multirate DSP
 
DSP lab manual
DSP lab manualDSP lab manual
DSP lab manual
 
Digital signal Processing all matlab code with Lab report
Digital signal Processing all matlab code with Lab report Digital signal Processing all matlab code with Lab report
Digital signal Processing all matlab code with Lab report
 
Machine Learning - Introduction to Tensorflow
Machine Learning - Introduction to TensorflowMachine Learning - Introduction to Tensorflow
Machine Learning - Introduction to Tensorflow
 

Similaire à Toward wave net speech synthesis

Speech Compression using LPC
Speech Compression using LPCSpeech Compression using LPC
Speech Compression using LPC
Disha Modi
 
Deep Learning Based Voice Activity Detection and Speech Enhancement
Deep Learning Based Voice Activity Detection and Speech EnhancementDeep Learning Based Voice Activity Detection and Speech Enhancement
Deep Learning Based Voice Activity Detection and Speech Enhancement
NAVER Engineering
 
IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...
ijceronline
 

Similaire à Toward wave net speech synthesis (20)

Speaker Dependent WaveNet Vocoder
Speaker Dependent WaveNet VocoderSpeaker Dependent WaveNet Vocoder
Speaker Dependent WaveNet Vocoder
 
Speech recognition final
Speech recognition finalSpeech recognition final
Speech recognition final
 
H0814247
H0814247H0814247
H0814247
 
Real Time Speech Enhancement in the Waveform Domain
Real Time Speech Enhancement in the Waveform DomainReal Time Speech Enhancement in the Waveform Domain
Real Time Speech Enhancement in the Waveform Domain
 
Research_Wu.pptx
Research_Wu.pptxResearch_Wu.pptx
Research_Wu.pptx
 
A Noise Reduction Method Based on Modified Least Mean Square Algorithm of Rea...
A Noise Reduction Method Based on Modified Least Mean Square Algorithm of Rea...A Noise Reduction Method Based on Modified Least Mean Square Algorithm of Rea...
A Noise Reduction Method Based on Modified Least Mean Square Algorithm of Rea...
 
Speech Compression using LPC
Speech Compression using LPCSpeech Compression using LPC
Speech Compression using LPC
 
A_Noise_Reduction_Method_Based_on_LMS_Adaptive_Fil.pdf
A_Noise_Reduction_Method_Based_on_LMS_Adaptive_Fil.pdfA_Noise_Reduction_Method_Based_on_LMS_Adaptive_Fil.pdf
A_Noise_Reduction_Method_Based_on_LMS_Adaptive_Fil.pdf
 
Deep Learning Based Voice Activity Detection and Speech Enhancement
Deep Learning Based Voice Activity Detection and Speech EnhancementDeep Learning Based Voice Activity Detection and Speech Enhancement
Deep Learning Based Voice Activity Detection and Speech Enhancement
 
IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...
 
Performance Evaluation of Conventional and Hybrid Feature Extractions Using M...
Performance Evaluation of Conventional and Hybrid Feature Extractions Using M...Performance Evaluation of Conventional and Hybrid Feature Extractions Using M...
Performance Evaluation of Conventional and Hybrid Feature Extractions Using M...
 
Translatotron review
Translatotron reviewTranslatotron review
Translatotron review
 
Speech Enhancement Based on Spectral Subtraction Involving Magnitude and Phas...
Speech Enhancement Based on Spectral Subtraction Involving Magnitude and Phas...Speech Enhancement Based on Spectral Subtraction Involving Magnitude and Phas...
Speech Enhancement Based on Spectral Subtraction Involving Magnitude and Phas...
 
COLEA : A MATLAB Tool for Speech Analysis
COLEA : A MATLAB Tool for Speech AnalysisCOLEA : A MATLAB Tool for Speech Analysis
COLEA : A MATLAB Tool for Speech Analysis
 
F010334548
F010334548F010334548
F010334548
 
WaveNet.pdf
WaveNet.pdfWaveNet.pdf
WaveNet.pdf
 
Dereverberation in the stft and log mel frequency feature domains
Dereverberation in the stft and log mel frequency feature domainsDereverberation in the stft and log mel frequency feature domains
Dereverberation in the stft and log mel frequency feature domains
 
Speech Recognition System By Matlab
Speech Recognition System By MatlabSpeech Recognition System By Matlab
Speech Recognition System By Matlab
 
Compressive Sensing in Speech from LPC using Gradient Projection for Sparse R...
Compressive Sensing in Speech from LPC using Gradient Projection for Sparse R...Compressive Sensing in Speech from LPC using Gradient Projection for Sparse R...
Compressive Sensing in Speech from LPC using Gradient Projection for Sparse R...
 
D111823
D111823D111823
D111823
 

Plus de NAVER Engineering

Plus de NAVER Engineering (20)

React vac pattern
React vac patternReact vac pattern
React vac pattern
 
디자인 시스템에 직방 ZUIX
디자인 시스템에 직방 ZUIX디자인 시스템에 직방 ZUIX
디자인 시스템에 직방 ZUIX
 
진화하는 디자인 시스템(걸음마 편)
진화하는 디자인 시스템(걸음마 편)진화하는 디자인 시스템(걸음마 편)
진화하는 디자인 시스템(걸음마 편)
 
서비스 운영을 위한 디자인시스템 프로젝트
서비스 운영을 위한 디자인시스템 프로젝트서비스 운영을 위한 디자인시스템 프로젝트
서비스 운영을 위한 디자인시스템 프로젝트
 
BPL(Banksalad Product Language) 무야호
BPL(Banksalad Product Language) 무야호BPL(Banksalad Product Language) 무야호
BPL(Banksalad Product Language) 무야호
 
이번 생에 디자인 시스템은 처음이라
이번 생에 디자인 시스템은 처음이라이번 생에 디자인 시스템은 처음이라
이번 생에 디자인 시스템은 처음이라
 
날고 있는 여러 비행기 넘나 들며 정비하기
날고 있는 여러 비행기 넘나 들며 정비하기날고 있는 여러 비행기 넘나 들며 정비하기
날고 있는 여러 비행기 넘나 들며 정비하기
 
쏘카프레임 구축 배경과 과정
 쏘카프레임 구축 배경과 과정 쏘카프레임 구축 배경과 과정
쏘카프레임 구축 배경과 과정
 
플랫폼 디자이너 없이 디자인 시스템을 구축하는 프로덕트 디자이너의 우당탕탕 고통 연대기
플랫폼 디자이너 없이 디자인 시스템을 구축하는 프로덕트 디자이너의 우당탕탕 고통 연대기플랫폼 디자이너 없이 디자인 시스템을 구축하는 프로덕트 디자이너의 우당탕탕 고통 연대기
플랫폼 디자이너 없이 디자인 시스템을 구축하는 프로덕트 디자이너의 우당탕탕 고통 연대기
 
200820 NAVER TECH CONCERT 15_Code Review is Horse(코드리뷰는 말이야)(feat.Latte)
200820 NAVER TECH CONCERT 15_Code Review is Horse(코드리뷰는 말이야)(feat.Latte)200820 NAVER TECH CONCERT 15_Code Review is Horse(코드리뷰는 말이야)(feat.Latte)
200820 NAVER TECH CONCERT 15_Code Review is Horse(코드리뷰는 말이야)(feat.Latte)
 
200819 NAVER TECH CONCERT 03_화려한 코루틴이 내 앱을 감싸네! 코루틴으로 작성해보는 깔끔한 비동기 코드
200819 NAVER TECH CONCERT 03_화려한 코루틴이 내 앱을 감싸네! 코루틴으로 작성해보는 깔끔한 비동기 코드200819 NAVER TECH CONCERT 03_화려한 코루틴이 내 앱을 감싸네! 코루틴으로 작성해보는 깔끔한 비동기 코드
200819 NAVER TECH CONCERT 03_화려한 코루틴이 내 앱을 감싸네! 코루틴으로 작성해보는 깔끔한 비동기 코드
 
200819 NAVER TECH CONCERT 10_맥북에서도 아이맥프로에서 빌드하는 것처럼 빌드 속도 빠르게 하기
200819 NAVER TECH CONCERT 10_맥북에서도 아이맥프로에서 빌드하는 것처럼 빌드 속도 빠르게 하기200819 NAVER TECH CONCERT 10_맥북에서도 아이맥프로에서 빌드하는 것처럼 빌드 속도 빠르게 하기
200819 NAVER TECH CONCERT 10_맥북에서도 아이맥프로에서 빌드하는 것처럼 빌드 속도 빠르게 하기
 
200819 NAVER TECH CONCERT 08_성능을 고민하는 슬기로운 개발자 생활
200819 NAVER TECH CONCERT 08_성능을 고민하는 슬기로운 개발자 생활200819 NAVER TECH CONCERT 08_성능을 고민하는 슬기로운 개발자 생활
200819 NAVER TECH CONCERT 08_성능을 고민하는 슬기로운 개발자 생활
 
200819 NAVER TECH CONCERT 05_모르면 손해보는 Android 디버깅/분석 꿀팁 대방출
200819 NAVER TECH CONCERT 05_모르면 손해보는 Android 디버깅/분석 꿀팁 대방출200819 NAVER TECH CONCERT 05_모르면 손해보는 Android 디버깅/분석 꿀팁 대방출
200819 NAVER TECH CONCERT 05_모르면 손해보는 Android 디버깅/분석 꿀팁 대방출
 
200819 NAVER TECH CONCERT 09_Case.xcodeproj - 좋은 동료로 거듭나기 위한 노하우
200819 NAVER TECH CONCERT 09_Case.xcodeproj - 좋은 동료로 거듭나기 위한 노하우200819 NAVER TECH CONCERT 09_Case.xcodeproj - 좋은 동료로 거듭나기 위한 노하우
200819 NAVER TECH CONCERT 09_Case.xcodeproj - 좋은 동료로 거듭나기 위한 노하우
 
200820 NAVER TECH CONCERT 14_야 너두 할 수 있어. 비전공자, COBOL 개발자를 거쳐 네이버에서 FE 개발하게 된...
200820 NAVER TECH CONCERT 14_야 너두 할 수 있어. 비전공자, COBOL 개발자를 거쳐 네이버에서 FE 개발하게 된...200820 NAVER TECH CONCERT 14_야 너두 할 수 있어. 비전공자, COBOL 개발자를 거쳐 네이버에서 FE 개발하게 된...
200820 NAVER TECH CONCERT 14_야 너두 할 수 있어. 비전공자, COBOL 개발자를 거쳐 네이버에서 FE 개발하게 된...
 
200820 NAVER TECH CONCERT 13_네이버에서 오픈 소스 개발을 통해 성장하는 방법
200820 NAVER TECH CONCERT 13_네이버에서 오픈 소스 개발을 통해 성장하는 방법200820 NAVER TECH CONCERT 13_네이버에서 오픈 소스 개발을 통해 성장하는 방법
200820 NAVER TECH CONCERT 13_네이버에서 오픈 소스 개발을 통해 성장하는 방법
 
200820 NAVER TECH CONCERT 12_상반기 네이버 인턴을 돌아보며
200820 NAVER TECH CONCERT 12_상반기 네이버 인턴을 돌아보며200820 NAVER TECH CONCERT 12_상반기 네이버 인턴을 돌아보며
200820 NAVER TECH CONCERT 12_상반기 네이버 인턴을 돌아보며
 
200820 NAVER TECH CONCERT 11_빠르게 성장하는 슈퍼루키로 거듭나기
200820 NAVER TECH CONCERT 11_빠르게 성장하는 슈퍼루키로 거듭나기200820 NAVER TECH CONCERT 11_빠르게 성장하는 슈퍼루키로 거듭나기
200820 NAVER TECH CONCERT 11_빠르게 성장하는 슈퍼루키로 거듭나기
 
200819 NAVER TECH CONCERT 07_신입 iOS 개발자 개발업무 적응기
200819 NAVER TECH CONCERT 07_신입 iOS 개발자 개발업무 적응기200819 NAVER TECH CONCERT 07_신입 iOS 개발자 개발업무 적응기
200819 NAVER TECH CONCERT 07_신입 iOS 개발자 개발업무 적응기
 

Dernier

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 

Dernier (20)

Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 

Toward wave net speech synthesis

  • 1. 2018. 12. 11 DSP & AI Lab Min-Jae Hwang TOWARD WAVENET SPEECH SYNTHESIS
  • 2. PROFILE 2 Min-Jae Hwang • Affiliation : 7th semester of MS/Ph.D. student DSP & AI Lab., Yonsei Univ., Seoul, Korea • Adviser : Prof. Hong-Goo Kang • E-mail : hmj234@dsp.yonsei.ac.kr Work experience • 2017.12 ~ 2017.12 – Intern at NAVER corporation • 2018.01 ~ 2018.11 – Intern at Microsoft Research Asia Research area • 2015.8 ~ 2016.12 - Audio watermarking • M. Hwang, J. Lee, M. Lee, and H. Kang,“사전 분석법을 통한 스프레드 스펙트럼 기반 오디오 워터마킹 알고리즘의 성능 향상,” 한국음향학회 제 33회 음성통신 및 신호처리 학술대회, 2016 • M. Hwang, J. Lee, M. Lee, and H. Kang, “SVD-based adaptive QIM watermarking on stereo audio signals,” in IEEE Transactions on Multimedia, 2017 • 2017.1 ~ Present - Deep learning-based statistical parametric speech synthesis • M. Hwang, E. Song, K. Byun, and H. Kang, “Modeling-by-generation structure-based noise compensation algorithm for glottal vocoding speech synthesis syste m,” in ICASSP, 2018 • M. Hwang, E. Song, J. Kim, and H. Kang, “A unified framework for the generation of glottal signals in deep learning-based parametric speech synthesis systems,” in Interspeech, 2018 • M. Hwang, F. Soong, F. Xie, X. Wang, and H. Kang, “LP-WaveNet: linear prediction-based WaveNet speech synthesis,” in ICASSP, 2019 [Submitted]
  • 3. CONTENTS Introduction • Story of WaveNet vocoder Various types of WaveNet vocoders • SoftMax-WaveNet • MDN-WaveNet • ExcitNet • Proposed LP-WaveNet Tuning of WaveNet • Noise injection • Waveform generation methods Experiments Conclusion & Summary 3
  • 4. INTRODUCTION Story of WaveNet vocoder Traditional Text-to-speech WaveNet MDN-WaveNet ExcitNet LP-WaveNet ~2015 2016 2017 2018 2019 WaveNet Vocoder 2017
  • 5. CLASSICAL TTS SYSTEMS • Parameterize speech signal using vocoder technique • Pros) High controllability, small database • Cons) Low quality 5 • Concatenate tiny segment of real speech signal • Pros) High quality • Cons) Low controllability, large database Unit-selection speech synthesis [1] Statistical parametric speech synthesis (SPSS) [2] Text-to-speech
  • 6. GENERATIVE MODEL-BASED SPEECH SYNTHESIS WaveNet [3] • First generative model for raw audio waveform • Predict the probability distribution of waveform sample auto-regressively • Generate high quality of audio / speech signal • Impact on the task of TTS, voice conversion, music synthesis, etc. WaveNet in TTS task 6 1 ( ) ( | ) N n n n p p x < = = Õx x [Mean opinion scores] • Utilize linguistic features and F0 contour as a conditional information [WaveNet speech synthesis] • Present higher quality than the conventional TTS systems
  • 7. WAVENET VOCODER-BASED SPEECH SYNTHESIS Utilize WaveNet as parametric vocoder [4] • Use acoustic features as conditional information Advantages • Higher quality synthesized speech than the conventional vocoders • Don’t require hand-engineered processing pipeline • Higher controllability than the case of linguistic features • Controlling acoustic features • Higher training efficiency than the case of linguistic features • Linguistic feature: 25~35 hour database • Acoustic feature: 1 hour database 7 [WaveNet vocoder based parametric speech synthesis] Conventional vocoder WaveNet vocoder Recorded
  • 8. 1. SoftMax-WaveNet 2. MDN-WaveNet 3. ExcitNet 4. Proposed LP-WaveNet Traditional Text-to-Speech WaveNet MDN-WaveNet ExcitNet LP-WaveNet ~2015 2016 2017 2018 2019 WaveNet Vocoder 2017 VARIOUS TYPES OF WAVENET VOCODERS
  • 9. BASIC OF WAVENET Auto-regressive generative model • Predict probability distribution of speech samples • Use past waveform samples as a condition information of WaveNet Problem: Long-term dependency nature of speech signal • Highly correlated speech signal in high sampling rate, e.g. 16000 Hz • E.g. 1) At least 160 (=16000/100) speech samples to represent 100Hz of voice correctly • E.g. 2) Averagely 6,000 speech samples to represent single word1 Solution • Put the speech samples as an input of dilated causal convolution layer • Stack the dilated causal convolution layer • Result in exponentially growing receptive field 9 : 1 1 ( ) ( | ) N n n R n n p p x - - = = Õx x [Dilated causal convolution] 1 Statistics based on the average speaking rate of a set of TED talk speakers http://sixminutes.dlugan.com/speaking-rate/ Effective embedding method of long receptive field is required
  • 10. BASIC OF WAVENET WaveNet without dilation 10 [WaveNet without dilated convolution] Receptive field: # layer - 1
  • 11. BASIC OF WAVENET WaveNet with dilation 11 [WaveNet with dilated convolution] Receptive field: 𝟐# layer − 𝟏
  • 12. BASIC OF WAVENET 12 Multiple stack of residual blocks 1. Dilated causal convolution • Exponentially increase the receptive field 2. Gated activation • Impose non-linearity on the model • Enable conditional WaveNet 3. Residual / skip connections • Speed up the convergence • Enable deep layered model training ( ) ( )tanh * * sigmoid * *f f g gW V W V= + +z x h x h! 1 0 [ , ] [ ] [ ] M k y d n h k x n d k - = = - ×å [Basic WaveNet architecture] : input of activation : conditional vector : output of activation x h z
  • 13. SOFTMAX-WAVENET Use WaveNet as multinomial logistic regression model [4] • 𝜇-law companding for evenly distributed speech sample • 8-bit one-hot encoding • 256 symbols Estimate each symbol using WaveNet • Predict the sample by SoftMax distribution Optimize network by cross-entropy (CE) loss • Minimize the probabilistic distance between 𝑝& and 𝑞& 13 ( )ln 1 ( ) , =255 ln(1 ) x y sign x µ µ µ + = × + 8 ( )bitp OneHot y= exp( ) exp( ) q n n q i i z q z = å [ ]logn n n L p q= -å ( | )q nWaveNet <=z q h [Distribution of speech samples] [SoftMax-WaveNet]
  • 14. SOFTMAX-WAVENET Limitation by the usage of 8-bit quantization • Noisy synthetic speech due to insufficient number of quantization bits Intuitive solution • Expand the SoftMax dimension to 65,536 corresponds to 16-bit quantization à High computational cost & difficult to train Mixture density network (MDN)-based solution [5] • Train the WaveNet to predict the parameter of pre-defined speech distribution 14
  • 15. MDN-WAVENET Define the distribution of waveform sample as parameterized form [5] • Discretized mixture of logistic (MoL) distribution • Discretized logistic mixture with 16bit quantization (Δ = 1/2-. ) Estimate mixture parameters by WaveNet Optimize network by negative log-likelihood (NLL) loss 15 : logistic functions [ ]log ( | , )n n n L p x x<= -å h 1 / 2 / 2 ( ) N n n n n n n x x p x s s µ µ p s s = é ùæ ö æ ö+ D - - D - = × -ê úç ÷ ç ÷ è ø è øë û å [ , , ] ( , )s nWaveNetp µ <=z z z x h softmax( ) exp( )s p µ = = = π z µ z s z , for unity-summed mixture gain , for positive value of mixture scale Mixture density network [MDN-WaveNet]
  • 16. MDN-WAVENET Higher quality than SoftMax-WaveNet • Enable to model the speech signal by 16-bit • Overcome difficulty of spectrum modeling Difficulty of WaveNet training • Due to increased quantization bit from 8-bit to 16-bit Solution based on the human speech production model [6] • Model the vocal source signal, whose physical behavior is much simpler than the speech signal 16 Mixture density network [Spectrogram comparison] [Loss comparison] SoftMax WaveNet MDN WaveNet Recorded
  • 17. SPEECH PRODUCTION MODEL Concept • Modeling the speech as the filtered output of vocal source signal to vocal tract filter Methodology: Linear prediction (LP) approach • Define speech signal as linear combination of past speech samples 17 [ ]( ) ( ) ( ) ( ) ( )S z G z V z R z E z= × × × [Speech production model] [Spectral deconvolution by LP analysis] 1 P n i n i n i s s ea - = = +å 1 1 ( ) ( ) ( ), where ( ) 1 p i ii S z H z E z H z za - = = × = - å Spectrum part = LP coefficients Excitation part = Residual signal of LP analysis Speech = [vocal fold × vocal tract × lip radiation] × excitation
  • 18. EXCITNET Model the excitation signal by WaveNet, instead of speech signal [6] Training stage • Extract excitation signal by linear prediction (LP) analysis • Periodically updated filter coefficients matched to frame rate of acoustic feature • Train WaveNet to model excitation signal Synthesis stage • Generated excitation signal by WaveNet • Synthesize speech signal by LP synthesis filtering 18 1 P n n i n i i e s sa - = = - å 1 p n i n i n i s s ea - = = +å [ExcitNet] Recorded Excitation LP synthesis
  • 19. EXCITNET High quality of synthesized speech even though 8-bit quantization is used • Overcome the difficulty of spectrum modeling by using external spectral shaping filter Limitation • Independent modeling of excitation and spectrum parts results in unpredictable prediction error 19 1 1 ( ) ( ) 1 p i i i S z E z za - = = × - å Contains error Contains error Objective Represent LP synthesis process during WaveNet training / generation Recorded SoftMax WaveNet ExcitNet
  • 20. LP-WAVENET Motivation from the assumption of WaveNet vocoder 1. Previous speech samples, 𝐱3&, are given 2. LP coefficients, {𝛼6}, are given Probabilistic analysis • Difference between the random variables 𝑋& and 𝐸& is only a constant value of 𝑥;& Assume the discretized MoL distributed speech • Shifting property of 2nd order random variable 20 1 ˆ p n i n i i x xa - = = åTheir linear combination, , are also given ( | , )n np e <x h ˆnx ( | , )n np x <x h [Difference of distributions between speech and excitation] ˆ x e i i x e i i n x e i i x s s p p µ µ = = + = Only mean parameters are different Linear prediction ˆ| ( , ) ( ) | ( , ) ˆ| ( , ) n n n n n n n n X E x E x < < < = + = + x h x h x h ˆn n nx e x= +
  • 21. LP-WAVENET Utilize the causality of WaveNet and the linearity of LP synthesis processes [7] 21 ( ), , ,s nWaveNetp µ < é ù =ë ûz z z x h 1 ( | , ) / 2 / 2 n n N i i i i i i p x x x s s µ µ p s s < = = é ùæ ö æ ö+ D - - D - × -ê úç ÷ ç ÷ è ø è øë û å x h 1. Mixture parameter prediction 2. Mixture parameter refinement 3. Discretized MoL loss calculation [ ]log ( | , )n n n E p x <= -å x h softmax( ) ˆ exp( ) n s x p µ = = + = π z µ z s z [LP-WaveNet] Linear prediction Recorded ExcitNet LP-WaveNet
  • 22. TUNING OF WAVENET 1. Solution to waveform divergence problem 2. Waveform generation methods
  • 23. WAVEFORM DIVERGENCE PROBLEM Waveform divergence problem • Waveform divergence by a stacked error during WaveNet’s autoregressive generation Cause – Overfitting on the silence region • Unique solution in silence region • Easier to be happen when the portion of silence region in training set is larger • Be sensitive to tiny error in silence region during waveform generation Solution – Noise injection • Inject negligible amount of noise • Allowing only 1 bit error • Increase robustness to the prediction error in silence region 23 ( ),curr prevx WaveNet= x h ( )0 , silWaveNet= 0 h 16 ˆ , where = 2 2e e= + ×x x n 0 0 0 1 1 1 2 2 2 ~ ( | ) ~ ( | ) ~ ( | ) ~ ( | )n n n x p x x p x x p x x p x < < < < x x x x ! 0err 0 1err err+ i i errå ! [Error stacking during waveform generation]
  • 24. WAVEFORM GENERATION METHODS Random sampling Argmax sampling Greedy sampling [8] Mode sampling [7] • Sharper distribution on voiced region 24 ˆ ( ) ~ ( ( ) | ( ), )randx n p x n x k n< h max ˆ ˆ ˆ( ) ( ) (1 ) ( )greedy randx n vuv x n vuv x n= × + - × max ( ) ˆ ( ) arg max ( ( ) | ( ), ) x n x n p x n x k n= < h , ˆ ˆ ˆ( ) ( ) (1 ) ( )mode rand nar randx n vuv x n vuv x n= × + - × 1 ˆ ( ) ~ ( , ) I rand i i i i x n DistLogic sp µ = å , 1 ˆ ( ) ~ , I i rand narr i i i s x n DistLogic c p µ = æ ö ç ÷ è ø å [Random & argmax sampling] [Scale parameter control in mode sampling] randx maxx Random sampling Recorded Argmax sampling Greedy sampling Mode sampling
  • 26. EXPERIMENTS Network architecture Systems • WN<: MDN-WaveNet that models the speech signal • WN=: MDN-WaveNet that models the excitation signal • WN>?: Proposed LP-WaveNet 26 Database Korean male: MBC YBS database Training / validation / Test 2,500 (~3.2h) / 200 / 200 Minibatch size 4 GPU x 20000 samples Dilation 3 * [1, 2, 4, 8, 16, 32, 64, 128, 256, 512] Layer 30 Receptive field 3070 samples Residual chn. 128 Skip chn. 128 Quantization 65536 uniform Number of mixture 10 mixture components --> 30 output channels Conditioning features Voicing flag, LSF, F0 (log), Eng (log), BAP Normalization Gaussian normalization Learning rate 1.00E-04 Initialization / optimizer Xavier / Adam Sample generation method Mode sampling
  • 28. OBJECTIVE EVALUATION Performance measurements • VUV: Voicing error rate (%) • F0 RMSE: F0 root mean square error (Hz) in voiced region • LSD: Log-spectral distance of AR spectrum (dB) • F-LSD: LSD of synthesized speech in frequency domain (dB) Results 28 A/S: analysis / synthesis SPSS: synthesis by predicted acoustic features
  • 29. SUBJECTIVE EVALUATION Mean opinion score (MOS) test • 20 random synthesized utterances from test set • 12 native Korean speakers • Include STRAIGHT (STR)-based speech synthesis system as baseline [9] Results 29 A/S: analysis / synthesis SPSS: synthesis by predicted acoustic features
  • 30. SAMPLES Recorded STRAIGHT – A/S WN< – A/S WN= – A/S WN>? – A/S 30 STRAIGHT – SPSS WN< – SPSS WN= – SPSS WN>? – SPSS A/S: analysis / synthesis SPSS: synthesis by predicted acoustic features
  • 31. SUMMARY & CONCLUSION Investigated various types of WaveNet vocoder • SoftMax-WaveNet • MDN-WaveNet • ExcitNet Proposed an LP-WaveNet • Explicitly represent linear prediction structure of speech waveform into WaveNet framework Introduced tuning methods of WaveNet training / generation • Noise injection • Sample generation methods Performed experiment in objective and subjective manner • Achieve faster convergence speed than conventional WaveNet vocoders, whereas keep similar speech quality 31
  • 32. REFERENCES [1] A. J. Hunt and A. W. Black, “Unit selection in a concatenative speech synthesis system using a large speech database,” in Proc. ICASSP, 1996 [2] H. Zen, et. al. “Statistical parametric speech synthesis using deep neural networks,” in Proc. ICASSP, 2013. [3] A. van den Oord, et. al.,“WaveNet: A generative model for raw audio generation,” in CoRR, 2016. [4] A.Tamamori, et. al., “Speaker-dependent wavenet vocoder,” in INTERSPEECH, 2017. [5] A. van den Oord, et. al., "Parallel WaveNet: Fast High-fidelity speech synthesis," in CoRR, 2017 [6] E. Song, K. Byun, and H.-G. Kang, “A neural excitation vocoder for statistical parametric speech synthesis systems,” in Arxiv, 2018. [7] M. Hwang, F. Soong, F. Xie, X. Wang, and H. Kang, “LP-WaveNet: Linear Prediction-based WaveNet Speech Synthesis,” Submitted to Proc. ICASSP, 2019. [8] W. Xin, et. al., “A comparison of recent waveform generation and acoustic modeling methods for neural-network-based speech synthesis,” in Proc. ICASSP, 2018. [9] H. Kawahara, “STRAIGHT, exploration of the other aspect of vocoder: Perceptually isomorphic decomposition of speech sounds,” Acoustical Science and Technology, 2006. 32