Scyclone: High-Quality and Parallel-Data-Free Voice Conversion Using Spectrogram and Cycle-Consistent Adversarial Networks

Masaya Tanaka (Tokyo University of Agriculture and Technology)
Takashi Nose (Tohoku University)
Aoi Kanagaki (Tohoku University)
Ryouhei Shimizu (The University of Tokyo)
Akira Ito (Tohoku University)

This paper proposes Scyclone, a high-quality voice conversion (VC) technique without parallel data training. Scyclone improves speech naturalness and speaker similarity of the converted speech by introducing CycleGAN-based spectrogram conversion with a simplified WaveRNN-based vocoder. In Scyclone, a linear spectrogram is used as the conversion features instead of vocoder parameters, which avoids quality degradation due to extraction errors in fundamental frequency and voiced/unvoiced parameters. The spectrogram of source and target speakers are modeled by modified CycleGAN networks, and the waveform is reconstructed using the simplified WaveRNN with a single Gaussian probability density function. The subjective experiments with completely unpaired training data show that Scyclone is significantly better than CycleGAN-VC2, one of the existing state-of-the-art parallel-data-free VC techniques.



Sample 1
Target speaker Ayanami F009
Source (natural)
Target (natural)
Target (vocoded)
CycleGAN-VC2
Scyclone

Sample 2
Target speaker Ayanami F009
Source (natural)
Target (natural)
Target (vocoded)
CycleGAN-VC2
Scyclone