Was that Stevie Nicks or Tacotron 2.0? ML Singing in 2018

[S]amim @samim tweeted:

In 2018, machine learning based singing vocal synthesisers will go mainstream. It will transform the music industry beyond recognition.

With these two links:

Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions by Jonathan Shen, et al.


This paper describes Tacotron 2, a neural network architecture for speech synthesis directly from text. The system is composed of a recurrent sequence-to-sequence feature prediction network that maps character embeddings to mel-scale spectrograms, followed by a modified WaveNet model acting as a vocoder to synthesize timedomain waveforms from those spectrograms. Our model achieves a mean opinion score (MOS) of 4.53 comparable to a MOS of 4.58 for professionally recorded speech. To validate our design choices, we present ablation studies of key components of our system and evaluate the impact of using mel spectrograms as the input to WaveNet instead of linguistic, duration, and F0 features. We further demonstrate that using a compact acoustic intermediate representation enables significant simplification of the WaveNet architecture.


Audio samples from “Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions”

Try the samples before dismissing the prediction of machine learning singing in 2018.

I have a different question:

What is in your test set for ML singing?

Among my top picks, Stevie Nicks, Janis Joplin, and of course, Grace Slick.

Comments are closed.