NHV-Sing

Neural Homomorphic Vocoder tuned for singing voice synthesis.
GitHub Repository  ·  Original Paper (Liu et al., 2020)

Kiritan & Natsume

Comparing speaker-specific training vs. M4Singer fine-tuning on Japanese singing voice. "Trained (speaker)": model trained exclusively on each speaker's data. "M4Singer fine-tuned": model fine-tuned on M4Singer (20 Mandarin singers), then applied to these speakers.

Speaker Ground Truth Trained (speaker) M4Singer fine-tuned
Tohoku Kiritan unseen
Natsume Yuri

M4Singer Fine-tuning Samples

Samples from the M4Singer dataset (Zhang et al., NeurIPS 2022 · CC BY-NC-SA 4.0). Ground truth audio is included for research demonstration purposes with full attribution.
Note: The M4Singer fine-tuned model is designed as a general-purpose vocoder capable of synthesizing a wide pitch range across diverse speakers. However, as these samples suggest, better quality can be achieved by focusing on a specific speaker and narrower pitch range.

Alto (7 singers)

SingerGround TruthM4Singer fine-tuned
Alto-1
Alto-2
Alto-3
Alto-4
Alto-5
Alto-6
Alto-7

Soprano (3 singers)

SingerGround TruthM4Singer fine-tuned
Soprano-1
Soprano-2
Soprano-3

Bass (3 singers)

SingerGround TruthM4Singer fine-tuned
Bass-1
Bass-2
Bass-3

Tenor (7 singers)

SingerGround TruthM4Singer fine-tuned
Tenor-1
Tenor-2
Tenor-3
Tenor-4
Tenor-5
Tenor-6
Tenor-7