View on GitHub

Speaker-Characteristics-Guided-Speech-Synthesis

Abstract: Nowadays, talking head techniques are widely researched. Most of the previous works pay attention to the association between tones, prosody, and visual cues, such as head motion, lip movement, and gestures. However, it is also important to concern with the timbre, matching the voice with the speaker identity, since people obtain speaker-specific information from both the auditory and visual modalities. Thus in this paper, we aim to generate proper voice characteristics in accordance with the speaker characteristics we set up. To address the problem, we first select 6 speaker characteristics related to the voice qualities: gender, age, race, body mass index, face shape, and personality. We then train a Conditional Variational AutoEncoder with attention (attentionCVAE) model to inference speaker embeddings from speaker characteristics and employ a multi-speaker text-to-speech system to generate speeches of nonexistent speakers we set. The subjective tests indicate that our method can reconstruct real-world speaker embedding, and generate meaningful fake embeddings from speaker characteristics. The further analysis uncovers how and to what extent the speaker characteristics influence the voice qualities of speakers.

Comparison with Baselines(Section 5.5)

These samples as corresponding to Section 5.5 in our paper. LibriTTS is the original TTS training dataset; Voxceleb2 is unseen during TTS training and we train the next two models on this dataset; we train MAF and attentionCVAE to generate embedding from fake characteristic labels. We randomly select 4 samples for each respectively.

LibriTTS

VoxCeleb2

MAF

attentionCVAE(ours)

Face Matching Test(Section 5.7)

We randomly select 3 seen speakers and 3 unseen speakers during training from VoxCeleb2 respectively. We respectively generate speech from their speaker embedding extracted from their audio and the embedding predicted from our model, and compare to what extent these speech match the speaker’s face.

seen speakers

id05663

Branching

ground truth:

ours:

id07017

Branching

ground truth:

ours:

id07254

Branching

ground truth:

ours:

unseen speakers

id00425

Branching

ground truth:

ours:

id00467

Branching

ground truth:

ours:

id01590

Branching

ground truth:

ours:

Characteristic Matching Test(Section 5.7)

We select 3 attributes: gender, age, and BMI to evaluate the influence of speaker characteristics tovoice qualities. We choose two opposite characteristics of one attributes to generate speech from.