Abstract: Nowadays, talking head techniques are widely researched. Most of the previous works pay attention to the association between tones, prosody, and visual cues, such as head motion, lip movement, and gestures. However, it is also important to concern with the timbre, matching the voice with the speaker identity, since people obtain speaker-specific information from both the auditory and visual modalities. Thus in this paper, we aim to generate proper voice characteristics in accordance with the speaker characteristics we set up. To address the problem, we first select 6 speaker characteristics related to the voice qualities: gender, age, race, body mass index, face shape, and personality. We then train a Conditional Variational AutoEncoder with attention (attentionCVAE) model to inference speaker embeddings from speaker characteristics and employ a multi-speaker text-to-speech system to generate speeches of nonexistent speakers we set. The subjective tests indicate that our method can reconstruct real-world speaker embedding, and generate meaningful fake embeddings from speaker characteristics. The further analysis uncovers how and to what extent the speaker characteristics influence the voice qualities of speakers.
Comparison with Baselines(Section 5.5)
These samples as corresponding to Section 5.5 in our paper. LibriTTS is the original TTS training dataset; Voxceleb2 is unseen during TTS training and we train the next two models on this dataset; we train MAF and attentionCVAE to generate embedding from fake characteristic labels. We randomly select 4 samples for each respectively.
LibriTTS
VoxCeleb2
MAF
attentionCVAE(ours)
Face Matching Test(Section 5.7)
We randomly select 3 seen speakers and 3 unseen speakers during training from VoxCeleb2 respectively. We respectively generate speech from their speaker embedding extracted from their audio and the embedding predicted from our model, and compare to what extent these speech match the speaker’s face.
seen speakers
id05663
ground truth:
ours:
id07017
ground truth:
ours:
id07254
ground truth:
ours:
unseen speakers
id00425
ground truth:
ours:
id00467
ground truth:
ours:
id01590
ground truth:
ours:
Characteristic Matching Test(Section 5.7)
We select 3 attributes: gender, age, and BMI to evaluate the influence of speaker characteristics tovoice qualities. We choose two opposite characteristics of one attributes to generate speech from.
gender
male:
female:
age
young:
elderly:
BMI
underweight:
overweight:
Relationship between Characteristic and Voice(Section 5.8)
We give some examples to illustrate how voice change as the characteristic change.
gender
male
female
age
male
young
middleage
elderly
female
young
middleage
elderly
BMI
male
underweight
middleweight
overweight
female
underweight
middleweight
overweight