Abstract: Nowadays, talking head techniques are widely researched. Most of the previous works pay attention to the association between tones, prosody, and visual cues, such as head motion, lip movement, and gestures. However, it is also important to concern with the timbre, matching the voice with the speaker identity, since people obtain speaker-specific information from both the auditory and visual modalities. Thus in this paper, we aim to generate proper voice characteristics in accordance with the speaker characteristics we set up. To address the problem, we first select 6 speaker characteristics related to the voice qualities: gender, age, race, body mass index, face shape, and personality. We then train a Conditional Variational AutoEncoder with attention (attentionCVAE) model to inference speaker embeddings from speaker characteristics and employ a multi-speaker text-to-speech system to generate speeches of nonexistent speakers we set. The subjective tests indicate that our method can reconstruct real-world speaker embedding, and generate meaningful fake embeddings from speaker characteristics. The further analysis uncovers how and to what extent the speaker characteristics influence the voice qualities of speakers.
Comparison with Baselines(Section 5.5)
These samples as corresponding to Section 5.5 in our paper. LibriTTS is the original TTS training dataset; Voxceleb2 is unseen during TTS training and we train the next two models on this dataset; we train MAF and attentionCVAE to generate embedding from fake characteristic labels. We randomly select 4 samples for each respectively.
LibriTTS
VoxCeleb2
MAF
attentionCVAE(ours)
Face Matching Test(Section 5.7)
We randomly select 3 seen speakers and 3 unseen speakers during training from VoxCeleb2 respectively. We respectively generate speech from their speaker embedding extracted from their audio and the embedding predicted from our model, and compare to what extent these speech match the speaker’s face.
seen speakers
id05663

ground truth:
ours:
id07017

ground truth:
ours:
id07254

ground truth:
ours:
unseen speakers
id00425

ground truth:
ours:
id00467

ground truth:
ours:
id01590

ground truth:
ours:
Characteristic Matching Test(Section 5.7)
We select 3 attributes: gender, age, and BMI to evaluate the influence of speaker characteristics tovoice qualities. We choose two opposite characteristics of one attributes to generate speech from.
gender
male:
female:
age
young:
elderly:
BMI
underweight:
overweight:
Relationship between Characteristic and Voice(Section 5.8)
We give some examples to illustrate how voice change as the characteristic change.
gender
male

female

age
male
young

middleage

elderly

female
young

middleage

elderly

BMI
male
underweight

middleweight

overweight

female
underweight

middleweight

overweight
