Question about speech conditioning #640

asr-pub · 2023-10-12T13:02:27Z

asr-pub
Oct 12, 2023

In the paper :

Two encodings were produced for each training sample, which are averaged together.

Why use two encodings (one is self, the other is another clip of the same person speaking) ? How about use the single encoding of current sample ?

manmay-nakhashi · 2023-10-18T18:00:24Z

because if we just use the same audio as conditioning it overfits into that data.

1 reply

So, during each training iteration, for the same audio, its Speaker Condition varies, which means that SC is dynamically computed ?