You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
iirc, bark is an encoder decoder based on a t5, and uses wav2vec-bert as the encoder
i dont recall where i learned that but i feel like i validated this at some point, wav2vec-bert uses tokens which represent phonemes, those are converted to encodec codebooks which produce the final audio
id love to discuss some of our current work with tts, stt, and sts if you have the bandwidth! my email is [email protected]
When I was figuring out how bark works, I wrote down my observations in here
I came up with a few theoretical methods that would allow voice cloning, which would be correct if my observations were correct. They were, and I published the code.
The source code written related to "Method 2" can be found here
The source code for "Method 3" can probably be found in the commit history for audio-webui. But you might need to go really far back. This method's outputs are far less convincing and much lower quality than method 2. So this might not be as interesting, but it could still help explain how bark's 3 step process works, exactly.
iirc, bark is an encoder decoder based on a t5, and uses wav2vec-bert as the encoder
i dont recall where i learned that but i feel like i validated this at some point, wav2vec-bert uses tokens which represent phonemes, those are converted to encodec codebooks which produce the final audio
id love to discuss some of our current work with tts, stt, and sts if you have the bandwidth! my email is [email protected]
more context about our org at https://AlignmentLab.ai
thanks!
Austin
The text was updated successfully, but these errors were encountered: