Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question about the Vocos paper #64

Open
penguin4466 opened this issue Nov 1, 2024 · 0 comments
Open

Question about the Vocos paper #64

penguin4466 opened this issue Nov 1, 2024 · 0 comments

Comments

@penguin4466
Copy link

penguin4466 commented Nov 1, 2024

The paper mentioned "hidden-dim activations are projected into a tensor h with n_fft + 2 channels"

I was wondering why you need n_fft + 2 number of channels (i.e. real numbers) to reconstruct the audio waveform. I understand that you would need only n_fft/2+1 number of fourier coefficients to represent the 'full' spectrum of an audio waveform of n_fft samples due to conjugate symmetry, but one or two of those coefficients will be real numbers depending on whether the number of audio samples is odd or even.

As a result, I believe you really theoretically only need n_fft number of channels/real numbers in fourier space to reconstruct the audio waveform, which makes sense since that is equal to the number of audio samples in the first place (i.e. no redundancy nor loss of information)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant