Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

token数目不对齐 #110

Open
fanpengustc opened this issue Nov 21, 2024 · 2 comments
Open

token数目不对齐 #110

fanpengustc opened this issue Nov 21, 2024 · 2 comments

Comments

@fanpengustc
Copy link

speech tokenizer输出的token数量是16384个 可是GLM输入的音频token只有16383个 这是个bug?

@sixsixcoder
Copy link

有示例吗?

@NanYANG2015
Copy link

speech tokenizer 支持输出的 audio tokens 数为 16384个,且存在音频会被 tokenizer 为包含最后一个 audio token,
但是 GLM 添加的 audio tokens 只到 <|audio_16382|>,少一个。

音频示例:WenetSpeech/audio/train/youtube/B00000/Y0000000009_-0p8pYdlfjY.opus 中的一段
torchaudio 读取参数:frame_offset=88405920, num_frames=1920000

上述行为是由于最后一个 audio token 的利用率很低可以弃用吗?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants