-
HIi, Thanks for the great work and open sourcing it! The model always takes input of length 30 seconds. For audio shorter than 30s, it just pad it with 0s. With padded tokens, no padding mask is used. While this is simple and efficient, I wonder what do you suggest me to do if I want to use features in the encoder for downstream tasks. Since no mask is used, in the transformer encoder, features whose positions correspond to the padded input contain contextualized information (they are not 0). In this case, to use these features for downstream tasks, do you suggest me to dropped the features of the padded position, or keep them? Or have you tried using padding mask in whisper forwarding? Best, |
Beta Was this translation helpful? Give feedback.
Replies: 3 comments 2 replies
-
Maybe this can help: |
Beta Was this translation helpful? Give feedback.
-
We haven't done any experiments on truncated encoder features, but I'd expect it'd be just fine to do so, e.g. using the first 150 out of 1500 tokens for a 3-second audio. |
Beta Was this translation helpful? Give feedback.
-
Hi, Slightly related. This padding to 30 seconds of silence seems to introduce errors for language detection. I have a sample clean English speech of 1 second and I get non-english detections if padded to 30 seconds as part of the feature extraction vs disabling padding when extracting features. |
Beta Was this translation helpful? Give feedback.
We haven't done any experiments on truncated encoder features, but I'd expect it'd be just fine to do so, e.g. using the first 150 out of 1500 tokens for a 3-second audio.