Question about padding mask, and using model's encoder features #307

jasonppy · 2022-10-12T19:50:11Z

jasonppy
Oct 12, 2022

HIi,

Thanks for the great work and open sourcing it!

The model always takes input of length 30 seconds. For audio shorter than 30s, it just pad it with 0s. With padded tokens, no padding mask is used. While this is simple and efficient, I wonder what do you suggest me to do if I want to use features in the encoder for downstream tasks. Since no mask is used, in the transformer encoder, features whose positions correspond to the padded input contain contextualized information (they are not 0). In this case, to use these features for downstream tasks, do you suggest me to dropped the features of the padded position, or keep them? Or have you tried using padding mask in whisper forwarding?

Best,

Answered by jongwook

Oct 17, 2022

We haven't done any experiments on truncated encoder features, but I'd expect it'd be just fine to do so, e.g. using the first 150 out of 1500 tokens for a 3-second audio.

View full answer

ArtyomZemlyak · 2022-10-13T07:49:31Z

ArtyomZemlyak
Oct 13, 2022

Maybe this can help:
huggingface/transformers#19548

1 reply

jasonppy Oct 13, 2022
Author

Thanks for your reply!

Extending all input to 30s and taking the 30s long latent representation for downstream tasks might be too expensive if most of the input actual lengths are much shorter than that. I wonder if there is a recommended way to shorten it

jongwook · 2022-10-17T18:50:41Z

jongwook
Oct 17, 2022
Maintainer

We haven't done any experiments on truncated encoder features, but I'd expect it'd be just fine to do so, e.g. using the first 150 out of 1500 tokens for a 3-second audio.

0 replies

ozancaglayan · 2024-04-10T12:42:31Z

ozancaglayan
Apr 10, 2024

Hi,

Slightly related. This padding to 30 seconds of silence seems to introduce errors for language detection. I have a sample clean English speech of 1 second and I get non-english detections if padded to 30 seconds as part of the feature extraction vs disabling padding when extracting features.

1 reply

d2a-raudenaerde Jan 27, 2025

We saw this too when finetuning short Dutch sentences. We got output in Thai, German and English. We solved this by explicitly setting the language tokens.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question about padding mask, and using model's encoder features #307

{{title}}

Replies: 3 comments 2 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Question about padding mask, and using model's encoder features #307

jasonppy Oct 12, 2022

Replies: 3 comments · 2 replies

ArtyomZemlyak Oct 13, 2022

jasonppy Oct 13, 2022 Author

jongwook Oct 17, 2022 Maintainer

ozancaglayan Apr 10, 2024

d2a-raudenaerde Jan 27, 2025

jasonppy
Oct 12, 2022

Replies: 3 comments 2 replies

ArtyomZemlyak
Oct 13, 2022

jasonppy Oct 13, 2022
Author

jongwook
Oct 17, 2022
Maintainer

ozancaglayan
Apr 10, 2024