Support return_tensors
in audio chat templates
#34601
Open
+362
−150
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
What does this PR do?
Successor of #34275. Analogously supports vectorized output from audio LLMs. Currently we have only Qwen2Audio which needs to upload it template on the hub with minor changes instead of having the
default_chat_template
that is deprecated. Already opened a PR for thatFrom now on vectorized chat templates will work only for processors that are already uniform in terms of input kwargs, because we pass
images
,videos
andaudio
and any of them can beNone
This PR has done:
Qwen2Audio standardization raises one question:
AutoTokenizer.from_pretrained(model_id, max_length=300, padding="max_length")
, those defaults will also be used in audio feature extractor. So in this case the audio extractor will by default pad to max length of 300. I see that audio models usually pass same kwargs for tokenizer and feature extractor at call time, but I am not sure if we want it for the initializationThe docs will be updated as part of #35657 but the logic is same as with VLMs