Support `return_tensors` in audio chat templates #34601

zucchini-nlp · 2024-11-04T13:47:33Z

What does this PR do?

Successor of #34275. Analogously supports vectorized output from audio LLMs. Currently we have only Qwen2Audio which needs to upload it template on the hub with minor changes instead of having the default_chat_template that is deprecated. Already opened a PR for that

From now on vectorized chat templates will work only for processors that are already uniform in terms of input kwargs, because we pass images, videos and audio and any of them can be None

This PR has done:

Standardization on Qwen2Audio processor
Some clean up on processor tests for audio models
Audio chat template support
Qwen2Audio default chat template has to be deleted when my PR on the hub is merged (TODO)
Verify all multimodal instruct LLMs have standard processor kwargs otherwise they will fail to return tensors in chat template (TODO: VideoLlava is the only one that needs standardization now)

Qwen2Audio standardization raises one question:

Currently if we have some defaults in tokenizer when init AutoTokenizer.from_pretrained(model_id, max_length=300, padding="max_length"), those defaults will also be used in audio feature extractor. So in this case the audio extractor will by default pad to max length of 300. I see that audio models usually pass same kwargs for tokenizer and feature extractor at call time, but I am not sure if we want it for the initialization

The docs will be updated as part of #35657 but the logic is same as with VLMs

messages = [
    {
        "role": "user",
        "content": [
            {"type": "audio", "url": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen2-Audio/audio/glass-breaking-151256.mp3"}
            {"type": "text", "text": "What do you hear in this audio?"},
        ],
    },
]

inputs = processor.apply_chat_template(messages, add_generation_prompt=True, tokenize=True, return_dict=True)
output = model.generate(**inputs, max_new_tokens=50)

HuggingFaceDocBuilderDev · 2024-11-04T14:13:49Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

add audio chat templates

88e052d

merge main

0851a87

zucchini-nlp requested review from molbap, yonigozlan, qubvel, Rocketknight1 and ArthurZucker as code owners January 17, 2025 10:54

zucchini-nlp removed request for qubvel, ArthurZucker and yonigozlan January 17, 2025 11:17

zucchini-nlp added 4 commits January 17, 2025 15:16

update

718881e

update

3469355

nit

e60d29e

green ci

fea1358

zucchini-nlp changed the title ~~[WIP ]Add audio chat templates~~ Support return_tensors in audio chat templates Jan 17, 2025

zucchini-nlp requested a review from eustlb January 17, 2025 16:06

zucchini-nlp added 4 commits January 17, 2025 17:11

Merge branch 'main' into audio-chat-templates

ba57157

Merge branch 'main' into audio-chat-templates

4bc53f3

Merge branch 'main' into audio-chat-templates

46b4915

Merge branch 'main' into audio-chat-templates

7ec8549

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support `return_tensors` in audio chat templates #34601

Support `return_tensors` in audio chat templates #34601

zucchini-nlp commented Nov 4, 2024 •

edited

Loading

HuggingFaceDocBuilderDev commented Nov 4, 2024

Support return_tensors in audio chat templates #34601

Are you sure you want to change the base?

Support return_tensors in audio chat templates #34601

Conversation

zucchini-nlp commented Nov 4, 2024 • edited Loading

What does this PR do?

HuggingFaceDocBuilderDev commented Nov 4, 2024

Support `return_tensors` in audio chat templates #34601

Support `return_tensors` in audio chat templates #34601

zucchini-nlp commented Nov 4, 2024 •

edited

Loading