Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support return_tensors in audio chat templates #34601

Open
wants to merge 10 commits into
base: main
Choose a base branch
from

Conversation

zucchini-nlp
Copy link
Member

@zucchini-nlp zucchini-nlp commented Nov 4, 2024

What does this PR do?

Successor of #34275. Analogously supports vectorized output from audio LLMs. Currently we have only Qwen2Audio which needs to upload it template on the hub with minor changes instead of having the default_chat_template that is deprecated. Already opened a PR for that

From now on vectorized chat templates will work only for processors that are already uniform in terms of input kwargs, because we pass images, videos and audio and any of them can be None

This PR has done:

  • Standardization on Qwen2Audio processor
  • Some clean up on processor tests for audio models
  • Audio chat template support
  • Qwen2Audio default chat template has to be deleted when my PR on the hub is merged (TODO)
  • Verify all multimodal instruct LLMs have standard processor kwargs otherwise they will fail to return tensors in chat template (TODO: VideoLlava is the only one that needs standardization now)

Qwen2Audio standardization raises one question:

  • Currently if we have some defaults in tokenizer when init AutoTokenizer.from_pretrained(model_id, max_length=300, padding="max_length"), those defaults will also be used in audio feature extractor. So in this case the audio extractor will by default pad to max length of 300. I see that audio models usually pass same kwargs for tokenizer and feature extractor at call time, but I am not sure if we want it for the initialization

The docs will be updated as part of #35657 but the logic is same as with VLMs

messages = [
    {
        "role": "user",
        "content": [
            {"type": "audio", "url": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen2-Audio/audio/glass-breaking-151256.mp3"}
            {"type": "text", "text": "What do you hear in this audio?"},
        ],
    },
]

inputs = processor.apply_chat_template(messages, add_generation_prompt=True, tokenize=True, return_dict=True)
output = model.generate(**inputs, max_new_tokens=50)

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@zucchini-nlp zucchini-nlp changed the title [WIP ]Add audio chat templates Support return_tensors in audio chat templates Jan 17, 2025
@zucchini-nlp zucchini-nlp requested a review from eustlb January 17, 2025 16:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants