This repo hosts the code and models of "BAT: Learning to Reason about Spatial Sounds with Large Language Models" [ICML 2024 bib].
Checkout our demo page and enjoy a QA game with spatial audio.
We use Spatial-AST as audio encoder, llama-2-7b as LLM backbone. We finetune the model by adding Q-Former and LoRA. To calculate MAP, you can refer to calculate_map.py
Encoder | Projector | LLM |
---|---|---|
Spatial-AST | Q-former(~73.56M) | llama-2-7b-hf |
Try inference.ipynb
.
You need to prepare the data jsonl in this format. Below is an example.
You can download the SpatialSoundQA dataset from SpatialAudio.
{
"audio_id": "eval/audio/YI-HlrcP6Qg4",
"reverb_id": "q9vSo1VnCiC/0.npy",
"audio_id2": null,
"reverb_id2": null,
"question_id": 0,
"question_type": "CLASSIFICATION",
"question": "Enumerate the sound occurrences in the audio clip.",
"answer": "accelerating, revving, vroom; car; vehicle"
}
...
{
"audio_id": "eval/audio/YZX2fVPmUidA",
"reverb_id": "q9vSo1VnCiC/32.npy",
"audio_id2": "eval/audio/YjNjUU01quLs",
"reverb_id2": "q9vSo1VnCiC/31.npy",
"question_id": 58,
"question_type": "MIXUP_NONBINARY_DISTANCE",
"question": "How far away is the sound of the banjo from the sound of the whack, thwack?",
"answer": "2m"
}
cd examples/seld_spatialsoundqa/
bash scripts/finetune_spatial-ast_qformer_llama_2_7b.sh
cd examples/seld_spatialsoundqa/
bash scripts/decode_spatial-ast_qformer_llama_2_7b.sh
- Decode with checkpoints
- Upload SpatialSoundQA dataset
- Upload pretrained checkpoints
- Update model performance
@article{zheng2024bat,
author = {Zheng, Zhisheng and Peng, Puyuan and Ma, Ziyang and Chen, Xie and Choi, Eunsol and Harwath, David},
title = {BAT: Learning to Reason about Spatial Sounds with Large Language Models},
journal = {arXiv preprint arXiv:2402.01591},
year = {2024},
}