Skip to content

Latest commit

 

History

History
77 lines (62 loc) · 2.98 KB

File metadata and controls

77 lines (62 loc) · 2.98 KB

SELD_SpatialSoundQA SELD_SpatialSoundQA

This repo hosts the code and models of "BAT: Learning to Reason about Spatial Sounds with Large Language Models" [ICML 2024 bib].

Checkout our demo page and enjoy a QA game with spatial audio.

Performance evaluation on SpatialSoundQA

We use Spatial-AST as audio encoder, llama-2-7b as LLM backbone. We finetune the model by adding Q-Former and LoRA. To calculate MAP, you can refer to calculate_map.py xxx

Checkpoints

Encoder Projector LLM
Spatial-AST Q-former(~73.56M) llama-2-7b-hf

Demo (Spatial Audio Inference)

Try inference.ipynb.

Data preparation

You need to prepare the data jsonl in this format. Below is an example.
You can download the SpatialSoundQA dataset from SpatialAudio.

{
  "audio_id": "eval/audio/YI-HlrcP6Qg4",
  "reverb_id": "q9vSo1VnCiC/0.npy", 
  "audio_id2": null, 
  "reverb_id2": null, 
  "question_id": 0, 
  "question_type": "CLASSIFICATION", 
  "question": "Enumerate the sound occurrences in the audio clip.", 
  "answer": "accelerating, revving, vroom; car; vehicle"
}

...

{
  "audio_id": "eval/audio/YZX2fVPmUidA", 
  "reverb_id": "q9vSo1VnCiC/32.npy", 
  "audio_id2": "eval/audio/YjNjUU01quLs", 
  "reverb_id2": "q9vSo1VnCiC/31.npy", 
  "question_id": 58, 
  "question_type": "MIXUP_NONBINARY_DISTANCE", 
  "question": "How far away is the sound of the banjo from the sound of the whack, thwack?", 
  "answer": "2m"
}

Train a new model

cd examples/seld_spatialsoundqa/
bash scripts/finetune_spatial-ast_qformer_llama_2_7b.sh

Decoding with checkpoints

cd examples/seld_spatialsoundqa/
bash scripts/decode_spatial-ast_qformer_llama_2_7b.sh

TODO

  • Decode with checkpoints
  • Upload SpatialSoundQA dataset
  • Upload pretrained checkpoints
  • Update model performance

Citation

@article{zheng2024bat,
  author    = {Zheng, Zhisheng and Peng, Puyuan and Ma, Ziyang and Chen, Xie and Choi, Eunsol and Harwath, David},
  title     = {BAT: Learning to Reason about Spatial Sounds with Large Language Models},
  journal   = {arXiv preprint arXiv:2402.01591},
  year      = {2024},
}