Skip to content

Commit

Permalink
Merge pull request #170 from Andreas-Xi/lxq-drcap
Browse files Browse the repository at this point in the history
Lxq drcap
  • Loading branch information
ddlBoJack authored Nov 9, 2024
2 parents 68c5fb9 + 135f69b commit f32b8a2
Show file tree
Hide file tree
Showing 5 changed files with 2,023 additions and 9 deletions.
24 changes: 18 additions & 6 deletions examples/drcap_zeroshot_aac/README.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
# DRCap_Zeroshot_Audio_Captioning

## Introduction
DRCap is a data-efficient and flexible audio captioning system requiring text-only data for training and can quickly adapt to new domains without additional fine-tuning.
[DRCap](https://www.arxiv.org/abs/2410.09472) is a data-efficient and flexible audio captioning system requiring text-only data for training and can quickly adapt to new domains without additional fine-tuning. It uses projection decoding and retrieval-augmented generation to perform zero-shot audio captioning.

![](assets/model.png)

Expand All @@ -14,7 +14,7 @@ You could download our pretrained CLAP model and linear mapping network through
* LLM [vicuna-7b-v1.5](https://huggingface.co/lmsys/vicuna-7b-v1.5)

## Inference
You could modify the variables `run_dir`, `audio_encoder_dir`, `output_dir`, `llm_path` in `scripts/inference_drcap.sh` to match the paths where the downloaded checkpoints are located. Additionally, update the `source` in `data/audiocaps_test.jsonl` to ensure the audio paths point to your audio files, and then run:
You could modify the variables `run_dir`, `audio_encoder_dir`, `output_dir`, `llm_path` in `scripts/inference_drcap.sh` to match the paths where the downloaded checkpoints are located. Additionally, update the `source` in `data_examples/audiocaps_test.jsonl` to ensure the audio paths point to your audio files, and then run:

```shell
bash scripts/inference_drcap.sh
Expand All @@ -24,10 +24,10 @@ bash scripts/inference_drcap.sh
## Data preparation
Prepare your `jsonl` data file in the following format:
```json
{"key": "Y7fmOlUlwoNg_1", "target": "Constant rattling noise and sharp vibrations", "text": "Constant rattling noise and sharp vibrations"}
{"key": "Y6BJ455B1aAs_1", "target": "A rocket flies by followed by a loud explosion and fire crackling as a truck engine runs idle", "text": "A rocket flies by followed by a loud explosion and fire crackling as a truck engine runs idle"}
{"key": "Y7fmOlUlwoNg_1", "target": "Constant rattling noise and sharp vibrations", "text": "Constant rattling noise and sharp vibrations", "similar_captions": ["The engine of a small machine pulling chains", "A market vendor is producing a rhythmic sound with metal forceps.", "A masonry machine is in operation at a fair."]}
{"key": "Y6BJ455B1aAs_1", "target": "A rocket flies by followed by a loud explosion and fire crackling as a truck engine runs idle", "text": "A rocket flies by followed by a loud explosion and fire crackling as a truck engine runs idle", "similar_captions": ["An engine is revving, with fire and an explosion.", "An explosion is heard after an engine cuts out.", "A car speeding past with a large boom"]}
```
Please note that only textual data is required for training. However, for zero-shot inference, audio files are also necessary. You could find an example of the jsonl file in `data/audiocaps_test.jsonl`
Please note that only textual data is required for training. However, for zero-shot inference, audio files are also necessary. You could find an example of the jsonl file in `data_examples/audiocaps_test.jsonl`

Run the following command to do the retrieval-augmentation and create the text embedding support for evaluation:
```shell
Expand All @@ -42,4 +42,16 @@ bash scripts/finetune_drcap.sh
For training only the linear layer (without using LoRA or other PEFT methods), you can set the following parameters: `use_peft=false` and `freeze_llm=true`. To turn off the RAG, you could set `use_arg=false` and `rag_first=false`

## Acknowledgement
The code of training the CLAP model is based on the [WavCaps](https://github.com/XinhaoMei/WavCaps) repo, we thank the contributors for open-sourcing their work.
The code of training the CLAP model is based on the [WavCaps](https://github.com/XinhaoMei/WavCaps) repo, we thank the contributors for open-sourcing their work.


## Citation
You can refer to our paper for more results
```
@article{li2024drcap,
title={DRCap: Decoding CLAP Latents with Retrieval-augmented Generation for Zero-shot Audio Captioning},
author={Li, Xiquan and Chen, Wenxi and Ma, Ziyang and Xu, Xuenan and Liang, Yuzhe and Zheng, Zhisheng and Kong, Qiuqiang and Chen, Xie},
journal={arXiv preprint arXiv:2410.09472},
year={2024}
}
```
Loading

0 comments on commit f32b8a2

Please sign in to comment.