Welcome to Awesome-Large-Audio-Models, your go-to resource for the latest advancements, papers, and benchmarks in the world of cutting-edge audio models. This repository is meticulously curated for researchers, developers, and enthusiasts passionate about pushing the boundaries of audio processing and understanding.
Stay at the forefront of audio model research with these recent papers:
- "AudioPaLM: A Large Language Model That Can Speak and Listen"
- Authors: Paul K. Rubenstein, Chulayuth Asawaroengchai, Duc Dung Nguyen, Ankur Bapna, Zalán Borsos, Félix de Chaumont Quitry, Peter Chen, Dalia El Badawy, Wei Han, Eugene Kharitonov, Hannah Muckenhirn, Dirk Padfield, James Qin, Danny Rozenberg, Tara Sainath, Johan Schalkwyk, Matt Sharifi, Michelle Tadmor Ramanovich, Marco Tagliasacchi, Alexandru Tudor, Mihajlo Velimirović, Damien Vincent, Jiahui Yu, Yongqiang Wang, Vicky Zayats, Neil Zeghidour, Yu Zhang, Zhishuai Zhang, Lukas Zilka, Christian Frank
- "Speechgpt: Empowering large language models with intrinsic cross-modal conversational abilities"
- "SeamlessM4T—Massively Multilingual & Multimodal Machine Translation"
- Authors: Seamless Communication, Loïc Barrault, Yu-An Chung, Mariano Cora Meglioli, David Dale, Ning Dong, Paul-Ambroise Duquenne, Hady Elsahar, Hongyu Gong, Kevin Heffernan, John Hoffman, Christopher Klaiber, Pengwei Li, Daniel Licht, Jean Maillard, Alice Rakotoarison, Kaushik Ram Sadagopan, Guillaume Wenzek, Ethan Ye, Bapi Akula, Peng-Jen Chen, Naji El Hachem, Brian Ellis, Gabriel Mejia Gonzalez, Justin Haaheim, Prangthip Hansanti, Russ Howes, Bernie Huang, Min-Jae Hwang, Hirofumi Inaguma, Somya Jain, Elahe Kalbassi, Amanda Kallet, Ilia Kulikov, Janice Lam, Daniel Li, Xutai Ma, Ruslan Mavlyutov, Benjamin Peloquin, Mohamed Ramadan, Abinesh Ramakrishnan, Anna Sun, Kevin Tran, Tuan Tran, Igor Tufanov, Vish Vogeti, Carleigh Wood, Yilin Yang, Bokai Yu, Pierre Andrews, Can Balioglu, Marta R. Costa-jussà, Onur Celebi, Maha Elbayad, Cynthia Gao, Francisco Guzmán, Justine Kao, Ann Lee, Alexandre Mourachko, Juan Pino, Sravya Popuri, Christophe Ropers, Safiyyah Saleem, Holger Schwenk, Paden Tomasello, Changhan Wang, Jeff Wang, Skyler Wang
- PDF Code
- “PolyVoice: Language Models for Speech to Speech Translation”
- Authors: Qianqian Dong, Zhiying Huang, Qiao Tian, Chen Xu, Tom Ko, Yunlong Zhao, Siyuan Feng, Tang Li, Kexin Wang, Xuxin Cheng, Fengpeng Yue, Ye Bai, Xi Chen, Lu Lu, Zejun Ma, Yuping Wang, Mingxuan Wang, Yuxuan Wang
- "Unified Model for Image, Video, Audio and Language Tasks"
- “TokenSplit: Using Discrete Speech Representations for Direct, Refined, and Transcript-Conditioned Speech Separation and Recognition”
- Authors: Hakan Erdogan, Scott Wisdom, Xuankai Chang, Zalán Borsos, Marco Tagliasacchi, Neil Zeghidour, John R. Hershey
- “Prompting and Adapter Tuning for Self-supervised Encoder-Decoder Speech Model”
- Authors: Kai-Wei Chang, Ming-Hsin Chen, Yun-Ping Lin, Jing Neng Hsu, Paul Kuo-Ming Huang, Chien-yu Huang, Shang-Wen Li, Hung-yi Lee
- “Multilingual Speech-to-Speech Translation into Multiple Target Languages”
- Authors: Hongyu Gong, Ning Dong, Sravya Popuri, Vedanuj Goswami, Ann Lee, Juan Pino
- “Textless Direct Speech-to-Speech Translation with Discrete Speech Representation”
- Authors: Xinjian Li, Ye Jia, Chung-Cheng Chiu
- “SpeechGen: Unlocking the Generative Power of Speech Language Models with Prompts”
- “Listen, Think, and Understand”
- Authors: Yuan Gong, Hongyin Luo, Alexander H. Liu, Leonid Karlinsky, James Glass
- “BuboGPT: Enabling Visual Grounding in Multi-Modal LLMs”
- Authors: Yang Zhao, Zhijie Lin, Daquan Zhou, Zilong Huang, Jiashi Feng, Bingyi Kang
Stay up-to-date with the latest advancements in speech-to-text models:
- "Robust Speech Recognition via Large-Scale Weak Supervision"
- “WhisperX: Time-Accurate Speech Transcription of Long-Form Audio”
- "Google USM: Scaling Automatic Speech Recognition Beyond 100 Languages"
- Authors: Yu Zhang, Wei Han, James Qin, Yongqiang Wang, Ankur Bapna, Zhehuai Chen, Nanxin Chen, Bo Li, Vera Axelrod, Gary Wang, Zhong Meng, Ke Hu, Andrew Rosenberg, Rohit Prabhavalkar, Daniel S. Park, Parisa Haghani, Jason Riesa, Ginger Perng, Hagen Soltau, Trevor Strohman, Bhuvana Ramabhadran, Tara Sainath, Pedro Moreno, Chung-Cheng Chiu, Johan Schalkwyk, Françoise Beaufays, Yonghui Wu
- “Prompting Large Language Models for Zero-Shot Domain Adaptation in Speech Recognition”
- Authors: Yuang Li, Yu Wu, Jinyu Li, Shujie Liu
- “Exploring the Integration of Large Language Models into Automatic Speech Recognition Systems: An Empirical Study”
- Authors: Zeping Min, Jinbo Wang
- “On decoder-only architecture for speech-to-text and large language model integration”
- Authors: Jian Wu, Yashesh Gaur, Zhuo Chen, Long Zhou, Yimeng Zhu, Tianrui Wang, Jinyu Li, Shujie Liu, Bo Ren, Linquan Liu, Yu Wu
- “Adapting Large Language Model with Speech for Fully Formatted End-to-End Speech Recognition”
- Authors: Shaoshi Ling, Yuxuan Hu, Shuangbei Qian, Guoli Ye, Yao Qian, Yifan Gong, Ed Lin, Michael Zeng
- “Adapting multilingual speech representation model for a new, underresourced
language through multilingual fine-tuning and continued pretraining”
- Authors: Karol Nowakowski, Michal Ptaszynski, Kyoko Murasaki, Jagna Nieuważny
- “Prompting Large Language Models with Speech Recognition Abilities”
- Authors: Yassir Fathullah, Chunyang Wu, Egor Lakomkin, Junteng Jia, Yuan Shangguan, Ke Li, Jinxi Guo, Wenhan Xiong, Jay Mahadeokar, Ozlem Kalinli, Christian Fuegen, Mike Seltzer
- “Semantic Segmentation with Bidirectional Language Models Improves Long-form ASR”
- Authors: W. Ronny Huang, Hao Zhang, Shankar Kumar, Shuo-yiin Chang, Tara N. Sainath
- “FLEURS: FEW-Shot Learning Evaluation of Universal Representations of Speech”
- Authors: Alexis Conneau, Min Ma, Simran Khanuja, Yu Zhang, Vera Axelrod, Siddharth Dalmia, Jason Riesa, Clara Rivera, Ankur Bapna
- “MASR: Multi-label Aware Speech Representation”
- Authors: Anjali Raj, Shikhar Bharadwaj, Sriram Ganapathy, Min Ma, Shikhar Vashishth
- “Exploring Speech Recognition, Translation, and Understanding with Discrete Speech Units: A Comparative Study”
- Authors: Xuankai Chang, Brian Yan, Kwanghee Choi, Jeeweon Jung, Yichen Lu, Soumi Maiti, Roshan Sharma, Jiatong Shi, Jinchuan Tian, Shinji Watanabe, Yuya Fujita, Takashi Maekaku, Pengcheng Guo, Yao-Fei Cheng, Pavel Denisov, Kohei Saijo, Hsiu-Hsuan Wang
- “SpeechTokenizer: Unified Speech Tokenizer for Speech Large Language Models”
- Authors: Xin Zhang, Dong Zhang, Shimin Li, Yaqian Zhou, Xipeng Qiu
Stay up to date on the latest advances in audio generation models:
- “AudioLM: A Language Modeling Approach to Audio Generation”
- Authors: Zalán Borsos, Raphaël Marinier, Damien Vincent, Eugene Kharitonov, Olivier Pietquin, Matt Sharifi,Dominik Roblek, Olivier Teboul, David Grangier, Marco Tagliasacchi, and Neil Zeghidour
- “AudioGen: Textually Guided Audio Generation”
- Authors: Felix Kreuk, Gabriel Synnaeve, Adam Polyak, Uriel Singer, Alexandre Défossez, Jade Copet, Devi Parikh, Yaniv Taigman, Yossi Adi
- “FoundationTTS: Text-to-Speech for ASR Customization with Generative Language Model”
- Authors: Ruiqing Xue, Yanqing Liu, Lei He, Xu Tan, Linquan Liu, Edward Lin, Sheng Zhao
- “Text-to-Audio Generation using Instruction-Tuned LLM and Latent Diffusion Model”
- “WavCaps: A ChatGPT-Assisted Weakly-Labelled Audio Captioning Dataset for Audio-Language Multimodal Research”
- “AudioLDM: Text-to-Audio Generation with Latent Diffusion Models”
- “Make-An-Audio: Text-To-Audio Generation with Prompt-Enhanced Diffusion Models”
- Authors: Rongjie Huang, Jiawei Huang, Dongchao Yang, Yi Ren, Luping Liu, Mingze Li, Zhenhui Ye, Jinglin Liu, Xiang Yin, Zhou Zhao
- “Large-scale unsupervised audio pre-training for video-to-speech synthesis”
- Authors: Triantafyllos Kefalas, Yannis Panagakis, Maja Pantic
- “ReVISE: Self-Supervised Speech Resynthesis With Visual Input for Universal and Generalized Speech Regeneration”
- Authors: Wei-Ning Hsu, Tal Remez, Bowen Shi, Jacob Donley, Yossi Adi
- “Back Translation for Speech-to-text Translation Without Transcripts”
- Authors: Qingkai Fang, Yang Feng
- “UniAudio: An Audio Foundation Model Toward Universal Audio Generation”
- Authors: Dongchao Yang, Jinchuan Tian, Xu Tan, Rongjie Huang, Songxiang Liu, Xuankai Chang, Jiatong Shi, Sheng Zhao, Jiang Bian, Xixin Wu, Zhou Zhao, Helen Meng
- “PromptTTS 2: Describing and Generating Voices with Text Prompt”
- Authors: Yichong Leng, Zhifang Guo, Kai Shen, Xu Tan, Zeqian Ju, Yanqing Liu, Yufei Liu, Dongchao Yang, Leying Zhang, Kaitao Song, Lei He, Xiang-Yang Li, Sheng Zhao, Tao Qin, Jiang Bian
- “Voicebox: Text-Guided Multilingual Universal Speech Generation at Scale”
- Authors: Matthew Le, Apoorv Vyas, Bowen Shi, Brian Karrer, Leda Sari, Rashel Moritz, Mary Williamson, Vimal Manohar, Yossi Adi, Jay Mahadeokar, Wei-Ning Hsu
- “StyleTTS 2: Towards Human-Level Text-to-Speech through Style Diffusion and Adversarial Training with Large Speech Language Models”
- Authors: Yinghao Aaron Li, Cong Han, Vinay S. Raghavan, Gavin Mischler, Nima Mesgarani
Stay up to date on the latest advances in cross-modal representation learning:
- “CLAP Learning Audio Concepts from Natural Language Supervision”
- Authors: Benjamin Elizalde, Soham Deshmukh, Mahmoud Al Ismail, Huaming Wang
- “SpeechCLIP: Integrating Speech with Pre-Trained Vision and Language Model”
- Authors: Yi-Jen Shih, Hsuan-Fu Wang, Heng-Jui Chang, Layne Berry, Hung-yi Lee, David Harwath
- “BLSP: Bootstrapping Language-Speech Pre-training via Behavior Alignment of Continuation Writing”
- “TalkCLIP: Talking Head Generation with Text-Guided Expressive Speaking Styles”
- Authors: Yifeng Ma, Suzhen Wang, Yu Ding, Bowen Ma, Tangjie Lv, Changjie Fan, Zhipeng Hu, Zhidong Deng, Xin Yu
- “Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding”
- “Music Understanding LLaMA: Advancing Text-to-Music Generation with Question Answering and Captioning”
- Authors: Shansong Liu, Atin Sakkeer Hussain, Chenshuo Sun, Ying Shan
- “Efficient Self-supervised Learning with Contextualized Target Representations
for Vision, Speech and Language”
- Authors: Alexei Baevski, Arun Babu, Wei-Ning Hsu, Michael Auli
- “Modality Adaption or Regularization? A Case Study on End-to-End Speech Translation”
- “MERT: Acoustic Music Understanding Model with Large-Scale Self-supervised Training”
- “VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset”
- “X-LLM: Bootstrapping Advanced Large Language Models by Treating Multi-Modalities as Foreign Languages”
- “ALOR: Vision-Audio-Language Omni-Perception Pretraining Model and Dataset”
Track the progress of large audio models through benchmark performance metrics:
- ASR Leaderboard: Monitor the latest results in automatic speech recognition (ASR) across diverse datasets, evaluating the accuracy, speed, and memory efficiency of various models.
- Speech Synthesis Quality: Explore benchmark scores for evaluating the quality and naturalness of synthesized speech, ensuring your applications deliver a premium auditory experience.
- Audio Classification Challenges: Stay up-to-date with ongoing audio classification challenges and assess model performance across different domains, from environmental sounds to musical genres.
We encourage the audio research community to actively contribute to this repository by adding relevant papers, benchmark results, or tools that can benefit the community at large. Let's collaboratively advance the field of large audio models!
We'd like to express our gratitude to all the researchers and developers who tirelessly work on improving audio models and sharing their findings with the community. This repository stands as a testament to your dedication.
If you find this resource helpful in your research, please consider citing us:
@github{Awesome-Large-Audio-Models,
title = {Awesome-Large-Audio-Models},
author = {Your Name},
year = {2023},
url = {<https://github.com/yourusername/Awesome-Large-Audio-Models>},
note = {Your specific note (optional)}
}
Join us in the quest to push the boundaries of large audio models. Together, we can create remarkable advancements in audio processing and understanding. Explore, contribute, and innovate! 🚀🎧🔊