This project focuses on fine-tuning the Whisper model for automatic speech recognition (ASR). The goal is to enhance the performance of Whisper on a custom dataset by using transfer learning and optimizing model parameters. The project uses a Jupyter Notebook for fine-tuning, training, and evaluating the model, leveraging the Hugging Face transformers
library.
- Fine-tuning Whisper for speech-to-text tasks on custom audio data.
- Supports mixed precision training (
fp16
) for faster training on GPUs. - Integration with Weights & Biases for experiment tracking and monitoring.
- Tokenization of both audio and text inputs for model training.
- Evaluation during each epoch to track performance improvements.
- Model checkpoint saving to allow resumption of training.
- Python 3.7 or higher
- Hugging Face Transformers
- PyTorch
- Weights & Biases (for experiment tracking)
- Additional libraries:
datasets
,tqdm
,torchaudio
,transformers
git clone https://github.com/AliiAhmadi/speech_to_text.git
cd speech_to_text/
For experiment tracking, create a W&B account and set up the API key:
wandb login
This project uses a Jupyter Notebook for fine-tuning the Whisper model. Open the notebook file whisper_finetuning.ipynb
in your Jupyter environment.
Ensure you have a dataset for fine-tuning. You can use any ASR dataset in the correct format (e.g., audio files with transcriptions). The dataset should have fields such as audio
and transcript
.
The notebook includes a custom tokenization function to preprocess the dataset:
def encode_audio(examples):
audio_input = tokenizer(examples["audio"], return_tensors="pt", padding=True, truncation=True)
return audio_input
train_dataset = train_dataset.map(encode_audio, batched=True)
Run the training cells in the Jupyter Notebook to start fine-tuning the Whisper model. Model checkpoints are saved during training.
Evaluation occurs at the end of each epoch, and model checkpoints are saved automatically within the notebook.
After training, use the fine-tuned model to make predictions on new audio data, which is also covered in the notebook.
speech_to_text/
├── data/ # Dataset (audio and transcriptions)
├── logs/ # Logs for experiment tracking (via Weights & Biases)
├── model/ # Fine-tuned model checkpoints
├── whisper_finetuning.ipynb # Jupyter notebook for training and evaluation
├── README.md # Project documentation
└── ... # Other helper files and scripts
This project is licensed under the MIT License - see the LICENSE file for details.