Skip to content

Fine-tuned OpenAI Whisper model on the Persian Common Voice dataset for enhanced speech-to-text accuracy in Persian.

License

Notifications You must be signed in to change notification settings

AliiAhmadi/persian_speech_to_text

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 

Repository files navigation

Whisper Fine-Tuning Project

This project focuses on fine-tuning the Whisper model for automatic speech recognition (ASR). The goal is to enhance the performance of Whisper on a custom dataset by using transfer learning and optimizing model parameters. The project uses a Jupyter Notebook for fine-tuning, training, and evaluating the model, leveraging the Hugging Face transformers library.

Features

  • Fine-tuning Whisper for speech-to-text tasks on custom audio data.
  • Supports mixed precision training (fp16) for faster training on GPUs.
  • Integration with Weights & Biases for experiment tracking and monitoring.
  • Tokenization of both audio and text inputs for model training.
  • Evaluation during each epoch to track performance improvements.
  • Model checkpoint saving to allow resumption of training.

Requirements

Installation

Step 1: Clone the repository

git clone https://github.com/AliiAhmadi/speech_to_text.git
cd speech_to_text/

Step 2: Setup Weights & Biases (optional)

For experiment tracking, create a W&B account and set up the API key:

wandb login

Usage

Step 1: Open the Jupyter Notebook

This project uses a Jupyter Notebook for fine-tuning the Whisper model. Open the notebook file whisper_finetuning.ipynb in your Jupyter environment.

Step 2: Prepare the Dataset

Ensure you have a dataset for fine-tuning. You can use any ASR dataset in the correct format (e.g., audio files with transcriptions). The dataset should have fields such as audio and transcript.

Step 3: Tokenize the Dataset

The notebook includes a custom tokenization function to preprocess the dataset:

def encode_audio(examples):
    audio_input = tokenizer(examples["audio"], return_tensors="pt", padding=True, truncation=True)
    return audio_input

train_dataset = train_dataset.map(encode_audio, batched=True)

Step 4: Fine-Tune the Model

Run the training cells in the Jupyter Notebook to start fine-tuning the Whisper model. Model checkpoints are saved during training.

Step 5: Evaluate the Model

Evaluation occurs at the end of each epoch, and model checkpoints are saved automatically within the notebook.

Step 6: Model Inference

After training, use the fine-tuned model to make predictions on new audio data, which is also covered in the notebook.

Project Structure

speech_to_text/
├── data/                    # Dataset (audio and transcriptions)
├── logs/                    # Logs for experiment tracking (via Weights & Biases)
├── model/                   # Fine-tuned model checkpoints
├── whisper_finetuning.ipynb  # Jupyter notebook for training and evaluation
├── README.md                # Project documentation
└── ...                      # Other helper files and scripts

License

This project is licensed under the MIT License - see the LICENSE file for details.

Acknowledgements

About

Fine-tuned OpenAI Whisper model on the Persian Common Voice dataset for enhanced speech-to-text accuracy in Persian.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published