CPSC 440 - Computer Systems Architecture
Jerry Liu, Sean Del Castillo
This program listens to a microphone device, writes to wav, then utilizes OpenAI Whisper to convert the audio to text.
OpenAI Whisper is an automated speech recognition (ASR) system that was trained using 680,000 hours of supervised web-based multilingual and multitask data. In using a dataset of size and variety increases resilience against accents, background noise, and technical terminology. Additionally, it permits both translation into English from several languages as well as transcription in those languages.
The Whisper module converted to C++ is included in this repo under the MIT license. The inference was written by Georgi Gerganov and the repo can be found here: https://github.com/ggerganov/whisper.cpp
The hardware driver depends on the GPIOZero library (https://github.com/gpiozero/gpiozero). Hardware must support this library and have at least 11 GPIO pins.
To download all Python library modules:
pip install -r requirements.txt
The sounddevice library needs PortAudio which isn't bundled in on Linux:
sudo apt install libportaudio2
To install libav please see this page for general requrements and platform specific installation instructions: https://wiki.libav.org/Platform
You might also need to install ffmpeg which is a open-source command-line tool for transcoding multimedia files.
sudo apt install ffmpeg
Make sure to create two subdirectories ./recordings, ./transcriptions to store file outputs.
- Start driver.py
python3 driver.py
- Press and hold the recording_button to record
- Release the button to start the transcription process
- Recording .wavs are stored in /recordings and matching transcriptions are stored in /transcripts
- Exit the program in listening state by KeyboardInterrupt
[ ] Unlit [*] Blinking [X] Lit
[ ] <- Recording LED
[ ] <- Caution LED
[ ] <- Graph LEDs
[ ] <-/ /
[ ] <--/
[ ] <- Power LED
- Listening
[ ] <- Recording LED
[ ] <- Caution LED
[ ] <- Graph LEDs
[ ] <-/ /
[ ] <--/
[X] <- Power LED
- Recording
[X] <- Recording LED
[ ] <- Caution LED
[ ] <- Graph LEDs
[ ] <-/ /
[ ] <--/
[X] <- Power LED
- Transcribing
[*] <- Recording LED
[*] <- Caution LED
[ ] <- Graph LEDs
[ ] <-/ /
[ ] <--/
[ ] <- Power LED
- Graph LED states
[ ] <- Recording LED
[X] <- Caution LED
[X] <- Graph LEDs
[X] <-/ /
[X] <--/
[ ] <- Power LED
- The three Graph LEDs are pulse wave modulated to 20 maximum values. They track how many recordings are in /recordings. If all three LEDs are fully lit and Caution is lit then that means the maximum amount of files are being tracked by the graph LEDs. Recording and transcribing are unaffected by this max value.