A local implementation of the Kokoro Text-to-Speech model, featuring dynamic module loading, automatic dependency management, and a web interface.
- Local text-to-speech synthesis using the Kokoro-82M model
- Multiple voice support with easy voice selection (31 voices available)
- Automatic model and voice downloading from Hugging Face
- Phoneme output support and visualization
- Interactive CLI and web interface
- Voice listing functionality
- Cross-platform support (Windows, Linux, macOS)
- Real-time generation progress display
- Multiple output formats (WAV, MP3, AAC)
- Python 3.8 or higher
- FFmpeg (optional, for MP3/AAC conversion)
- CUDA-compatible GPU (optional, for faster generation)
- Git (for version control and package management)
- Clone the repository and create a Python virtual environment:
# Windows
python -m venv venv
.\venv\Scripts\activate
# Linux/macOS
python3 -m venv venv
source venv/bin/activate
- Install dependencies:
pip install -r requirements.txt
The system will automatically download required models and voice files on first run.
You can use either the command-line interface or the web interface:
Run the interactive CLI:
python tts_demo.py
The CLI provides an interactive menu with the following options:
- List available voices - Shows all available voice options
- Generate speech - Interactive process to:
- Select a voice from the numbered list
- Enter text to convert to speech
- Adjust speech speed (0.5-2.0)
- Exit - Quit the program
Example session:
=== Kokoro TTS Menu ===
1. List available voices
2. Generate speech
3. Exit
Select an option (1-3): 2
Available voices:
1. af_alloy
2. af_aoede
3. af_bella
...
Select a voice number (or press Enter for default 'af_bella'): 3
Enter the text you want to convert to speech
(or press Enter for default text)
> Hello, world!
Enter speech speed (0.5-2.0, default 1.0): 1.2
Generating speech for: 'Hello, world!'
Using voice: af_bella
Speed: 1.2x
...
For a more user-friendly experience, launch the web interface:
python gradio_interface.py
Then open your browser to the URL shown in the console (typically http://localhost:7860).
The web interface provides:
- Easy voice selection from a dropdown menu
- Text input field with examples
- Real-time generation progress
- Audio playback in the browser
- Multiple output format options (WAV, MP3, AAC)
- Download options for generated audio
The system includes 31 different voices across various categories:
-
Female (af_*):
- af_alloy: Alloy - Clear and professional
- af_aoede: Aoede - Smooth and melodic
- af_bella: Bella - Warm and friendly
- af_jessica: Jessica - Natural and engaging
- af_kore: Kore - Bright and energetic
- af_nicole: Nicole - Professional and articulate
- af_nova: Nova - Modern and dynamic
- af_river: River - Soft and flowing
- af_sarah: Sarah - Casual and approachable
- af_sky: Sky - Light and airy
-
Male (am_*):
- am_adam: Adam - Strong and confident
- am_echo: Echo - Resonant and clear
- am_eric: Eric - Professional and authoritative
- am_fenrir: Fenrir - Deep and powerful
- am_liam: Liam - Friendly and conversational
- am_michael: Michael - Warm and trustworthy
- am_onyx: Onyx - Rich and sophisticated
- am_puck: Puck - Playful and energetic
-
Female (bf_*):
- bf_alice: Alice - Refined and elegant
- bf_emma: Emma - Warm and professional
- bf_isabella: Isabella - Sophisticated and clear
- bf_lily: Lily - Sweet and gentle
-
Male (bm_*):
- bm_daniel: Daniel - Polished and professional
- bm_fable: Fable - Storytelling and engaging
- bm_george: George - Classic British accent
- bm_lewis: Lewis - Modern British accent
-
French Female (ff_*):
- ff_siwis: Siwis - French accent
-
High-pitched Voices:
- Female (hf_*):
- hf_alpha: Alpha - Higher female pitch
- hf_beta: Beta - Alternative high female pitch
- Male (hm_*):
- hm_omega: Omega - Higher male pitch
- hm_psi: Psi - Alternative high male pitch
- Female (hf_*):
.
├── .cache/ # Cache directory for downloaded models
│ └── huggingface/ # Hugging Face model cache
├── .git/ # Git repository data
├── .gitignore # Git ignore rules
├── __pycache__/ # Python cache files
├── voices/ # Voice model files (downloaded on demand)
│ └── *.pt # Individual voice files
├── venv/ # Python virtual environment
├── outputs/ # Generated audio files directory
├── LICENSE # Apache 2.0 License file
├── README.md # Project documentation
├── models.py # Core TTS model implementation
├── gradio_interface.py # Web interface implementation
├── config.json # Model configuration file
├── requirements.txt # Python dependencies
└── tts_demo.py # CLI implementation
The project uses the latest Kokoro model from Hugging Face:
- Repository: hexgrad/Kokoro-82M
- Model file:
kokoro-v1_0.pth
(downloaded automatically) - Sample rate: 24kHz
- Voice files: Located in the
voices/
directory (downloaded automatically) - Available voices: 31 voices across multiple categories
- Languages: American English ('a'), British English ('b')
- Model size: 82M parameters
Common issues and solutions:
-
Model Download Issues
- Ensure stable internet connection
- Check Hugging Face is accessible
- Verify sufficient disk space
- Try clearing the
.cache/huggingface
directory
-
CUDA/GPU Issues
- Verify CUDA installation with
nvidia-smi
- Update GPU drivers
- Check PyTorch CUDA compatibility
- Fall back to CPU if needed
- Verify CUDA installation with
-
Audio Output Issues
- Check system audio settings
- Verify output directory permissions
- Install FFmpeg for MP3/AAC support
- Try different output formats
-
Voice File Issues
- Delete and let system redownload voice files
- Check
voices/
directory permissions - Verify voice file integrity
- Try using a different voice
-
Web Interface Issues
- Check port 7860 availability
- Try different browser
- Clear browser cache
- Check network firewall settings
For any other issues:
- Check the console output for error messages
- Verify all prerequisites are installed
- Ensure virtual environment is activated
- Check system resource usage
- Try reinstalling dependencies
Feel free to contribute by:
- Opening issues for bugs or feature requests
- Submitting pull requests with improvements
- Helping with documentation
- Testing different voices and reporting issues
- Suggesting new features or optimizations
- Testing on different platforms and reporting results
Apache 2.0 - See LICENSE file for details