This repository provides a Python implementation for detecting singing segments within audio files. It uses a classification-based approach with Convolutional Neural Networks (CNNs) to analyze audio data and identify sections containing singing. The output includes timestamps for detected singing segments in both seconds and hh:mm:ss
format.
- Detects singing segments within an audio file
- Outputs results in JSON format with detailed timing information
- Configurable detection threshold and analysis stride
- Command-line interface for easy usage
- Minimum duration filtering for singing segments
- Comprehensive logging for better debugging
To install the dependencies, ensure you have Python installed and run:
pip install -r requirements.txt
The main dependencies include:
tensorflow
andkeras
for model handlinglibrosa
for audio feature extractionnumpy
for numerical operationsmadmom
(custom fork for Python 3.10 compatibility)
Refer to the requirements.txt
file for detailed package versions.
The script can be run from the command line with various options:
python SVAD.py --file path/to/audio_file.wav [--threshold 0.5] [--stride 5] [--output path/to/output.json]
--file
: Path to the audio file (required)--threshold
: Detection threshold (default: 0.5)--stride
: Stride for feature extraction (default: 5)--output
: Output JSON file path (default: './results/singing_segments.json')
To integrate the detection functionality within another Python script:
from SVAD import Options, load_model, predict_singing_segments, process_predictions
# Initialize options
options = Options(threshold=0.5, stride=5)
# Load the model
model = load_model('./weights/SVAD_CNN_ML.hdf5')
# Process audio file
file_path = './data/your_audio_file.wav'
predictions = predict_singing_segments(file_path, model, options)
# Get singing segments
segments = process_predictions(predictions, options, min_duration=1.0)
The detection results are saved in JSON format, with detailed timing information for each segment:
[
{
"start": "21.900",
"end": "22.950",
"duration": "1.050",
"start_hhmmss": "0:00:21.900",
"end_hhmmss": "0:00:22.950"
}
]
The detection system can be configured through the Options
class:
threshold
: Probability threshold for classifying singing segments (default: 0.5)stride
: Step size in frames for analysis (default: 5)min_duration
: Minimum duration for a valid singing segment (default: 1.0 seconds)
SVAD.py
: Main script with command-line interface and core functionalitymodel_SVAD.py
: CNN model architecture definitionload_feature.py
: Audio processing and feature extractionweights/SVAD_CNN_ML.hdf5
: Pre-trained model weights
The system uses a Convolutional Neural Network (CNN) to classify audio segments as singing or non-singing. The process involves:
- Audio feature extraction using sliding windows
- CNN-based classification of each window
- Post-processing to combine adjacent positive detections
- Filtering out segments shorter than the minimum duration
.
├── SVAD.py
├── model_SVAD.py
├── load_feature.py
├── requirements.txt
├── weights/
│ └── SVAD_CNN_ML.hdf5
└── results/
└── singing_segments.json
This project is based on the original work developed by researchers at the Korea Advanced Institute of Science and Technology (KAIST). Please refer to the original license terms as specified in the LICENSE file included in this repository.
This project is a modified version of the original algorithm developed by:
- Sangeun Kum: [email protected]
- Juhan Nam: [email protected]
For further inquiries regarding the original research and algorithm, please contact the authors above at KAIST.
The system includes comprehensive error handling and logging:
- Validates model weight file existence
- Creates output directories automatically
- Provides informative logging messages
- Handles common runtime errors gracefully