-
Notifications
You must be signed in to change notification settings - Fork 4
Recognition Results Explanation
For details on the recognitions techniques and implementations, see Recognition Techniques Discussion.
There are four different types of recognitions implemented. FingerprintRecognizer uses the fingerprinting technique. CorrelationRecognizer uses cross-correlation with the audio arrays. CorrelationSpectrogramRecognizer uses cross-correlation with the audio spectrograms. VisualRecognizer uses either ssim or mse with the audio spectrograms.
The target file is the first file passed into the recognition functions. It is also the only file directly passed into the recognition_directory functions. All other files are against files.
The order of the results is by the recognition strength stat, so the first index is always the best alignment between those files.
Formula for seconds to frames: seconds // (config.fft_window_size
/ sample_rate * (1-config.DEFAULT_OVERLAP_RATIO
))
- match_time
- match_info
match_time contains the time in seconds it took to do the recognition. This stage of recognize is very fast if the target file has been
match_info contains all actual alignment information. Each key in match_info is an filename that was recognized against. This is the second file name given for all recognitions other than recognize. If it is a recognition directory, the keys are every file in the given directory that is not the target file and resulted in a positive match.
{
"match_time": float,
"match_info": {
against_filename : {
...
}
}
"rankings" : {
"match_info" : {
"against_file" : int,
...
}
}
}
Rankings are from 1-10 with 10 being the best alignment. They are not proof of a good or bad alignment, but they give a good indication of the alignment quality.
Every against_filename dictionary has the key "offset_seconds".
Where locality is given, locality tuples are of the form (target_offset, against_offset, confidence). Target_offset is the location of the locality window in the target_file. Against_offset is the location of the locality window in the against file. Confidence is the confidence measure between those specific locality windows.
Each alignment offset between files has its own list of locality tuples. If locality is not given, each list is replaced with None.
In the explanations below, I start with against_filename as "match_time" and "match_info" are common to every alignment.
against_filename : {
"offset_seconds" : [float, float,...],
"offset_frames" : [int, int,...],
"confidence" : [int, int,...],
"locality_seconds" : [
[(t_offset, a_offset, confidence), ...],
[(t_offset, a_offset, confidence), ...],
...],
"locality_frames" : [
[(t_offset, a_offset, confidence), ...],
...],
"locality_frames_setting" : int
}
- offset_seconds: offset in seconds between the two files. Positive offset means against file starts x seconds after target file.
- offset_frames: offset in spectrogram frames.
- confidence: number of fingerprints that match between the files.
- locality_seconds: locality tuples in seconds.
- locality_frames: locality tuples in spectrogram frames.
- locality_frames_setting: locality setting after conversion to frames and back to seconds.
The locality tuples here are the center of the locality windows. The windows are of variable length not to exceed the given locality setting. All fingerprints are calculated first, then windows of matching fingerprints are calculated.
against_filename : {
"offset_seconds" : [float, float,...],
"offset_samples" : [int, int,...],
"confidence" : [float, float,...],
"locality_seconds" : [
[(t_offset, a_offset, confidence), ...],
...],
"locality_samples" : [
[(t_offset, a_offset, confidence), ...],
...],
"sample_rate" : int,
"scaling_factor" : float,
}
- offset_seconds: offset in seconds between the two files. Positive offset means against file starts x seconds after target file.
- offset_samples: offset in audio samples.
- confidence: From 0-1, peaks from normalized correlation.
- locality_seconds: locality tuples in seconds.
- locality_samples: locality tuples in audio samples.
- sample_rate: inputted sample rate.
- scaling_factor: max(correlation) / len(correlation) / 65536
65536 is the range of 16 bit audio
The locality tuples here are the start of the locality windows. The windows are all the given width unless the audio file is shorter than the locality window, in which case the locality window is the length of the audio file.
against_filename : {
"offset_seconds" : [float, float,...],
"offset_frames" : [int, int,...],
"confidence" : [float, float,...],
"locality_seconds" : [
[(t_offset, a_offset, confidence), ...],
...],
"locality_frames" : [
[(t_offset, a_offset, confidence), ...],
...],
"sample_rate" : int,
"scaling_factor" : float,
}
- offset_seconds: offset in seconds between the two files. Positive offset means against file starts x seconds after target file.
- offset_frames: offset in spectrogram frames.
- confidence: From 0-1, peaks from normalized correlation.
- locality_seconds: locality tuples in seconds.
- locality_frames: locality tuples in spectrogram frames.
- sample_rate: inputted sample rate.
- scaling_factor: max(correlation) / len(correlation) /
config.fft_window_size
/ 100
200 is about the max of the spectrograms, 100 gives a little better range
The locality tuples here are the start of the locality windows. The windows are all the given width unless the audio file is shorter than the locality window, in which case the locality window is the length of the audio file.
against_filename : {
"offset_seconds" : [float, float,...],
"offset_frames" : [int, int,...],
"num_matches" : [int, int,...],
"ssim" : [float, float,...],
"mse" : [float, float,...],
}
- offset_seconds: offset in seconds between the two files. Positive offset means against file starts x seconds after target file.
- offset_frames: offset in spectrogram frames.
- num_matches: Number of windows with corresponding offset calculated.
- ssim: Average ssim for all windows with given offset.
- mse: Average mse for all windows with given offset. If calc_mse is False, defaults to 20000000.0
mse is not as good for this application as ssim, but it can be calculated if you're curious.
locality is not implemented for this technique. VisualRecognizer is the most finicky of the recognitions and it relies on a lot of windows to average out to a more accurate result, so locality doesn't work as well for this application as the others.
Have Fun Aligning!