This work presents a biometric system for speaker recognition using visual-only speech features. The dataset used can be found at AV DATASET, which is an open dataset.
The libraries used in all programs are listed in requirements.txt
and can be installed with Pip using the following command:
Run:
pip install -r requirements.txt
├── auxiliars
│ ├── faceDetection.py #Detect faces in a frame
│ └── lipsExtraction.py #Extract coordinates of lips and save lips points (txt file) and images with their points and curves around.
├── AVOriginalDataset #Folder that contains the AV dataset mentioned before.
| ├──Phrases
| | ├── **/* #Folders containing .mp4 and csv timestamps
| └──Digits
| ├── **/* #Folders containing .mp4 and csv timestamps
├── AVSegmentedDataset #Folder that contains the AV dataset, segmented by utterances.
| ├── Digits
| | ├── Normal # Folders containing .mp4 segment by utterance and speech mode
| | ├── Whispered
| | ├── Silent
| ├── Phrases
| | ├── Normal
| | ├── Whispered
| | ├── Silent
├── LipsFrames #Folder containing all lips images generated.
| ├── **/*.jpg
├── modelsFaceRecognition #Contains neccesary files for computer vision functions. (if not included, could be found on internet)
| ├── haarcascade_frontalface_alt.xml
| ├── opencv_face_detector_uint8.pb
| ├── opencv_face_detector.pbtxt
| └── shape_predictor_68_face_landmarks.dat
├── AV_lips_coordinates_v0.txt #File containing dictiory with all lip coordinates of uterances (is generated by lipsCoordExtraction.py)
├── featuresProcessing.py #Functions that process the coordinates.
├── hmm.py #Program that uses features to generate HMM
├── README.md
├── lipsCoordExtraction.py #Program that generate lips coordinates for all utterances and also the lips images.
├── requirements.txt #Libraries needed run the programs
└── segmentVideos.py #Script to segment original videos into utterances ussing CSV files timestamps in each video.
First we run the segmentVideos.py
, which will generate the videos separated by each speech mode and specific utterance. This script uses the timestamp provided by the dataset to segment each utterance.
Note: The script will generate only the segmentation for phrases, to apply segmentation for digits user should change the paths used in the script.
Now we execute:
python3 lipsCoordExtraction.py
With this, we generate files containing the coordinates of the lips in each frame for all videos. To specify number of coordinates, type of face location algorithm and dataset, we need to change commented parts in code.
Function used for feature processing in this case, normalization of lip coordinates can be found in featuresProcessing.py
, these function are used directly when required in the step of generating HMMs.