From Video input to audio output. Via object detection - (yolov8, onnx format), LLM - (chatGPT, via API) and text-to-speech (fastspeech2-en-ljspeech). One can use webcam, movie files or youtube videos as input. Compatible with Mac and Windows and properly Linux.
github_sub_low.mov
python==3.9
If you can leverage your GPU by having all CUDA dependencies installed, you can substitute onnxruntime
with onnxrunntime-gpu
in requirements.txt
Got it running with:
- NVIDIA CUDA Driver Version 11.5
- CuDNN library Version 8.3.0
- For Windows: Microsoft Visual C++ (MSVC) compiler
You can install them via
pip install -r requirements.txt
You need an OpenAI Token to get it running
- webcam:
python yolo-chat-tts/main.py -ok <your key>
- local video:
python yolo-chat-tts/main.py -ok <your key> -vp "path/to/your/video.mov"
- youtube:
python yolo-chat-tts/main.py -ok <your key> -y "https://www.youtube.com/watch?v=uhkdUdXTUuc"
See all arguments : python yolo-chat-tts/main.py --help
You can
- choose between multiple camera devices
- pick the interval between the cynical comments
- choose whether the object detection is in your video or just in the logs
- choose a threshold for confidence
- choose a threshold for IoU
- choose the model size
Thanks Tien Luong Ngoc & Ibai Gorordo, I took a bunch of useful code from your linked repositories