Skip to content

Links to AI solutions

okulovsky edited this page Nov 1, 2024 · 2 revisions

Wake word detection in Frontend

https://github.com/tubignat/wakeword_detector - WakeWord detection demo in browser, VOSK/Kaldi, works very well.

To run, install nodejs https://nodejs.org/en/download/package-manager and then npm install && npm run dev In the root repo folder

Image generation

https://www.youtube.com/watch?v=849xBkgpF3E video with a workflow for a consistent characters image series

https://github.com/ComfyWorkflows/ComfyUI-Launcher/blob/main/Dockerfile - container, also downloads dependencies for a workflow

https://youtu.be/g74Cq9Ip2ik?si=6itn1c1M9iuTZdAo Tutorial on comfyui

Narration ideas

Generally, it's not a low-hanging fruit to fine-tune language model yet. Better/cheaper techniques are needed.

Creative articulator (CA) project allows synthesizing summary-to-original-text datasets, so one can train network from short plan of the text to the full text. It also contains a basic container that runs the training, like in CoquiTTS

Might be interesting to train network on anime/movies dialogues to better capture the genre. whisper-x allows a diarization.

CA project also contains a pilot research to predict speech modality based on dialogue. Maybe gestures/intonations can be predicted, so in the free conversation the image would naturally react to the conversation's course. Maybe along with the diarization, emotions and gestures can be extracted from the video as well.

https://github.com/OpenAccess-AI-Collective/axolotl supposedly has an already depeloped container for LLM training

https://github.com/stanfordnlp/dspy - lockpicking? Chooses the prompt to "train" on dataset

Voice recognition

To use https://github.com/SYSTRAN/faster-whisper insread of Whisper. Only integration with BrainBox is needed

https://github.com/m-bain/whisperX can do diarization of the video, might help if building corpus for specific speech.

Kaldi: it needs to be extracted from the Rhasspy and accept several models, kinda like Resemblyzer does. The intents can be trained hierarchically: top-intents for intent recognition, and then a model for skills that require it.

https://github.com/rhasspy/rhasspy-asr-kaldi seems like this repo actually does it, and converts rhasspy-nlu graph into Kladi model. We need to understand how and from where to download base profiles, also, raise an error if unknown words are encountered.

This thing parses voice to phonemes not letters. May be used in language skills to improve pronunciation

https://github.com/huytd/speech

Voice/Sound generation

https://mynoise.net/NoiseMachines/dungeonRPGSoundscapeGenerator.php can help to generate back noises such as ocean etc for athmosphere, This https://stability.ai/news/introducing-stable-audio-open can do it too but seems just a more expensive tool for the same purpose

StyleTTS may replace TortoiseTTS.

https://github.com/yl4579/StyleTTS2

https://huggingface.co/spaces/styletts2/styletts2

The voice quality is very good, it's less resource-intense and more stable than TortoiseTTS.

It also supports emotions, so voice's samples can be generated from different emotions and then VITS would train on them as if on different voices.

To proceed, integration of StyleTTS into BrainBox is needed

To clean up voices from imperfect sources, this might be used https://huggingface.co/spaces/ResembleAI/resemble-enhance

To train VITS model of a character in another language:

https://github.com/rhasspy/piper has several VITS models for different languages and the recepy for training
There is no known tool for upsampling (TortoiseTTS/StyleTTS analogue for German/Russian)
As for voice transfer: the problem is to generate some voice samples on language X, having only english samples. Since the amount is really small, anything would work, including paid solutions.

Elevenlabs do not really capture voice peculiarities.
OpenVoice captures tone https://github.com/myshell-ai/OpenVoice , but not other things such as tempo etc. Maybe a solution could be to reproduce the voices manually (i.e. with own mouth) and then use OpenVoice to improve tone. Integration of OpenVoice to BrainBox is needed.

This repo contains demo on how to run Bark, emotions and multiple languages are also shown https://github.com/kekdude/bark_tinkering