Tool for searching transcriptions of vtuber videos.
Uses:
- Metadata from Holodex
- Subtitles and audio from Youtube and some archive sites
- pyannote-audio for speaker diarization
- Whisper for transcription
- Use Python 3.11+
- Install dependencies with
python3.11 -m pip install -r requirements.txt
- Start with
HOLODEX_API_KEY
env variable - (Optional) Start with
HUGGINGFACE_TOKEN
env variable
See env_config.py for configurable ENV variables.
-
Fetch list of all Hololive channels
python3.11 -m holo_subs_search --holodex-fetch-org-channels Hololive
-
Go to
./data/channels/
and delete the channels you don't care about. This will greatly limit the amount of data that will have to be downloaded, and will speed everything up. -
Fetch list of all videos for the channels you did not delete and collabs on other channels.
python3.11 -m holo_subs_search --holodex-refresh-videos
-
Fetch subtitles for all videos (This takes a while). Only English subtitles are downloaded by default.
python3.11 -m holo_subs_search --youtube-fetch-subtitles python3.11 -m holo_subs_search --youtube-fetch-subtitles --youtube-fetch-subtitles-langs en jp id
-
(Optional) Fetch subtitles for membership videos. This requires you to have membership and be logged in your browser.
python3.11 -m holo_subs_search --video-filter flags:includes:youtube-membership --youtube-fetch-subtitles --youtube-memberships UCHsx4Hqa-1ORjQTh9TYDhww --youtube-cookies-from-browser chrome
If you plan to commit your data to public git repo, run
python3.11 -m holo_subs_search --storage-git-privacy public
to automatically create.gitignore
files that will exclude all membership content from git. -
Search
python3.11 -m holo_subs_search --search "solo live" python3.11 -m holo_subs_search --search "solo.*?live" --search-regex python3.11 -m holo_subs_search --search "solo live" --search-subtitle-filter source:eq:youtube langs:includes:en
Some videos completely lack subtitles or the subtitles are VERY bad. In this case, you can try to download the audio and transcribe it yourself. The results can be better or worse depending on a lot of variables.
Most of these steps assume that you have Docker with support for Nvidia GPU installed, a relatively powerful/new Nvidia GPU (like RTX 3090), and at least 64GB RAM (for loading long audio to memory).
Everything should also be able to run on CPU, but you would have use the *-cpu
containers, and it would be a lot slower.
Check PERFORMANCE.md if you are currious about how long this can take.
python3.11 -m holo_subs_search --youtube-fetch-audio
Detecting which parts of the audio file contain speech, is needed to prevent hallucinations later in the transcription step. As a bonus, this will also allow us to detect which lines were spoken by who in the future.
-
Get access to default models
- Accept https://hf.co/pyannote/segmentation-3.0 user conditions
- Accept https://hf.co/pyannote/speaker-diarization-3.1 user conditions
- Create access token at https://hf.co/settings/tokens and set it as
HUGGINGFACE_TOKEN
ENV variable
-
Start
PyAnnote
serverdocker compose up pyannote-server-cuda
-
Process the audio files
python3.11 -m holo_subs_search --pyannote-diarize-audio
-
Start
Whisper
serverdocker compose up faster-whisper-server-cuda
Alternatively, you should be able to use the Whisper from official OpenAI API with
WHISPER_BASE_URLS
,WHISPER_API_KEYS
ENVs and--whisper-model
parameter. -
Transcribe the audio files into subtitles
python3.11 -m holo_subs_search --whisper-transcribe-audio --whisper-langs multi
This will try to automatically detect the language of audio segments, so the resulting transcription might contain multiple languages.
Translations to specific language can be created with
--whisper-langs en
, but the results are not that good, so it's not recommended. But, you can also use this parameter if Whisper is incorrectly detecting the language. -
Search transcribed audio
python3.11 -m holo_subs_search --search "solo live" --search-subtitle-filter source:eq:whisper
Here is command that:
- Refreshes list of videos
- Does not process videos that already have subtitles generated by Whisper
- Downloads subtitles and audio from Youtube and archive sites, including membership videos for channel
UCHsx4Hqa-1ORjQTh9TYDhww
- Diarizes and transcribes the audio into subtitles with Whisper
- Deletes downloaded audio to free disk space
python3.11 -m holo_subs_search
--holodex-refresh-videos
--video-filter subtitle_sources:excludes:whisper
--youtube-memberships UCHsx4Hqa-1ORjQTh9TYDhww
--youtube-cookies-from-browser chrome
--youtube-fetch-subtitles
--youtube-fetch-subtitles-langs en
--youtube-fetch-audio
--youtube-clear-audio
--pyannote-diarize-audio
--whisper-transcribe-audio
--whisper-langs multi
If you don't want to spend many hours/days downloading and processing everything, then data for some channels can be found in following repos:
Use --storage PATH
to search data in the downloaded repo.
Use pre-commit