HoloSubs Search

Tool for searching transcriptions of vtuber videos.

Uses:

Metadata from Holodex
Subtitles and audio from Youtube and some archive sites
pyannote-audio for speaker diarization
Whisper for transcription

Setup

Use Python 3.11+
Install dependencies with python3.11 -m pip install -r requirements.txt
Start with HOLODEX_API_KEY env variable
(Optional) Start with HUGGINGFACE_TOKEN env variable

See env_config.py for configurable ENV variables.

Quickstart

Fetch list of all Hololive channels

python3.11 -m holo_subs_search --holodex-fetch-org-channels Hololive

Go to ./data/channels/ and delete the channels you don't care about. This will greatly limit the amount of data that will have to be downloaded, and will speed everything up.
Fetch list of all videos for the channels you did not delete and collabs on other channels.
```
python3.11 -m holo_subs_search --holodex-refresh-videos
```

Fetch subtitles for all videos (This takes a while). Only English subtitles are downloaded by default.

python3.11 -m holo_subs_search --youtube-fetch-subtitles
python3.11 -m holo_subs_search --youtube-fetch-subtitles --youtube-fetch-subtitles-langs en jp id

(Optional) Fetch subtitles for membership videos. This requires you to have membership and be logged in your browser.
```
python3.11 -m holo_subs_search --video-filter flags:includes:youtube-membership --youtube-fetch-subtitles --youtube-memberships UCHsx4Hqa-1ORjQTh9TYDhww --youtube-cookies-from-browser chrome
```
If you plan to commit your data to public git repo, run python3.11 -m holo_subs_search --storage-git-privacy public to automatically create .gitignore files that will exclude all membership content from git.

Search

python3.11 -m holo_subs_search --search "solo live"
python3.11 -m holo_subs_search --search "solo.*?live" --search-regex
python3.11 -m holo_subs_search --search "solo live" --search-subtitle-filter source:eq:youtube langs:includes:en

Audio Transcription

Some videos completely lack subtitles or the subtitles are VERY bad. In this case, you can try to download the audio and transcribe it yourself. The results can be better or worse depending on a lot of variables.

Most of these steps assume that you have Docker with support for Nvidia GPU installed, a relatively powerful/new Nvidia GPU (like RTX 3090), and at least 64GB RAM (for loading long audio to memory). Everything should also be able to run on CPU, but you would have use the *-cpu containers, and it would be a lot slower.

Check PERFORMANCE.md if you are currious about how long this can take.

Download all audio files

python3.11 -m holo_subs_search --youtube-fetch-audio

Diarize Audio

Detecting which parts of the audio file contain speech, is needed to prevent hallucinations later in the transcription step. As a bonus, this will also allow us to detect which lines were spoken by who in the future.

Get access to default models
- Accept https://hf.co/pyannote/segmentation-3.0 user conditions
- Accept https://hf.co/pyannote/speaker-diarization-3.1 user conditions
- Create access token at https://hf.co/settings/tokens and set it as HUGGINGFACE_TOKEN ENV variable
Start PyAnnote server
```
docker compose up pyannote-server-cuda
```

Process the audio files

python3.11 -m holo_subs_search --pyannote-diarize-audio

Transcribe Audio

Start Whisper server
```
docker compose up faster-whisper-server-cuda
```
Alternatively, you should be able to use the Whisper from official OpenAI API with WHISPER_BASE_URLS, WHISPER_API_KEYS ENVs and --whisper-model parameter.
Transcribe the audio files into subtitles
```
python3.11 -m holo_subs_search --whisper-transcribe-audio --whisper-langs multi
```
This will try to automatically detect the language of audio segments, so the resulting transcription might contain multiple languages.

Translations to specific language can be created with --whisper-langs en, but the results are not that good, so it's not recommended. But, you can also use this parameter if Whisper is incorrectly detecting the language.

Search transcribed audio

python3.11 -m holo_subs_search --search "solo live" --search-subtitle-filter source:eq:whisper

Everything Together

Here is command that:

Refreshes list of videos
Does not process videos that already have subtitles generated by Whisper
Downloads subtitles and audio from Youtube and archive sites, including membership videos for channel UCHsx4Hqa-1ORjQTh9TYDhww
Diarizes and transcribes the audio into subtitles with Whisper
Deletes downloaded audio to free disk space

python3.11 -m holo_subs_search 
--holodex-refresh-videos
--video-filter subtitle_sources:excludes:whisper

--youtube-memberships UCHsx4Hqa-1ORjQTh9TYDhww
--youtube-cookies-from-browser chrome
--youtube-fetch-subtitles
--youtube-fetch-subtitles-langs en
--youtube-fetch-audio
--youtube-clear-audio

--pyannote-diarize-audio
--whisper-transcribe-audio
--whisper-langs multi

Processed Data

If you don't want to spend many hours/days downloading and processing everything, then data for some channels can be found in following repos:

https://github.com/kunesj/holo-subs-search-data

Use --storage PATH to search data in the downloaded repo.

Development

Use pre-commit

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

HoloSubs Search

Setup

Quickstart

Audio Transcription

Download all audio files

Diarize Audio

Transcribe Audio

Everything Together

Processed Data

Development

Files

README.md

Latest commit

History

README.md

File metadata and controls

HoloSubs Search

Setup

Quickstart

Audio Transcription

Download all audio files

Diarize Audio

Transcribe Audio

Everything Together

Processed Data

Development