ROCm CT2

Upstream README | ROCm Install Guide

Install Guide

These install instructions are for https://hub.docker.com/r/rocm/pytorch. They should mostly work for system installs as well, but then you'll have to change install directories and make sure all dependencies are installed (in the image they are already present in the conda env)

after following the guide in https://hub.docker.com/r/rocm/pytorch (tested for latest 9e1748e5b (ROCm 6.2))

#init conda
conda init
bash
conda activate py_3.9

git clone https://github.com/arlo-phoenix/CTranslate2-rocm.git --recurse-submodules
cd CTranslate2-rocm
#export PYTORCH_ROCM_ARCH=gfx1030 #optionally set this only to your ROCm arch to speed up compiling. You can find it with rocminfo | grep gfx
CLANG_CMAKE_CXX_COMPILER=clang++ CXX=clang++ HIPCXX="$(hipconfig -l)/clang" HIP_PATH="$(hipconfig -R)"     cmake -S . -B build -DWITH_MKL=OFF -DWITH_HIP=ON -DCMAKE_HIP_ARCHITECTURES=$PYTORCH_ROCM_ARCH -DBUILD_TESTS=ON -DWITH_CUDNN=ON
cmake --build build -- -j16
cd build
cmake --install . --prefix $CONDA_PREFIX #or just sudo make install if not using conda env
sudo ldconfig
cd ../python
pip install -r install_requirements.txt
python setup.py bdist_wheel
pip install dist/*.whl
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$CONDA_PREFIX/lib/

Running tests / debugging issues

Tests

In CT2 project root folder:

./build/tests/ctranslate2_test ./tests/data/ --gtest_filter=*CUDA*:-*bfloat16*

for me only some int8 test failed (I think that test shouldn't even be run for CUDA, but didn't check too deeply. The guard is from CT2 itself so it's supposed to fail)

Checking that all libraries are found

ld -lctranslate2 --verbose (ignore warnings, only important thing is that it doesn't find link errors)

BF16 issues

This fork just commented out everything related to bf16. I think an implicit conversion operator from __hip_bfloat16 to float is missing

example error with bf16 enabled:

CTranslate2/src/cuda/primitives.cu:284:19: error: no viable conversion from 'const __hip_bfloat16' to 'const float'
  284 |       const float score = previous_scores[i];

Other than that I won't be adding FA2 or AWQ support. It's written with assembly for cuda and it isn't helpful at all for my use case (whisper). Otherwise on this older commit (besides bf16) this fork is feature complete, so I might look into cleaning it up and possibilities of disabling these for ROCm on master for upstreaming. But I'll only do that after BF16 gets proper support since this discrepancy adds way too many different code paths between ROCm and CUDA. Other than that conversion worked quite well, for the majority of the project I only had to change a couple defines. Only the conv1d OP required a custom implementation for MIOpen (hipDNN isn't maintained anymore).

Tested libraries

faster-whisper

pip install faster-whisper

#1.0.3 was the most recent version when I made this, so try testing that one first if a newer one doesn't work
#pip install faster-whisper==1.0.3

I included a small benchmark script in this CT2 fork. You need to download a test file from the faster whisper repo

wget -P "./tests/data" https://github.com/SYSTRAN/faster-whisper/raw/master/tests/data/physicsworks.wav

Then you should be able to run

python faster_whisper_bench.py

This per default does just one testrun with the medium model. I'm getting around 10.9-11.0s on my RX6800 (with model loading included 13.7-13.8s).

whisperX

System dependency is just ffmpeg. Either use your system package manager or with conda conda install conda-forge::ffmpeg

pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm6.1 --force-reinstall
pip3 install transformers pandas nltk pyannote.audio==3.1.1 faster-whisper==1.0.1 -U
pip3 install whisperx --no-deps

Python dependencies are a mess here since versions aren't really pinned and the image doesn't come with torchaudio. The commands above worked for me though, but will take a while since this reinstalls all python dependencies.

For running you can use its great cli-tool by just using whisperx path/to/audio or running my little bench script for the medium model.

python whisperx_bench.py

this took around 4.1s with language detection and around 3.94s without.

If you do get it running it's pretty fast. I excluded model load since that one takes quite a while. With model load it was only slightly faster than faster_whisper, but I think that's connected with the bunch of version conflicts I had. The main advantage of whisperx is its great feature set (Forced Alignment, VAD, Speaker Diarization) and the cli-tool (lots of output options), so do try and get it running it's worth it.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README_ROCM.md

README_ROCM.md

ROCm CT2

Install Guide

Running tests / debugging issues

Tests

Checking that all libraries are found

BF16 issues

Tested libraries

faster-whisper

whisperX

Files

README_ROCM.md

Latest commit

History

README_ROCM.md

File metadata and controls

ROCm CT2

Install Guide

Running tests / debugging issues

Tests

Checking that all libraries are found

BF16 issues

Tested libraries

faster-whisper

whisperX