Bitextor Neural can be installed from source.
Step-by-step Bitextor Neural installation from source.
# if you are cloning from scratch:
git clone --recurse-submodules https://github.com/bitextor/bitextor-neural.git
# otherwise:
git submodule update --init --recursive
These are some external tools that need to be in the path before installing the project. If you are using an apt-like package manager you can run the following commands line to install all these dependencies:
# mandatory:
sudo apt install python3 python3-venv python3-pip golang-go build-essential cmake libboost-all-dev liblzma-dev time curl pigz parallel
# optional, feel free to skip dependencies for components that you don't expect to use:
## wget crawler:
sudo apt install wget
## warc2text:
sudo apt install uchardet libuchardet-dev libzip-dev
## biroamer:
sudo apt install libgoogle-perftools-dev libsparsehash-dev
Compile and install Bitextor Neural's C++ dependencies:
mkdir build && cd build
cmake -DCMAKE_INSTALL_PREFIX=$HOME/.local ..
# other prefix can be used, as long as it is in the PATH
make -j install
Optionally, it is possible to skip the compilation of the dependencies that are not expected to be used:
cmake -DSKIP_BIROAMER=ON -DCMAKE_INSTALL_PREFIX=$HOME/.local ..
# dependencies that can optionally be skipped:
# BIROAMER, WARC2TEXT, KENLM
Additionally, Bitextor Neural uses giashard for WARC files preprocessing.
# build and place the necessary tools in $HOME/go/bin
go install github.com/paracrawl/giashard/...@latest
Furthermore, most of the scripts in Bitextor Neural are written in Python 3. The minimum requirement is Python>=3.7.
Some additional Python libraries are required. They can be installed automatically with pip
. We recommend using a virtual environment to manage Bitextor Neural installation.
# create virtual environment & activate
python3 -m venv /path/to/virtual/environment
source /path/to/virtual/environment/bin/activate
# install dependencies in virtual enviroment
pip3 install --upgrade pip
# Bitextor Neural:
pip3 install .
# additional dependencies:
pip3 install ./bifixer
pip3 install ./biroamer && python3 -m spacy download en_core_web_sm
pip3 install ./neural-document-aligner
pip3 install ./bicleaner-ai && pip3 install ./kenlm --install-option="--max_order 7"
If you don't want to install all Python requirements in requirements.txt
because you don't expect to run some of Bitextor Neural modules, you can comment those *.txt
in requirements.txt
and rerun Bitextor Neural installation.
- Depending on the version of libboost that you are using given a certain OS version or distribution package from your package manager, you may experience some problems when compiling some of the sub-modules included in Bitextor Neural. If this is the case you can install it manually by running the following commands:
sudo apt-get remove libboost-all-dev
sudo apt-get autoremove
wget https://dl.bintray.com/boostorg/release/1.76.0/source/boost_1_76_0.tar.gz
tar xvf boost_1_76_0.tar.gz
cd boost_1_76_0/
./bootstrap.sh
./b2 -j4 --layout=system install || echo FAILURE
cd ..
rm -rf boost_1_76_0*
- There are dependencies that are GPU-dependent, and this might be a problem if the installed dependencies does not support your specific GPU. This is very common in the case of
pytorch
, and in the case you have this problem, you might need to uninstall and install the specific versions with support for your GPU.