-
Notifications
You must be signed in to change notification settings - Fork 9
COSMOS Ingestion on ML4AI Lab Servers
This page documents how to setup and run the COSMOS ingestion pipeline on the various ML4AI servers. Currently the only supported server is kraken
and the pipeline is failing with a known issue on carp
(documented below).
The following is a list of required software (paired with setup guides) that will install all pre-requisite software on a new ML4AI lab server needed to run the COSMOS Ingestion Pipeline. This is only required when a new (GPU-enabled) server is being setup.
NOTE: these instructions are intended for a server running the Ubuntu 20.04 operating system.
- Install Nvidia graphics drivers
- Install Nvidia CUDA drivers
- Install Nvidia CuDNN drivers
- Verify the Nvidia CUDA/CuDNN by downloading and running the MNIST sample located here:
- Install the docker-ci runtime environment
- Install Nvidia-docker (sometimes called nvidia-docker2 or nvidia-container-toolkit)
- Install docker-compose (ensure this is an up-to-date version -- 1.29.2 works well)
- Restart the Docker daemon
- Verify that the installed GPUs are reachable in docker using the following command:
sudo docker run --rm --gpus all nvidia/cuda:11.0.3-base-ubuntu20.04 nvidia-smi
- Ensure your user account is in the
ai
anddocker
user groups - Clone the Cosmos repository to a location in your home directory (for this working example we will use
/home/my_username/Cosmos
to refer to that location) - Create a data directory structure to store COSMOS input, temp and output files.
- The root of my data directory structure will be
/home/my_username/COSMOS-data/
. - Add the following sub-directories:
/home/my_username/COSMOS-data/input_files/
/home/my_username/COSMOS-data/output_files/
/home/my_username/COSMOS-data/tmp_files/
- Change the file permissions on the root dir and all child dirs to allow full access with the command:
chmod -R 777 /home/my_username/COSMOS-data/
- This is necessary for COSMOS to read/write data in these dirs (execution access (x) is also needed for some reason)
- The root of my data directory structure will be
- Create the file
.env
within the COSMOS deployment directory (which in our working example would be:/home/my_username/Cosmos/deployment/
) with the content that follows. NOTE: you will need to replace the<absolute_path_to_COSMOS_data>
for INPUT_DIR, TMP_DIR, and OUTPUT_DIR with the appropriate absolute path, which in our working example here is/home/my_username/COSMOS-data
.
BASE_IMAGE=uwcosmos/cosmos-base-c11:latest
DETECT_IMAGE=uwcosmos/cosmos-ingestion-c11:latest
WORKER_IMAGE=uwcosmos/cosmos-ingestion-c11:latest
RETRIEVAL_IMAGE=uwcosmos/cosmos-retrieval:latest
EXTRACTION_IMAGE=ankurgos/cosmos-extraction:latest
VISUALIZER_IMAGE=uwcosmos/visualizer_kb:latest
LINKING_IMAGE=uwcosmos/cosmos-linking:latest
UPLOAD_IMAGE=uwcosmos/cosmos-api:latest
API_IMAGE=uwcosmos/cosmos-api:latest
SCHEDULER_ADDRESS=scheduler:8786
ELASTIC_ADDRESS=es01:9200
NUM_PROCESSES=$WORKER_PROCS
WORKER_PROCS=4
DETECT_PROCS=1
# Default to GPU. On Kraken: use either :0 or :1 to choose specific GPU
DEVICE=cuda:1
RERANKING_DEVICE=cuda:1
# Uncomment to use CPUs
#DEVICE=cpu
#RERANKING_DEVICE=cpu
# Env vars for training
TRAINING_DIR=/path/to/training_data
VALIDATION_DIR=/path/to/validation_data/
CONFIG_DIR=${PWD}/deployment/configs
WEIGHTS_DIR=/path/for/training/output
# Env vars for primary pipeline
INPUT_DIR=<absolute_path_to_COSMOS_data>/input_files
TMP_DIR=<absolute_path_to_COSMOS_data>/tmp_files
OUTPUT_DIR=<absolute_path_to_COSMOS_data>/output_files
ELASTIC_DATA_PATH=/path/for/elasticsearch/
- Replace the file
docker-compose-ingest.yml
(in the COSMOS deployment/ directory, in this example here:/home/my_username/Cosmos/deployment/
) with the following content:
version: '3.4'
networks:
swarm_network:
driver: overlay
attachable: true
services:
scheduler:
image: $BASE_IMAGE
command: "dask-scheduler"
ports:
- 8787:8787
networks:
swarm_network:
detect_worker:
image: $DETECT_IMAGE
environment:
- MODEL_CONFIG=/configs/model_config.yaml
- WEIGHTS_PTH=/weights/model_weights.pth
- DEVICE
- OMP_NUM_THREADS=2
volumes:
- ${INPUT_DIR}:/input
- ${TMP_DIR}:/mytmp
- ${OUTPUT_DIR}:/output
command: "dask-worker tcp://scheduler:8786 --nprocs ${DETECT_PROCS} --nthreads 1 --memory-limit 0 --resources 'GPU=1' --preload ingest.preload_plugins.detect_setup"
networks:
swarm_network:
depends_on:
- scheduler
deploy:
resources:
reservations:
devices:
- capabilities: [gpu]
worker:
image: $WORKER_IMAGE
environment:
- OMP_NUM_THREADS=2
- MODEL_CONFIG=/configs/model_config.yaml
- CLASSES_PTH=/configs/model_config.yaml
- PP_WEIGHTS_PTH=/weights/pp_model_weights.pth
volumes:
- ${INPUT_DIR}:/input
- ${TMP_DIR}:/mytmp
- ${OUTPUT_DIR}:/output
command: "dask-worker tcp://scheduler:8786 --nprocs ${WORKER_PROCS} --nthreads 1 --memory-limit 0 --resources 'process=1' --preload ingest.preload_plugins.process_setup"
networks:
swarm_network:
depends_on:
- scheduler
runner:
image: $WORKER_IMAGE
environment:
- OMP_NUM_THREADS=2
- MODEL_CONFIG=/configs/model_config.yaml
- CLASSES_PTH=/configs/model_config.yaml
- PP_WEIGHTS_PTH=/weights/pp_model_weights.pth
volumes:
- ${INPUT_DIR}:/input
- ${TMP_DIR}:/mytmp
- ${OUTPUT_DIR}:/output
networks:
swarm_network:
depends_on:
- scheduler
command: "python3.8 -m ingest.scripts.ingest_documents --use-semantic-detection \
--use-rules-postprocess --use-xgboost-postprocess -a pdfs -a sections -a tables -a figures -a equations \
--no-compute-word-vecs --ngram 3 \
--input-path /input --output-path /output --dataset-id documents_5Feb \
--cluster tcp://scheduler:8786 --tmp-dir /mytmp"
Assuming you have already set up ssh tunneling access to the <ml4ai_machine> you wish to tunnel to, along with the machine alias name set in your .ssh config file to <ml4ai_machine>, you can use methods like rsync
to copy one or more files:
rsync -avz <source_pdf(s)> <ml4ai_machine>:<path_to_INPUT_DIR>
... where <INPUT_DIR>
is the full directory path as specified in the <Cosmos>/deployment/.env
file.
- A pre-requisite for running the pipeline is that you can run docker. Execute the following command to verify that you can run docker:
docker run --rm --gpus all nvidia/cuda:11.0-base nvidia-smi
If this succeeds, then you have verified that everything up to docker-compose
is working.
- Next, verify that the docker swarm is running by executing
docker swarm init
This may produce the following error, indicating that the swarm is already running:
Error response from daemon: This node is already part of a swarm. Use "docker swarm leave" to leave this swarm and join another one.
- Next, ensure the
.pdf
files you want to process have been placed in the INPUT_DIR directory (as specified in the /deployment/.env config file).
For initial testing, it is recommended that you place two .pdf
files, to test that multiple files can be processed.
- Finally, run the COSMOS ingest pipeline:
docker-compose -f deployment/docker-compose-ingest.yml up
If everything is working, COSMOS will run to completion and then pause. In order to exit the execution run, use the ctrl-c
key combination.
Successful runs should generate in the OUTPUT_DIR the parquet files for the main document as well as equations, figures, sections and tables.
These parquet files should then be ready for processing by Paul's script in AutoMATES:
scripts/model_assembly/cosmos_integration.py
- [2021-09-16 - on kraken]: It appears that only one docker swarm can run at a time. If more that one COSMOS execution is initiated at the same time, the most recent will execute to completion and the other appears to fail. This has not been rigorously tested. TODO: Follow up with Ian Ross @ UWisc.
- [2021-09-16 - COSMOS pipeline not working on carp]: Paul and Ian did a lot of work attempting to get COSMOS running on carp, but does not work yet. Current recommendation is to try the following:
- first: try uninstalling (in the reverse order from installation
- docker-compose
- Nvidia-docker (sometimes called nvidia-docker2 or nvidia-container-toolkit)
- docker-ci runtime environment
- ... then, try reinstalling in original order, then attempt to proceed with setup as described above.
- second: if the above does not work, consider doing a complete os wipe of carp and rebuild from beginning.
- first: try uninstalling (in the reverse order from installation