COSMOS Ingestion on ML4AI Lab Servers

COSMOS Ingestion Pipeline Setup and Usage

Last updated: 09/16/2021

This page documents how to setup and run the COSMOS ingestion pipeline on the various ML4AI servers. Currently the only supported server is kraken and the pipeline is failing with a known issue on carp (documented below).

Initial server setup

The following is a list of required software (paired with setup guides) that will install all pre-requisite software on a new ML4AI lab server needed to run the COSMOS Ingestion Pipeline. This is only required when a new (GPU-enabled) server is being setup.

NOTE: these instructions are intended for a server running the Ubuntu 20.04 operating system.

Install Nvidia graphics drivers
Install Nvidia CUDA drivers
Install Nvidia CuDNN drivers
Verify the Nvidia CUDA/CuDNN by downloading and running the MNIST sample located here:
Install the docker-ci runtime environment
Install Nvidia-docker (sometimes called nvidia-docker2 or nvidia-container-toolkit)
Install docker-compose (ensure this is an up-to-date version -- 1.29.2 works well)
Restart the Docker daemon
Verify that the installed GPUs are reachable in docker using the following command:

sudo docker run --rm --gpus all nvidia/cuda:11.0.3-base-ubuntu20.04 nvidia-smi

Ingestion pipeline setup

Ensure your user account is in the ai and docker user groups
Clone the Cosmos repository to a location in your home directory (for this working example we will use /home/my_username/Cosmos to refer to that location)
Create a data directory structure to store COSMOS input, temp and output files.
- The root of my data directory structure will be /home/my_username/COSMOS-data/.
- Add the following sub-directories:
  - /home/my_username/COSMOS-data/input_files/
  - /home/my_username/COSMOS-data/output_files/
  - /home/my_username/COSMOS-data/tmp_files/
- Change the file permissions on the root dir and all child dirs to allow full access with the command:
  - chmod -R 777 /home/my_username/COSMOS-data/
  - This is necessary for COSMOS to read/write data in these dirs (execution access (x) is also needed for some reason)
Create the file .env within the COSMOS deployment directory (which in our working example would be: /home/my_username/Cosmos/deployment/) with the content that follows. NOTE: you will need to replace the <absolute_path_to_COSMOS_data> for INPUT_DIR, TMP_DIR, and OUTPUT_DIR with the appropriate absolute path, which in our working example here is /home/my_username/COSMOS-data.

BASE_IMAGE=uwcosmos/cosmos-base-c11:latest
DETECT_IMAGE=uwcosmos/cosmos-ingestion-c11:latest
WORKER_IMAGE=uwcosmos/cosmos-ingestion-c11:latest
RETRIEVAL_IMAGE=uwcosmos/cosmos-retrieval:latest
EXTRACTION_IMAGE=ankurgos/cosmos-extraction:latest
VISUALIZER_IMAGE=uwcosmos/visualizer_kb:latest
LINKING_IMAGE=uwcosmos/cosmos-linking:latest
UPLOAD_IMAGE=uwcosmos/cosmos-api:latest
API_IMAGE=uwcosmos/cosmos-api:latest

SCHEDULER_ADDRESS=scheduler:8786
ELASTIC_ADDRESS=es01:9200
NUM_PROCESSES=$WORKER_PROCS

WORKER_PROCS=4
DETECT_PROCS=1

# Default to GPU.  On Kraken: use either :0 or :1 to choose specific GPU
DEVICE=cuda:1
RERANKING_DEVICE=cuda:1

# Uncomment to use CPUs
#DEVICE=cpu
#RERANKING_DEVICE=cpu

# Env vars for training
TRAINING_DIR=/path/to/training_data
VALIDATION_DIR=/path/to/validation_data/
CONFIG_DIR=${PWD}/deployment/configs
WEIGHTS_DIR=/path/for/training/output

# Env vars for primary pipeline
INPUT_DIR=<absolute_path_to_COSMOS_data>/input_files
TMP_DIR=<absolute_path_to_COSMOS_data>/tmp_files
OUTPUT_DIR=<absolute_path_to_COSMOS_data>/output_files
ELASTIC_DATA_PATH=/path/for/elasticsearch/

Replace the file docker-compose-ingest.yml (in the COSMOS deployment/ directory, in this example here: /home/my_username/Cosmos/deployment/) with the following content:

version: '3.4'

networks:
  swarm_network:
    driver: overlay
    attachable: true

services:
  scheduler:
    image: $BASE_IMAGE
    command: "dask-scheduler"
    ports:
      - 8787:8787
    networks:
      swarm_network:

  detect_worker:
    image: $DETECT_IMAGE
    environment:
      - MODEL_CONFIG=/configs/model_config.yaml
      - WEIGHTS_PTH=/weights/model_weights.pth
      - DEVICE
      - OMP_NUM_THREADS=2
    volumes:
      - ${INPUT_DIR}:/input
      - ${TMP_DIR}:/mytmp
      - ${OUTPUT_DIR}:/output
    command: "dask-worker tcp://scheduler:8786 --nprocs ${DETECT_PROCS} --nthreads 1 --memory-limit 0 --resources 'GPU=1' --preload ingest.preload_plugins.detect_setup"
    networks:
      swarm_network:
    depends_on:
      - scheduler
    deploy:
      resources:
        reservations:
          devices:
          - capabilities: [gpu]

  worker:
    image: $WORKER_IMAGE
    environment:
      - OMP_NUM_THREADS=2
      - MODEL_CONFIG=/configs/model_config.yaml
      - CLASSES_PTH=/configs/model_config.yaml
      - PP_WEIGHTS_PTH=/weights/pp_model_weights.pth
    volumes:
      - ${INPUT_DIR}:/input
      - ${TMP_DIR}:/mytmp
      - ${OUTPUT_DIR}:/output
    command: "dask-worker tcp://scheduler:8786 --nprocs ${WORKER_PROCS} --nthreads 1 --memory-limit 0 --resources 'process=1' --preload ingest.preload_plugins.process_setup"
    networks:
      swarm_network:
    depends_on:
      - scheduler

  runner:
    image: $WORKER_IMAGE
    environment:
      - OMP_NUM_THREADS=2
      - MODEL_CONFIG=/configs/model_config.yaml
      - CLASSES_PTH=/configs/model_config.yaml
      - PP_WEIGHTS_PTH=/weights/pp_model_weights.pth
    volumes:
      - ${INPUT_DIR}:/input
      - ${TMP_DIR}:/mytmp
      - ${OUTPUT_DIR}:/output
    networks:
      swarm_network:
    depends_on:
      - scheduler
    command: "python3.8 -m ingest.scripts.ingest_documents --use-semantic-detection \
              --use-rules-postprocess --use-xgboost-postprocess -a pdfs -a sections -a tables -a figures -a equations \
              --no-compute-word-vecs --ngram 3 \
              --input-path /input --output-path /output --dataset-id documents_5Feb \
              --cluster tcp://scheduler:8786 --tmp-dir /mytmp"

Transferring data to/from ML4AI Lab servers

Assuming you have already set up ssh tunneling access to the <ml4ai_machine> you wish to tunnel to, along with the machine alias name set in your .ssh config file to <ml4ai_machine>, you can use methods like rsync to copy one or more files:

rsync -avz <source_pdf(s)> <ml4ai_machine>:<path_to_INPUT_DIR>

... where <INPUT_DIR> is the full directory path as specified in the <Cosmos>/deployment/.env file.

Running the ingestion pipeline

A pre-requisite for running the pipeline is that you can run docker. Execute the following command to verify that you can run docker:

docker run --rm --gpus all nvidia/cuda:11.0-base nvidia-smi

If this succeeds, then you have verified that everything up to docker-compose is working.

Next, verify that the docker swarm is running by executing

docker swarm init

This may produce the following error, indicating that the swarm is already running:

Error response from daemon: This node is already part of a swarm. Use "docker swarm leave" to leave this swarm and join another one.

Next, ensure the .pdf files you want to process have been placed in the INPUT_DIR directory (as specified in the /deployment/.env config file).

For initial testing, it is recommended that you place two .pdf files, to test that multiple files can be processed.

Finally, run the COSMOS ingest pipeline:

docker-compose -f deployment/docker-compose-ingest.yml up

If everything is working, COSMOS will run to completion and then pause. In order to exit the execution run, use the ctrl-c key combination.

Successful runs should generate in the OUTPUT_DIR the parquet files for the main document as well as equations, figures, sections and tables.

These parquet files should then be ready for processing by Paul's script in AutoMATES: scripts/model_assembly/cosmos_integration.py

Best practices

Known issues

[2021-09-16 - on kraken]: It appears that only one docker swarm can run at a time. If more that one COSMOS execution is initiated at the same time, the most recent will execute to completion and the other appears to fail. This has not been rigorously tested. TODO: Follow up with Ian Ross @ UWisc.
[2021-09-16 - COSMOS pipeline not working on carp]: Paul and Ian did a lot of work attempting to get COSMOS running on carp, but does not work yet. Current recommendation is to try the following:
- first: try uninstalling (in the reverse order from installation
  - docker-compose
  - Nvidia-docker (sometimes called nvidia-docker2 or nvidia-container-toolkit)
  - docker-ci runtime environment
  - ... then, try reinstalling in original order, then attempt to proceed with setup as described above.
- second: if the above does not work, consider doing a complete os wipe of carp and rebuild from beginning.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly