Skip to content

Commit

Permalink
instructions
Browse files Browse the repository at this point in the history
  • Loading branch information
soldni committed Oct 24, 2024
1 parent 42be092 commit 476d6d0
Show file tree
Hide file tree
Showing 2 changed files with 9 additions and 3 deletions.
10 changes: 8 additions & 2 deletions classifiers/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,12 @@ python -m dolma_classifiers.inference \
-m HuggingFaceFW/fineweb-edu-classifier
```

Run [NVIDIA's Deberta quality classifier](https://huggingface.co/nvidia/quality-classifier-deberta) on S3 data with model compilation:

<!-- Run [NVIDIA's Deberta quality classifier](https://huggingface.co/nvidia/quality-classifier-deberta) on S3 data:
-->
```bash
python -m dolma_classifiers.inference \
-s 's3://ai2-llm/pretraining-data/sources/dclm/v0/documents/40b-split/*/*zstd' \
-m nvidia/quality-classifier-deberta \
--model-compile \
--max-length 1024
```
2 changes: 1 addition & 1 deletion classifiers/scripts/nvidia-deberta-100b.sh
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ DOCUMENTS='s3://ai2-llm/pretraining-data/sources/dclm/v0/documents/100*/*.jsonl.
NUM_NODES=2
MODEL_NAME="nvidia/quality-classifier-deberta"
CLUSTER="ai2/jupiter*"
BATCH_SIZE=1024
BATCH_SIZE=512
PRIORITY="high"

# Generate a hash for the run name by combining model name and documents
Expand Down

0 comments on commit 476d6d0

Please sign in to comment.