Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Baseline data #61

Draft
wants to merge 69 commits into
base: main
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
69 commits
Select commit Hold shift + click to select a range
4da316e
testing warc
soldni Aug 28, 2023
7fc2c9c
ignore
soldni Aug 28, 2023
b35e6ee
testing slow
soldni Aug 28, 2023
4979ea7
langdetect
soldni Aug 28, 2023
b6f96c3
optional import
soldni Aug 28, 2023
bd903b2
refactoring
soldni Aug 29, 2023
8c9e00a
wip
soldni Aug 30, 2023
2d97f76
style
soldni Aug 30, 2023
d25287d
wip
soldni Sep 6, 2023
0350a5f
test
soldni Sep 7, 2023
c08bc6f
wip
soldni Sep 8, 2023
294ffca
configs
soldni Sep 20, 2023
32610e5
hash sample
soldni Sep 20, 2023
ab0d741
small improvements
soldni Sep 20, 2023
d0cde79
updated with output
soldni Sep 21, 2023
5562666
more details
soldni Sep 21, 2023
3909b7f
updated readme
soldni Sep 21, 2023
f1e463a
decon wip
soldni Sep 24, 2023
f9ed26d
new confits
soldni Sep 24, 2023
a3b08c3
taggging content
soldni Sep 24, 2023
07a745c
Merge pull request #49 from allenai/main
soldni Sep 24, 2023
ba2a413
changed name of file
soldni Sep 24, 2023
4b2fb1b
fixes
soldni Sep 24, 2023
534c3c2
deal with empty docs/local files
soldni Sep 24, 2023
0ccf67c
increased bloom size
soldni Sep 24, 2023
59559d8
configs for rest of splits
soldni Sep 25, 2023
39405e8
switching to option2
soldni Sep 25, 2023
8c2af40
forgot to do two more
soldni Sep 25, 2023
c1c5b54
finding puctuation
soldni Sep 26, 2023
abaf44d
tokenizer porting
soldni Sep 26, 2023
1363dff
configs
soldni Sep 27, 2023
637ee26
books config
soldni Sep 27, 2023
b387f6a
more sources
soldni Sep 27, 2023
a099d59
configs
soldni Sep 27, 2023
5fe9e2b
updated paths
soldni Sep 27, 2023
f7796d7
new c4
soldni Sep 27, 2023
a8ff9dc
cleaned up
soldni Sep 27, 2023
33cb671
sampling
soldni Sep 27, 2023
9cd6dcc
sample
soldni Sep 27, 2023
6b369b4
sampling
soldni Sep 28, 2023
2cebbe2
added tokenizer
soldni Sep 28, 2023
7168c8b
update all
soldni Sep 28, 2023
e60bb46
style
soldni Sep 28, 2023
cae806a
updated
soldni Sep 28, 2023
d64f225
configs
soldni Sep 28, 2023
383c1cb
tokenizer cli wip
soldni Sep 28, 2023
14b4724
cli
soldni Oct 2, 2023
1e17a8f
wip big refactor
soldni Oct 5, 2023
110eaee
fixed small bugs
soldni Oct 6, 2023
e2d3f75
tokenizer log
soldni Oct 6, 2023
9c76d04
fixed tokenizer paths
soldni Oct 6, 2023
b1d48a9
added tokenizer small
soldni Oct 6, 2023
d1659f1
fixed glob issue
soldni Oct 13, 2023
e00b8c3
falcon baseline thru mixing
IanMagnusson Oct 19, 2023
3fe273e
trying to debug decon
IanMagnusson Oct 24, 2023
2b97d61
Merge branch 'main' into baselines
soldni Oct 25, 2023
591b6bc
Merge remote-tracking branch 'origin/main' into baselines
IanMagnusson Oct 26, 2023
6d78c2a
pile
IanMagnusson Oct 26, 2023
43ea152
redpajama
IanMagnusson Oct 28, 2023
b30b206
Had to remove this to make decon work.
IanMagnusson Oct 28, 2023
c90345b
bump dolma version
IanMagnusson Nov 7, 2023
921e7ad
pile tokenization
IanMagnusson Nov 8, 2023
50ab5a9
falcon
IanMagnusson Nov 8, 2023
5f30bd4
fix pile tokenization
IanMagnusson Nov 18, 2023
48e323e
c4 decon
IanMagnusson Nov 18, 2023
b09a89c
c4 mixing
IanMagnusson Nov 18, 2023
50ac33b
c4 tokenization
IanMagnusson Nov 18, 2023
dfb1c3b
mc4
IanMagnusson Nov 21, 2023
16295d8
cc only dolma
IanMagnusson Dec 3, 2023
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion Cargo.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

20 changes: 20 additions & 0 deletions configs/baselines/decontamination/c4.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
documents:
- s3://ai2-llm/pretraining-data/sources/c4/v0/documents/train/*.gz

dedupe:
name: perplexity_suite_v3_option2_redo
paragraphs:
attribute_name: bff_duplicate_paragraph_spans_decontamination
skip_empty: true

bloom_filter:
read_only: true
estimated_doc_count: 488541
size_in_bytes: 33554432 # 100 MB; smaller causes too many FPs
file: s3://ai2-llm/bloom-filters/perplexity-suite-v3_option2.bin

processes: 224

work_dir:
input: /mnt/tank/dolma_tmp/c4_input
output: /mnt/tank/dolma_tmp/c4_output
20 changes: 20 additions & 0 deletions configs/baselines/decontamination/falcon-refinedweb.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
documents:
- s3://ai2-llm/pretraining-data/sources/falcon-refinedweb/v0-0.05-heldout-complement/documents/*.gz

dedupe:
name: perplexity_suite_v3_option2
paragraphs:
attribute_name: bff_duplicate_paragraph_spans_decontamination
skip_empty: true

bloom_filter:
read_only: true
estimated_doc_count: 488541
size_in_bytes: 33554432 # 100 MB; smaller causes too many FPs
file: s3://ai2-llm/bloom-filters/perplexity-suite-v3_option2.bin

processes: 224

work_dir:
input: /mnt/tank/dolma_tmp/falcon_input
output: /mnt/tank/dolma_tmp/falcon_output
20 changes: 20 additions & 0 deletions configs/baselines/decontamination/mc4.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
documents:
- s3://ai2-llm/pretraining-data/sources/mc4/en_wimbd_splits/documents/train/*.gz

dedupe:
name: perplexity_suite_v3_option2
paragraphs:
attribute_name: bff_duplicate_paragraph_spans_decontamination
skip_empty: true

bloom_filter:
read_only: true
estimated_doc_count: 488541
size_in_bytes: 33554432 # 100 MB; smaller causes too many FPs
file: s3://ai2-llm/bloom-filters/perplexity-suite-v3_option2.bin

processes: 224

work_dir:
input: /mnt/tank/dolma_tmp/mc4_input
output: /mnt/tank/dolma_tmp/mc4_output
20 changes: 20 additions & 0 deletions configs/baselines/decontamination/pile.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
documents:
- s3://ai2-llm/pretraining-data/sources/pile/v0/documents/train/*.gz

dedupe:
name: perplexity_suite_v3_option2
paragraphs:
attribute_name: bff_duplicate_paragraph_spans_decontamination
skip_empty: true

bloom_filter:
read_only: true
estimated_doc_count: 488541
size_in_bytes: 33554432 # 100 MB; smaller causes too many FPs
file: s3://ai2-llm/bloom-filters/perplexity-suite-v3_option2.bin

processes: 224

work_dir:
input: /mnt/tank/dolma_tmp/pile_input
output: /mnt/tank/dolma_tmp/pile_output
25 changes: 25 additions & 0 deletions configs/baselines/decontamination/redpajama.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
documents:
- s3://ai2-llm/pretraining-data/sources/redpajama/v1/documents/split=train/dataset=arxiv/*.gz
- s3://ai2-llm/pretraining-data/sources/redpajama/v1/documents/split=train/dataset=book/*.gz
- s3://ai2-llm/pretraining-data/sources/redpajama/v1/documents/split=train/dataset=c4/*.gz
- s3://ai2-llm/pretraining-data/sources/redpajama/v1/documents/split=train/dataset=common_crawl/*.gz
- s3://ai2-llm/pretraining-data/sources/redpajama/v1/documents/split=train/dataset=stackexchange/*.gz
- s3://ai2-llm/pretraining-data/sources/redpajama/v1/documents/split=train/dataset=wikipedia/*.gz

dedupe:
name: perplexity_suite_v3_option2
paragraphs:
attribute_name: bff_duplicate_paragraph_spans_decontamination
skip_empty: true

bloom_filter:
read_only: true
estimated_doc_count: 488541
size_in_bytes: 33554432 # 100 MB; smaller causes too many FPs
file: s3://ai2-llm/bloom-filters/perplexity-suite-v3_option2.bin

processes: 224

work_dir:
input: /mnt/tank/dolma_tmp/rp_input
output: /mnt/tank/dolma_tmp/rp_output
27 changes: 27 additions & 0 deletions configs/baselines/mixing/c4.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
{
"streams": [
{
"name": "c4",
"documents": [
"s3://ai2-llm/pretraining-data/sources/c4/v0/documents/train/*.gz"
],
"output": {
"path": "s3://ai2-llm/pretraining-data/sources/c4/v0_decon_ppl_suite_v3",
"max_size_in_bytes": 1000000000
},
"attributes": [
"perplexity_suite_v3_option2_redo"
],
"filter": {
"exclude": [
"[email protected][?(@.bff_duplicate_paragraph_spans_decontamination && @.bff_duplicate_paragraph_spans_decontamination[0] && @.bff_duplicate_paragraph_spans_decontamination[0][2] >= 1.0)]"
]
}
}
],
"work_dir": {
"input" : "/mnt/tank/dolma_tmp/c4_input_mix",
"output" : "/mnt/tank/dolma_tmp/c4_output_mix"
},
"processes": 1
}
27 changes: 27 additions & 0 deletions configs/baselines/mixing/falcon-refinedweb.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
{
"streams": [
{
"name": "falcon-refinedweb",
"documents": [
"s3://ai2-llm/pretraining-data/sources/falcon-refinedweb/v0-0.05-heldout-complement/documents/*.gz"
],
"output": {
"path": "s3://ai2-llm/pretraining-data/sources/falcon-refinedweb/v0-0.05-heldout-complement_decon_ppl_suite_v3",
"max_size_in_bytes": 1000000000
},
"attributes": [
"perplexity_suite_v3_option2"
],
"filter": {
"exclude": [
"[email protected][?(@.bff_duplicate_paragraph_spans_decontamination && @.bff_duplicate_paragraph_spans_decontamination[0] && @.bff_duplicate_paragraph_spans_decontamination[0][2] >= 1.0)]"
]
}
}
],
"work_dir": {
"input" : "/mnt/tank/dolma_tmp/falcon_input_mix",
"output" : "/mnt/tank/dolma_tmp/falcon_output_mix"
},
"processes": 1
}
27 changes: 27 additions & 0 deletions configs/baselines/mixing/mc4.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
{
"streams": [
{
"name": "mc4",
"documents": [
"s3://ai2-llm/pretraining-data/sources/mc4/en_wimbd_splits/documents/train/*.gz"
],
"output": {
"path": "s3://ai2-llm/pretraining-data/sources/mc4/en_wimbd_splits_decon_ppl_suite_v3/",
"max_size_in_bytes": 1000000000
},
"attributes": [
"perplexity_suite_v3_option2"
],
"filter": {
"exclude": [
"[email protected][?(@.bff_duplicate_paragraph_spans_decontamination && @.bff_duplicate_paragraph_spans_decontamination[0] && @.bff_duplicate_paragraph_spans_decontamination[0][2] >= 1.0)]"
]
}
}
],
"work_dir": {
"input" : "/mnt/tank/dolma_tmp/mc4_input_mix",
"output" : "/mnt/tank/dolma_tmp/mc4_output_mix"
},
"processes": 1
}
27 changes: 27 additions & 0 deletions configs/baselines/mixing/pile.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
{
"streams": [
{
"name": "pile",
"documents": [
"s3://ai2-llm/pretraining-data/sources/pile/v0/documents/train/*.gz"
],
"output": {
"path": "s3://ai2-llm/pretraining-data/sources/pile/v0_decon_ppl_suite_v3",
"max_size_in_bytes": 1000000000
},
"attributes": [
"perplexity_suite_v3_option2"
],
"filter": {
"exclude": [
"[email protected][?(@.bff_duplicate_paragraph_spans_decontamination && @.bff_duplicate_paragraph_spans_decontamination[0] && @.bff_duplicate_paragraph_spans_decontamination[0][2] >= 1.0)]"
]
}
}
],
"work_dir": {
"input" : "/mnt/tank/dolma_tmp/pile_input_mix",
"output" : "/mnt/tank/dolma_tmp/pile_output_mix"
},
"processes": 1
}
32 changes: 32 additions & 0 deletions configs/baselines/mixing/redpajama.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
{
"streams": [
{
"name": "redpajama",
"documents": [
"s3://ai2-llm/pretraining-data/sources/redpajama/v1/documents/split=train/dataset=arxiv/*.gz",
"s3://ai2-llm/pretraining-data/sources/redpajama/v1/documents/split=train/dataset=book/*.gz",
"s3://ai2-llm/pretraining-data/sources/redpajama/v1/documents/split=train/dataset=c4/*.gz",
"s3://ai2-llm/pretraining-data/sources/redpajama/v1/documents/split=train/dataset=common_crawl/*.gz",
"s3://ai2-llm/pretraining-data/sources/redpajama/v1/documents/split=train/dataset=stackexchange/*.gz",
"s3://ai2-llm/pretraining-data/sources/redpajama/v1/documents/split=train/dataset=wikipedia/*.gz"
],
"output": {
"path": "s3://ai2-llm/pretraining-data/sources/redpajama/v1_decon_ppl_suite_v3",
"max_size_in_bytes": 1000000000
},
"attributes": [
"perplexity_suite_v3_option2"
],
"filter": {
"exclude": [
"[email protected][?(@.bff_duplicate_paragraph_spans_decontamination && @.bff_duplicate_paragraph_spans_decontamination[0] && @.bff_duplicate_paragraph_spans_decontamination[0][2] >= 1.0)]"
]
}
}
],
"work_dir": {
"input" : "/mnt/tank/dolma_tmp/rp_input_mix",
"output" : "/mnt/tank/dolma_tmp/rp_output_mix"
},
"processes": 1
}
9 changes: 9 additions & 0 deletions configs/baselines/tokenization/c4.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
destination: s3://ai2-llm/preprocessed/c4/v0_decon_ppl_suite_v3/gpt-neox-20b-pii-special
documents:
- s3://ai2-llm/pretraining-data/sources/c4/v0_decon_ppl_suite_v3/*.json.gz
processes: 224
seed: 3920
tokenizer_name_or_path: allenai/eleuther-ai-gpt-neox-20b-pii-special
work_dir:
input: /mnt/tank/dolma_tmp/c4_input_tokenized
output: /mnt/tank/dolma_tmp/c4_output_tokenized
11 changes: 11 additions & 0 deletions configs/baselines/tokenization/dolma_v1_5_cc_only.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
destination: s3://ai2-llm/preprocessed/olmo-mix/v1_5_cc_only/gpt-neox-20b-pii-special/
documents:
- s3://ai2-llm/pretraining-data/sources/olmo-mix/v1_5/documents/cc_en_head/*.json.gz
- s3://ai2-llm/pretraining-data/sources/olmo-mix/v1_5/documents/cc_en_middle/*.json.gz
- s3://ai2-llm/pretraining-data/sources/olmo-mix/v1_5/documents/cc_en_tail/*.json.gz
processes: 224
seed: 3920
tokenizer_name_or_path: allenai/eleuther-ai-gpt-neox-20b-pii-special
work_dir:
input: /mnt/tank/dolma_tmp/v1_5_cc_only_input_tokenized
output: /mnt/tank/dolma_tmp/v1_5_cc_only_output_tokenized
9 changes: 9 additions & 0 deletions configs/baselines/tokenization/falcon-refinedweb.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
destination: s3://ai2-llm/preprocessed/falcon-refinedweb/v0-0.05-heldout-complement_decon_ppl_suite_v3/gpt-neox-20b-pii-special
documents:
- s3://ai2-llm/pretraining-data/sources/falcon-refinedweb/v0-0.05-heldout-complement_decon_ppl_suite_v3/*.json.gz
processes: 224
seed: 3920
tokenizer_name_or_path: allenai/eleuther-ai-gpt-neox-20b-pii-special
work_dir:
input: /mnt/tank/dolma_tmp/falcon_input_tokenized
output: /mnt/tank/dolma_tmp/falcon_output_tokenized
9 changes: 9 additions & 0 deletions configs/baselines/tokenization/mc4.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
destination: s3://ai2-llm/preprocessed/mc4/en_wimbd_splits_decon_ppl_suite_v3/gpt-neox-20b-pii-special
documents:
- s3://ai2-llm/pretraining-data/sources/mc4/en_wimbd_splits_decon_ppl_suite_v3/*.json.gz
processes: 224
seed: 3920
tokenizer_name_or_path: allenai/eleuther-ai-gpt-neox-20b-pii-special
work_dir:
input: /mnt/tank/dolma_tmp/mc4_input_tokenized
output: /mnt/tank/dolma_tmp/mc4_output_tokenized
9 changes: 9 additions & 0 deletions configs/baselines/tokenization/pile.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
destination: s3://ai2-llm/preprocessed/pile/v0_decon_ppl_suite_v3_fixed/gpt-neox-20b-pii-special
documents:
- s3://ai2-llm/pretraining-data/sources/pile/v0_decon_ppl_suite_v3/*.json.gz
processes: 150
seed: 3920
tokenizer_name_or_path: allenai/eleuther-ai-gpt-neox-20b-pii-special
work_dir:
input: /mnt/tank/tmp/pile_v0_decon_ppl_suite_v3_fixed_input
output: /mnt/tank/tmp/pile_v0_decon_ppl_suite_v3_fixed_output
9 changes: 9 additions & 0 deletions configs/baselines/tokenization/redpajama.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
destination: s3://ai2-llm/preprocessed/redpajama/v1_decon_ppl_suite_v3/gpt-neox-20b-pii-special
documents:
- s3://ai2-llm/pretraining-data/sources/redpajama/v1_decon_ppl_suite_v3/*.json.gz
processes: 224
seed: 3920
tokenizer_name_or_path: allenai/eleuther-ai-gpt-neox-20b-pii-special
work_dir:
input: /mnt/tank/dolma_tmp/rp_input_tokenized
output: /mnt/tank/dolma_tmp/rp_output_tokenized
3 changes: 3 additions & 0 deletions configs/dolma-v1_5/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
# Dolma 1.5

This directory
Loading