Adding Tokenizer, Writing Documentation, Misc Bugs & CLI improvements (…

…#54) * testing warc * ignore * testing slow * langdetect * optional import * refactoring * wip * style * wip * test * wip * configs * hash sample * small improvements * updated with output * more details * updated readme * decon wip * new confits * taggging content * changed name of file * fixes * deal with empty docs/local files * increased bloom size * configs for rest of splits * switching to option2 * forgot to do two more * finding puctuation * tokenizer porting * configs * books config * more sources * configs * updated paths * new c4 * cleaned up * sampling * sample * sampling * added tokenizer * update all * style * updated * configs * tokenizer cli wip * cli * wip big refactor * fixed small bugs * tokenizer log * fixed tokenizer paths * added tokenizer small * fixed glob issue * removed temporary directory * added todo * conversion script * more writing * more docs * more docs * logos * pipelines * datasheet * wip * adding script to make wikipedia * wip * more text * more docs! * new examples. * documentation * fixed bug local file * lint * removing docs while they are wip * reverted bug * style * quoting queue to ensure 3.8 compatibility * quoting queue to ensure 3.8 compatibility
allenai · Oct 15, 2023 · 1728f4f · 1728f4f
1 parent 9c4d960
commit 1728f4f
Show file tree

Hide file tree

Showing 75 changed files with 2,925 additions and 341 deletions.
diff --git a/.gitignore b/.gitignore
@@ -63,3 +63,4 @@ target/
 
 # ignoring test output
 /tests/work/
+/python/dolma/core/warc
diff --git a/Makefile b/Makefile
@@ -31,7 +31,7 @@ publish:
 test: test-python test-rust
 
 test-python:
-	pytest -vs tests/python
+	pytest -vsx tests/python
 	rm -rf tests/work/*
 
 test-rust:

diff --git a/README.md b/README.md
@@ -1,118 +1,53 @@
-<img alt="Dolma's official logo. It's dolma written in yellow, round lowercase letters over a blue background." src="https://github.com/allenai/dolma/blob/main/res/logo.png?raw=true" width="100%">
+<img alt="Dolma's official logo. It's dolma written in yellow, round lowercase letters over a blue background." src="https://raw.githubusercontent.com/allenai/dolma/main/docs/assets/AI2_Blog_1400x685_2x.webp" width="100%">
 
+Dolma is two things:
 
-Dolma is an open dataset of 3 trillion tokens from a diverse mix of web content, academic publications, code, books, and encyclopedic materials.
-It was created as a training corpus for [OLMo](https://allenai.org/olmo), AI2 language model.
-
-Dolma is available for download on the HuggingFace 🤗 Hub: [`huggingface.co/datasets/allenai/dolma`](https://huggingface.co/datasets/allenai/dolma). To access Dolma, users must agree to the terms of the terms of [AI2 ImpACT License for Medium Risk Artifacts](https://allenai.org/licenses/impact-mr).
-You can also read more about Dolma in [our announcement](https://blog.allenai.org/dolma-3-trillion-tokens-open-llm-corpus-9a0ff4b8da64), as well as by consulting its [data sheet](https://drive.google.com/file/d/12gOf5I5RytsD159nSP7iim_5zN31FCXq/view?usp=drive_link).
-
-This repository contains tools for generating and inspecting Dolma. To get started, install the Dolma Python library from [PyPI](https://pypi.org/project/dolma/).
-
-```shell
-pip install dolma
-```
-
-## Usage
-
-The dolma CLI can be access using the `dolma` command. To see the available commands, use the `--help` flag.
-
-```shell
-dolma --help
-```
-
-At the moment, the CLI supports three commands: `tag`, `dedupe`, and `mix`.
+1. **Dolma Dataset**: an open dataset of 3 trillion tokens from a diverse mix of web content, academic publications, code, books, and encyclopedic materials.
+2. **Dolma Toolkit**: a high-performance toolkit for curating datasets for language modeling.
 
-For all commands, configurations can be specified from command line, or by passing a YAML or JSON file using the `-c` flag. For example:
+## Dolma Dataset
 
-```shell
-dolma -c config.yaml dedupe --dedupe.name "test"
-```
-
-### The `tag` command
-
-The tag command is used to run any of the built-in taggers on a set of documents. For example:
-
-```shell
-dolma tag \
-    --experiment sample \
-    --documents \
-        's3://ai2-llm/pretraining-data/sources/common-crawl/test/v0/documents/**/*.json.gz' \
-        's3://ai2-llm/pretraining-data/sources/common-crawl/test/v1/documents/*.json.gz' \
-    --taggers random_number_v1 \
-    --processes 2
-```
-
-This command will run the `random_number_v1` tagger on all documents in the specified S3 paths. The results will be written to the `s3://ai2-llm/pretraining-data/sources/common-crawl/test/v0/attributes/sample` and `s3://ai2-llm/pretraining-data/sources/common-crawl/test/v1/attributes/sample` paths.
-
-### The `dedupe` command
-
-The dedupe command is used to deduplicate a set of documents at the attribute level using a bloom filter.
-For example configurations, see directory `tests/config`. For example:
-
-```shell
-dolma dedupe -c tests/config/dedupe-paragraphs.json
-```
-
-### The `mix` command
-
-The mix command is used to mix documents from multiple sources, optionally filtering by attributes and/or performing string replacement. For example configurations, see directory `tests/config`. For example:
-
-```shell
-dolma mix -c tests/config/mixer.json
-```
-
-
-## Development
-
-Create a conda environment with Python >= 3.8. In this case, we use Python 3.10 and use Anaconda to create the environment.
+Dolma is an open dataset of 3 trillion tokens from a diverse mix of web content, academic publications, code, books, and encyclopedic materials.
+It was created as a training corpus for [OLMo](https://allenai.org/olmo), a language model from the [Allen Institute for AI](https://allenai.org) (AI2).
 
-```shell
-conda create -n dolma python=3.10
-```
+Dolma is available for download on the HuggingFace 🤗 Hub: [`huggingface.co/datasets/allenai/dolma`](https://huggingface.co/datasets/allenai/dolma). To access Dolma, users must agree to the terms of the terms of [AI2 ImpACT License for Medium Risk Artifacts](https://allenai.org/licenses/impact-mr).
 
-After creating the environment, activate it and install necessary tools using the included makefile.
+You can also read more about Dolma in [our announcement](https://blog.allenai.org/dolma-3-trillion-tokens-open-llm-corpus-9a0ff4b8da64), as well as by consulting its [data sheet](docs/assets/dolma-datasheet-v0.1.pdf).
 
-```shell
-conda activate dolma
-make setup
-```
 
-and restart your shell. Finally, to begin development, install the repository in editable mode using maturin.
+## Dolma Toolkit
 
-```shell
-make develop
-```
+Dolma is a toolkit to curate large datasets for (pre)-training ML models. Its key features are:
 
-To run tests, use the following command.
+1. **High Performance** ⚡: Can process billions of documents concurrently thanks to built-in parallelism.
+2. **Portabilty** 🧳: Works on a single machine, a cluster, or cloud environment.
+3. **Built-In Taggers** 🏷: Includes ready-to-use taggers commonly used to curate datasets such as [Gopher](https://arxiv.org/abs/2112.11446), [C4](https://arxiv.org/abs/1910.10683), and [OpenWebText](https://openwebtext2.readthedocs.io/en/latest/).
+4. **Fast Deduplication** 🗑: Speedy document deduplication using a Rust Bloom filter.
+5. **Extensibility** 🧩 & **Cloud Support** ☁: Supports custom taggers and AWS S3-compatible locations.
 
-```shell
-make test
-```
-You can choose to run just the Python or Rust tests by calling `make test-python` or `make test-rust` respectively.
+To install, simply type `pip install dolma` in your terminal.
 
-You can skip S3 related tests by exporting `DOLMA_TESTS_SKIP_AWS=True`
+To learn more about how to use the Dolma Toolkit, please visit the [documentation](/docs).
 
-```shell
-DOLMA_TESTS_SKIP_AWS=True make test
-```
+## Citation
 
-## Contributing
+If you use the Dolma dataset or toolkit, please cite the following items:
 
-Before committing, use the following command
-```shell
-make style
+```bibtex
+@techreport{DolmaDataset,
+    author = {Soldaini, Luca and Kinney, Rodney and Bhagia, Akshita and Schwenk, Dustin and Atkinson, David and Authur, Russell and Chandu, Khyathi and Dumas, Jennifer and Lucy, Li and Lyu, Xinxi and Magnusson, Ian and Naik, Aakanksha and Nam , Crystal and  Peters, Matthew E.  and Ravichander, Abhilasha and Shen, Zejiang and Strubell, Emma and Subramani, Nishant and Tafjord, Oyvind and Walsh, Evan Pete and Hajishirzi, Hannaneh and Smith, Noah A. and Zettlemoyer, Luke and Beltagy, Iz and Groeneveld, Dirk and Dodge, Jesse and Lo, Kyle},
+    title = {{Dolma: An Open Corpus of 3 Trillion Tokens for Language Model Pretraining Research}},
+    institution = {{Allen Institute for AI}},
+    year = {2023},
+    note = {Released under ImpACT License as Medium Risk artifact, \url{https://github.com/allenai/dolma}}
+}
 ```
 
-## Citation
-
-If you use this repository, please cite it as:
-
 ```bibtex
-@software{dolma,
+@software{DolmaToolkit,
     author = {{Soldaini, Luca and Lo, Kyle and Kinney, Rodney and Naik, Aakanksha and Ravichander, Abhilasha and Bhagia, Akshita and Groeneveld, Dirk and Schwenk, Dustin and Magnusson, Ian and Chandu, Khyathi}},
-    license = {{Apache-2.0}},
-    title = {{Dolma}},
-    url = {https://github.com/allenai/dolma}
+    title = {{The Dolma Toolkit}},
+    year = {2023},
+    note = {{Apache 2.0 License, Version \texttt{0.9.0}, \url{https://github.com/allenai/dolma}}}
 }
 ```
diff --git a/examples/c4-like/README.md → configs/c4-replication/README.md b/examples/c4-like/README.md → configs/c4-replication/README.md
diff --git a/examples/c4-like/mixer.yaml → configs/c4-replication/mixer.yaml b/examples/c4-like/mixer.yaml → configs/c4-replication/mixer.yaml
diff --git a/examples/c4-like/reformat_trafilatura.py → ...gs/c4-replication/reformat_trafilatura.py b/examples/c4-like/reformat_trafilatura.py → ...gs/c4-replication/reformat_trafilatura.py
@@ -17,7 +17,7 @@ class TrafilaturaReformatter(BaseParallelProcessor):
     @classmethod
     def increment_progressbar(  # type: ignore
         cls,
-        queue: Queue[Union[Tuple[int, ...], None]],
+        queue: "Queue[Union[Tuple[int, ...], None]]",
         /,
         files: int = 0,
         documents: int = 0,
@@ -26,7 +26,7 @@ def increment_progressbar(  # type: ignore
 
     @classmethod
     def process_single(
-        cls, source_path: str, destination_path: str, queue: Queue[Union[Tuple[int, ...], None]], **kwargs: Any
+        cls, source_path: str, destination_path: str, queue: "Queue[Union[Tuple[int, ...], None]]", **kwargs: Any
     ):
         documents = 0
         interval = 10_000

diff --git a/examples/c4-like/tagger.yaml → configs/c4-replication/tagger.yaml b/examples/c4-like/tagger.yaml → configs/c4-replication/tagger.yaml
diff --git a/configs/dedup/pes2o_decontamination.json → ...gs/pes2o-dedup/pes2o_decontamination.json b/configs/dedup/pes2o_decontamination.json → ...gs/pes2o-dedup/pes2o_decontamination.json
diff --git a/docs/README.md b/docs/README.md
@@ -0,0 +1,44 @@
+# Dolma Toolkit Documentation
+
+
+Dolma is a toolkit to curate dataset for pretraining AI models. Reason to use the Dolma toolkit are:
+
+- **High performance** ⚡️ Dolma is designed to be highly performant, and can be used to process datasets with billions of documents in parallel.
+- **Portable** 🧳 Dolma can be run on a single machine, a cluster, or a cloud computing environment.
+- **Built-in taggers** 🏷 Dolma comes with a number of built-in taggers, including language detection, toxicity detection, perplexity scoring, and common filtering recipes, such as the ones used to create [Gopher](https://arxiv.org/abs/2112.11446) and [C4](https://arxiv.org/abs/1910.10683).
+- **Fast deduplication** 🗑 Dolma can deduplicate documents using rust-based a Bloom filter, which is significantly faster than other methods.
+- **Extensible** 🧩 Dolma is designed to be extensible, and can be extended with custom taggers.
+- **Cloud support** ☁️ Dolma supports reading and write data from local disk, and AWS S3-compatible locations.
+
+Dataset curation with Dolma usually happens in four steps:
+
+1. Using **taggers**, spans of documents in a dataset are tagged with properties (e.g. the language their are in, toxicity, etc);
+2. Documents are optionally **deduplicated** based on their content or metadata;
+3. Using the **mixer**, documents removed or filtered depending on value of attributes;
+4. Finally, documents can be **tokenized** using any [HuggingFace-compatible tokenizer](https://huggingface.co/docs/tokenizers/index).
+
+![The four steps of dataset curation with Dolma.](assets/diagram.webp)
+
+Dolma can be installed using `pip`:
+
+```shell
+pip install dolma
+```
+
+Dolma can be used either as a Python library or as a command line tool. The command line tool can be accessed using the `dolma` command. To see the available commands, use the `--help` flag.
+
+```shell
+dolma --help
+```
+
+## Index
+
+To read Dolma's documentation, visit the following pages:
+
+- [Getting Started](getting-started.md)
+- [Data Format](data-format.md)
+- [Taggers](taggers.md)
+- [Deduplication](deduplication.md)
+- [Mixer](mixer.md)
+- [Tokenization](tokenize.md)
+- [Contributing to Dolma](develop.md)
diff --git a/docs/assets/AI2_Blog_1400x685.png b/docs/assets/AI2_Blog_1400x685.png
diff --git a/docs/assets/AI2_Blog_1400x685.webp b/docs/assets/AI2_Blog_1400x685.webp
diff --git a/docs/assets/AI2_Blog_1400x685_2x.png b/docs/assets/AI2_Blog_1400x685_2x.png
diff --git a/docs/assets/AI2_Blog_1400x685_2x.webp b/docs/assets/AI2_Blog_1400x685_2x.webp
diff --git a/docs/assets/DOLMA.webp b/docs/assets/DOLMA.webp
diff --git a/docs/assets/DOLMA_2x.png b/docs/assets/DOLMA_2x.png
diff --git a/docs/assets/DOLMA_4x.png b/docs/assets/DOLMA_4x.png
diff --git a/docs/assets/Small_655x120.png b/docs/assets/Small_655x120.png
diff --git a/docs/assets/Small_655x120_2x.png b/docs/assets/Small_655x120_2x.png
diff --git a/docs/assets/Square_1_600x600.png b/docs/assets/Square_1_600x600.png
diff --git a/docs/assets/Square_1_600x600_2x.png b/docs/assets/Square_1_600x600_2x.png
diff --git a/docs/assets/code-pipeline.pdf b/docs/assets/code-pipeline.pdf
diff --git a/docs/assets/code-pipeline.png b/docs/assets/code-pipeline.png
diff --git a/docs/assets/diagram.webp b/docs/assets/diagram.webp
diff --git a/docs/assets/dolma-datasheet-v0.1.pdf b/docs/assets/dolma-datasheet-v0.1.pdf
diff --git a/docs/assets/web-pipeline.pdf b/docs/assets/web-pipeline.pdf
diff --git a/docs/assets/web-pipeline.png b/docs/assets/web-pipeline.png
diff --git a/docs/data-format.md b/docs/data-format.md
@@ -0,0 +1,114 @@
+# Data Format
+
+In this document, we explain the data format for the datasets processed by Dolma.
+
+
+## Directory Structure
+
+While all components of the Dolma toolkit can read from arbitrary local and S3 locations, we recommend the following directory structure for storing datasets:
+
+```plain-text
+|-- dataset-name/
+    |-- documents/
+        |-- 2019-09/
+            |-- 0933_uk_all.jsonl.gz        (1GB)
+            |-- 0933_vi_all.jsonl.gz        (1GB)
+            |-- 0106_uk_all.jsonl.gz        (1GB)
+            |-- 0106_vi_all.jsonl.gz        (1GB)
+        |-- 2019-08/
+            |-- ...
+    |-- attributes/
+        |-- toxicity-0/
+            |-- 2019-09/
+                |-- 0933_uk_all.jsonl.gz    (..MB)
+                |-- 0933_vi_all.jsonl.gz    (..MB)
+                |-- 0106_uk_all.jsonl.gz    (..MB)
+                |-- 0106_vi_all.jsonl.gz    (..MB)
+            |-- 2019-08/
+                |-- ...
+        |-- paragraph_duplicates/
+            |-- ...
+```
+
+In the example above, all data is stored under the `documents` subdirectory. The directory structure under `documents` is left up to Dolma users. Each file in the `documents` directory is a gzipped JSONL file, where each line is a JSON object representing a document. We explain format of each file in the next section.
+
+Data produced by taggers and deduper is stored under `attributes/attribute-name`; the original directory structure is preserved, and each attributes file contains the same documents as the corresponding file in `documents`.
+
+
+### Dolma Document Format
+
+This is the unified format we will use across all the sources to represent a single **document**. Each row in one of the `documents/*/*.jsonl.gz` file looks like:
+
+```yaml
+{
+    "id": "...",             # MANDATORY: source-specific identifier
+    "text": "foo",           # MANDATORY: textual content of the document
+    "source": "...",         # MANDATORY: source of the data, such as peS2o, common-crawl, etc.
+    "added": "...",          # OPTIONAL: timestamp ai2 acquired this data
+    "created": "..."         # OPTIONAL: timestamp when orig document was created (best-guess if not available)
+    "metadata": {...}        # OPTIONAL: source-specific metadata
+}
+```
+
+#### `id` field
+
+The `id` field is very important as we will need:
+
+- the ability to trace every single document in every version back to the original source document,
+- the ability to store a `blocklist` of documents (e.g. avoid due to LLM-Eval, takedown requests, manual inspection).
+
+It is important that document IDs are stable across dataset versions. For example, Document 12345 in `raw` is the same as Document 12345 in `v0`, `v1`, ...
+
+The `id` only needs to be consistent/unique within a `source`. For example, `id='123'` is acceptable because `(c4, '123')` and `(github, '123')` would uniquely identify this document still. But there cannot be two rows in The Stack `v0` dataset that has `id='123'`.
+
+#### `metadata` field
+
+The `metadata` field will be a free-for-all field that contains any source-specific information. This could be things like code license for the Stack, or paper identifiers for Semantic Scholar (S2) data.
+
+It is especially important to preserve source-specific identifiers when possible. For example, in S2 raw data, we have S2 IDs for each document, but we should also persist things like the DOI, arXiv ID, ACL ID, PubMed ID, etc. when they're available to us.
+
+### Dolma Attributes Format
+
+Let's say we are at a good state of document, but we need to iterate on the toxicity classifier a few times. We don't want to duplicate multiple copies of the dataset just because we updated the toxicity classifier. Hence, we store **documents** separately from **attributes**, where attributes are newly derived/predicted aspects as a result of using our tools to analyze the documents.
+
+These are flat JSONs that look like:
+
+```yaml
+{
+    "source": "...",
+    "id": "...",
+    "attributes": {
+      "toxicity": 0.7
+    }
+}
+```
+
+where the `source` and `id` keys uniquely identify which document carries these attributes.
+
+The mixer create a unified `attributes` dictionary by merging all of the individual `attributes` dictionaries.
+
+Note that it's very important that the `*.jsonl.gz` files for attributes lines up exactly (same number of rows, same sort order) with the `*.jsonl.gz` files for the associated documents. It'll save us a lot of headache in the future.
+
+For something like Language identification, this JSON might look like:
+
+```yaml
+{
+    "id": "...",
+    attributes: {
+        "olmo_mix_v1_taggers__ft_lang_id_en_paragraph_with_doc_score_v2__en": [
+            [0, 300, 0.9],         # this means text[0:300] is tagged with score 0.9
+            [300, 540, 0.3],       # this means text[300:540] is tagged with score 0.3
+            ...
+        ],
+        ...
+    }
+}
+```
+
+Each attribute can have one or more scores associated with it; in the example above, each paragraph in the document is tagged with a language score.
+For each paragraph, the tuple indicate the start and end index of the paragraph, and the score associated with it.
+
+The idea that we're going with is that attributes identify spans of text within a document that might be problematic.
+These signal get cached during tagging and allow for "building" of the dataset to happen as a configuration afterwards. so for example, given signal data like this, we might try different confidence thresholds on mean_word_length when creating final data mixture
+how does your signals data look?
+}
Original file line number	Diff line number	Diff line change
Expand Up		@@ -63,3 +63,4 @@ target/

		# ignoring test output
		/tests/work/
		/python/dolma/core/warc