Skip to content

Commit

Permalink
Adding Tokenizer, Writing Documentation, Misc Bugs & CLI improvements (
Browse files Browse the repository at this point in the history
…#54)

* testing warc

* ignore

* testing slow

* langdetect

* optional import

* refactoring

* wip

* style

* wip

* test

* wip

* configs

* hash sample

* small improvements

* updated with output

* more details

* updated readme

* decon wip

* new confits

* taggging content

* changed name of file

* fixes

* deal with empty docs/local files

* increased bloom size

* configs for rest of splits

* switching to option2

* forgot to do two more

* finding puctuation

* tokenizer porting

* configs

* books config

* more sources

* configs

* updated paths

* new c4

* cleaned up

* sampling

* sample

* sampling

* added tokenizer

* update all

* style

* updated

* configs

* tokenizer cli wip

* cli

* wip big refactor

* fixed small bugs

* tokenizer log

* fixed tokenizer paths

* added tokenizer small

* fixed glob issue

* removed temporary directory

* added todo

* conversion script

* more writing

* more docs

* more docs

* logos

* pipelines

* datasheet

* wip

* adding script to make wikipedia

* wip

* more text

* more docs!

* new examples.

* documentation

* fixed bug local file

* lint

* removing docs while they are wip

* reverted bug

* style

* quoting queue to ensure 3.8 compatibility

* quoting queue to ensure 3.8 compatibility
  • Loading branch information
soldni authored Oct 15, 2023
1 parent 9c4d960 commit 1728f4f
Show file tree
Hide file tree
Showing 75 changed files with 2,925 additions and 341 deletions.
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -63,3 +63,4 @@ target/

# ignoring test output
/tests/work/
/python/dolma/core/warc
2 changes: 1 addition & 1 deletion Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -31,7 +31,7 @@ publish:
test: test-python test-rust

test-python:
pytest -vs tests/python
pytest -vsx tests/python
rm -rf tests/work/*

test-rust:
Expand Down
129 changes: 32 additions & 97 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,118 +1,53 @@
<img alt="Dolma's official logo. It's dolma written in yellow, round lowercase letters over a blue background." src="https://github.com/allenai/dolma/blob/main/res/logo.png?raw=true" width="100%">
<img alt="Dolma's official logo. It's dolma written in yellow, round lowercase letters over a blue background." src="https://raw.githubusercontent.com/allenai/dolma/main/docs/assets/AI2_Blog_1400x685_2x.webp" width="100%">

Dolma is two things:

Dolma is an open dataset of 3 trillion tokens from a diverse mix of web content, academic publications, code, books, and encyclopedic materials.
It was created as a training corpus for [OLMo](https://allenai.org/olmo), AI2 language model.

Dolma is available for download on the HuggingFace 🤗 Hub: [`huggingface.co/datasets/allenai/dolma`](https://huggingface.co/datasets/allenai/dolma). To access Dolma, users must agree to the terms of the terms of [AI2 ImpACT License for Medium Risk Artifacts](https://allenai.org/licenses/impact-mr).
You can also read more about Dolma in [our announcement](https://blog.allenai.org/dolma-3-trillion-tokens-open-llm-corpus-9a0ff4b8da64), as well as by consulting its [data sheet](https://drive.google.com/file/d/12gOf5I5RytsD159nSP7iim_5zN31FCXq/view?usp=drive_link).

This repository contains tools for generating and inspecting Dolma. To get started, install the Dolma Python library from [PyPI](https://pypi.org/project/dolma/).

```shell
pip install dolma
```

## Usage

The dolma CLI can be access using the `dolma` command. To see the available commands, use the `--help` flag.

```shell
dolma --help
```

At the moment, the CLI supports three commands: `tag`, `dedupe`, and `mix`.
1. **Dolma Dataset**: an open dataset of 3 trillion tokens from a diverse mix of web content, academic publications, code, books, and encyclopedic materials.
2. **Dolma Toolkit**: a high-performance toolkit for curating datasets for language modeling.

For all commands, configurations can be specified from command line, or by passing a YAML or JSON file using the `-c` flag. For example:
## Dolma Dataset

```shell
dolma -c config.yaml dedupe --dedupe.name "test"
```

### The `tag` command

The tag command is used to run any of the built-in taggers on a set of documents. For example:

```shell
dolma tag \
--experiment sample \
--documents \
's3://ai2-llm/pretraining-data/sources/common-crawl/test/v0/documents/**/*.json.gz' \
's3://ai2-llm/pretraining-data/sources/common-crawl/test/v1/documents/*.json.gz' \
--taggers random_number_v1 \
--processes 2
```

This command will run the `random_number_v1` tagger on all documents in the specified S3 paths. The results will be written to the `s3://ai2-llm/pretraining-data/sources/common-crawl/test/v0/attributes/sample` and `s3://ai2-llm/pretraining-data/sources/common-crawl/test/v1/attributes/sample` paths.

### The `dedupe` command

The dedupe command is used to deduplicate a set of documents at the attribute level using a bloom filter.
For example configurations, see directory `tests/config`. For example:

```shell
dolma dedupe -c tests/config/dedupe-paragraphs.json
```

### The `mix` command

The mix command is used to mix documents from multiple sources, optionally filtering by attributes and/or performing string replacement. For example configurations, see directory `tests/config`. For example:

```shell
dolma mix -c tests/config/mixer.json
```


## Development

Create a conda environment with Python >= 3.8. In this case, we use Python 3.10 and use Anaconda to create the environment.
Dolma is an open dataset of 3 trillion tokens from a diverse mix of web content, academic publications, code, books, and encyclopedic materials.
It was created as a training corpus for [OLMo](https://allenai.org/olmo), a language model from the [Allen Institute for AI](https://allenai.org) (AI2).

```shell
conda create -n dolma python=3.10
```
Dolma is available for download on the HuggingFace 🤗 Hub: [`huggingface.co/datasets/allenai/dolma`](https://huggingface.co/datasets/allenai/dolma). To access Dolma, users must agree to the terms of the terms of [AI2 ImpACT License for Medium Risk Artifacts](https://allenai.org/licenses/impact-mr).

After creating the environment, activate it and install necessary tools using the included makefile.
You can also read more about Dolma in [our announcement](https://blog.allenai.org/dolma-3-trillion-tokens-open-llm-corpus-9a0ff4b8da64), as well as by consulting its [data sheet](docs/assets/dolma-datasheet-v0.1.pdf).

```shell
conda activate dolma
make setup
```

and restart your shell. Finally, to begin development, install the repository in editable mode using maturin.
## Dolma Toolkit

```shell
make develop
```
Dolma is a toolkit to curate large datasets for (pre)-training ML models. Its key features are:

To run tests, use the following command.
1. **High Performance** ⚡: Can process billions of documents concurrently thanks to built-in parallelism.
2. **Portabilty** 🧳: Works on a single machine, a cluster, or cloud environment.
3. **Built-In Taggers** 🏷: Includes ready-to-use taggers commonly used to curate datasets such as [Gopher](https://arxiv.org/abs/2112.11446), [C4](https://arxiv.org/abs/1910.10683), and [OpenWebText](https://openwebtext2.readthedocs.io/en/latest/).
4. **Fast Deduplication** 🗑: Speedy document deduplication using a Rust Bloom filter.
5. **Extensibility** 🧩 & **Cloud Support** ☁: Supports custom taggers and AWS S3-compatible locations.

```shell
make test
```
You can choose to run just the Python or Rust tests by calling `make test-python` or `make test-rust` respectively.
To install, simply type `pip install dolma` in your terminal.

You can skip S3 related tests by exporting `DOLMA_TESTS_SKIP_AWS=True`
To learn more about how to use the Dolma Toolkit, please visit the [documentation](/docs).

```shell
DOLMA_TESTS_SKIP_AWS=True make test
```
## Citation

## Contributing
If you use the Dolma dataset or toolkit, please cite the following items:

Before committing, use the following command
```shell
make style
```bibtex
@techreport{DolmaDataset,
author = {Soldaini, Luca and Kinney, Rodney and Bhagia, Akshita and Schwenk, Dustin and Atkinson, David and Authur, Russell and Chandu, Khyathi and Dumas, Jennifer and Lucy, Li and Lyu, Xinxi and Magnusson, Ian and Naik, Aakanksha and Nam , Crystal and Peters, Matthew E. and Ravichander, Abhilasha and Shen, Zejiang and Strubell, Emma and Subramani, Nishant and Tafjord, Oyvind and Walsh, Evan Pete and Hajishirzi, Hannaneh and Smith, Noah A. and Zettlemoyer, Luke and Beltagy, Iz and Groeneveld, Dirk and Dodge, Jesse and Lo, Kyle},
title = {{Dolma: An Open Corpus of 3 Trillion Tokens for Language Model Pretraining Research}},
institution = {{Allen Institute for AI}},
year = {2023},
note = {Released under ImpACT License as Medium Risk artifact, \url{https://github.com/allenai/dolma}}
}
```

## Citation

If you use this repository, please cite it as:

```bibtex
@software{dolma,
@software{DolmaToolkit,
author = {{Soldaini, Luca and Lo, Kyle and Kinney, Rodney and Naik, Aakanksha and Ravichander, Abhilasha and Bhagia, Akshita and Groeneveld, Dirk and Schwenk, Dustin and Magnusson, Ian and Chandu, Khyathi}},
license = {{Apache-2.0}},
title = {{Dolma}},
url = {https://github.com/allenai/dolma}
title = {{The Dolma Toolkit}},
year = {2023},
note = {{Apache 2.0 License, Version \texttt{0.9.0}, \url{https://github.com/allenai/dolma}}}
}
```
File renamed without changes.
File renamed without changes.
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@ class TrafilaturaReformatter(BaseParallelProcessor):
@classmethod
def increment_progressbar( # type: ignore
cls,
queue: Queue[Union[Tuple[int, ...], None]],
queue: "Queue[Union[Tuple[int, ...], None]]",
/,
files: int = 0,
documents: int = 0,
Expand All @@ -26,7 +26,7 @@ def increment_progressbar( # type: ignore

@classmethod
def process_single(
cls, source_path: str, destination_path: str, queue: Queue[Union[Tuple[int, ...], None]], **kwargs: Any
cls, source_path: str, destination_path: str, queue: "Queue[Union[Tuple[int, ...], None]]", **kwargs: Any
):
documents = 0
interval = 10_000
Expand Down
File renamed without changes.
File renamed without changes.
44 changes: 44 additions & 0 deletions docs/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,44 @@
# Dolma Toolkit Documentation


Dolma is a toolkit to curate dataset for pretraining AI models. Reason to use the Dolma toolkit are:

- **High performance** ⚡️ Dolma is designed to be highly performant, and can be used to process datasets with billions of documents in parallel.
- **Portable** 🧳 Dolma can be run on a single machine, a cluster, or a cloud computing environment.
- **Built-in taggers** 🏷 Dolma comes with a number of built-in taggers, including language detection, toxicity detection, perplexity scoring, and common filtering recipes, such as the ones used to create [Gopher](https://arxiv.org/abs/2112.11446) and [C4](https://arxiv.org/abs/1910.10683).
- **Fast deduplication** 🗑 Dolma can deduplicate documents using rust-based a Bloom filter, which is significantly faster than other methods.
- **Extensible** 🧩 Dolma is designed to be extensible, and can be extended with custom taggers.
- **Cloud support** ☁️ Dolma supports reading and write data from local disk, and AWS S3-compatible locations.

Dataset curation with Dolma usually happens in four steps:

1. Using **taggers**, spans of documents in a dataset are tagged with properties (e.g. the language their are in, toxicity, etc);
2. Documents are optionally **deduplicated** based on their content or metadata;
3. Using the **mixer**, documents removed or filtered depending on value of attributes;
4. Finally, documents can be **tokenized** using any [HuggingFace-compatible tokenizer](https://huggingface.co/docs/tokenizers/index).

![The four steps of dataset curation with Dolma.](assets/diagram.webp)

Dolma can be installed using `pip`:

```shell
pip install dolma
```

Dolma can be used either as a Python library or as a command line tool. The command line tool can be accessed using the `dolma` command. To see the available commands, use the `--help` flag.

```shell
dolma --help
```

## Index

To read Dolma's documentation, visit the following pages:

- [Getting Started](getting-started.md)
- [Data Format](data-format.md)
- [Taggers](taggers.md)
- [Deduplication](deduplication.md)
- [Mixer](mixer.md)
- [Tokenization](tokenize.md)
- [Contributing to Dolma](develop.md)
Binary file added docs/assets/AI2_Blog_1400x685.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/assets/AI2_Blog_1400x685.webp
Binary file not shown.
Binary file added docs/assets/AI2_Blog_1400x685_2x.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/assets/AI2_Blog_1400x685_2x.webp
Binary file not shown.
Binary file added docs/assets/DOLMA.webp
Binary file not shown.
Binary file added docs/assets/DOLMA_2x.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/assets/DOLMA_4x.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/assets/Small_655x120.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/assets/Small_655x120_2x.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/assets/Square_1_600x600.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/assets/Square_1_600x600_2x.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/assets/code-pipeline.pdf
Binary file not shown.
Binary file added docs/assets/code-pipeline.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/assets/diagram.webp
Binary file not shown.
Binary file added docs/assets/dolma-datasheet-v0.1.pdf
Binary file not shown.
Binary file added docs/assets/web-pipeline.pdf
Binary file not shown.
Binary file added docs/assets/web-pipeline.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
114 changes: 114 additions & 0 deletions docs/data-format.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,114 @@
# Data Format

In this document, we explain the data format for the datasets processed by Dolma.


## Directory Structure

While all components of the Dolma toolkit can read from arbitrary local and S3 locations, we recommend the following directory structure for storing datasets:

```plain-text
|-- dataset-name/
|-- documents/
|-- 2019-09/
|-- 0933_uk_all.jsonl.gz (1GB)
|-- 0933_vi_all.jsonl.gz (1GB)
|-- 0106_uk_all.jsonl.gz (1GB)
|-- 0106_vi_all.jsonl.gz (1GB)
|-- 2019-08/
|-- ...
|-- attributes/
|-- toxicity-0/
|-- 2019-09/
|-- 0933_uk_all.jsonl.gz (..MB)
|-- 0933_vi_all.jsonl.gz (..MB)
|-- 0106_uk_all.jsonl.gz (..MB)
|-- 0106_vi_all.jsonl.gz (..MB)
|-- 2019-08/
|-- ...
|-- paragraph_duplicates/
|-- ...
```

In the example above, all data is stored under the `documents` subdirectory. The directory structure under `documents` is left up to Dolma users. Each file in the `documents` directory is a gzipped JSONL file, where each line is a JSON object representing a document. We explain format of each file in the next section.

Data produced by taggers and deduper is stored under `attributes/attribute-name`; the original directory structure is preserved, and each attributes file contains the same documents as the corresponding file in `documents`.


### Dolma Document Format

This is the unified format we will use across all the sources to represent a single **document**. Each row in one of the `documents/*/*.jsonl.gz` file looks like:

```yaml
{
"id": "...", # MANDATORY: source-specific identifier
"text": "foo", # MANDATORY: textual content of the document
"source": "...", # MANDATORY: source of the data, such as peS2o, common-crawl, etc.
"added": "...", # OPTIONAL: timestamp ai2 acquired this data
"created": "..." # OPTIONAL: timestamp when orig document was created (best-guess if not available)
"metadata": {...} # OPTIONAL: source-specific metadata
}
```

#### `id` field

The `id` field is very important as we will need:

- the ability to trace every single document in every version back to the original source document,
- the ability to store a `blocklist` of documents (e.g. avoid due to LLM-Eval, takedown requests, manual inspection).

It is important that document IDs are stable across dataset versions. For example, Document 12345 in `raw` is the same as Document 12345 in `v0`, `v1`, ...

The `id` only needs to be consistent/unique within a `source`. For example, `id='123'` is acceptable because `(c4, '123')` and `(github, '123')` would uniquely identify this document still. But there cannot be two rows in The Stack `v0` dataset that has `id='123'`.

#### `metadata` field

The `metadata` field will be a free-for-all field that contains any source-specific information. This could be things like code license for the Stack, or paper identifiers for Semantic Scholar (S2) data.

It is especially important to preserve source-specific identifiers when possible. For example, in S2 raw data, we have S2 IDs for each document, but we should also persist things like the DOI, arXiv ID, ACL ID, PubMed ID, etc. when they're available to us.

### Dolma Attributes Format

Let's say we are at a good state of document, but we need to iterate on the toxicity classifier a few times. We don't want to duplicate multiple copies of the dataset just because we updated the toxicity classifier. Hence, we store **documents** separately from **attributes**, where attributes are newly derived/predicted aspects as a result of using our tools to analyze the documents.

These are flat JSONs that look like:

```yaml
{
"source": "...",
"id": "...",
"attributes": {
"toxicity": 0.7
}
}
```

where the `source` and `id` keys uniquely identify which document carries these attributes.

The mixer create a unified `attributes` dictionary by merging all of the individual `attributes` dictionaries.

Note that it's very important that the `*.jsonl.gz` files for attributes lines up exactly (same number of rows, same sort order) with the `*.jsonl.gz` files for the associated documents. It'll save us a lot of headache in the future.

For something like Language identification, this JSON might look like:

```yaml
{
"id": "...",
attributes: {
"olmo_mix_v1_taggers__ft_lang_id_en_paragraph_with_doc_score_v2__en": [
[0, 300, 0.9], # this means text[0:300] is tagged with score 0.9
[300, 540, 0.3], # this means text[300:540] is tagged with score 0.3
...
],
...
}
}
```

Each attribute can have one or more scores associated with it; in the example above, each paragraph in the document is tagged with a language score.
For each paragraph, the tuple indicate the start and end index of the paragraph, and the score associated with it.

The idea that we're going with is that attributes identify spans of text within a document that might be problematic.
These signal get cached during tagging and allow for "building" of the dataset to happen as a configuration afterwards. so for example, given signal data like this, we might try different confidence thresholds on mean_word_length when creating final data mixture
how does your signals data look?
}
Loading

0 comments on commit 1728f4f

Please sign in to comment.