Skip to content

Commit

Permalink
deploy: c33a8e9
Browse files Browse the repository at this point in the history
  • Loading branch information
lllAlexanderlll committed Jan 16, 2024
0 parents commit ecfda6b
Show file tree
Hide file tree
Showing 62 changed files with 7,205 additions and 0 deletions.
4 changes: 4 additions & 0 deletions .buildinfo
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
# Sphinx build info version 1
# This file hashes the configuration used when building these files. When it is not found, a full rebuild will be done.
config: eafc7b279315a8a9b5f44f5c8ecc4328
tags: 645f666f9bcd5a90fca523b33c5a78b7
Empty file added .nojekyll
Empty file.
3 changes: 3 additions & 0 deletions _sources/benchmarking.rst.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
Benchmarking
=============================
**EDIT "docs/source/benchmarking.rst" IN ORDER TO MAKE CHANGES HERE**
91 changes: 91 additions & 0 deletions _sources/configuration.rst.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,91 @@
.. role:: python(code)
:language: python

Configuration
========================================================================

**EDIT "docs/source/configuration.rst" IN ORDER TO MAKE CHANGES HERE**

Training config is defined in yaml formatted files. See :file:`data/config_lorem_ipsum.yaml`. These configs are very explicit specifying all training parameters to keep model trainings as transparent and reproducible as possible. Each config setting is reflected in pydantic classes in :file:`src/llm_gym/config/*.py`. In the config you need to define which config classes to load in field type_hint. This specifies the concrete class. A second parameter, config, then takes all the constructor arguments for that config class. This way it is easy to change i.e. DataLoaders while still having input validation in place.

Pydantic and ClassResolver
------------------------------------------------------------------------

The mechanismn introduced to instantiate classes via :python:`type_hint` in the :file:`config.yaml`, utilizes

1) Omegaconf to load the config yaml file
2) Pydantic for the validation of the config
3) ClassResolver to instantiate the correct, concrete class of a class hierarchy.

Firstly, Omegaconf loads the config yaml file and resolves internal refrences such as `${subconfig.attribue}`.

Then, Pydantic validates the whole config as is and checks that each of the sub-configs are :python:`pydantic.BaseModel` classes.
For configs, which allow different concrete classes to be instantiated by :python:`ClassResolver`, the special member names :python:`type_hint` and :python:`config` are introduced.
With this we utilize Pydantics feature to auto-select a fitting type based on the keys in the config yaml file.

:python:`ClassResolver` replaces large if-else control structures to infer the correct concrete type with a :python:`type_hint` used for correct class selection:

.. code-block:: python
activation_resolver = ClassResolver(
[nn.ReLU, nn.Tanh, nn.Hardtanh],
base=nn.Module,
default=nn.ReLU,
)
type_hint="ReLU"
activation_kwargs={...}
activation_resolver.make(type_hint, activation_kwargs),
In our implmentation we go a step further, as both,

* a :python:`type_hint` in a :python:`BaseModel` config must be of type :python:`llm_gym.config.lookup_types.LookupEnum` and
* :python:`config` is a union of allowed concrete configs of base type :python:`BaseModel`.

:python:`config` hereby replaces :python:`activation_kwargs` in the example above, and replaces it with pydantic-validated :python:`BaseModel` configs.

With this, a mapping between type hint strings needed for `class-resolver`, and the concrete class is introduced, while allowing pydantic to select the correct concrete config:

.. code-block:: python
from enum import Enum
from pydantic import BaseModel, PositiveInt, PositiveFloat, conint, confloat
class LookupEnum(Enum):
@classmethod
def _missing_(cls, value: str) -> type:
"""constructs Enum by member name, if not constructable by value"""
return cls.__dict__[value]
class SchedulerTypes(LookupEnum):
StepLR = torch.optim.lr_scheduler.StepLR
ConstantLR = torch.optim.lr_scheduler.ConstantLR
class StepLRConfig(BaseModel):
step_size: conint(ge=1)
gamma: confloat(ge=0.0)
class ConstantLRConfig(BaseModel):
factor: PositiveFloat
total_iters: PositiveInt
class SchedulerConfig(BaseModel):
type_hint: SchedulerTypes
config: StepLRConfig | ConstantLRConfig
To allow a user-friendly instantiation, all class resolvers are defined in the :python:`ResolverRegistry` and :python:`build_component_by_config` as convenience function is introduced. Dependecies can be passed-through with the :python:`extra_kwargs` argument:

.. code-block:: python
resolvers = ResolverRegister(config=config)
optimizer = ... # our example dependency
scheduler = resolvers.build_component_by_config(config=config.scheduler, extra_kwargs=dict(optimizer=optimizer))
To add a new resolver use :python:`add_resolver`, and the corresponding added resolver will be accessible by the register_key given during adding.

For access use the :python:`build_component_by_key_query` function of the :python:`ResolverRegistry`.



71 changes: 71 additions & 0 deletions _sources/entrypoints.rst.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,71 @@
.. role:: python(code)
:language: python

.. role:: bash(code)
:language: bash


Entrypoints
=======================================================


**EDIT "docs/source/entrypoints.rst" IN ORDER TO MAKE CHANGES HERE**

We use `click <https://click.palletsprojects.com/en/>`_ as a tool to add new entry points and their CLI arguments.
For this we have a main entry point from which all other entry points are started.

The main entry point is :file:`src/llm_gym/__main__.py:main()`.
We register other sub-entrypoints by using our main :python:`click.group`, called :python:`main`, as follows:

.. code-block:: python
@main.command(name="my_new_entry_point")
See the following full example:

.. code-block:: python
import click
import click_pathlib
@click.group()
def main() -> None:
pass
config_option = click.option(
"--config_file_path",
type=click_pathlib.Path(exists=False),
required=True,
help="Path to a file with the YAML config file.",
)
@main.command(name="do_stuff")
@config_option
@click.option(
"--my_cli_argument",
type=int,
required=True,
help="New integer argument",
)
def entry_point_do_stuff(config_file_path: Path, my_cli_argument: int):
print(f"Do stuff with {config_file_path} and {my_cli_argument}...)
...
if __name__ == "__main__":
main()
With
.. code-block:: python
[project.scripts]
llm_gym = "llm_gym.__main__:main"
in our :file:`pyproject.toml`, we can start only main with :python:`llm_gym` (which does nothing), or a specific sub-entrypoint e.g. :bash:`llm_gym do_stuff --config_file_path config_files/config.yaml --my_cli_argument 3537`.
Alternatively, directly use :bash:`src/llm_gym/__main__.py do_stuff --config_file_path config_files/config.yaml --my_cli_argument 3537`.
9 changes: 9 additions & 0 deletions _sources/future_work.rst.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
Future Work
=======================================================

The team is currently working on our already established LLM code base to bring in multi-modality into the mix. This extension will be based on ideas similar to CoCa and/or AudioPaLM, which would enable users to either use different encoders for different modalities in conjunction with a text-based decoder, or use a decoder-only architecture.
Future modalities other than text can be used, namely,

* image
* audio
* video
48 changes: 48 additions & 0 deletions _sources/index.rst.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,48 @@
Welcome to Modalities's documentation!
======================================================================

**EDIT "docs/source/index.rst" IN ORDER TO MAKE CHANGES HERE**

<TODO: Add abstract --> still needed: USPs, key features; include FSDP here;>

<TODO: CAN ADD LINKS TO SPECIFIC THINGS USERS CAN EXPLORE AT FIRST>


.. note::

This project is under active development.

.. toctree::
:caption: Getting Started

quickstart
configuration
model_cards
benchmarking
known_issues

.. toctree::
:caption: Datasets

memmap

.. toctree::
:caption: Entrypoints

entrypoints

.. toctree::
:caption: VSCode Setup

vs_code_setup


.. toctree::
:caption: Future Work

future_work

.. toctree::
:caption: API

api/modules
7 changes: 7 additions & 0 deletions _sources/known_issues.rst.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
Known Issues
==================================================================

**EDIT "docs/source/known_issues.rst" IN ORDER TO MAKE CHANGES HERE**

1. hardcoded dataset path :file:`/raid/s3/opengptx/mehdi/temp/temp_data/train_text_document.bin` in :file:`config/config.yaml`
2. Dependency on weights&biases
43 changes: 43 additions & 0 deletions _sources/memmap.rst.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,43 @@
.. role:: python(code)
:language: python

.. role:: bash(code)
:language: bash

MemMap Datasets
====================================================

**EDIT "docs/source/memmap.rst" IN ORDER TO MAKE CHANGES HERE**

MemMapDataset Index Generator
------------------------------------------------------------------------------

The :python:`MemMapDataset` requires an index file providing the necessary pointers into the raw data file. The :python:`MemMapDataset` can create the index file lazily, however, it is advised to create it beforehand. This can be done by running

.. code-block:: bash
modalities create_memmap_index <path/to/jsonl/file>
The index will be created in the same directory as the raw data file. For further options you may look into the usage documentation via :bash:`llm_gym create_memmap_index --help`.

Packed Dataset Generator
--------------------------------------------------------------------------------

The :python:`PackedMemMapDatasetContinuous` and :python:`PackedMemMapDatasetMegatron` require a packed data file. To create the data file, you first have to generate a :python:`MemMapDataset` index file as described `above <memMapDataset-index-generator>`_. Assuming the index and raw data are located in the same directory, you can simply execute the following command:

.. code-block:: bash
modalities create_packed_data <path/to/jsonl/file>
The packed data file will be created in the same directory as the raw data file. For further options you may look into the usage documentation via :bash:`llm_gym create_packed_data --help`.

Packed Data Format
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

The packed data file is a bytestream containing both the tokenized data as well as an index denoting the start and length of the tokenized documents inside the bytestream. The data file consists of 3 concatenated parts:

header segment | data segment | index segment

* **header segment**: This section is a 8 bytes sized integer which encodes the length of the data segment in bytes.
* **data segment**: This section contains a concatenation of all documents in form of 4 bytes sized tokens. An end-of-sequence token is placed between consecutive documents.
* **index segment**: This section contains a pickled index which locates the documents inside the data segment. The index is basically a list of tuples, where each tuple contains the start position and length in bytes for the corresponding document, e.g., :python:`[(start_doc1, len_doc1), (start_doc2, len_doc2), ....]`.
4 changes: 4 additions & 0 deletions _sources/model_cards.rst.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
Model Cards
====================================================
**EDIT "docs/source/model_cards.rst" IN ORDER TO MAKE CHANGES HERE**
<TODO>
41 changes: 41 additions & 0 deletions _sources/quickstart.rst.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,41 @@
Quickstart
====================================================

**EDIT "docs/source/quickstart.rst" IN ORDER TO MAKE CHANGES HERE**

Installation
-----------------------------------------------------
Setup a conda environment `conda create -n modalities python=3.10 & conda activate modalities` and install the requirements `pip install -e .`.

Setup Dataset
-------------------------------------------------
To start a training you need to create memmap dataset out of a jsonl file first, then pack it, then run the training.

.. code-block:: bash
# Create memmap dataset from jsonl file.
modalities create_memmap_index <path/to/jsonl/file>
# Create packed dataset.
modalities create_packed_data <path/to/jsonl/file>
For example, using the lorem ipsum example:

.. code-block:: bash
# Create memmap dataset from jsonl file.
modalities create_memmap_index data/lorem_ipsum.jsonl
# Create packed dataset.
modalities create_packed_data data/lorem_ipsum.jsonl
Training
----------------------------------------------------
To run a training environment variables in a multi-gpu setting are required.

.. code-block:: bash
CUDA_VISIBLE_DEVICES=0,1 torchrun --nnodes 1 --nproc_per_node 2 --rdzv-endpoint=0.0.0.0:29502 src/modalities/__main__.py run --config_file_path config_files/config_lorem_ipsum.yaml
4. **Evaluation:**
WIP add contents
33 changes: 33 additions & 0 deletions _sources/vs_code_setup.rst.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
VSCode Setup
====================================================

**EDIT "docs/source/vs_code_setup.rst" IN ORDER TO MAKE CHANGES HERE**

In VSCode, add this to your :file:`launch.json`:

.. code-block:: json
{
"name": "Torchrun Main",
"type": "python",
"request": "launch",
"module": "torch.distributed.run",
"env": {
"CUDA_VISIBLE_DEVICES": "0"
},
"args": [
"--nnodes",
"1",
"--nproc_per_node",
"2",
"--rdzv-endpoint=0.0.0.0:29503",
"src/modalities/__main__.py",
"run",
"--config_file_path",
"config_files/config.yaml",
],
"console": "integratedTerminal",
"justMyCode": true,
"envFile": "${workspaceFolder}/.env"
}
Loading

0 comments on commit ecfda6b

Please sign in to comment.