-
Notifications
You must be signed in to change notification settings - Fork 9
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
0 parents
commit ecfda6b
Showing
62 changed files
with
7,205 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,4 @@ | ||
# Sphinx build info version 1 | ||
# This file hashes the configuration used when building these files. When it is not found, a full rebuild will be done. | ||
config: eafc7b279315a8a9b5f44f5c8ecc4328 | ||
tags: 645f666f9bcd5a90fca523b33c5a78b7 |
Empty file.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,3 @@ | ||
Benchmarking | ||
============================= | ||
**EDIT "docs/source/benchmarking.rst" IN ORDER TO MAKE CHANGES HERE** |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,91 @@ | ||
.. role:: python(code) | ||
:language: python | ||
|
||
Configuration | ||
======================================================================== | ||
|
||
**EDIT "docs/source/configuration.rst" IN ORDER TO MAKE CHANGES HERE** | ||
|
||
Training config is defined in yaml formatted files. See :file:`data/config_lorem_ipsum.yaml`. These configs are very explicit specifying all training parameters to keep model trainings as transparent and reproducible as possible. Each config setting is reflected in pydantic classes in :file:`src/llm_gym/config/*.py`. In the config you need to define which config classes to load in field type_hint. This specifies the concrete class. A second parameter, config, then takes all the constructor arguments for that config class. This way it is easy to change i.e. DataLoaders while still having input validation in place. | ||
|
||
Pydantic and ClassResolver | ||
------------------------------------------------------------------------ | ||
|
||
The mechanismn introduced to instantiate classes via :python:`type_hint` in the :file:`config.yaml`, utilizes | ||
|
||
1) Omegaconf to load the config yaml file | ||
2) Pydantic for the validation of the config | ||
3) ClassResolver to instantiate the correct, concrete class of a class hierarchy. | ||
|
||
Firstly, Omegaconf loads the config yaml file and resolves internal refrences such as `${subconfig.attribue}`. | ||
|
||
Then, Pydantic validates the whole config as is and checks that each of the sub-configs are :python:`pydantic.BaseModel` classes. | ||
For configs, which allow different concrete classes to be instantiated by :python:`ClassResolver`, the special member names :python:`type_hint` and :python:`config` are introduced. | ||
With this we utilize Pydantics feature to auto-select a fitting type based on the keys in the config yaml file. | ||
|
||
:python:`ClassResolver` replaces large if-else control structures to infer the correct concrete type with a :python:`type_hint` used for correct class selection: | ||
|
||
.. code-block:: python | ||
activation_resolver = ClassResolver( | ||
[nn.ReLU, nn.Tanh, nn.Hardtanh], | ||
base=nn.Module, | ||
default=nn.ReLU, | ||
) | ||
type_hint="ReLU" | ||
activation_kwargs={...} | ||
activation_resolver.make(type_hint, activation_kwargs), | ||
In our implmentation we go a step further, as both, | ||
|
||
* a :python:`type_hint` in a :python:`BaseModel` config must be of type :python:`llm_gym.config.lookup_types.LookupEnum` and | ||
* :python:`config` is a union of allowed concrete configs of base type :python:`BaseModel`. | ||
|
||
:python:`config` hereby replaces :python:`activation_kwargs` in the example above, and replaces it with pydantic-validated :python:`BaseModel` configs. | ||
|
||
With this, a mapping between type hint strings needed for `class-resolver`, and the concrete class is introduced, while allowing pydantic to select the correct concrete config: | ||
|
||
.. code-block:: python | ||
from enum import Enum | ||
from pydantic import BaseModel, PositiveInt, PositiveFloat, conint, confloat | ||
class LookupEnum(Enum): | ||
@classmethod | ||
def _missing_(cls, value: str) -> type: | ||
"""constructs Enum by member name, if not constructable by value""" | ||
return cls.__dict__[value] | ||
class SchedulerTypes(LookupEnum): | ||
StepLR = torch.optim.lr_scheduler.StepLR | ||
ConstantLR = torch.optim.lr_scheduler.ConstantLR | ||
class StepLRConfig(BaseModel): | ||
step_size: conint(ge=1) | ||
gamma: confloat(ge=0.0) | ||
class ConstantLRConfig(BaseModel): | ||
factor: PositiveFloat | ||
total_iters: PositiveInt | ||
class SchedulerConfig(BaseModel): | ||
type_hint: SchedulerTypes | ||
config: StepLRConfig | ConstantLRConfig | ||
To allow a user-friendly instantiation, all class resolvers are defined in the :python:`ResolverRegistry` and :python:`build_component_by_config` as convenience function is introduced. Dependecies can be passed-through with the :python:`extra_kwargs` argument: | ||
|
||
.. code-block:: python | ||
resolvers = ResolverRegister(config=config) | ||
optimizer = ... # our example dependency | ||
scheduler = resolvers.build_component_by_config(config=config.scheduler, extra_kwargs=dict(optimizer=optimizer)) | ||
To add a new resolver use :python:`add_resolver`, and the corresponding added resolver will be accessible by the register_key given during adding. | ||
|
||
For access use the :python:`build_component_by_key_query` function of the :python:`ResolverRegistry`. | ||
|
||
|
||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,71 @@ | ||
.. role:: python(code) | ||
:language: python | ||
|
||
.. role:: bash(code) | ||
:language: bash | ||
|
||
|
||
Entrypoints | ||
======================================================= | ||
|
||
|
||
**EDIT "docs/source/entrypoints.rst" IN ORDER TO MAKE CHANGES HERE** | ||
|
||
We use `click <https://click.palletsprojects.com/en/>`_ as a tool to add new entry points and their CLI arguments. | ||
For this we have a main entry point from which all other entry points are started. | ||
|
||
The main entry point is :file:`src/llm_gym/__main__.py:main()`. | ||
We register other sub-entrypoints by using our main :python:`click.group`, called :python:`main`, as follows: | ||
|
||
.. code-block:: python | ||
@main.command(name="my_new_entry_point") | ||
See the following full example: | ||
|
||
.. code-block:: python | ||
import click | ||
import click_pathlib | ||
@click.group() | ||
def main() -> None: | ||
pass | ||
config_option = click.option( | ||
"--config_file_path", | ||
type=click_pathlib.Path(exists=False), | ||
required=True, | ||
help="Path to a file with the YAML config file.", | ||
) | ||
@main.command(name="do_stuff") | ||
@config_option | ||
@click.option( | ||
"--my_cli_argument", | ||
type=int, | ||
required=True, | ||
help="New integer argument", | ||
) | ||
def entry_point_do_stuff(config_file_path: Path, my_cli_argument: int): | ||
print(f"Do stuff with {config_file_path} and {my_cli_argument}...) | ||
... | ||
if __name__ == "__main__": | ||
main() | ||
With | ||
.. code-block:: python | ||
[project.scripts] | ||
llm_gym = "llm_gym.__main__:main" | ||
in our :file:`pyproject.toml`, we can start only main with :python:`llm_gym` (which does nothing), or a specific sub-entrypoint e.g. :bash:`llm_gym do_stuff --config_file_path config_files/config.yaml --my_cli_argument 3537`. | ||
Alternatively, directly use :bash:`src/llm_gym/__main__.py do_stuff --config_file_path config_files/config.yaml --my_cli_argument 3537`. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,9 @@ | ||
Future Work | ||
======================================================= | ||
|
||
The team is currently working on our already established LLM code base to bring in multi-modality into the mix. This extension will be based on ideas similar to CoCa and/or AudioPaLM, which would enable users to either use different encoders for different modalities in conjunction with a text-based decoder, or use a decoder-only architecture. | ||
Future modalities other than text can be used, namely, | ||
|
||
* image | ||
* audio | ||
* video |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,48 @@ | ||
Welcome to Modalities's documentation! | ||
====================================================================== | ||
|
||
**EDIT "docs/source/index.rst" IN ORDER TO MAKE CHANGES HERE** | ||
|
||
<TODO: Add abstract --> still needed: USPs, key features; include FSDP here;> | ||
|
||
<TODO: CAN ADD LINKS TO SPECIFIC THINGS USERS CAN EXPLORE AT FIRST> | ||
|
||
|
||
.. note:: | ||
|
||
This project is under active development. | ||
|
||
.. toctree:: | ||
:caption: Getting Started | ||
|
||
quickstart | ||
configuration | ||
model_cards | ||
benchmarking | ||
known_issues | ||
|
||
.. toctree:: | ||
:caption: Datasets | ||
|
||
memmap | ||
|
||
.. toctree:: | ||
:caption: Entrypoints | ||
|
||
entrypoints | ||
|
||
.. toctree:: | ||
:caption: VSCode Setup | ||
|
||
vs_code_setup | ||
|
||
|
||
.. toctree:: | ||
:caption: Future Work | ||
|
||
future_work | ||
|
||
.. toctree:: | ||
:caption: API | ||
|
||
api/modules |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,7 @@ | ||
Known Issues | ||
================================================================== | ||
|
||
**EDIT "docs/source/known_issues.rst" IN ORDER TO MAKE CHANGES HERE** | ||
|
||
1. hardcoded dataset path :file:`/raid/s3/opengptx/mehdi/temp/temp_data/train_text_document.bin` in :file:`config/config.yaml` | ||
2. Dependency on weights&biases |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,43 @@ | ||
.. role:: python(code) | ||
:language: python | ||
|
||
.. role:: bash(code) | ||
:language: bash | ||
|
||
MemMap Datasets | ||
==================================================== | ||
|
||
**EDIT "docs/source/memmap.rst" IN ORDER TO MAKE CHANGES HERE** | ||
|
||
MemMapDataset Index Generator | ||
------------------------------------------------------------------------------ | ||
|
||
The :python:`MemMapDataset` requires an index file providing the necessary pointers into the raw data file. The :python:`MemMapDataset` can create the index file lazily, however, it is advised to create it beforehand. This can be done by running | ||
|
||
.. code-block:: bash | ||
modalities create_memmap_index <path/to/jsonl/file> | ||
The index will be created in the same directory as the raw data file. For further options you may look into the usage documentation via :bash:`llm_gym create_memmap_index --help`. | ||
|
||
Packed Dataset Generator | ||
-------------------------------------------------------------------------------- | ||
|
||
The :python:`PackedMemMapDatasetContinuous` and :python:`PackedMemMapDatasetMegatron` require a packed data file. To create the data file, you first have to generate a :python:`MemMapDataset` index file as described `above <memMapDataset-index-generator>`_. Assuming the index and raw data are located in the same directory, you can simply execute the following command: | ||
|
||
.. code-block:: bash | ||
modalities create_packed_data <path/to/jsonl/file> | ||
The packed data file will be created in the same directory as the raw data file. For further options you may look into the usage documentation via :bash:`llm_gym create_packed_data --help`. | ||
|
||
Packed Data Format | ||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | ||
|
||
The packed data file is a bytestream containing both the tokenized data as well as an index denoting the start and length of the tokenized documents inside the bytestream. The data file consists of 3 concatenated parts: | ||
|
||
header segment | data segment | index segment | ||
|
||
* **header segment**: This section is a 8 bytes sized integer which encodes the length of the data segment in bytes. | ||
* **data segment**: This section contains a concatenation of all documents in form of 4 bytes sized tokens. An end-of-sequence token is placed between consecutive documents. | ||
* **index segment**: This section contains a pickled index which locates the documents inside the data segment. The index is basically a list of tuples, where each tuple contains the start position and length in bytes for the corresponding document, e.g., :python:`[(start_doc1, len_doc1), (start_doc2, len_doc2), ....]`. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,4 @@ | ||
Model Cards | ||
==================================================== | ||
**EDIT "docs/source/model_cards.rst" IN ORDER TO MAKE CHANGES HERE** | ||
<TODO> |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,41 @@ | ||
Quickstart | ||
==================================================== | ||
|
||
**EDIT "docs/source/quickstart.rst" IN ORDER TO MAKE CHANGES HERE** | ||
|
||
Installation | ||
----------------------------------------------------- | ||
Setup a conda environment `conda create -n modalities python=3.10 & conda activate modalities` and install the requirements `pip install -e .`. | ||
|
||
Setup Dataset | ||
------------------------------------------------- | ||
To start a training you need to create memmap dataset out of a jsonl file first, then pack it, then run the training. | ||
|
||
.. code-block:: bash | ||
# Create memmap dataset from jsonl file. | ||
modalities create_memmap_index <path/to/jsonl/file> | ||
# Create packed dataset. | ||
modalities create_packed_data <path/to/jsonl/file> | ||
For example, using the lorem ipsum example: | ||
|
||
.. code-block:: bash | ||
# Create memmap dataset from jsonl file. | ||
modalities create_memmap_index data/lorem_ipsum.jsonl | ||
# Create packed dataset. | ||
modalities create_packed_data data/lorem_ipsum.jsonl | ||
Training | ||
---------------------------------------------------- | ||
To run a training environment variables in a multi-gpu setting are required. | ||
|
||
.. code-block:: bash | ||
CUDA_VISIBLE_DEVICES=0,1 torchrun --nnodes 1 --nproc_per_node 2 --rdzv-endpoint=0.0.0.0:29502 src/modalities/__main__.py run --config_file_path config_files/config_lorem_ipsum.yaml | ||
4. **Evaluation:** | ||
WIP add contents |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,33 @@ | ||
VSCode Setup | ||
==================================================== | ||
|
||
**EDIT "docs/source/vs_code_setup.rst" IN ORDER TO MAKE CHANGES HERE** | ||
|
||
In VSCode, add this to your :file:`launch.json`: | ||
|
||
.. code-block:: json | ||
{ | ||
"name": "Torchrun Main", | ||
"type": "python", | ||
"request": "launch", | ||
"module": "torch.distributed.run", | ||
"env": { | ||
"CUDA_VISIBLE_DEVICES": "0" | ||
}, | ||
"args": [ | ||
"--nnodes", | ||
"1", | ||
"--nproc_per_node", | ||
"2", | ||
"--rdzv-endpoint=0.0.0.0:29503", | ||
"src/modalities/__main__.py", | ||
"run", | ||
"--config_file_path", | ||
"config_files/config.yaml", | ||
], | ||
"console": "integratedTerminal", | ||
"justMyCode": true, | ||
"envFile": "${workspaceFolder}/.env" | ||
} | ||
Oops, something went wrong.