deploy: c33a8e9

Modalities · Jan 16, 2024 · ecfda6b · ecfda6b
commit ecfda6b
Show file tree

Hide file tree

Showing 62 changed files with 7,205 additions and 0 deletions.
diff --git a/.buildinfo b/.buildinfo
@@ -0,0 +1,4 @@
+# Sphinx build info version 1
+# This file hashes the configuration used when building these files. When it is not found, a full rebuild will be done.
+config: eafc7b279315a8a9b5f44f5c8ecc4328
+tags: 645f666f9bcd5a90fca523b33c5a78b7
diff --git a/.nojekyll b/.nojekyll
diff --git a/_sources/benchmarking.rst.txt b/_sources/benchmarking.rst.txt
@@ -0,0 +1,3 @@
+Benchmarking
+=============================
+**EDIT "docs/source/benchmarking.rst" IN ORDER TO MAKE CHANGES HERE**
diff --git a/_sources/configuration.rst.txt b/_sources/configuration.rst.txt
@@ -0,0 +1,91 @@
+.. role:: python(code)
+   :language: python
+
+Configuration
+========================================================================
+
+**EDIT "docs/source/configuration.rst" IN ORDER TO MAKE CHANGES HERE**
+
+Training config is defined in yaml formatted files. See :file:`data/config_lorem_ipsum.yaml`. These configs are very explicit specifying all training parameters to keep model trainings as transparent and reproducible as possible. Each config setting is reflected in pydantic classes in :file:`src/llm_gym/config/*.py`. In the config you need to define which config classes to load in field type_hint. This specifies the concrete class. A second parameter, config, then takes all the constructor arguments for that config class. This way it is easy to change i.e. DataLoaders while still having input validation in place.
+
+Pydantic and ClassResolver
+------------------------------------------------------------------------
+
+The mechanismn introduced to instantiate classes via :python:`type_hint` in the :file:`config.yaml`, utilizes 
+
+1) Omegaconf to load the config yaml file
+2) Pydantic for the validation of the config
+3) ClassResolver to instantiate the correct, concrete class of a class hierarchy.
+
+Firstly, Omegaconf loads the config yaml file and resolves internal refrences such as `${subconfig.attribue}`. 
+
+Then, Pydantic validates the whole config as is and checks that each of the sub-configs are :python:`pydantic.BaseModel` classes.
+For configs, which allow different concrete classes to be instantiated by :python:`ClassResolver`, the special member names :python:`type_hint` and :python:`config` are introduced.
+With this we utilize Pydantics feature to auto-select a fitting type based on the keys in the config yaml file.
+
+:python:`ClassResolver` replaces large if-else control structures to infer the correct concrete type with a :python:`type_hint` used for correct class selection:
+
+.. code-block:: python
+
+  activation_resolver = ClassResolver(
+    [nn.ReLU, nn.Tanh, nn.Hardtanh],
+    base=nn.Module,
+    default=nn.ReLU,
+  )
+  type_hint="ReLU"
+  activation_kwargs={...}
+  activation_resolver.make(type_hint, activation_kwargs),
+
+
+In our implmentation we go a step further, as both,
+
+* a :python:`type_hint` in a :python:`BaseModel` config must be of type :python:`llm_gym.config.lookup_types.LookupEnum` and 
+* :python:`config` is a union of allowed concrete configs of base type :python:`BaseModel`. 
+
+:python:`config` hereby replaces :python:`activation_kwargs` in the example above, and replaces it with pydantic-validated :python:`BaseModel` configs.
+
+With this, a mapping between type hint strings needed for `class-resolver`, and the concrete class is introduced, while allowing pydantic to select the correct concrete config:
+
+.. code-block:: python
+
+  from enum import Enum
+  from pydantic import BaseModel, PositiveInt, PositiveFloat, conint, confloat
+  
+  class LookupEnum(Enum):
+      @classmethod
+      def _missing_(cls, value: str) -> type:
+          """constructs Enum by member name, if not constructable by value"""
+          return cls.__dict__[value]
+  
+  class SchedulerTypes(LookupEnum):
+      StepLR = torch.optim.lr_scheduler.StepLR
+      ConstantLR = torch.optim.lr_scheduler.ConstantLR
+  
+  class StepLRConfig(BaseModel):
+      step_size: conint(ge=1)
+      gamma: confloat(ge=0.0)
+  
+  
+  class ConstantLRConfig(BaseModel):
+      factor: PositiveFloat
+      total_iters: PositiveInt
+  
+  
+  class SchedulerConfig(BaseModel):
+      type_hint: SchedulerTypes
+      config: StepLRConfig | ConstantLRConfig
+
+To allow a user-friendly instantiation, all class resolvers are defined in the :python:`ResolverRegistry` and :python:`build_component_by_config` as convenience function is introduced. Dependecies can be passed-through with the :python:`extra_kwargs` argument:
+
+.. code-block:: python
+
+  resolvers = ResolverRegister(config=config)
+  optimizer = ...  # our example dependency
+  scheduler = resolvers.build_component_by_config(config=config.scheduler, extra_kwargs=dict(optimizer=optimizer))
+
+To add a new resolver use :python:`add_resolver`, and the corresponding added resolver will be accessible by the register_key given during adding.
+
+For access use the :python:`build_component_by_key_query` function of the :python:`ResolverRegistry`.
+
+
+
diff --git a/_sources/entrypoints.rst.txt b/_sources/entrypoints.rst.txt
@@ -0,0 +1,71 @@
+.. role:: python(code)
+   :language: python
+
+.. role:: bash(code)
+   :language: bash
+
+
+Entrypoints
+=======================================================
+
+
+**EDIT "docs/source/entrypoints.rst" IN ORDER TO MAKE CHANGES HERE**
+
+We use `click <https://click.palletsprojects.com/en/>`_ as a tool to add new entry points and their CLI arguments.
+For this we have a main entry point from which all other entry points are started. 
+
+The main entry point is :file:`src/llm_gym/__main__.py:main()`. 
+We register other sub-entrypoints by using our main :python:`click.group`, called :python:`main`, as follows:
+
+.. code-block:: python
+
+  @main.command(name="my_new_entry_point")
+
+
+See the following full example:
+
+.. code-block:: python
+
+  
+  import click
+  import click_pathlib
+  
+  
+  @click.group()
+  def main() -> None:
+      pass
+  
+  
+  config_option = click.option(
+      "--config_file_path",
+      type=click_pathlib.Path(exists=False),
+      required=True,
+      help="Path to a file with the YAML config file.",
+  )
+  
+  
+  @main.command(name="do_stuff")
+  @config_option
+  @click.option(
+      "--my_cli_argument",
+      type=int,
+      required=True,
+      help="New integer argument",
+  )
+  def entry_point_do_stuff(config_file_path: Path, my_cli_argument: int):
+      print(f"Do stuff with {config_file_path} and {my_cli_argument}...)
+      ...
+  
+  if __name__ == "__main__":
+      main()
+
+With 
+    
+.. code-block:: python
+    
+  [project.scripts]
+  llm_gym = "llm_gym.__main__:main"
+
+in our :file:`pyproject.toml`, we can start only main with :python:`llm_gym` (which does nothing), or a specific sub-entrypoint e.g. :bash:`llm_gym do_stuff --config_file_path config_files/config.yaml --my_cli_argument 3537`.
+
+Alternatively, directly use :bash:`src/llm_gym/__main__.py do_stuff --config_file_path config_files/config.yaml --my_cli_argument 3537`.
diff --git a/_sources/future_work.rst.txt b/_sources/future_work.rst.txt
@@ -0,0 +1,9 @@
+Future Work
+=======================================================
+
+The team is currently working on our already established LLM code base to bring in multi-modality into the mix. This extension will be based on ideas similar to CoCa and/or AudioPaLM, which would enable users to either use different encoders for different modalities in conjunction with a text-based decoder, or use a decoder-only architecture.
+Future modalities other than text can be used, namely,
+
+* image
+* audio
+* video
diff --git a/_sources/index.rst.txt b/_sources/index.rst.txt
@@ -0,0 +1,48 @@
+Welcome to Modalities's documentation!
+======================================================================
+
+**EDIT "docs/source/index.rst" IN ORDER TO MAKE CHANGES HERE**
+
+<TODO: Add abstract --> still needed: USPs, key features; include FSDP here;>
+
+<TODO: CAN ADD LINKS TO SPECIFIC THINGS USERS CAN EXPLORE AT FIRST>
+
+
+.. note::
+
+   This project is under active development.
+
+.. toctree::
+   :caption: Getting Started
+
+   quickstart
+   configuration
+   model_cards
+   benchmarking
+   known_issues
+
+.. toctree::
+   :caption: Datasets
+
+   memmap
+
+.. toctree::
+   :caption: Entrypoints
+
+   entrypoints
+
+.. toctree::
+   :caption: VSCode Setup
+
+   vs_code_setup
+
+
+.. toctree::
+   :caption: Future Work
+
+   future_work
+
+.. toctree::
+   :caption: API
+
+   api/modules
diff --git a/_sources/known_issues.rst.txt b/_sources/known_issues.rst.txt
@@ -0,0 +1,7 @@
+Known Issues
+==================================================================
+
+**EDIT "docs/source/known_issues.rst" IN ORDER TO MAKE CHANGES HERE**
+
+1. hardcoded dataset path :file:`/raid/s3/opengptx/mehdi/temp/temp_data/train_text_document.bin` in :file:`config/config.yaml`
+2. Dependency on weights&biases
diff --git a/_sources/memmap.rst.txt b/_sources/memmap.rst.txt
@@ -0,0 +1,43 @@
+.. role:: python(code)
+   :language: python
+
+.. role:: bash(code)
+   :language: bash
+
+MemMap Datasets
+====================================================
+
+**EDIT "docs/source/memmap.rst" IN ORDER TO MAKE CHANGES HERE**
+
+MemMapDataset Index Generator
+------------------------------------------------------------------------------
+
+The :python:`MemMapDataset` requires an index file providing the necessary pointers into the raw data file. The :python:`MemMapDataset` can create the index file lazily, however, it is advised to create it beforehand. This can be done by running
+
+.. code-block:: bash
+
+  modalities create_memmap_index <path/to/jsonl/file>
+
+The index will be created in the same directory as the raw data file. For further options you may look into the usage documentation via :bash:`llm_gym create_memmap_index --help`.
+
+Packed Dataset Generator
+--------------------------------------------------------------------------------
+
+The :python:`PackedMemMapDatasetContinuous` and :python:`PackedMemMapDatasetMegatron` require a packed data file. To create the data file, you first have to generate a :python:`MemMapDataset` index file as described `above <memMapDataset-index-generator>`_. Assuming the index and raw data are located in the same directory, you can simply execute the following command:
+
+.. code-block:: bash
+
+  modalities create_packed_data <path/to/jsonl/file>
+
+The packed data file will be created in the same directory as the raw data file. For further options you may look into the usage documentation via :bash:`llm_gym create_packed_data --help`.
+
+Packed Data Format
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+The packed data file is a bytestream containing both the tokenized data as well as an index denoting the start and length of the tokenized documents inside the bytestream. The data file consists of 3 concatenated parts:
+
+header segment | data segment | index segment
+
+* **header segment**: This section is a 8 bytes sized integer which encodes the length of the data segment in bytes.
+* **data segment**: This section contains a concatenation of all documents in form of 4 bytes sized tokens. An end-of-sequence token is placed between consecutive documents.
+* **index segment**: This section contains a pickled index which locates the documents inside the data segment. The index is basically a list of tuples, where each tuple contains the start position and length in bytes for the corresponding document, e.g., :python:`[(start_doc1, len_doc1), (start_doc2, len_doc2), ....]`.
diff --git a/_sources/model_cards.rst.txt b/_sources/model_cards.rst.txt
@@ -0,0 +1,4 @@
+Model Cards
+====================================================
+**EDIT "docs/source/model_cards.rst" IN ORDER TO MAKE CHANGES HERE**
+<TODO>
diff --git a/_sources/quickstart.rst.txt b/_sources/quickstart.rst.txt
@@ -0,0 +1,41 @@
+Quickstart
+====================================================
+
+**EDIT "docs/source/quickstart.rst" IN ORDER TO MAKE CHANGES HERE**
+
+Installation
+-----------------------------------------------------
+Setup a conda environment `conda create -n modalities python=3.10 & conda activate modalities` and install the requirements `pip install -e .`.
+
+Setup Dataset
+-------------------------------------------------
+To start a training you need to create memmap dataset out of a jsonl file first, then pack it, then run the training.
+
+.. code-block:: bash
+
+    # Create memmap dataset from jsonl file.
+    modalities create_memmap_index <path/to/jsonl/file>
+
+    # Create packed dataset.
+    modalities create_packed_data <path/to/jsonl/file>
+
+For example, using the lorem ipsum example:
+
+.. code-block:: bash
+
+    # Create memmap dataset from jsonl file.
+    modalities create_memmap_index data/lorem_ipsum.jsonl
+
+    # Create packed dataset.
+    modalities create_packed_data data/lorem_ipsum.jsonl
+
+Training
+----------------------------------------------------
+To run a training environment variables in a multi-gpu setting are required.
+
+.. code-block:: bash
+
+    CUDA_VISIBLE_DEVICES=0,1 torchrun --nnodes 1 --nproc_per_node 2 --rdzv-endpoint=0.0.0.0:29502 src/modalities/__main__.py run --config_file_path config_files/config_lorem_ipsum.yaml
+
+4. **Evaluation:**
+   WIP add contents
diff --git a/_sources/vs_code_setup.rst.txt b/_sources/vs_code_setup.rst.txt
@@ -0,0 +1,33 @@
+VSCode Setup
+====================================================
+
+**EDIT "docs/source/vs_code_setup.rst" IN ORDER TO MAKE CHANGES HERE**
+
+In VSCode, add this to your :file:`launch.json`:
+
+.. code-block:: json
+
+  {
+      "name": "Torchrun Main",
+      "type": "python",
+      "request": "launch",
+      "module": "torch.distributed.run",
+      "env": {
+          "CUDA_VISIBLE_DEVICES": "0"
+      },
+      "args": [
+          "--nnodes",
+          "1",
+          "--nproc_per_node",
+          "2",
+          "--rdzv-endpoint=0.0.0.0:29503",
+          "src/modalities/__main__.py",
+          "run",
+          "--config_file_path",
+          "config_files/config.yaml",
+      ],
+      "console": "integratedTerminal",
+      "justMyCode": true,
+      "envFile": "${workspaceFolder}/.env"
+  }
+