Benchmarking
+EDIT “docs/source/benchmarking.rst” IN ORDER TO MAKE CHANGES HERE
+' + + '' + + _("Hide Search Matches") + + "
" + ) + ); + }, + + /** + * helper function to hide the search marks again + */ + hideSearchWords: () => { + document + .querySelectorAll("#searchbox .highlight-link") + .forEach((el) => el.remove()); + document + .querySelectorAll("span.highlighted") + .forEach((el) => el.classList.remove("highlighted")); + localStorage.removeItem("sphinx_highlight_terms") + }, + + initEscapeListener: () => { + // only install a listener if it is really needed + if (!DOCUMENTATION_OPTIONS.ENABLE_SEARCH_SHORTCUTS) return; + + document.addEventListener("keydown", (event) => { + // bail for input elements + if (BLACKLISTED_KEY_CONTROL_ELEMENTS.has(document.activeElement.tagName)) return; + // bail with special keys + if (event.shiftKey || event.altKey || event.ctrlKey || event.metaKey) return; + if (DOCUMENTATION_OPTIONS.ENABLE_SEARCH_SHORTCUTS && (event.key === "Escape")) { + SphinxHighlight.hideSearchWords(); + event.preventDefault(); + } + }); + }, +}; + +_ready(() => { + /* Do not call highlightSearchWords() when we are on the search page. + * It will highlight words from the *previous* search query. + */ + if (typeof Search === "undefined") SphinxHighlight.highlightSearchWords(); + SphinxHighlight.initEscapeListener(); +}); diff --git a/benchmarking.html b/benchmarking.html new file mode 100644 index 000000000..c340581d1 --- /dev/null +++ b/benchmarking.html @@ -0,0 +1,136 @@ + + + + + + +EDIT “docs/source/benchmarking.rst” IN ORDER TO MAKE CHANGES HERE
+EDIT “docs/source/configuration.rst” IN ORDER TO MAKE CHANGES HERE
+Training config is defined in yaml formatted files. See data/config_lorem_ipsum.yaml
. These configs are very explicit specifying all training parameters to keep model trainings as transparent and reproducible as possible. Each config setting is reflected in pydantic classes in src/llm_gym/config/*.py
. In the config you need to define which config classes to load in field type_hint. This specifies the concrete class. A second parameter, config, then takes all the constructor arguments for that config class. This way it is easy to change i.e. DataLoaders while still having input validation in place.
The mechanismn introduced to instantiate classes via type_hint
in the config.yaml
, utilizes
Omegaconf to load the config yaml file
Pydantic for the validation of the config
ClassResolver to instantiate the correct, concrete class of a class hierarchy.
Firstly, Omegaconf loads the config yaml file and resolves internal refrences such as ${subconfig.attribue}.
+Then, Pydantic validates the whole config as is and checks that each of the sub-configs are pydantic.BaseModel
classes.
+For configs, which allow different concrete classes to be instantiated by ClassResolver
, the special member names type_hint
and config
are introduced.
+With this we utilize Pydantics feature to auto-select a fitting type based on the keys in the config yaml file.
ClassResolver
replaces large if-else control structures to infer the correct concrete type with a type_hint
used for correct class selection:
activation_resolver = ClassResolver(
+ [nn.ReLU, nn.Tanh, nn.Hardtanh],
+ base=nn.Module,
+ default=nn.ReLU,
+)
+type_hint="ReLU"
+activation_kwargs={...}
+activation_resolver.make(type_hint, activation_kwargs),
+
In our implmentation we go a step further, as both,
+a type_hint
in a BaseModel
config must be of type llm_gym.config.lookup_types.LookupEnum
and
config
is a union of allowed concrete configs of base type BaseModel
.
config
hereby replaces activation_kwargs
in the example above, and replaces it with pydantic-validated BaseModel
configs.
With this, a mapping between type hint strings needed for class-resolver, and the concrete class is introduced, while allowing pydantic to select the correct concrete config:
+from enum import Enum
+from pydantic import BaseModel, PositiveInt, PositiveFloat, conint, confloat
+
+class LookupEnum(Enum):
+ @classmethod
+ def _missing_(cls, value: str) -> type:
+ """constructs Enum by member name, if not constructable by value"""
+ return cls.__dict__[value]
+
+class SchedulerTypes(LookupEnum):
+ StepLR = torch.optim.lr_scheduler.StepLR
+ ConstantLR = torch.optim.lr_scheduler.ConstantLR
+
+class StepLRConfig(BaseModel):
+ step_size: conint(ge=1)
+ gamma: confloat(ge=0.0)
+
+
+class ConstantLRConfig(BaseModel):
+ factor: PositiveFloat
+ total_iters: PositiveInt
+
+
+class SchedulerConfig(BaseModel):
+ type_hint: SchedulerTypes
+ config: StepLRConfig | ConstantLRConfig
+
To allow a user-friendly instantiation, all class resolvers are defined in the ResolverRegistry
and build_component_by_config
as convenience function is introduced. Dependecies can be passed-through with the extra_kwargs
argument:
resolvers = ResolverRegister(config=config)
+optimizer = ... # our example dependency
+scheduler = resolvers.build_component_by_config(config=config.scheduler, extra_kwargs=dict(optimizer=optimizer))
+
To add a new resolver use add_resolver
, and the corresponding added resolver will be accessible by the register_key given during adding.
For access use the build_component_by_key_query
function of the ResolverRegistry
.
EDIT “docs/source/entrypoints.rst” IN ORDER TO MAKE CHANGES HERE
+We use click as a tool to add new entry points and their CLI arguments. +For this we have a main entry point from which all other entry points are started.
+The main entry point is src/llm_gym/__main__.py:main()
.
+We register other sub-entrypoints by using our main click.group
, called main
, as follows:
@main.command(name="my_new_entry_point")
+
See the following full example:
+import click
+import click_pathlib
+
+
+@click.group()
+def main() -> None:
+ pass
+
+
+config_option = click.option(
+ "--config_file_path",
+ type=click_pathlib.Path(exists=False),
+ required=True,
+ help="Path to a file with the YAML config file.",
+)
+
+
+@main.command(name="do_stuff")
+@config_option
+@click.option(
+ "--my_cli_argument",
+ type=int,
+ required=True,
+ help="New integer argument",
+)
+def entry_point_do_stuff(config_file_path: Path, my_cli_argument: int):
+ print(f"Do stuff with {config_file_path} and {my_cli_argument}...)
+ ...
+
+if __name__ == "__main__":
+ main()
+
With
+[project.scripts]
+llm_gym = "llm_gym.__main__:main"
+
in our pyproject.toml
, we can start only main with llm_gym
(which does nothing), or a specific sub-entrypoint e.g. llm_gym do_stuff --config_file_path config_files/config.yaml --my_cli_argument 3537
.
Alternatively, directly use src/llm_gym/__main__.py do_stuff --config_file_path config_files/config.yaml --my_cli_argument 3537
.
The team is currently working on our already established LLM code base to bring in multi-modality into the mix. This extension will be based on ideas similar to CoCa and/or AudioPaLM, which would enable users to either use different encoders for different modalities in conjunction with a text-based decoder, or use a decoder-only architecture. +Future modalities other than text can be used, namely,
+image
audio
video
EDIT “docs/source/index.rst” IN ORDER TO MAKE CHANGES HERE
+<TODO: Add abstract –> still needed: USPs, key features; include FSDP here;>
+<TODO: CAN ADD LINKS TO SPECIFIC THINGS USERS CAN EXPLORE AT FIRST>
+Note
+This project is under active development.
+EDIT “docs/source/known_issues.rst” IN ORDER TO MAKE CHANGES HERE
+hardcoded dataset path /raid/s3/opengptx/mehdi/temp/temp_data/train_text_document.bin
in config/config.yaml
Dependency on weights&biases
EDIT “docs/source/memmap.rst” IN ORDER TO MAKE CHANGES HERE
+The MemMapDataset
requires an index file providing the necessary pointers into the raw data file. The MemMapDataset
can create the index file lazily, however, it is advised to create it beforehand. This can be done by running
modalities create_memmap_index <path/to/jsonl/file>
+
The index will be created in the same directory as the raw data file. For further options you may look into the usage documentation via llm_gym create_memmap_index --help
.
The PackedMemMapDatasetContinuous
and PackedMemMapDatasetMegatron
require a packed data file. To create the data file, you first have to generate a MemMapDataset
index file as described above. Assuming the index and raw data are located in the same directory, you can simply execute the following command:
modalities create_packed_data <path/to/jsonl/file>
+
The packed data file will be created in the same directory as the raw data file. For further options you may look into the usage documentation via llm_gym create_packed_data --help
.
The packed data file is a bytestream containing both the tokenized data as well as an index denoting the start and length of the tokenized documents inside the bytestream. The data file consists of 3 concatenated parts:
+header segment | data segment | index segment
+header segment: This section is a 8 bytes sized integer which encodes the length of the data segment in bytes.
data segment: This section contains a concatenation of all documents in form of 4 bytes sized tokens. An end-of-sequence token is placed between consecutive documents.
index segment: This section contains a pickled index which locates the documents inside the data segment. The index is basically a list of tuples, where each tuple contains the start position and length in bytes for the corresponding document, e.g., [(start_doc1, len_doc1), (start_doc2, len_doc2), ....]
.
EDIT “docs/source/model_cards.rst” IN ORDER TO MAKE CHANGES HERE +<TODO>
+EDIT “docs/source/quickstart.rst” IN ORDER TO MAKE CHANGES HERE
+Setup a conda environment conda create -n modalities python=3.10 & conda activate modalities and install the requirements pip install -e ..
+To start a training you need to create memmap dataset out of a jsonl file first, then pack it, then run the training.
+# Create memmap dataset from jsonl file.
+modalities create_memmap_index <path/to/jsonl/file>
+
+# Create packed dataset.
+modalities create_packed_data <path/to/jsonl/file>
+
For example, using the lorem ipsum example:
+# Create memmap dataset from jsonl file.
+modalities create_memmap_index data/lorem_ipsum.jsonl
+
+# Create packed dataset.
+modalities create_packed_data data/lorem_ipsum.jsonl
+
To run a training environment variables in a multi-gpu setting are required.
+CUDA_VISIBLE_DEVICES=0,1 torchrun --nnodes 1 --nproc_per_node 2 --rdzv-endpoint=0.0.0.0:29502 src/modalities/__main__.py run --config_file_path config_files/config_lorem_ipsum.yaml
+
Evaluation: +WIP add contents
EDIT “docs/source/vs_code_setup.rst” IN ORDER TO MAKE CHANGES HERE
+In VSCode, add this to your launch.json
:
{
+ "name": "Torchrun Main",
+ "type": "python",
+ "request": "launch",
+ "module": "torch.distributed.run",
+ "env": {
+ "CUDA_VISIBLE_DEVICES": "0"
+ },
+ "args": [
+ "--nnodes",
+ "1",
+ "--nproc_per_node",
+ "2",
+ "--rdzv-endpoint=0.0.0.0:29503",
+ "src/modalities/__main__.py",
+ "run",
+ "--config_file_path",
+ "config_files/config.yaml",
+ ],
+ "console": "integratedTerminal",
+ "justMyCode": true,
+ "envFile": "${workspaceFolder}/.env"
+}
+