Skip to content

Configuration file

Edo Abati edited this page Aug 23, 2024 · 6 revisions

The configuration file is the main input of the benchmark runner. Please refer to the default_config.json to see an example of what this should look like.

Benchmark config

On a high level, the configuration file should list all the pipelines that should be run.

{
    "pipeline_list": [
        <each-pipeline-in-the-benchmark>
    ]
}

Pipeline Config

For each pipeline, the following keys should be present:

  • name: The name of the pipeline. This is use for reporting purpose.
  • s3_input_data_uri: (Optional) The S3 URI of the location of the input data (i.e. train.csv, dev.csv, test.csv).
  • s3_model_uri: (Optional) The S3 URI of the location to save the model artifacts.
  • pipeline_type: Which supported pipeline to use. If custom_code_location is provided, this shouldn't be used. Supported pipeline types are: "biencoder", "bm25", "class_tfidf", "crossencoder", "doc_tfidf", "roberta", "t5"
  • custom_code_location: (Optional) The path to the custom code to use for the pipeline. See Custom Pipeline from more info
  • preprocessing: The configuration for the preprocessing step. See Processing Config for more details.
  • training: The configuration for the training step. See Training Config for more details.
  • evaluation: The configuration for the evaluation step. See Processing Config for more details.
  • metrics_file_names: The names of the cvs files containing the metrics to show in the benchmark report.

Processing Config

The "preprocessing and "evaluation" steps of the pipeline are considered "processing" steps. Therefore their configurations have the same format.

The following keys should be provided:

  • framework: which framework to use in the processing step. Available frameworks: SKLearn, HuggingFace. See docs for Scikit Learn Processor and HuggingFace Processor for more info. If not provided, it is inferred from the instance type.
  • instance_type: The type of EC2 instance to use for the task. See Sagemaker docs for the available instance types.
  • instance_count: The number of instances to use for the task.
  • max_runtime_in_hours: The maximum runtime in hours for the task.
  • parameters: The parameters to pass to the task script. These will be passed as CLI arguments.
  • processor_kwargs: Any other kwargs argument that the processor accepts (e.g. role, tags, etc).

Training Config

These are the configurations for the training task. The following keys should be provided:

  • framework: which framework to use in the processing step. Available frameworks: SKLearn, HuggingFace. See docs for Scikit Learn Estimator and HuggingFace Estimator for more info. If not provided, it is inferred from the instance type.
  • instance_type: The type of EC2 instance to use for the task. See Sagemaker docs for the available instance types.
  • instance_count: The number of instances to use for the task.
  • max_runtime_in_hours: The maximum runtime in hours for the task.
  • parameters: The parameters to pass to the task script. These will be passed as CLI arguments.
  • enable_smdistributed: Whether to enable distributed training using SMDistributed Data Parallel.
  • metric_definitions: The metric definitions to use for the task. See (Metric Definitions)(#metric-definitions) for more info.
  • estimator_kwargs: Any other kwargs argument that the processor accepts (e.g. role, tags, etc).

Metric Definitions

Using the AWS Sagemaker definition you can define any metrics that you want Sagemaker to capture form the logs. For example:

"metric_definitions": {
    "Name": "val:loss", "Regex": "eval_loss': (.*?),",
    "Name": "val:microf1", "Regex": "eval_micro_f1': (.*?),",
}

Examples

See the default_config.json for a full benchmark config example.

For a more detailed description of all the parameters available in each built-in pipeline type, please see the Built-in Pipeline Types page.