Remove attributes from PipelineContext that can change per Block in a Pipeline #491

bbrowning · 2025-01-21T15:16:11Z

The following attributes of PipelineContext are specific to individual Blocks in a Pipeline, and should not be set in the overall PipelineContext:

model_id - Not all Blocks use a model at all, and those that do may choose to use separate models for separate Blocks within a single Pipeline. We have some upcoming use-cases that expect to use different models within a single Pipeline, so we need to get in front of that.
model_family - same reasoning as the above - one Pipeline could have separate Blocks that call into mixtral, granite, or other models within a single Pipeline
num_instructions_to_generate - This is only used to substitute the n parameter in gen_kwargs of an LLMBlock's config when it has the special value of scaled. However, users may want a different value per Block here and it's better to show them how to edit this in their Pipeline yaml as opposed to having a single parameter that they may not want to apply to all Blocks.
max_num_tokens - This is only used to substitute the max_tokens parameter in gen_kwargs of an LLMBlock's config when it has one of a few predefined block names. That's brittle, and users will need the ability to adjust the tokens for any LLMBlock independently in a Pipeline without having to rely on special predefined block names to control the maximum tokens generated.

All of these must be specified per-block instead, which today means in the yaml configuration of the Block.

What this will leave on PipelineContext is:

client - the OpenAI Client - optional, as not all Pipelines need to call an inference endpoint
dataset_num_procs - the number of parallel threads to use when processing datasets - optional, with a reasonable default
batch_num_workers - the number of parallel threads to use when calling inference servers - optional, with a reasonable default
batch_size - the batch size to use in completion requests when calling inference endpoints that support batching - optional, with a reasonable default
checkpoint_dir - the directory to store checkpoints during data generation, enabling some ability to recover from where we left off if a data generation gets interrupted - optional, with checkpointing disabled if not specified

The text was updated successfully, but these errors were encountered:

bbrowning · 2025-01-21T15:19:19Z

A number of these get exposed and passed in via the CLI today for ilab data generate, with the expectation that you can set these values once for the entire Pipeline when in reality that's not the correct API surface area to expose here. So, changing these to be specified per-block in the pipeline yaml will require some coordination with the instructlab/instructlab team to adjust the CLI parameters passed in. cc @instructlab/core-maintainers

bbrowning added enhancement New feature or request UX Affects the User Experience labels Jan 21, 2025

bbrowning added the epic Larger tracking issue encompassing multiple smaller issues label Jan 28, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove attributes from PipelineContext that can change per Block in a Pipeline #491

Remove attributes from PipelineContext that can change per Block in a Pipeline #491

bbrowning commented Jan 21, 2025

bbrowning commented Jan 21, 2025

Remove attributes from PipelineContext that can change per Block in a Pipeline #491

Remove attributes from PipelineContext that can change per Block in a Pipeline #491

Comments

bbrowning commented Jan 21, 2025

bbrowning commented Jan 21, 2025