Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove attributes from PipelineContext that can change per Block in a Pipeline #491

Open
bbrowning opened this issue Jan 21, 2025 · 1 comment
Labels
enhancement New feature or request epic Larger tracking issue encompassing multiple smaller issues UX Affects the User Experience

Comments

@bbrowning
Copy link
Contributor

The following attributes of PipelineContext are specific to individual Blocks in a Pipeline, and should not be set in the overall PipelineContext:

  • model_id - Not all Blocks use a model at all, and those that do may choose to use separate models for separate Blocks within a single Pipeline. We have some upcoming use-cases that expect to use different models within a single Pipeline, so we need to get in front of that.
  • model_family - same reasoning as the above - one Pipeline could have separate Blocks that call into mixtral, granite, or other models within a single Pipeline
  • num_instructions_to_generate - This is only used to substitute the n parameter in gen_kwargs of an LLMBlock's config when it has the special value of scaled. However, users may want a different value per Block here and it's better to show them how to edit this in their Pipeline yaml as opposed to having a single parameter that they may not want to apply to all Blocks.
  • max_num_tokens - This is only used to substitute the max_tokens parameter in gen_kwargs of an LLMBlock's config when it has one of a few predefined block names. That's brittle, and users will need the ability to adjust the tokens for any LLMBlock independently in a Pipeline without having to rely on special predefined block names to control the maximum tokens generated.

All of these must be specified per-block instead, which today means in the yaml configuration of the Block.

What this will leave on PipelineContext is:

  • client - the OpenAI Client - optional, as not all Pipelines need to call an inference endpoint
  • dataset_num_procs - the number of parallel threads to use when processing datasets - optional, with a reasonable default
  • batch_num_workers - the number of parallel threads to use when calling inference servers - optional, with a reasonable default
  • batch_size - the batch size to use in completion requests when calling inference endpoints that support batching - optional, with a reasonable default
  • checkpoint_dir - the directory to store checkpoints during data generation, enabling some ability to recover from where we left off if a data generation gets interrupted - optional, with checkpointing disabled if not specified
@bbrowning
Copy link
Contributor Author

A number of these get exposed and passed in via the CLI today for ilab data generate, with the expectation that you can set these values once for the entire Pipeline when in reality that's not the correct API surface area to expose here. So, changing these to be specified per-block in the pipeline yaml will require some coordination with the instructlab/instructlab team to adjust the CLI parameters passed in. cc @instructlab/core-maintainers

@bbrowning bbrowning added enhancement New feature or request UX Affects the User Experience labels Jan 21, 2025
@bbrowning bbrowning added the epic Larger tracking issue encompassing multiple smaller issues label Jan 28, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request epic Larger tracking issue encompassing multiple smaller issues UX Affects the User Experience
Projects
None yet
Development

No branches or pull requests

1 participant