Simplify `base_document` column usage with auxiliary instructions in pipeline config #228

bbrowning · 2024-07-29T18:15:42Z

Currently, we expect users that are creating auxiliary instructions to create a base_document column that contains the original document, as well as ensuring that gets set as a dataset_type. An example from our full pipeline config:

blocks:
  - name: duplicate_document_col
    type: DuplicateColumnsBlock
    config:
      columns_map:
        document: base_document
  - name: gen_spellcheck
    type: LLMBlock
    config:
      config_path: ../../configs/knowledge/spellcheck.yaml
      output_cols:
        - spellcheck
      gen_kwargs:
        max_tokens: 2048
  - name: flatten_auxiliary_columns
    type: FlattenColumnsBlock
    config:
      var_cols:
        - spellcheck
        - base_document
      value_name: corrected_document
      var_name: dataset_type

Is there a way to simplify this for authors of pipeline config, where we automatically handle the base_document dataset without the user ever needing to include references to that column in their config? That specific dataset_type string has a special meaning in the code, but how would a user know to include it without reading the code?

This issue is created to track a comment in another PR at #204 (comment) so we don't lose sight of improving this.

The text was updated successfully, but these errors were encountered:

github-actions · 2024-11-19T02:02:50Z

This issue has been automatically marked as stale because it has not had activity within 90 days. It will be automatically closed if no further activity occurs within 30 days.

bbrowning · 2024-11-20T13:27:06Z

Still relevant, and pipeline config is changing as we work to merge some external prototype improvements to SDG into this repo.

bbrowning mentioned this issue Jul 29, 2024

Add support for auxiliary dataset generation #204

Merged

nathan-weinberg added the enhancement New feature or request label Aug 20, 2024

github-actions bot added the stale label Nov 19, 2024

github-actions bot removed the stale label Dec 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Simplify `base_document` column usage with auxiliary instructions in pipeline config #228

Simplify `base_document` column usage with auxiliary instructions in pipeline config #228

bbrowning commented Jul 29, 2024

github-actions bot commented Nov 19, 2024

bbrowning commented Nov 20, 2024

Simplify base_document column usage with auxiliary instructions in pipeline config #228

Simplify base_document column usage with auxiliary instructions in pipeline config #228

Comments

bbrowning commented Jul 29, 2024

github-actions bot commented Nov 19, 2024

bbrowning commented Nov 20, 2024

Simplify `base_document` column usage with auxiliary instructions in pipeline config #228

Simplify `base_document` column usage with auxiliary instructions in pipeline config #228