Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix dataset formats to match each pipeline expectations #55

Closed
russellb opened this issue Jun 30, 2024 · 0 comments · Fixed by #57
Closed

Fix dataset formats to match each pipeline expectations #55

russellb opened this issue Jun 30, 2024 · 0 comments · Fixed by #57

Comments

@russellb
Copy link
Member

generate_data.py calls leaf_node_to_samples(), which right now assumes that the expected dataset format is the same for all pipeline types. It matched the simple pipelines and the full knowledge pipeline. I didn't notice that the full skills pipeline was different. I saw that #50 changed the name of fields expected in the template, so now we have 3 different dataset formats expected by the default pipelines.

simple pipelines expect question_N, answer_N for N in (1, 2, 3)
full knowledge pipeline now expects icl_query_N / icl_reponse_Nfor N in (1, 2, 3)
full skills pipelines expect seed_question and seed_response (just 1 instead of 3)

Short term fix:

  1. Document the dataset format expected by each pipeline.
  2. fix the code to generate the dataset in the expected format for each pipeline.

Longer term consideration:

None of this matches the source format for the dataset (taxonomy). Allowing people to specify a custom pipeline implies specifying their expected sample dataset format somehow.

Another idea instead ...

  1. Always assume a consistent dataset format.
  2. Add a new pipeline capability for dataset transformation -- rename fields if you want, squash the rows into groups of 3 seed questions/answers per row (for the knowledge case)

I think something like this is going to be necessary to allow more configurable custom pipelines, as we'll need a way for a custom pipeline to declare the dataset format it is expecting from a known starting point.

russellb added a commit to russellb/sdg that referenced this issue Jun 30, 2024
PR instructlab#50 changed the format used in the full knowledge pipeline. Change
the simple pipelines to match.

Part of issue instructlab#55.

Signed-off-by: Russell Bryant <[email protected]>
jwm4 pushed a commit to jwm4/sdg that referenced this issue Dec 13, 2024
…-past-wikipedia

Knowledge past wikipedia
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant