Fix dataset formats to match each pipeline expectations #55

russellb · 2024-06-30T15:56:22Z

generate_data.py calls leaf_node_to_samples(), which right now assumes that the expected dataset format is the same for all pipeline types. It matched the simple pipelines and the full knowledge pipeline. I didn't notice that the full skills pipeline was different. I saw that #50 changed the name of fields expected in the template, so now we have 3 different dataset formats expected by the default pipelines.

simple pipelines expect question_N, answer_N for N in (1, 2, 3)
full knowledge pipeline now expects icl_query_N / icl_reponse_Nfor N in (1, 2, 3)
full skills pipelines expect seed_question and seed_response (just 1 instead of 3)

Short term fix:

Document the dataset format expected by each pipeline.
fix the code to generate the dataset in the expected format for each pipeline.

Longer term consideration:

None of this matches the source format for the dataset (taxonomy). Allowing people to specify a custom pipeline implies specifying their expected sample dataset format somehow.

Another idea instead ...

Always assume a consistent dataset format.
Add a new pipeline capability for dataset transformation -- rename fields if you want, squash the rows into groups of 3 seed questions/answers per row (for the knowledge case)

I think something like this is going to be necessary to allow more configurable custom pipelines, as we'll need a way for a custom pipeline to declare the dataset format it is expecting from a known starting point.

The text was updated successfully, but these errors were encountered:

PR instructlab#50 changed the format used in the full knowledge pipeline. Change the simple pipelines to match. Part of issue instructlab#55. Signed-off-by: Russell Bryant <[email protected]>

…-past-wikipedia Knowledge past wikipedia

russellb mentioned this issue Jun 30, 2024

📚 Adding Knowledge llm blocks #50

Merged

russellb mentioned this issue Jun 30, 2024

Fix dataset formatting for pipeline differences #57

Merged

russellb mentioned this issue Jul 1, 2024

Make template input more consistent with each other and taxonomy naming scheme #59

Closed

russellb closed this as completed in e606811 Jul 1, 2024

russellb closed this as completed in #57 Jul 1, 2024

jwm4 pushed a commit to jwm4/sdg that referenced this issue Dec 13, 2024

Merge pull request instructlab#55 from instructlab/jjasghar/knowledge…

471d9e5

…-past-wikipedia Knowledge past wikipedia

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix dataset formats to match each pipeline expectations #55

Fix dataset formats to match each pipeline expectations #55

russellb commented Jun 30, 2024

Fix dataset formats to match each pipeline expectations #55

Fix dataset formats to match each pipeline expectations #55

Comments

russellb commented Jun 30, 2024