Skip to content

Commit

Permalink
Create a sample per seed example for skills
Browse files Browse the repository at this point in the history
The full skills pipelines expect a single seed question and response
in each sample in the dataset. Change the simple skills pipelines to
match and update the code to generate the samples in the expected
format.

Closes instructlab#55 (the short term needs at least)

Signed-off-by: Russell Bryant <[email protected]>
  • Loading branch information
russellb committed Jun 30, 2024
1 parent 15ae2b9 commit e606811
Show file tree
Hide file tree
Showing 3 changed files with 16 additions and 47 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -15,16 +15,10 @@ Here are the requirements:
examples: |
The task is {task_description}.
Here are some examples to help you understand the type of questions that are asked for:
Here is an example to help you understand the type of questions that are asked for:
{icl_query_1}
{icl_response_1}
{icl_query_2}
{icl_response_2}
{icl_query_3}
{icl_response_3}
{seed_question}
{seed_response}
generation: |
Provide a single question and answer pair based on the examples.
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -15,23 +15,17 @@ Here are the requirements:
examples: |
The task is {task_description}.
Here is some context for the example questions:
Here is some context for the example question:
{context}
{seed_context}
Here are some examples to help you understand the type of questions that are asked for:
Here is an example to help you understand the type of questions that are asked for:
{icl_query_1}
{icl_response_1}
{icl_query_2}
{icl_response_2}
{icl_query_3}
{icl_response_3}
{seed_question}
{seed_response}
generation: |
Provide a single question and answer pair based on the examples.
Provide a single question and answer pair based on the example.
start_tags: [""]
end_tags: [""]
33 changes: 7 additions & 26 deletions src/instructlab/sdg/utils/taxonomy.py
Original file line number Diff line number Diff line change
Expand Up @@ -472,43 +472,24 @@ def _knowledge_leaf_node_to_samples(leaf_node, server_ctx_size, chunk_word_count


def _skill_leaf_node_to_samples(leaf_node):
samples = [{}]
samples = []

# pylint: disable=consider-using-enumerate
for i in range(len(leaf_node)):
samples[-1].setdefault("task_description", leaf_node[i]["task_description"])
samples.append({})
samples[-1]["task_description"] = leaf_node[i]["task_description"]
if leaf_node[i].get("input"):
samples[-1].setdefault("context", leaf_node[i]["input"])
if "icl_query_3" in samples[-1]:
samples.append({})
if "icl_query_1" not in samples[-1]:
samples[-1]["icl_query_1"] = leaf_node[i]["instruction"]
samples[-1]["icl_response_1"] = leaf_node[i]["output"]
elif "icl_query_2" not in samples[-1]:
samples[-1]["icl_query_2"] = leaf_node[i]["instruction"]
samples[-1]["icl_response_2"] = leaf_node[i]["output"]
else:
samples[-1]["icl_query_3"] = leaf_node[i]["instruction"]
samples[-1]["icl_response_3"] = leaf_node[i]["output"]

# wrap back around to the beginning if the number of examples was not
# evenly divisble by 3
if "icl_query_2" not in samples[-1]:
samples[-1]["icl_query_2"] = leaf_node[0]["instruction"]
samples[-1]["icl_response_2"] = leaf_node[0]["output"]
if "icl_query_3" not in samples[-1]:
samples[-1]["icl_query_3"] = leaf_node[1 if len(leaf_node) > 1 else 0][
"instruction"
]
samples[-1]["icl_response_3"] = leaf_node[1 if len(leaf_node) > 1 else 0]["output"]
samples[-1]["seed_context"] = leaf_node[i]["input"]
samples[-1]["seed_question"] = leaf_node[i]["instruction"]
samples[-1]["seed_response"] = leaf_node[i]["output"]

return samples


def leaf_node_to_samples(leaf_node, server_ctx_size, chunk_word_count):
if not leaf_node:
return []
if "document" in leaf_node[0]:
if leaf_node[0].get("document"):
return _knowledge_leaf_node_to_samples(
leaf_node, server_ctx_size, chunk_word_count
)
Expand Down

0 comments on commit e606811

Please sign in to comment.