Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Initial CLI integration with new SDG interfaces #46

Merged
merged 10 commits into from
Jun 28, 2024

Conversation

russellb
Copy link
Member

@russellb russellb commented Jun 26, 2024

General Approach

The approach taken here is to try to make the new SDG API the only interface
used as opposed to keeping the old implementation off to the side. Reasons for
this:

  • less to maintain
  • we'll be able to clean up CLI options to only include ones relevant to the new library

To prepare for this, we pulled all SDG code out of the CLI to start this
library. The interface point between the CLI and instructlab.sdg is the
generate_data() method in instructlab.sdg.generate_data. That is where
you'll see most of the changes centered in this PR.

The PR changes the implementation underneath the existing
generate_data() so we can make things work without further CLI changes.

Status Summary

This has been tested with Merlinite both locally and via the e2e CI job. The
intent here is to replace the existing functionality. Future changes will work
on adding more extensive pipeline support for larger systems.

Changes

1e928ec Split up the utils module
39c943f Import utils.read_taxonomy() from instructlab
d886b12 Add get_model_family() from instructlab.utils
67dd057 Add model_prompt for merlinite/granite
d53a57a Add API to enable disabling batching
14d646f llmblock: fix batched=False
0207156 Add simple knowledge pipeline for use with default merlinite
9ee7f70 Use new API in generate_data
54b065a added number of iterations to generate
31ecfda Add skills variants for the full pipeline

commit 1e928ec
Author: Mark McLoughlin [email protected]
Date: Thu Jun 27 07:27:11 2024 -0400

Split up the utils module

To stop utils.py becoming a dumping ground, split the current code
into json, chunking, and openai sub-modules.

Signed-off-by: Mark McLoughlin <[email protected]>

commit 39c943f
Author: Mark McLoughlin [email protected]
Date: Thu Jun 27 06:27:21 2024 -0400

Import utils.read_taxonomy() from instructlab

Part of #11

sdg appears to be the main user of this, along with `ilab taxonomy diff`.

We want to adapt the output of read_taxonomy() to be better suited to
what sdg needs.

This is the majority of src/instructlab/utils.py from commit commit 4737feb
with read_taxonomy() and TaxonomyReadingException as the public API.

Temporarily disable logging-fstring-interpolation to get lint passing.

Signed-off-by: Mark McLoughlin <[email protected]>

commit d886b12
Author: Russell Bryant [email protected]
Date: Tue Jun 25 17:57:24 2024 -0400

Add get_model_family() from instructlab.utils

Signed-off-by: Russell Bryant <[email protected]>

commit 67dd057
Author: Russell Bryant [email protected]
Date: Tue Jun 25 20:58:29 2024 -0400

Add model_prompt for merlinite/granite

The model_family param is used to "force" a particular family,
overriding what might be guessed from the model filename.

Since the utils.models is copied and pasted from instructlab,
let's isolate the use of utils.models to the generate_data()
function so if we move the generate_data() code to instructlab
we can get rid of the copy here.

In its place add MODEL_FAMILY_MIXTRAL/MERLINITE constants to
the API.

Signed-off-by: Russell Bryant <[email protected]>
Signed-off-by: Mark McLoughlin <[email protected]>

commit d53a57a
Author: Mark McLoughlin [email protected]
Date: Thu Jun 27 05:55:32 2024 -0400

Add API to enable disabling batching

Add a `batched` constructor param.

llama-cpp is still the CLI default and doesn't support batching. We
don't know in this code what backend is being used, so for now just
turn off batching. We need to come back around to disabling it only
when we know the default llama-cpp backend is in use.

Signed-off-by: Russell Bryant <[email protected]>
Signed-off-by: Mark McLoughlin <[email protected]>

commit 14d646f
Author: Russell Bryant [email protected]
Date: Wed Jun 26 09:23:40 2024 -0400

llmblock: fix batched=False

_generate() returns a list of samples. The previouis code would create
a list of lists when batching was turned off. This restores that case
back to a list of samples.

Signed-off-by: Russell Bryant <[email protected]>

commit 0207156
Author: Russell Bryant [email protected]
Date: Wed Jun 26 11:17:09 2024 -0400

Add simple knowledge pipeline for use with default merlinite

The CLI's default model is quantized merlinite, and it does not seem
good enough to follow the instructions in the full pipeline included
in the new library.

It's not doing any validation on the output, so the output is not
going to be great.  Then again, the output has never been great doing
SDG with merlinite and the old sdg implementation. This at least keeps
the ability to a basic workflow test and demo on a smaller system.

Signed-off-by: Russell Bryant <[email protected]>

commit 9ee7f70
Author: Russell Bryant [email protected]
Date: Tue Jun 25 16:31:19 2024 -0400

Use new API in generate_data

This makes use of the new SDG API under the generate_data()
method used by the CLI. It uses new simple workflows for knowlege and
skills that inteded for basic usable with a small model for testing
and demo purposes. The full pipelines provided in the library will
only work in larger environments capable of running Mixtral-8x7b.

There are still various TODOs in the code, but this is enough to
start with. I'm sure we will make enhancements to these basic
workflows that still work for the small environments.

Signed-off-by: Russell Bryant <[email protected]>

commit 54b065a
Author: Oindrilla Chatterjee [email protected]
Date: Fri Jun 28 11:52:12 2024 -0400

added number of iterations to generate

Signed-off-by: Oindrilla Chatterjee <[email protected]>

commit 31ecfda
Author: Russell Bryant [email protected]
Date: Fri Jun 28 16:30:24 2024 -0400

Add skills variants for the full pipeline

Add the last little bit needed to choose the right "full" pipeline for
skills.

I also renamed "profile" to "pipeline" to better reflect what is being
selected here. The term "profile" is a bit overloaded from lots of
past CLI UX discussion, so it's better not to use that here.

Signed-off-by: Russell Bryant <[email protected]>

@mergify mergify bot added the ci-failure label Jun 26, 2024
@markmc
Copy link
Contributor

markmc commented Jun 27, 2024

I'm going to try to help this along

Goal: to have ilab data generate use the new SDG API (Flows, Pipelines, Blocks, etc.) and support "full SDG"

(Re - "full SDG" - I'm a bit unclear how we define that precisely)

Constraint: avoid causing any regression in "simple SDG" (aka "small model support")

Step #1 - re-implement instructlab.sdg.generate_data.generate_data() using the new API - i.e. model the existing "simple SDG" using the new API, and then re-write generate_data() to use it

Step #2 - replace the call to instructlab.sdg.generate_data.generate_data() in instructlab/data/generate.py with calls to the new API - i.e. move the relatively simple code in generate_data() back into instructlab

Step #3 - add "full SDG" support in ilab data generate - e.g. by using the more advanced flows available via the new API


This PR is step #1

We can merge this PR once we're confident it is in good enough shape that it doesn't regress the existing ilab data generate (unless there's a particular regression we decide we're ok with)


I've been doing some hacking to get up to speed: https://github.com/russellb/sdg/compare/new-cli-integration...markmc:sdg:new-cli-integration?expand=1

Not being deeply familiar with the existing "simple SDG" flow though, I'm going to be slow in getting the new implementation up to parity. Happy to have help!

@markmc
Copy link
Contributor

markmc commented Jun 27, 2024

Next steps AIUI:

  1. verify generate_data() is producing output in the same format as before
  2. create and incorporate workflows for freeform and grounded skills
  3. audit the CLI options to help identify any functional regressions
  4. get e2e passing

@markmc
Copy link
Contributor

markmc commented Jun 27, 2024

e2e test failure is:

Generate synthetic data - Wed Jun 26 18:32:00 UTC 2024
...
Generating synthetic data using 'models/merlinite-7b-lab-Q4_K_M.gguf' model, taxonomy:'taxonomy' against http://127.0.0.1:8000/v1 server
INFO 2024-06-26 18:32:13,652 generate_data.py:222: generate_data Synthesizing new instructions. If you aren't satisfied with the generated instructions, interrupt training (Ctrl-C) and try adjusting your YAML files. Adding more examples may help.
INFO 2024-06-26 18:32:13,670 pipeline.py:48: generate Running block: gen_knowledge
INFO 2024-06-26 18:32:13,670 pipeline.py:49: generate Dataset({
    features: ['task_description', 'document', 'domain', 'question_1', 'response_1', 'question_2', 'response_2', 'question_3', 'response_3'],
    num_rows: 6
})
INFO 2024-06-26 18:32:56,378 generate_data.py:276: generate_data Generated 5 samples
INFO 2024-06-26 18:32:56,379 generate_data.py:282: generate_data Generation took 49.68s
...
Train the model - Wed Jun 26 18:32:57 UTC 2024
...
generated does not contain training or test files, did you run `ilab generate`?
Error: Process completed with exit code 1.

@mergify mergify bot added ci-failure CI/CD Affects CI/CD configuration testing Relates to testing and removed ci-failure labels Jun 27, 2024
Copy link
Contributor

mergify bot commented Jun 27, 2024

This pull request has merge conflicts that must be resolved before it can be
merged. @russellb please rebase it. https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@russellb
Copy link
Member Author

Status notes before I log off for tonight.

The Good News

  • e2e is green!

Status

The e2e test job runs 3 rounds of data generation: grounded skill (a skill with context), freeform skill (a skill with no context), and knowledge. The two skills steps are exiting cleanly having generated zero data. The training step that follows is consuming the data we've generated for the knowledge addition and passing.

This is a great milestone, but we have some more to go.

Next Top Priority

  • top priority: this PR includes a "simple" knowledge pipeline, but we need to add similar ones that handle the two types of skills. Now that the knowledge one works, I'm hoping the other two come in a lot quicker, as they will follow a very similar pattern.
    • need a file similar to src/instructlab/sdg/configs/knowledge/simple_generate_qa.yaml but tweaked a bit. there's no document. The grounded skill one will be similar in that we'll be giving a context blob similar to how it gives a document for knowledge. The freeform skill one will not have context or document.
    • in default_flows.py, in addition to the SimpleKnowledgeFlow that's there, we need one or two skills flows that are the simple one-stage things
    • generate_data.py needs updates to make use of these new flows

Other TODO items

There are still other todo items scattered around generate_data.py

  • we still don't handle the num-instructions parameter. @oindrillac was looking at that today.
  • there are other parameters still being ignored. I tried to notate what I thought should happen to them, but I did it quickly. There may be other parameters we need to be acting on.

Post-merge TODOs for the simple workflow

These are things I think can come after we merge this PR (IMO, open for discussion)

  • chunking is an overly simplified thing right now. It just truncates the document. It's not trying to iterate over multiple chunks. We also have an open PR working on a smarter chunking mechanism. Updated Chunking Methods #45 -- doing something to process more chunks could possibly be a follow-up issue instead of blocking this PR.
  • There may be some CLI options that can be removed - TBD
  • we automatically turn batching off. we should turn it on when vllm is used, or we're talking to an arbitrary OpenAI compatible API endpoint. In other words, we should default to on unless the CLI tells us to turn it off because llama-cpp is in use. This is an optimization and not a functional blocker so I think a follow-up issue is OK.

Full workflow via CLI

Most of the code we've done here applies to the full workflow, we just need to instantiate a different pipeline.

  • we need a CLI option for selecting the "full" pipeline vs the simple one. This may justify a quick design doc in instructlab/dev-docs just to clarify the UX. --pipeline simple and --pipeline full might be enough, but there was a LOT of debate over a very similar set of options proposed for training that never got full support. That may be different though because training has so many variables.
  • we are going to want a new CLI option (or options) for adding custom pipeline configuration that can be tagged on to the end of the chosen pipeline. This is the most significant area of UX impact to very quickly sort out.
  • we don't have any CI environment capable of testing this. It's probably about time we figure that out.

markmc and others added 6 commits June 28, 2024 05:16
To stop utils.py becoming a dumping ground, split the current code
into json, chunking, and openai sub-modules.

Signed-off-by: Mark McLoughlin <[email protected]>
Part of instructlab#11

sdg appears to be the main user of this, along with `ilab taxonomy diff`.

We want to adapt the output of read_taxonomy() to be better suited to
what sdg needs.

This is the majority of src/instructlab/utils.py from commit commit 4737feb
with read_taxonomy() and TaxonomyReadingException as the public API.

Temporarily disable logging-fstring-interpolation to get lint passing.

Signed-off-by: Mark McLoughlin <[email protected]>
The model_family param is used to "force" a particular family,
overriding what might be guessed from the model filename.

Since the utils.models is copied and pasted from instructlab,
let's isolate the use of utils.models to the generate_data()
function so if we move the generate_data() code to instructlab
we can get rid of the copy here.

In its place add MODEL_FAMILY_MIXTRAL/MERLINITE constants to
the API.

Signed-off-by: Russell Bryant <[email protected]>
Signed-off-by: Mark McLoughlin <[email protected]>
Add a `batched` constructor param.

llama-cpp is still the CLI default and doesn't support batching. We
don't know in this code what backend is being used, so for now just
turn off batching. We need to come back around to disabling it only
when we know the default llama-cpp backend is in use.

Signed-off-by: Russell Bryant <[email protected]>
Signed-off-by: Mark McLoughlin <[email protected]>
_generate() returns a list of samples. The previouis code would create
a list of lists when batching was turned off. This restores that case
back to a list of samples.

Signed-off-by: Russell Bryant <[email protected]>
The CLI's default model is quantized merlinite, and it does not seem
good enough to follow the instructions in the full pipeline included
in the new library.

It's not doing any validation on the output, so the output is not
going to be great.  Then again, the output has never been great doing
SDG with merlinite and the old sdg implementation. This at least keeps
the ability to a basic workflow test and demo on a smaller system.

Signed-off-by: Russell Bryant <[email protected]>
@russellb russellb force-pushed the new-cli-integration branch from e931ed8 to 7e932d0 Compare June 28, 2024 19:14
@russellb
Copy link
Member Author

saving off the "how to test" instructions here before I redo the PR description now that it's almost ready for merge

How to Test

(assumes you have the gh CLI installed and configured) -- https://cli.github.com/

# Setup a venv to work in
mkdir workspace
cd workspace
python -m venv venv
. venv/bin/activate

# Install the `ilab` CLI from main
git clone https://github.com/instructlab/instructlab
cd instructlab
pip install -e .
cd ..

# Install this PR version of instructlab-sdg
git clone https://github.com/instructlab/sdg
cd sdg
gh co 46
pip install -e .
cd ..

# Configure ilab and put a test taxonomy addition in place
# allow it to check out taxonomy for you
ilab config init
cd taxonomy
git remote add russellb https://github.com/russellb/taxonomy.git
git fetch russellb
git checkout russellb/softball
cd ..

# Download merlinite
ilab download

# Optional - serve the model in another terminal.  helpful if you get server side errors and want all the logs
cd workspace
. venv/bin/activate
ilab model serve

# and now you can test sdg
ilab data generate

# Find the results in the generated/ directory
ls generated/
cat generated/generated_*

@russellb russellb changed the title CLI integration with new SDG interfaces Initial CLI integration with new SDG interfaces Jun 28, 2024
@russellb russellb marked this pull request as ready for review June 28, 2024 19:24
russellb and others added 2 commits June 28, 2024 15:29
This makes use of the new SDG API under the generate_data()
method used by the CLI. It uses new simple workflows for knowlege and
skills that inteded for basic usable with a small model for testing
and demo purposes. The full pipelines provided in the library will
only work in larger environments capable of running Mixtral-8x7b.

There are still various TODOs in the code, but this is enough to
start with. I'm sure we will make enhancements to these basic
workflows that still work for the small environments.

Signed-off-by: Russell Bryant <[email protected]>
Signed-off-by: Oindrilla Chatterjee <[email protected]>
@russellb russellb force-pushed the new-cli-integration branch from 7e932d0 to 54b065a Compare June 28, 2024 19:29
@russellb
Copy link
Member Author

Most of the CI jobs aren't running right now. GitHub is having problems: https://www.githubstatus.com/

@markmc
Copy link
Contributor

markmc commented Jun 28, 2024

FWIW, +1 from me if we're not aware of it causing significant regressions

@russellb
Copy link
Member Author

FWIW, +1 from me if we're not aware of it causing significant regressions

Thanks! As long as e2e still passes, I'm good with this going in and then we continue to iterate from here. It won't actually get picked up by anyone else until we tag a library release, anyway.

Add the last little bit needed to choose the right "full" pipeline for
skills.

I also renamed "profile" to "pipeline" to better reflect what is being
selected here. The term "profile" is a bit overloaded from lots of
past CLI UX discussion, so it's better not to use that here.

Signed-off-by: Russell Bryant <[email protected]>
Copy link
Contributor

@oindrillac oindrillac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested, +1 to get this merged and do follow-ups on this.

@russellb
Copy link
Member Author

I'm going to merge this, but won't cut a release just yet. I want to look at knowledge document chunking in particular. I don't think I fully restored the previous behavior on that.

@russellb russellb merged commit 1f71fb6 into instructlab:main Jun 28, 2024
11 checks passed
@russellb russellb added this to the 0.1.0 milestone Jul 8, 2024
jwm4 pushed a commit to jwm4/sdg that referenced this pull request Dec 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CI/CD Affects CI/CD configuration testing Relates to testing
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants