Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update README.md with newer content from research team #444

Merged
merged 1 commit into from
Dec 11, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 7 additions & 0 deletions .spellcheck-en-custom.txt
Original file line number Diff line number Diff line change
Expand Up @@ -5,8 +5,11 @@ Backport
backported
CLI
codebase
composable
Composable
config
configs
customizable
Dataset
dataset
datasets
Expand All @@ -20,6 +23,7 @@ freeform
ICL
icl
ie
instructlab
Jinja
JSON
Langchain's
Expand All @@ -38,8 +42,11 @@ Pregenerated
qna
quantized
repo
Reusability
scalable
sdg
Splitter
src
subdirectory
subfolder
Tatsu
Expand Down
147 changes: 120 additions & 27 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,59 +9,152 @@
![`e2e-nvidia-l4-x1.yaml` on `main`](https://github.com/instructlab/sdg/actions/workflows/e2e-nvidia-l4-x1.yml/badge.svg?branch=main)
![`e2e-nvidia-l40s-x4.yml` on `main`](https://github.com/instructlab/sdg/actions/workflows/e2e-nvidia-l40s-x4.yml/badge.svg?branch=main)

Python library for Synthetic Data Generation
The SDG Framework is a modular, scalable, and efficient solution for creating synthetic data generation workflows in a “no-code” manner. At its core, this framework is designed to simplify data creation for LLMs, allowing users to chain computational units and build powerful pipelines for generating data and processing tasks.

## Introduction
## Core Design Principles

Synthetic Data Generation (SDG) is a process that creates an artificially generated dataset that mimics real data based on provided examples. SDG uses a YAML file containing question-and-answer pairs as input data.
The framework is built around the following principles:

## Installing the SDG library
1. **Modular Design**: Highly composable blocks form the building units of the framework, allowing users to build workflows effortlessly.
2. **No-Code Workflow Creation**: Specify workflows using simple YAML configuration files.
3. **Scalability and Performance**: Optimized for handling large-scale workflows with millions of records.

Clone the library and navigate to the repo:
---

```bash
git clone https://github.com/instructlab/sdg
cd sdg
```
## Framework Architecture

Install the library:
![overview](assets/imgs/overview.png)

```bash
pip install .
```
### Blocks: The Fundamental Unit

### Using the library
At the heart of the framework is the **Block**. Each block is a self-contained computational unit that performs specific tasks, such as:

You can import SDG into your Python files with the following items:
- Making LLM calls
- Performing data transformations
- Applying filters

```python
from instructlab.sdg.generate_data import generate_data
from instructlab.sdg.utils import GenerateException
```
Blocks are designed to be:

- **Modular**: Reusable across multiple pipelines.
- **Composable**: Easily chained together to create workflows.

These blocks are implemented in the [src/instructlab/sdg/blocks](src/instructlab/sdg/blocks) directory.

## Pipelines
### Pipelines: Higher-Level Abstraction

A pipeline is a series of steps to execute in order to generate data.
Blocks can be chained together to form a **Pipeline**. Pipelines enable:

- Linear or recursive chaining of blocks.
- Execution of complex workflows by chaining multiple pipelines together.

There are three default pipelines shipped in SDG: `simple`, `full`, and `eval`. Each pipeline requires specific hardware specifications

### Simple Pipeline
#### Simple Pipeline

The [simple pipeline](src/instructlab/sdg/pipelines/simple) is designed to be used with [quantized Merlinite](https://huggingface.co/instructlab/merlinite-7b-lab-GGUF) as the teacher model. It enables basic data generation results on low-end consumer grade hardware, such as laptops and desktops with small or no discrete GPUs.

### Full Pipeline
#### Full Pipeline

The [full pipeline](src/instructlab/sdg/pipelines/full) is designed to be used with [Mixtral-8x7B-Instruct-v0.1](https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1) as the the teacher model, but has also been successfully tested with smaller models such as [Mistral-7B-Instruct-v0.2](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2) and even some quantized versions of the two teacher models. This is the preferred data generation pipeline on higher end consumer grade hardware and all enterprise hardware.

### Eval Pipeline
#### Eval Pipeline

The [eval pipeline](src/instructlab/sdg/pipelines/eval) is used to generate [MMLU](https://en.wikipedia.org/wiki/MMLU) benchmark data that can be used to later evaluate a trained model on your knowledge dataset. It does not generate data for use during model training.

### Pipeline architecture
---

### YAML-Based Workflow: The Pipeline Configuration

The Pipeline YAML configuration file is central to defining data generation workflows in the SDG Framework. This configuration file describes how blocks and pipelines are orchestrated to process and generate data efficiently. By leveraging YAML, users can create highly customizable and modular workflows without writing any code.

Pipeline configuration must adhere to our [JSON schema](src/instructlab/sdg/pipelines/schema/v1.json) to be considered valid.

#### Key Features of Pipeline Configuration

1. **Modular Design**:
- Pipelines are composed of blocks, which can be chained together.
- Each block performs a specific task, such as generating, filtering, or transforming data.

2. **Reusability**:
- Blocks and their configurations can be reused across different workflows.
- YAML makes it easy to tweak or extend workflows without significant changes.

3. **Ease of Configuration**:
- Users can specify block types, configurations, and data processing details in a simple and intuitive manner.

---

### Sample Pipeline Configuration

Here is an example of a Pipeline configuration:

```yaml
version: "1.0"
blocks:
- name: gen_questions
type: LLMBlock
config:
config_path: configs/skills/freeform_questions.yaml
output_cols:
- question
batch_kwargs:
num_samples: 30
drop_duplicates:
- question
- name: filter_questions
type: FilterByValueBlock
config:
filter_column: score
filter_value: 1.0
operation: eq
convert_dtype: float
drop_columns:
- evaluation
- score
- num_samples
- name: gen_responses
type: LLMBlock
block_config:
config_path: configs/skills/freeform_responses.yaml
output_cols:
- response
```

### Data Flow and Storage

- **Data Representation**: Data flow between blocks and pipelines is handled using **Hugging Face Datasets**, which are based on Arrow tables. This provides:
- Native parallelization capabilities (e.g., maps, filters).
- Support for efficient data transformations.

- **Data Checkpoints**: Intermediate caches of generated data. Checkpoints allow users to:
- Resume workflows from the last successful state if interrupted.
- Improve reliability for long-running workflows.

All the pipelines are written in a YAML format and must adhere to a [specific schema](src/instructlab/sdg/pipelines/schema/v1.json).
---

The pipelines that generate data for model training (simple and full pipelines) expect to have three different pipeline configs - one each for knowledge, grounded skills, and freeform skills. They are expected to exist in files called `knowledge.yaml`, `grounded_skills.yaml`, and `freeform_skills.yaml` respectively. For background on the difference in knowledge, grounded skills, and freeform skills, refer to the [InstructLab Taxonomy repository](https://github.com/instructlab/taxonomy).
## Installing the SDG library

Clone the library and navigate to the repo:

```bash
git clone https://github.com/instructlab/sdg
cd sdg
```

Install the library:

```bash
pip install .
```

### Using the library

You can import SDG into your Python files with the following items:

```python
from instructlab.sdg.generate_data import generate_data
from instructlab.sdg.utils import GenerateException
```

## Repository structure

Expand Down
Binary file added assets/imgs/overview.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading