prm800k-denorm

This repository is home to a script for converting OpenAI's PRM800K process supervision dataset to a denormalized format in .parquet file type, for easier consumption.

Datasets converted with this tool have been published at huggingface.co/sl-alex.

Why would I want process supervision?

See OpenAI's blog post and "Let's Verify Step by Step" paper, but essentially it helps to improve a language model's reasoning and human alignment.

If you only reward a language model for correct answers, then incentivizes "reward hacking" — reaching correct conclusions via incorrect reasoning. This results in a poor underlying understanding, which fails to generalize, and gives rise to logical mistakes (often called "hallucinations").

So, rather than the usual "outcome supervision": let's try "process supervision": reward the language model for making good progress towards solutions. This will improve its reasoning.
We encourage this, by teaching it to complete maths problems step-by-step, showing its workings.

Our goal with this repository is to give you access to a step-by-step dataset, so that you can train your language model in this way.

The dataset is semantic, so that you can template it into your prompt style however works best for you.

Dataset types

In total, the datasets are:

You probably just want one of the three combined datasets rather than one of the twelve per-phase splits.

Loading a dataset

You can load a dataset and iterate through it, like so:

from datasets import load_dataset, DatasetDict, Dataset
# dd: DatasetDict = load_dataset("sl-alex/openai-prm800k-solutions-only")
# dd: DatasetDict = load_dataset("sl-alex/openai-prm800k-stepwise-best")
dd: DatasetDict = load_dataset("sl-alex/openai-prm800k-stepwise-critic")

d: Dataset = dd['train']

prompt_template = """Below is an instruction that describes a task. Write a response that appropriately completes the request. Show your reasoning step-by-step, and indicate your final answer under a heading titled Answer.

### Instruction:
{instruction}

### Response:
{response_history}"""

answer_template = """{response}

# Answer:
{answer}"""

for record in zip(
  d['instruction'],
  d['responses'],
  d['next_response'],
  d['answer'],
  d['is_human_response'],
  d['is_solution'],
  d['is_preferred_response'],
  d['rating']
):
  instruction, responses, next_response, answer, is_human_response, is_solution, is_preferred_response, rating = record
  prompt = prompt_template.format(
    instruction=instruction,
    response_history=''.join((f'{response}\n' for response in responses)),
  )
  completion=next_response if answer is None else answer_template.format(response=next_response, answer=answer)
  print(f'Prompt:\n<{prompt}>')
  print(f'Completion:\n<{completion}>')

It should print something like:

Prompt:
<Below is an instruction that describes a task. Write a response that appropriately completes the request. Show your reasoning step-by-step, and indicate your final answer under a heading titled Answer.

### Instruction:
How many seconds are in 7.8 minutes?

### Response:
>
Completion:
<7.8 minutes is the same as 7 minutes and 0.8 minutes.>

Dataset description

Solutions only

# the problem statement (e.g. "what is x, given 2x = 1?")
instruction: str

# history of responses that the bot emitted before reaching this final step
responses: List[str]

# response emitted for the final step in the conversation.
# it is accompanied by an "answer" subsection, which we capture separately
next_response: str

# answer subsection of the final response
answer: str

Only conversations which achieved "finish_reason": "solution" are retained.

We provide only one record per conversation — the final step.
This record contains a list of all the steps which led to that final step. So you have the whole conversation available.

You could use this to create the following kind of training data:

Source: initial instruction
Target: all steps of the conversation, including the final step and answer

Usage

Example of how to take this sample and turn it into a source->target training pair:

sample:

instruction: "What is x, given 2x + 1 = 2?"

responses: [
    "Okay, let's first rearrange the equation to isolate the x.",
    "2x = 2 - 1, which simplifes to 2x = 1",
    "Now that the x term is isolated, let's divide both sides to eliminate its coefficient."
]

next_response: "x = 1/2"

answer: "0.5"

Example Alpaca-style source prompt:

Below is an instruction that describes a task. Write responses which progress toward a solution to the request. Indicate your final answer under a heading Final Answer.

### Instruction:
What is x, given 2x + 1 = 2?

### Response:
Okay, let's first rearrange the equation to isolate the x.

### Response:
2x = 2 - 1, which simplifes to 2x = 1

### Response:
Now that the x term is isolated, let's divide both sides to eliminate its coefficient.

### Response:

Example Alpaca-style target prompt:

x = 1/2

# Answer
0.5

Stepwise best

# the problem statement (e.g. "what is x, given 2x = 1?")
instruction: str

# any responses that were emitted in prior conversation steps
# empty list means it's the first conversation turn
responses: List[str]

# the response emitted for the current step in the conversation
# if this is the final response in the conversation: the response may be accompanied by
# an answer subsection. we separate this out into the `answer` field below.
next_response: str

# usually None, but if this is filled in: the final response of the conversation has occurred,
# and has a subsection declaring its overall answer. this field captures that overall answer.
answer: Optional[str]

# often, there are conversation steps in which the human evaluator had to intervene,
# as no generated response was satisfactory.
is_human_response: bool

We provide a record per productive conversation turn.
Whichever is the preferred completion for that conversation turn (i.e. the chosen_completion if a bot response was preferred, or a human_completion otherwise), is the one we use.

You could use this to create the following kind of training data:

Source: initial instruction + 0-to-many steps of conversation so far Target: next step of conversation (doesn't necessarily get you to a complete solution)

Usage

Example of how to take this sample and turn it into a source->target training pair:

sample:

instruction: "What is x, given 2x + 1 = 2?"

responses: [
    "Okay, let's first rearrange the equation to isolate the x.",
    "2x = 2 - 1, which simplifes to 2x = 1",
]

next_response: "Now that the x term is isolated, let's divide both sides to eliminate its coefficient."

answer: None

is_human_response: False

Example Alpaca-style source prompt:

Below is an instruction that describes a task. Write responses which progress toward a solution to the request. Indicate your final answer under a heading Final Answer.

### Instruction:
What is x, given 2x + 1 = 2?

### Response:
Okay, let's first rearrange the equation to isolate the x.

### Response:
2x = 2 - 1, which simplifes to 2x = 1

### Response:

Example Alpaca-style target prompt:

Now that the x term is isolated, let's divide both sides to eliminate its coefficient.

Notice how this time there was no "answer" subsection, because this conversation step does not propose a solution. It can happen though, so check whether answer is None.

Stepwise critique

# the problem statement (e.g. "what is x, given 2x = 1?")
instruction: str

# any responses that were emitted in prior conversation steps
# empty list means it's the first conversation turn
responses: List[str]

# the response emitted for the current step in the conversation
# if this is the final response in the conversation: the response may be accompanied by
# an answer subsection. we separate this out into the `answer` field below.
next_response: str

# usually None, but if this is filled in: the final response of the conversation has occurred,
# and has a subsection declaring its overall answer. this field captures that overall answer.
answer: Optional[str]

# often, there are conversation steps in which the human evaluator had to intervene,
# as no generated response was satisfactory.
is_human_response: bool

# whether an answer was provided **and** was correct
is_solution: bool

# tells us if this completion was rated as the best option by the human evaluator
is_preferred_response: bool

#   -1 = counterproductive
#    0 = unproductive
#    1 = productive
# None = human answer (you can treat this as 1)
rating: Optional[int]

We provide a record per proposed completion, per conversation turn.
This includes human ratings.

You could use this to create the following kind of training data:

Source: initial instruction + 0-to-many steps of conversation so far Target: a possible completion and rating

The data here includes both good and bad rated samples.

You could just keep the good ones. That would be similar to what Stepwise best gives you, but instead of "best" it gives you a wider "any answer which got a good rating, even if it wasn't the best one". So, more data (lower quality).

But an even better usage is to use this to train a critic. @crowsonkb has demonstated that if you have a classifier, you can guide your sampling of the language model, generating a few candidate next-tokens and picking the one that your classifier prefers.
So you could use this to train a critic, to guide another language model to employ more reasoning in its responses.

Usage

You'd format this the same as stepwise best.

Setup

Note: if you're happy with the published datasets, then there's no need for you to get this repository set up yourself — these Setup instructions are for tinkerers who wish to try the export themselves, or change the format.

Get OpenAI data

Copy the PRM800K and MATH datasets into the prm800k directory, in accordance with prm800k/README.md.

Install Python dependencies

pip install -r requirements.txt

Run the converter

python -m scripts.convert

This should output some .parquet files to a directory out, at the root of the repository.

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
.vscode		.vscode
prm800k		prm800k
scripts		scripts
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

prm800k-denorm

Why would I want process supervision?

Dataset types

Loading a dataset

Dataset description

Solutions only

Usage

Stepwise best

Usage

Stepwise critique

Usage

Setup

Get OpenAI data

Install Python dependencies

Run the converter

About

Releases

Packages

Contributors 2

Languages

License

scottlogic-alex/prm800k-denorm

Folders and files

Latest commit

History

Repository files navigation

prm800k-denorm

Why would I want process supervision?

Dataset types

Loading a dataset

Dataset description

Solutions only

Usage

Stepwise best

Usage

Stepwise critique

Usage

Setup

Get OpenAI data

Install Python dependencies

Run the converter

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages