-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
inference script #16
inference script #16
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left some comments with either simple questions or mention of a best practice that could (optionally) be improved.
- What I find necessary are typehints and docstrings. That is good for you and others to understand you code :)
- Some of your files seem to be redundant. There are files in
src/delphi/eval/
and inscripts/
and I do not think we need both of them. It probably makes sense to have the ìnference.pyin the
eval/folder and the bash script in the
scripts/` folder. - Is the file ìnference_on_validation.py` necessary anymore?
please see #24, rebase on top of main and use the function introduced there |
same for #25 |
e5a29da
to
d36a135
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just a few small comments - please implement them and merge
scripts/inference.py
Outdated
output_file = os.path.join(output_folder, f'{model_name.replace("/", "-")}.parquet') | ||
accumulated_df.to_parquet(output_file) | ||
_, next_logprobs = get_all_and_next_logprobs(model, val_sequences) | ||
accumulated_logprobs = torch.cat((accumulated_logprobs, next_logprobs), dim=0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This approach is pretty inefficient, you copy all the data accumulated so far in each step. Instead you could append next_logprobs to logprobs_list
and then at the end have all_logprobs = torch.cat(logprobs_list, dim=0)
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But actually, you're converting it all to list anyway to make a DataFrame. But then Pandas stores it's data as numpy arrays 🙈 It's fine for now, but please remind me on 1-1 to chat about this.
scripts/inference.py
Outdated
df_dataset = pd.DataFrame({"logprobs": extended_next_logprobs.tolist()}) | ||
hf_dataset = Dataset.from_pandas(df_dataset) | ||
|
||
# change the repo_id to your hf username |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that should be an argument
scripts/inference.py
Outdated
parser.add_argument( | ||
"--batch_size", | ||
type=int, | ||
default=80, | ||
help="Batch size for processing (default: 80)", | ||
) | ||
|
||
parser.add_argument( | ||
"--dataset_name", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"--dataset_name", | |
"--dataset-name", |
also rebase required |
e4bd41c
to
67bf0f2
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A Few more comments
scripts/inference.py
Outdated
val_ds = load_validation_dataset(dataset_name) | ||
|
||
# model accepts 2D tensors (batch_size, seq_len) | ||
val_sequences = torch.tensor([s["tokens"] for s in val_ds]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This makes a copy of whole validation dataset, we don't need it
scripts/inference.py
Outdated
|
||
logprobs_list = [] | ||
for i in tqdm(range(0, len(val_sequences), batch_size)): | ||
batch_sequences = val_sequences[i : i + batch_size] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use val_ds here to make a batch tensor. val_ds[tokens] [i:i+b] should work, otherwise with for loop
f556811
to
95f4272
Compare
Two new files in the Scripts folder. A python module that takes for a given model and a given split (usually "validation), and optionally a given batchsize, creates a dataset (.parquet file) with the log probabilities for the correct next token. The bash script inlcudes the name of all the llama models (from 100k to 25.6m) and iterates over them calling the python module. The datasets are created in a new folder called "Correct logprobs".