inference script #16

transcendingvictor · 2024-01-31T21:57:45Z

Two new files in the Scripts folder. A python module that takes for a given model and a given split (usually "validation), and optionally a given batchsize, creates a dataset (.parquet file) with the log probabilities for the correct next token. The bash script inlcudes the name of all the llama models (from 100k to 25.6m) and iterates over them calling the python module. The datasets are created in a new folder called "Correct logprobs".

joshuawe

Left some comments with either simple questions or mention of a best practice that could (optionally) be improved.

What I find necessary are typehints and docstrings. That is good for you and others to understand you code :)
Some of your files seem to be redundant. There are files in src/delphi/eval/ and in scripts/ and I do not think we need both of them. It probably makes sense to have the ìnference.pyin theeval/folder and the bash script in thescripts/` folder.
Is the file ìnference_on_validation.py` necessary anymore?

scripts/inference.py

src/delphi/eval/inference_on_validation.py

scripts/inference.py

src/delphi/eval/generate_logprobs.sh

src/delphi/eval/inference.py

scripts/inference.py

jettjaniak · 2024-02-06T06:17:31Z

please see #24, rebase on top of main and use the function introduced there

jettjaniak · 2024-02-08T05:52:37Z

same for #25

scripts/inference_delete.py

jettjaniak

just a few small comments - please implement them and merge

jettjaniak · 2024-02-16T18:08:08Z

scripts/inference.py

-    output_file = os.path.join(output_folder, f'{model_name.replace("/", "-")}.parquet')
-    accumulated_df.to_parquet(output_file)
+        _, next_logprobs = get_all_and_next_logprobs(model, val_sequences)
+        accumulated_logprobs = torch.cat((accumulated_logprobs, next_logprobs), dim=0)


This approach is pretty inefficient, you copy all the data accumulated so far in each step. Instead you could append next_logprobs to logprobs_list and then at the end have all_logprobs = torch.cat(logprobs_list, dim=0).

But actually, you're converting it all to list anyway to make a DataFrame. But then Pandas stores it's data as numpy arrays 🙈 It's fine for now, but please remind me on 1-1 to chat about this.

jettjaniak · 2024-02-16T18:09:14Z

scripts/inference.py

+    df_dataset = pd.DataFrame({"logprobs": extended_next_logprobs.tolist()})
+    hf_dataset = Dataset.from_pandas(df_dataset)
+
+    # change the repo_id to your hf username


that should be an argument

jettjaniak · 2024-02-16T18:10:52Z

scripts/inference.py

    parser.add_argument(
        "--batch_size",
        type=int,
        default=80,
        help="Batch size for processing (default: 80)",
    )
-
+    parser.add_argument(
+        "--dataset_name",


Suggested change

"--dataset_name",

"--dataset-name",

adressed

jettjaniak · 2024-02-17T02:57:05Z

also rebase required

jettjaniak

A Few more comments

scripts/generate_logprobs.sh

jettjaniak · 2024-02-17T16:53:34Z

scripts/inference.py

+    val_ds = load_validation_dataset(dataset_name)
+
+    # model accepts 2D tensors (batch_size, seq_len)
+    val_sequences = torch.tensor([s["tokens"] for s in val_ds])


This makes a copy of whole validation dataset, we don't need it

jettjaniak · 2024-02-17T16:57:50Z

scripts/inference.py

+
+    logprobs_list = []
+    for i in tqdm(range(0, len(val_sequences), batch_size)):
+        batch_sequences = val_sequences[i : i + batch_size]


Use val_ds here to make a batch tensor. val_ds[tokens] [i:i+b] should work, otherwise with for loop

.gitignore

… models

transcendingvictor linked an issue Jan 31, 2024 that may be closed by this pull request

inference script #10

Closed

transcendingvictor changed the title ~~10 script to run inference on whole validation dataset~~ [DRAFT] 10 script to run inference on whole validation dataset Jan 31, 2024

joshuawe previously requested changes Feb 3, 2024

View reviewed changes

jettjaniak reviewed Feb 3, 2024

View reviewed changes

scripts/inference.py Outdated Show resolved Hide resolved

jettjaniak mentioned this pull request Feb 3, 2024

Add utility functions for text processing and visualization #17

Merged

transcendingvictor force-pushed the 10-script-to-run-inference-on-whole-validation-dataset branch from e5a29da to d36a135 Compare February 13, 2024 21:27

jettjaniak reviewed Feb 13, 2024

View reviewed changes

scripts/inference_delete.py Outdated Show resolved Hide resolved

transcendingvictor marked this pull request as ready for review February 15, 2024 20:52

jettjaniak approved these changes Feb 16, 2024

View reviewed changes

jettjaniak changed the title ~~[DRAFT] 10 script to run inference on whole validation dataset~~ inference script Feb 17, 2024

transcendingvictor force-pushed the 10-script-to-run-inference-on-whole-validation-dataset branch from e4bd41c to 67bf0f2 Compare February 17, 2024 12:42

jettjaniak approved these changes Feb 17, 2024

View reviewed changes

transcendingvictor added 9 commits February 21, 2024 17:56

script to get correct logprobs of llama 100k

028ba95

python script to get logprobs and shell script that iterates over the…

be1af77

… models

moved the scripts to the Scripts folder

24f2300

new files

cfadc01

other changes

d33d957

corrected version after comments

9643bfa

prepend NaN values to logprobs and clean up

c81a8b8

comments reloved and functional test

8f33dcc

batch processing and other changes

95f4272

transcendingvictor force-pushed the 10-script-to-run-inference-on-whole-validation-dataset branch from f556811 to 95f4272 Compare February 21, 2024 16:56

transcendingvictor merged commit 75e68aa into main Feb 21, 2024
1 check passed

transcendingvictor deleted the 10-script-to-run-inference-on-whole-validation-dataset branch February 21, 2024 17:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

inference script #16

inference script #16

transcendingvictor commented Jan 31, 2024

joshuawe left a comment

jettjaniak commented Feb 6, 2024

jettjaniak commented Feb 8, 2024

jettjaniak left a comment

jettjaniak Feb 16, 2024

jettjaniak Feb 16, 2024

jettjaniak Feb 16, 2024

jettjaniak Feb 16, 2024

jettjaniak commented Feb 17, 2024

jettjaniak left a comment

jettjaniak Feb 17, 2024

jettjaniak Feb 17, 2024

inference script #16

inference script #16

Conversation

transcendingvictor commented Jan 31, 2024

joshuawe left a comment

Choose a reason for hiding this comment

jettjaniak commented Feb 6, 2024

jettjaniak commented Feb 8, 2024

jettjaniak left a comment

Choose a reason for hiding this comment

jettjaniak Feb 16, 2024

Choose a reason for hiding this comment

jettjaniak Feb 16, 2024

Choose a reason for hiding this comment

jettjaniak Feb 16, 2024

Choose a reason for hiding this comment

jettjaniak Feb 16, 2024

Choose a reason for hiding this comment

jettjaniak commented Feb 17, 2024

jettjaniak left a comment

Choose a reason for hiding this comment

jettjaniak Feb 17, 2024

Choose a reason for hiding this comment

jettjaniak Feb 17, 2024

Choose a reason for hiding this comment