Adding code to benchmark different models on a set of Singaporean math problems #12

ivanleomk · 2024-09-29T13:48:37Z

Important

Add scripts to benchmark models on Singaporean math problems, including data processing and evaluation, with new dependencies for datasets and cloud storage.

Scripts:
- evaluate_psle_math.py: Evaluates models on Singaporean math problems using braintrust and instructor libraries. Supports gemini, openai, and anthropic providers.
- process_math.py: Extracts metadata from solution.json files and uploads images to R2 cloud storage.
Dependencies:
- Updated instructor to 1.4.3 in requirements.txt.
- Added datasets, boto3, autoevals, braintrust, and google-generativeai to requirements.txt.

^{This description was created by}^{for e2ea4e7. It will automatically update as commits are pushed.}

ellipsis-dev

👍 Looks good to me! Reviewed everything up to e2ea4e7 in 13 seconds

More details

Looked at 322 lines of code in 3 files
Skipped 0 files when reviewing.
Skipped posting 5 drafted comments based on config settings.

1. scripts/process_math.py:36

Draft comment:
The upload_images_to_r2 function is defined but never used. Consider removing it if it's not needed.
Reason this comment was not posted:
Confidence changes required: 50%
The upload_images_to_r2 function is defined but not used in the script. This might be an oversight or intentional if the function is meant for future use or manual invocation. However, it's generally a good practice to remove unused code to keep the codebase clean.

2. scripts/process_math.py:56

Draft comment:
The call to upload_images_to_r2 is commented out. If this is intentional, consider adding a comment explaining why. Otherwise, remove the comment to enable the function call.
Reason this comment was not posted:
Confidence changes required: 50%
The upload_images_to_r2 function is commented out in the main block. If this is intentional for testing or development purposes, it should be noted. Otherwise, it should be removed or properly integrated.

3. scripts/evaluate_psle_math.py:128

Draft comment:
The content field in the messages parameter is a list. Ensure this is the expected format for the API call, as it might expect a string instead.
Reason this comment was not posted:
Comment did not seem useful.

4. scripts/evaluate_psle_math.py:167

Draft comment:
Consider handling exceptions for the httpx.get call to avoid potential runtime errors if the request fails.
Reason this comment was not posted:
Confidence changes required: 50%
The generate_anthropic_response function uses httpx.get to fetch an image and then encodes it in base64. This could be inefficient if the image is large or if there are many requests. Consider handling potential exceptions or using a more efficient method if performance becomes an issue.

5. scripts/evaluate_psle_math.py:73

Draft comment:
Consider loading the dataset once and reusing it to improve efficiency, especially if the dataset is large.
Reason this comment was not posted:
Confidence changes required: 50%
The generate_questions function loads the dataset every time it's called, which might be inefficient if the dataset is large. Consider loading the dataset once and reusing it if possible.

Workflow ID: wflow_Ktx7bu2pNZTTLrCi

You can customize Ellipsis with 👍 / 👎 feedback, review rules, user-specific overrides, quiet mode, and more.

Added requirements.txt

e2ea4e7

ellipsis-dev bot reviewed Sep 29, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding code to benchmark different models on a set of Singaporean math problems #12

Adding code to benchmark different models on a set of Singaporean math problems #12

ivanleomk commented Sep 29, 2024 •

edited by ellipsis-dev bot

Loading

ellipsis-dev bot left a comment

Adding code to benchmark different models on a set of Singaporean math problems #12

Are you sure you want to change the base?

Adding code to benchmark different models on a set of Singaporean math problems #12

Conversation

ivanleomk commented Sep 29, 2024 • edited by ellipsis-dev bot Loading

ellipsis-dev bot left a comment

Choose a reason for hiding this comment

ivanleomk commented Sep 29, 2024 •

edited by ellipsis-dev bot

Loading