Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding code to benchmark different models on a set of Singaporean math problems #12

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

ivanleomk
Copy link
Contributor

@ivanleomk ivanleomk commented Sep 29, 2024

Important

Add scripts to benchmark models on Singaporean math problems, including data processing and evaluation, with new dependencies for datasets and cloud storage.

  • Scripts:
    • evaluate_psle_math.py: Evaluates models on Singaporean math problems using braintrust and instructor libraries. Supports gemini, openai, and anthropic providers.
    • process_math.py: Extracts metadata from solution.json files and uploads images to R2 cloud storage.
  • Dependencies:
    • Updated instructor to 1.4.3 in requirements.txt.
    • Added datasets, boto3, autoevals, braintrust, and google-generativeai to requirements.txt.

This description was created by Ellipsis for e2ea4e7. It will automatically update as commits are pushed.

Copy link

@ellipsis-dev ellipsis-dev bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 Looks good to me! Reviewed everything up to e2ea4e7 in 13 seconds

More details
  • Looked at 322 lines of code in 3 files
  • Skipped 0 files when reviewing.
  • Skipped posting 5 drafted comments based on config settings.
1. scripts/process_math.py:36
  • Draft comment:
    The upload_images_to_r2 function is defined but never used. Consider removing it if it's not needed.
  • Reason this comment was not posted:
    Confidence changes required: 50%
    The upload_images_to_r2 function is defined but not used in the script. This might be an oversight or intentional if the function is meant for future use or manual invocation. However, it's generally a good practice to remove unused code to keep the codebase clean.
2. scripts/process_math.py:56
  • Draft comment:
    The call to upload_images_to_r2 is commented out. If this is intentional, consider adding a comment explaining why. Otherwise, remove the comment to enable the function call.
  • Reason this comment was not posted:
    Confidence changes required: 50%
    The upload_images_to_r2 function is commented out in the main block. If this is intentional for testing or development purposes, it should be noted. Otherwise, it should be removed or properly integrated.
3. scripts/evaluate_psle_math.py:128
  • Draft comment:
    The content field in the messages parameter is a list. Ensure this is the expected format for the API call, as it might expect a string instead.
  • Reason this comment was not posted:
    Comment did not seem useful.
4. scripts/evaluate_psle_math.py:167
  • Draft comment:
    Consider handling exceptions for the httpx.get call to avoid potential runtime errors if the request fails.
  • Reason this comment was not posted:
    Confidence changes required: 50%
    The generate_anthropic_response function uses httpx.get to fetch an image and then encodes it in base64. This could be inefficient if the image is large or if there are many requests. Consider handling potential exceptions or using a more efficient method if performance becomes an issue.
5. scripts/evaluate_psle_math.py:73
  • Draft comment:
    Consider loading the dataset once and reusing it to improve efficiency, especially if the dataset is large.
  • Reason this comment was not posted:
    Confidence changes required: 50%
    The generate_questions function loads the dataset every time it's called, which might be inefficient if the dataset is large. Consider loading the dataset once and reusing it if possible.

Workflow ID: wflow_Ktx7bu2pNZTTLrCi


You can customize Ellipsis with 👍 / 👎 feedback, review rules, user-specific overrides, quiet mode, and more.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant