Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add automatic evaluation with LLM-as-a-Judge, LangSmith export, and SGI evaluation #174

Merged
merged 61 commits into from
Nov 12, 2023

Conversation

FelixTJDietrich
Copy link
Collaborator

@FelixTJDietrich FelixTJDietrich commented Nov 5, 2023

Motivation and Context

We want an option to systematically evaluate the modules without manual labor.

closes #84
closes #36

Description

  • Add /evaluation endpoint to Athena for returning arbitrary evaluation results
  • Add example of how this /evaluation endpoint would like to module_example (for programming exercises)
  • Add this new module request to the Playground
  • Run evaluation for all healthy modules of the same type that support evaluation in evaluation mode
  • Display progress of automatic evaluation (and move "Start Generating" button)
  • Add LLM-as-a-judge for text modules
  • Disable LLM-as-a-judge using environment variable
  • Add token usage and response times in evaluation response (coming from LangSmith)
  • Add eval code for SGI evaluation

Note: I omitted the "module_text_llm estimated Accepted with Minor modifications" text from the UI since it is kind of ugly to add and does not give immediate benefit. In the future one might make automatic results available for the inline feedback view but this depends also on future experiment designs.

Steps for Testing

  1. Run an evaluation with module_text_llm
  2. After everything is done click export, it should export the automatic_evaluation now as a separate file
  3. File should contain LLM-as-a-judge, llm_statistics (from LangSmith, I added my key to the env), and feedback_statistics
  4. Cancel experiment
  5. Import results and automatic evaluation and check if it works

Screenshots

Docs:
image

Not Started

Screenshot 2023-11-07 at 16 36 26

Sent Training Feedback

Screenshot 2023-11-07 at 16 36 31

Generated Suggestions

Screenshot 2023-11-07 at 16 38 37

Finished

Screenshot 2023-11-07 at 16 38 42

@FelixTJDietrich FelixTJDietrich mentioned this pull request Nov 5, 2023
16 tasks
@FelixTJDietrich FelixTJDietrich self-assigned this Nov 5, 2023
@FelixTJDietrich FelixTJDietrich marked this pull request as ready for review November 7, 2023 20:15
@FelixTJDietrich FelixTJDietrich added dependencies Pull requests that update a dependency file deploy:athena-test1 Athena Test Server 1 enhancement New feature or request python Pull requests that update Python code playground Pull requests that update the playground athena package javascript Pull requests that update Javascript code labels Nov 7, 2023
@FelixTJDietrich
Copy link
Collaborator Author

Seems like I did not validate the structured_grading_instruction_id, it should be fine now.

@FelixTJDietrich FelixTJDietrich added deploy:athena-test1 Athena Test Server 1 and removed lock:athena-test1 Is currently deployed to Athena Test Server 1 labels Nov 9, 2023
@github-actions github-actions bot added lock:athena-test1 Is currently deployed to Athena Test Server 1 and removed deploy:athena-test1 Athena Test Server 1 labels Nov 9, 2023
@FelixTJDietrich
Copy link
Collaborator Author

Works for me on the test server now :)
image
image

@FelixTJDietrich
Copy link
Collaborator Author

I also added now some docs:
image

@FelixTJDietrich FelixTJDietrich added documentation Improvements or additions to documentation and removed lock:athena-test1 Is currently deployed to Athena Test Server 1 labels Nov 11, 2023
@pal03377 pal03377 added the deploy:athena-test1 Athena Test Server 1 label Nov 12, 2023
@github-actions github-actions bot removed the deploy:athena-test1 Athena Test Server 1 label Nov 12, 2023

This comment was marked as outdated.

@github-actions github-actions bot added the deployment-error Added by deployment workflows if an error occured label Nov 12, 2023
@pal03377 pal03377 removed the deployment-error Added by deployment workflows if an error occured label Nov 12, 2023
@pal03377 pal03377 added the deploy:athena-test1 Athena Test Server 1 label Nov 12, 2023
@github-actions github-actions bot added lock:athena-test1 Is currently deployed to Athena Test Server 1 and removed deploy:athena-test1 Athena Test Server 1 labels Nov 12, 2023
@pal03377 pal03377 temporarily deployed to Athena Test Server 3 November 12, 2023 19:59 — with GitHub Actions Inactive
Copy link
Contributor

@pal03377 pal03377 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The exported files look like expected. I also cannot find any issues (neither in the code nor in my tests) ✅

@pal03377 pal03377 removed the lock:athena-test1 Is currently deployed to Athena Test Server 1 label Nov 12, 2023
@FelixTJDietrich FelixTJDietrich merged commit 89e7047 into develop Nov 12, 2023
18 checks passed
@FelixTJDietrich FelixTJDietrich deleted the feature/automatic-evaluation branch November 12, 2023 22:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
athena package dependencies Pull requests that update a dependency file documentation Improvements or additions to documentation enhancement New feature or request javascript Pull requests that update Javascript code playground Pull requests that update the playground python Pull requests that update Python code
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Initial Evaluation Mode Roadmap
2 participants