Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Playground: Side-by-Side Expert Evaluation #345

Merged
merged 77 commits into from
Nov 18, 2024

Conversation

DominikRemo
Copy link
Contributor

@DominikRemo DominikRemo commented Oct 2, 2024

Motivation and Context

By GPT-4o and Suno (sound on 🎶)

Side.by.Side.-.SD.480p.mov

The primary motivation behind this tool is to create a robust and versatile benchmark for evaluating feedback on student submissions across specific use cases.

  • Sustainability: The tool is designed to be flexible and adaptable, allowing benchmarks to be automatically repeated with newer or alternative large language models (LLMs) or other feedback generation methods as they become available.
  • Reliability: By ensuring that benchmarks are tested on real-world data not included in LLM training data, the tool helps prevent overfitting and provides a more reliable evaluation of feedback quality.
  • Readability: The benchmarks produced by this tool are designed to be clear and understandable, with metrics that are specific to the use case, making it easier for researchers and practitioners to interpret and apply the results effectively

Description

This PR introduces the Side-by-Side Evaluation Tool, designed to assist experts in evaluating feedback provided on student submissions. This tool is especially useful for researchers seeking to assess the quality and relevance of feedback across multiple criteria and through multiple evaluators. Ths Side-by-Side Tool provides two views:

  • Researcher View: The researcher defines evaluation metrics (e.g., accuracy, tone), selects exercises and submissions for review, and generates individual links for each expert involved in the evaluation.
  • Expert View: Each expert reviews submissions and evaluates feedback based on the metrics set by the researcher. A Likert scale is used for each metric, allowing for consistent and structured assessments across feedback instances.

Steps for Testing

  1. Access the playground on the test server or locally
  2. Select Evaluation Mode, and scroll down to the Expert Evaluation
  3. Create a new evaluation by providing an Evaluation Name
  4. Import exercises using the jsons below ⬇️
  5. Add at least two Metrics (e.g. Accuracy, Adaptability, Tone...) by providing a Title, Summary and a Description
  6. Create a new expert link and Save Changes
  7. Start the Evaluation and open the expert link
  8. See the Welcome Screen and proceed to the Tutorial
  9. Make sure the tutorial is understandable and Start Evaluation
  10. Evaluate the submissions according to how it is described in the tutorial
  11. Also make sure all buttons are working as expected => try Continue Later, exiting the page and continuing
  12. Finish the evaluation by evaluating all metrics for all submissions

Testserver States

Note

These badges show the state of the test servers.
Green = Currently available, Red = Currently locked
Click on the badges to get to the test servers.


Test Data

Use the following exercises to test
exercise-5000.json
exercise-800.json
exercise-700.json
exercise-123.json

Screenshots

Researcher View - Imported Exercises and Metrics
image

Researcher View - Defining Metrics and creating expert links
image

Expert View - Welcome Screen
image

Exper View - Evaluation View
image

Added dependencies for free fontawesome.
Adds an attribute (false by default) that allows hiding the detail header (References / unreferenced, Grading Criterion ID etc.) from inline feedback.
Adds the expert view and auxiliary data and an auxiliary button to access the expert view temporarily in testing
@DominikRemo DominikRemo changed the title ´Playground´: Side by Side Expert Evaluation (WIP) Playground: Side-by-Side Expert Evaluation (WIP) Oct 2, 2024
laadvo and others added 24 commits October 3, 2024 20:20
- Replaced sanitize-html with Markdown rendering for metric descriptions.
- Simplified the process of editing descriptions by allowing Markdown syntax.
- Improved visualization of descriptions in popups and during the editing process, enhancing the user experience and readability.
- Refactored the Metric type, which was previously defined in two separate locations, to now consistently use the Metric type from the model.
- Moved the ExpertEvaluationConfig definition from the evaluation_management component to the model, enabling reuse across multiple components for better maintainability and consistency.
- Fixes an issue where switching the ExpertEvaluationConfig via the dropdown did not correctly update the selected exercises.
- Refactors the ExpertEvaluationExerciseSelect component to simplify its structure by moving data fetching and error handling to the child component.
- Ensures proper synchronization of selected exercises between the parent and child components.
- Implements better separation of concerns by letting the parent manage state while the child handles fetching and rendering exercises.
- Adds multiple exercise selection capability with clear communication between the parent and child components.

This commit improves code clarity, ensures reliable state management, and enhances user experience during config switching and exercise selection.
- Added creationDate Attribute to better distinguish configs with same name.
- Added eager loading of submissions and categorizedFeedbacks in multiple_exercises_select for simpler access.
- Made the header in the expert view of the side-by-side tool sticky, allowing experts to navigate more efficiently without needing to scroll back to the top after each evaluation.
- Reduced the header’s footprint, giving more screen space for critical evaluation features and improving overall usability.
- Enhanced the visual appeal by adding a cleaner, more functional progress bar for tracking evaluation progress.
Once an evaluation is started, the user is not able to change it anymore.
Enables the researchers to create links for the expert evaluation. The experts can access the side by side tool with the links
When an experiment is started, no new metrics can be added. Therefor, the new metric form is now hidden accordingly.
- If disabled, buttons are hidden form the evaluation management rather than disabling them.
- Use section instead of labels if fitting
@laadvo laadvo changed the title Playground: Side-by-Side Expert Evaluation (WIP) Playground: Side-by-Side Expert Evaluation Nov 12, 2024
@DominikRemo DominikRemo added the deploy:athena-test2 Athena Test Server 2 label Nov 12, 2024
@DominikRemo DominikRemo temporarily deployed to athena-test2.ase.cit.tum.de November 12, 2024 13:10 — with GitHub Actions Inactive
@github-actions github-actions bot added lock:athena-test2 Is currently deployed to Athena Test Server 2 and removed deploy:athena-test2 Athena Test Server 2 labels Nov 12, 2024
@DominikRemo DominikRemo marked this pull request as ready for review November 12, 2024 13:30
@DominikRemo DominikRemo removed the lock:athena-test2 Is currently deployed to Athena Test Server 2 label Nov 12, 2024
@DominikRemo DominikRemo added the deploy:athena-test2 Athena Test Server 2 label Nov 12, 2024
@DominikRemo DominikRemo temporarily deployed to athena-test2.ase.cit.tum.de November 12, 2024 14:33 — with GitHub Actions Inactive
@github-actions github-actions bot added lock:athena-test2 Is currently deployed to Athena Test Server 2 and removed deploy:athena-test2 Athena Test Server 2 labels Nov 12, 2024
@@ -14,6 +14,9 @@
},
"dependencies": {
"@blueprintjs/core": "5.5.1",
"@fortawesome/fontawesome-svg-core": "^6.6.0",
"@fortawesome/free-solid-svg-icons": "^6.6.0",
"@fortawesome/react-fontawesome": "^0.2.2",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please pin the dependencies

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We now pinned the dependencies 👍

@@ -1 +1 @@
export type DataMode = "example" | "evaluation" | string;
export type DataMode = "example" | "evaluation" | "expert_evaluation" | "expert_evaluation/exercises" | string;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you have two different modes here insted of just one mode?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! This was part of a previous iteration for storing evaluations. It is not needed anymore. We have now removed it 👍

@DominikRemo DominikRemo removed the lock:athena-test2 Is currently deployed to Athena Test Server 2 label Nov 12, 2024
Copy link
Collaborator

@FelixTJDietrich FelixTJDietrich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just implement the comments from above, this PR looks good to me. I also tested again on the test server and could not see any issues with it. Good job!

Copy link
Member

@maximiliansoelch maximiliansoelch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please incorporate the changes requested by @FelixTJDietrich.
Otherwise, it looks good! Great job 👍

@FelixTJDietrich FelixTJDietrich merged commit acb2929 into develop Nov 18, 2024
14 checks passed
@FelixTJDietrich FelixTJDietrich deleted the feature/side-by-side-tool branch November 18, 2024 11:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants