`Playground`: Side-by-Side Expert Evaluation #345

DominikRemo · 2024-10-02T09:09:08Z

Motivation and Context

By GPT-4o and Suno (sound on 🎶)

Side.by.Side.-.SD.480p.mov

The primary motivation behind this tool is to create a robust and versatile benchmark for evaluating feedback on student submissions across specific use cases.

Sustainability: The tool is designed to be flexible and adaptable, allowing benchmarks to be automatically repeated with newer or alternative large language models (LLMs) or other feedback generation methods as they become available.
Reliability: By ensuring that benchmarks are tested on real-world data not included in LLM training data, the tool helps prevent overfitting and provides a more reliable evaluation of feedback quality.
Readability: The benchmarks produced by this tool are designed to be clear and understandable, with metrics that are specific to the use case, making it easier for researchers and practitioners to interpret and apply the results effectively

Description

This PR introduces the Side-by-Side Evaluation Tool, designed to assist experts in evaluating feedback provided on student submissions. This tool is especially useful for researchers seeking to assess the quality and relevance of feedback across multiple criteria and through multiple evaluators. Ths Side-by-Side Tool provides two views:

Researcher View: The researcher defines evaluation metrics (e.g., accuracy, tone), selects exercises and submissions for review, and generates individual links for each expert involved in the evaluation.
Expert View: Each expert reviews submissions and evaluates feedback based on the metrics set by the researcher. A Likert scale is used for each metric, allowing for consistent and structured assessments across feedback instances.

Steps for Testing

Access the playground on the test server or locally
Select Evaluation Mode, and scroll down to the Expert Evaluation
Create a new evaluation by providing an Evaluation Name
Import exercises using the jsons below ⬇️
Add at least two Metrics (e.g. Accuracy, Adaptability, Tone...) by providing a Title, Summary and a Description
Create a new expert link and Save Changes
Start the Evaluation and open the expert link
See the Welcome Screen and proceed to the Tutorial
Make sure the tutorial is understandable and Start Evaluation
Evaluate the submissions according to how it is described in the tutorial
Also make sure all buttons are working as expected => try Continue Later, exiting the page and continuing
Finish the evaluation by evaluating all metrics for all submissions

Testserver States

Note

These badges show the state of the test servers.
Green = Currently available, Red = Currently locked
Click on the badges to get to the test servers.

Test Data

Use the following exercises to test
exercise-5000.json
exercise-800.json
exercise-700.json
exercise-123.json

Screenshots

Researcher View - Imported Exercises and Metrics

Researcher View - Defining Metrics and creating expert links

Expert View - Welcome Screen

Exper View - Evaluation View

Added dependencies for free fontawesome.

Adds an attribute (false by default) that allows hiding the detail header (References / unreferenced, Grading Criterion ID etc.) from inline feedback.

Adds the expert view and auxiliary data and an auxiliary button to access the expert view temporarily in testing

- Replaced sanitize-html with Markdown rendering for metric descriptions. - Simplified the process of editing descriptions by allowing Markdown syntax. - Improved visualization of descriptions in popups and during the editing process, enhancing the user experience and readability.

- Refactored the Metric type, which was previously defined in two separate locations, to now consistently use the Metric type from the model. - Moved the ExpertEvaluationConfig definition from the evaluation_management component to the model, enabling reuse across multiple components for better maintainability and consistency.

- Fixes an issue where switching the ExpertEvaluationConfig via the dropdown did not correctly update the selected exercises. - Refactors the ExpertEvaluationExerciseSelect component to simplify its structure by moving data fetching and error handling to the child component. - Ensures proper synchronization of selected exercises between the parent and child components. - Implements better separation of concerns by letting the parent manage state while the child handles fetching and rendering exercises. - Adds multiple exercise selection capability with clear communication between the parent and child components. This commit improves code clarity, ensures reliable state management, and enhances user experience during config switching and exercise selection.

- Added creationDate Attribute to better distinguish configs with same name. - Added eager loading of submissions and categorizedFeedbacks in multiple_exercises_select for simpler access.

- Made the header in the expert view of the side-by-side tool sticky, allowing experts to navigate more efficiently without needing to scroll back to the top after each evaluation. - Reduced the header’s footprint, giving more screen space for critical evaluation features and improving overall usability. - Enhanced the visual appeal by adding a cleaner, more functional progress bar for tracking evaluation progress.

Once an evaluation is started, the user is not able to change it anymore.

Enables the researchers to create links for the expert evaluation. The experts can access the side by side tool with the links

…um/Athena into feature/side-by-side-tool

When an experiment is started, no new metrics can be added. Therefor, the new metric form is now hidden accordingly.

- If disabled, buttons are hidden form the evaluation management rather than disabling them. - Use section instead of labels if fitting

FelixTJDietrich · 2024-11-12T14:48:49Z

playground/package.json

@@ -14,6 +14,9 @@
  },
  "dependencies": {
    "@blueprintjs/core": "5.5.1",
+    "@fortawesome/fontawesome-svg-core": "^6.6.0",
+    "@fortawesome/free-solid-svg-icons": "^6.6.0",
+    "@fortawesome/react-fontawesome": "^0.2.2",


Please pin the dependencies

We now pinned the dependencies 👍

FelixTJDietrich · 2024-11-12T14:56:44Z

playground/src/model/data_mode.ts

@@ -1 +1 @@
-export type DataMode = "example" | "evaluation" | string;
+export type DataMode = "example" | "evaluation" | "expert_evaluation" | "expert_evaluation/exercises" | string;


Why do you have two different modes here insted of just one mode?

Good catch! This was part of a previous iteration for storing evaluations. It is not needed anymore. We have now removed it 👍

FelixTJDietrich

Just implement the comments from above, this PR looks good to me. I also tested again on the test server and could not see any issues with it. Good job!

maximiliansoelch

Please incorporate the changes requested by @FelixTJDietrich.
Otherwise, it looks good! Great job 👍

DominikRemo added 3 commits September 16, 2024 10:25

Add support for icons

bafba4b

Added dependencies for free fontawesome.

Optionally hide details from inline feedback

114a20c

Adds an attribute (false by default) that allows hiding the detail header (References / unreferenced, Grading Criterion ID etc.) from inline feedback.

Add expert view and temporary auxilaries

b9db380

Adds the expert view and auxiliary data and an auxiliary button to access the expert view temporarily in testing

github-actions bot assigned DominikRemo Oct 2, 2024

DominikRemo changed the title ~~´Playground´: Side by Side Expert Evaluation (WIP)~~ Playground: Side-by-Side Expert Evaluation (WIP) Oct 2, 2024

DominikRemo assigned laadvo Oct 2, 2024

laadvo and others added 24 commits October 3, 2024 20:20

extend side-by-side-tool

97b77a6

fetch submissions and feedback together with exercises

350809a

Implemented UI for Evaluation Config

aedab07

Add option for html in metrics

899b763

Merge branch 'develop' into feature/side-by-side-tool

3a8678f

Improve format for Evaluation Management Component

291a0f8

Improvements to ExpertEvaluationConfig

8c6bad3

- Added creationDate Attribute to better distinguish configs with same name. - Added eager loading of submissions and categorizedFeedbacks in multiple_exercises_select for simpler access.

remove feedback details and distinguish editors

18702d1

load data from config json

ed75936

save and load expert evaluation progress

a661d15

Attempt at saving config

1bbc3dd

Attempt at saving config

d9c5fa4

Added started attribute to EvaluationConfig

92502f0

Once an evaluation is started, the user is not able to change it anymore.

support multiple experts

9c54297

Expert Link Generation

c69d83f

Enables the researchers to create links for the expert evaluation. The experts can access the side by side tool with the links

Merge branch 'feature/side-by-side-tool' of https://github.com/ls1int…

50f1e29

…um/Athena into feature/side-by-side-tool

Update metrics_form.tsx

af54e7f

When an experiment is started, no new metrics can be added. Therefor, the new metric form is now hidden accordingly.

Hide buttons if disabled

8c51d61

- If disabled, buttons are hidden form the evaluation management rather than disabling them. - Use section instead of labels if fitting

improve config saving and anonymize when sending to client

8c15ca5

undo part of anonymization

3da3ef6

add gitkeep to expert_evaluation data folder

af8c89e

laadvo changed the title ~~Playground: Side-by-Side Expert Evaluation (WIP)~~ Playground: Side-by-Side Expert Evaluation Nov 12, 2024

DominikRemo added the deploy:athena-test2 Athena Test Server 2 label Nov 12, 2024

DominikRemo temporarily deployed to athena-test2.ase.cit.tum.de November 12, 2024 13:10 — with GitHub Actions Inactive

github-actions bot added lock:athena-test2 Is currently deployed to Athena Test Server 2 and removed deploy:athena-test2 Athena Test Server 2 labels Nov 12, 2024

DominikRemo marked this pull request as ready for review November 12, 2024 13:30

laadvo added the ready to review label Nov 12, 2024

DominikRemo removed the lock:athena-test2 Is currently deployed to Athena Test Server 2 label Nov 12, 2024

laadvo and others added 2 commits November 12, 2024 14:59

fix confirmation and is_finished

d73a07f

Fix off by one fault

b1856f2

DominikRemo added the deploy:athena-test2 Athena Test Server 2 label Nov 12, 2024

DominikRemo temporarily deployed to athena-test2.ase.cit.tum.de November 12, 2024 14:33 — with GitHub Actions Inactive

github-actions bot added lock:athena-test2 Is currently deployed to Athena Test Server 2 and removed deploy:athena-test2 Athena Test Server 2 labels Nov 12, 2024

maximiliansoelch requested review from FelixTJDietrich and maximiliansoelch November 12, 2024 14:55

FelixTJDietrich requested changes Nov 12, 2024

View reviewed changes

DominikRemo removed the lock:athena-test2 Is currently deployed to Athena Test Server 2 label Nov 12, 2024

FelixTJDietrich previously approved these changes Nov 13, 2024

View reviewed changes

maximiliansoelch previously approved these changes Nov 18, 2024

View reviewed changes

DominikRemo added 2 commits November 18, 2024 11:48

Remove redundant data mode constant.

bd3fba8

Pin dependencies

9d9c852

DominikRemo dismissed stale reviews from maximiliansoelch and FelixTJDietrich via 9d9c852 November 18, 2024 11:09

DominikRemo requested review from maximiliansoelch and FelixTJDietrich November 18, 2024 11:15

FelixTJDietrich approved these changes Nov 18, 2024

View reviewed changes

FelixTJDietrich merged commit acb2929 into develop Nov 18, 2024
14 checks passed

FelixTJDietrich deleted the feature/side-by-side-tool branch November 18, 2024 11:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`Playground`: Side-by-Side Expert Evaluation #345

`Playground`: Side-by-Side Expert Evaluation #345

DominikRemo commented Oct 2, 2024 •

edited by laadvo

Loading

FelixTJDietrich Nov 12, 2024

DominikRemo Nov 18, 2024

FelixTJDietrich Nov 12, 2024

DominikRemo Nov 18, 2024

FelixTJDietrich left a comment

maximiliansoelch left a comment

		@@ -1 +1 @@
		export type DataMode = "example" \| "evaluation" \| string;
		export type DataMode = "example" \| "evaluation" \| "expert_evaluation" \| "expert_evaluation/exercises" \| string;

Playground: Side-by-Side Expert Evaluation #345

Playground: Side-by-Side Expert Evaluation #345

Conversation

DominikRemo commented Oct 2, 2024 • edited by laadvo Loading

Motivation and Context

Description

Steps for Testing

Testserver States

Test Data

Screenshots

FelixTJDietrich Nov 12, 2024

Choose a reason for hiding this comment

DominikRemo Nov 18, 2024

Choose a reason for hiding this comment

FelixTJDietrich Nov 12, 2024

Choose a reason for hiding this comment

DominikRemo Nov 18, 2024

Choose a reason for hiding this comment

FelixTJDietrich left a comment

Choose a reason for hiding this comment

maximiliansoelch left a comment

Choose a reason for hiding this comment

`Playground`: Side-by-Side Expert Evaluation #345

`Playground`: Side-by-Side Expert Evaluation #345

DominikRemo commented Oct 2, 2024 •

edited by laadvo

Loading