Merge pull request #18 from lgabs/ita-2025/report/udpate-score-calcul…

…ation ITA 2025 - Add score calculation explanation in report.md
lgabs · Nov 15, 2024 · 1865029 · 1865029
2 parents f15e569 + 080caa8
commit 1865029
Showing 1 changed file with 4 additions and 1 deletion.
diff --git a/exams/ita_2025/report.md b/exams/ita_2025/report.md
@@ -29,6 +29,9 @@ Preliminary solutions can be found by widely known high schools like [Poliedro](
 | ITA  | 2025  | Portuguese | Essay | o1-preview | 🚧 TODO | - | - |
 | ITA  | 2025  | Math    | Multiple Choice | o1-preview | 🚧 TODO | - | - |
 
+## Score calculation
+For multiple-choice exams, the score is straightforward. For essays, each question is valued at 10.0 points. For questions with more than one part, we considered each part equally weighted. Since it's an essay, in some cases, the LLM provided almost correct reasoning but made some mistakes, so we had to assign partial points when reviewing each question (essay exams are inherently more subjective). 
+
 ## Comments
 
 ### Math Essay
@@ -38,6 +41,6 @@ For almost of all questions, `max_tokens_question_description=400` and `max_toke
 
 ### Chemistry Essay
 
-Again, `o1-preview` got all questions correct in the Chemistry essay exam. The only wrong question was question 3, where it got parts (a) and (b) correct, but (c) and (d) wrong. This is probably because parts (c) and (d) involved reasoning about what was represented in the picture of the question, which is hard for LLMs to do, so gpt-4o did some limited description of the image but it was not enough for o1-preview. This is why, for this question, we considered 0.5 points for the score (2 parts out of 4).
+Again, `o1-preview` got all questions correct in the Chemistry essay exam. The only wrong question was question 3, where it got parts (a) and (b) correct, but (c) and (d) wrong. This is probably because parts (c) and (d) involved reasoning about what was represented in the picture of the question, which is hard for LLMs to do, so gpt-4o did some limited description of the image but it was not enough for o1-preview. This is why, for this question, we considered 5.0 points for the score (2 parts out of 4).
 
 `max_tokens_question_answer=5000` worked for most questions; only question 8 could not and so `10000` was used. Also, for some questions the confusion about how to use decimal separators or thousands separators in portuguese or english induced some errors, even when the development was clearly correct. To fix that, we updated the o1-preview prompt to use `,` for decimal separators and avoid thousands separators.