Skip to content

Commit

Permalink
Merge pull request #18 from lgabs/ita-2025/report/udpate-score-calcul…
Browse files Browse the repository at this point in the history
…ation

ITA 2025 - Add score calculation explanation in report.md
  • Loading branch information
lgabs authored Nov 15, 2024
2 parents f15e569 + 080caa8 commit 1865029
Showing 1 changed file with 4 additions and 1 deletion.
5 changes: 4 additions & 1 deletion exams/ita_2025/report.md
Original file line number Diff line number Diff line change
Expand Up @@ -29,6 +29,9 @@ Preliminary solutions can be found by widely known high schools like [Poliedro](
| ITA | 2025 | Portuguese | Essay | o1-preview | 🚧 TODO | - | - |
| ITA | 2025 | Math | Multiple Choice | o1-preview | 🚧 TODO | - | - |

## Score calculation
For multiple-choice exams, the score is straightforward. For essays, each question is valued at 10.0 points. For questions with more than one part, we considered each part equally weighted. Since it's an essay, in some cases, the LLM provided almost correct reasoning but made some mistakes, so we had to assign partial points when reviewing each question (essay exams are inherently more subjective).

## Comments

### Math Essay
Expand All @@ -38,6 +41,6 @@ For almost of all questions, `max_tokens_question_description=400` and `max_toke

### Chemistry Essay

Again, `o1-preview` got all questions correct in the Chemistry essay exam. The only wrong question was question 3, where it got parts (a) and (b) correct, but (c) and (d) wrong. This is probably because parts (c) and (d) involved reasoning about what was represented in the picture of the question, which is hard for LLMs to do, so gpt-4o did some limited description of the image but it was not enough for o1-preview. This is why, for this question, we considered 0.5 points for the score (2 parts out of 4).
Again, `o1-preview` got all questions correct in the Chemistry essay exam. The only wrong question was question 3, where it got parts (a) and (b) correct, but (c) and (d) wrong. This is probably because parts (c) and (d) involved reasoning about what was represented in the picture of the question, which is hard for LLMs to do, so gpt-4o did some limited description of the image but it was not enough for o1-preview. This is why, for this question, we considered 5.0 points for the score (2 parts out of 4).

`max_tokens_question_answer=5000` worked for most questions; only question 8 could not and so `10000` was used. Also, for some questions the confusion about how to use decimal separators or thousands separators in portuguese or english induced some errors, even when the development was clearly correct. To fix that, we updated the o1-preview prompt to use `,` for decimal separators and avoid thousands separators.

0 comments on commit 1865029

Please sign in to comment.