Add leaderboard tasks tests #8

KonradSzafer · 2024-07-10T13:45:06Z

No description provided.

clefourrier · 2024-07-25T12:40:51Z

tests/leaderboards/test_tasks_leaderboard.py

+
+    for key, reference_val in reference.items():
+        if isinstance(reference_val, Dict):
+            if recursive:


In which cases would we not want to test the subcontents?

I used it in e.g. the first test function to check only the lowest level of the config. I am now trying to structure configs so that this is no longer used

clefourrier · 2024-07-25T12:45:32Z

tests/leaderboards/test_tasks_leaderboard.py

+        subtasks_configs = (
+            filter_dict(task_configs.to_dict(), task_name)
+            if is_multitask
+            else {
+                task_name: {
+                    k: v for k, v in task_configs.to_dict().items() if k != "limit"
+                }
+            }
+        )


Ternary expression is a bit long

if is_multitask: substasks_config = {k: v for k, v in task_configs.to_dict.items() if task_name in k} else: substasks_config = {task_name: {k: v for k, v in task_configs.to_dict().items() if k != "limit"}}

This is typically a function which could use more comments inside.

Why are you not removing limit in the first case?

In multitask scenario, the limit is defined twice in the config: once as an argument for evaluation (which is removed to avoid mismatch), and once at a different level for output checking. In the first case, it can be used for both

tests/leaderboards/test_tasks_leaderboard.py

clefourrier · 2024-07-25T13:07:11Z

tests/leaderboards/utils.py

+
+
+class ConfigParser:
+    """


I still think it would be good to be homogeneous, and either use a ConfigParser everywhere, or to use nested dicts everywhere - (and I'm in favor of the latter as it's less likely to introduce issues)

by everywehre you mean for both expected results and generated results ?

NathanHB · 2024-07-26T09:45:38Z

tests/leaderboards/test_tasks_leaderboard.py

+    for task_name, expected_n_samples in config.n_samples.items():
+        results = all_results[task_name]
+        observed_n_samples = (
+            results["n-samples"]


why do we choose the task name if the len is > 1 ?

NathanHB · 2024-07-26T10:15:27Z

tests/leaderboards/test_tasks_leaderboard.py

+            - Evaluation settings
+            - Expected results for the given configuration
+    """
+    return request.param


load_all_configs returns a list, doesn't this function return a list as well ? Or does pytest automatically yields each element of the list ?

NathanHB force-pushed the main branch from 2c5db37 to 4a62757 Compare July 22, 2024 09:15

KonradSzafer added 6 commits July 23, 2024 09:26

initial commit

a3eede9

explicit recursive checks

8681ab6

recursive refactor and full config

52b758b

configs update

df780af

small refactor and docstrings update

81d1874

comparing only dicts

085f1d0

KonradSzafer force-pushed the add-leaderboard-tasks-tests branch from d6c0a94 to 085f1d0 Compare July 23, 2024 09:28

KonradSzafer added 8 commits July 24, 2024 12:34

typing and docstrings improvements

faaa2d1

arg name and multitask refactor

a824a67

removed evaluation tracker args

ffa20f4

renamed ConfigParser

2c1ab05

removed ConfigParser from compare_results

3d1c61f

bbh changed to shot=3

80cb9a1

math_hard shot=4

e31d7ea

update mmlu to shot=5

5b5b11c

clefourrier reviewed Jul 25, 2024

View reviewed changes

tests/leaderboards/test_tasks_leaderboard.py Outdated Show resolved Hide resolved

clefourrier reviewed Jul 25, 2024

View reviewed changes

tests/leaderboards/test_tasks_leaderboard.py Outdated Show resolved Hide resolved

clefourrier reviewed Jul 25, 2024

View reviewed changes

KonradSzafer added 4 commits July 25, 2024 15:03

transformers version bump

7ac8cac

removed request caching

99db80e

simplified compare_results

abcabff

test device var refactor

3f2fcdf

NathanHB reviewed Jul 26, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add leaderboard tasks tests #8

Add leaderboard tasks tests #8

KonradSzafer commented Jul 10, 2024

clefourrier Jul 25, 2024

KonradSzafer Jul 25, 2024

clefourrier Jul 25, 2024 •

edited

Loading

clefourrier Jul 25, 2024

KonradSzafer Jul 26, 2024

clefourrier Jul 25, 2024

NathanHB Jul 26, 2024

NathanHB Jul 26, 2024

NathanHB Jul 26, 2024

Add leaderboard tasks tests #8

Are you sure you want to change the base?

Add leaderboard tasks tests #8

Conversation

KonradSzafer commented Jul 10, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

clefourrier Jul 25, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

clefourrier Jul 25, 2024 •

edited

Loading