Skip to content

Commit

Permalink
Merge pull request #6 from pfnet-research/v0.2.0
Browse files Browse the repository at this point in the history
v0.2.0
  • Loading branch information
masanorihirano authored May 7, 2024
2 parents 970629b + e418e81 commit 135f1e4
Show file tree
Hide file tree
Showing 2,838 changed files with 623,464 additions and 9,479 deletions.
3 changes: 3 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -168,3 +168,6 @@ run.sh
results2
results3
run
run2
.venv*
cache
268 changes: 174 additions & 94 deletions README.md

Large diffs are not rendered by default.

27 changes: 27 additions & 0 deletions UPDATE.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
# Major Update History and Explanation
If you have any concern about the update, you can argue in the issue! More discussion is welcome!

重大なアップデートに関して、疑念がある場合は、Issueで議論を投稿できます。日本語でも大丈夫です。

## v0.2.0
In this update, we made the following major changes that affect the evaluation results:
 - Only 0-shot results are employed for the leaderboard.
 - We reject the different prompt selections for each task. Now, the same prompt version is enforced for all tasks.

このアップデートでは、以下の重要な変更を行いました。これらの変更は評価結果に影響を与えます。
- リーダーボードには0-shotの結果のみを使用します。
- 各タスクごとに異なるプロンプトを利用することを禁じます。今後は、すべてのタスクに同じプロンプトバージョンを使用する組のみが許可されます。

### Why ?

#### Only 0-shot results are employed for the leaderboard / リーダーボードには0-shotの結果のみを使用します

Recently, many leaderboards have employed 0-shot results for evaluation. We also follow this trend. On the other hand, for some private evaluation of models that support longer prompts, n-shots (n >> 5) evaluation is still used. However, checking the results of huge patterns of n-shots is not practical because it requires a lot of computational resources. Moreover, testing many n-shot results could be p-hacking (It means that there is a good chance of finding extremely high results accidentally). Therefore, we decided to use only 0-shot results for the leaderboard.

最近、多くのリーダーボードが評価に0-shotの結果を使用しているため、この傾向に従うことにしました。一方、より長いプロンプトをサポートするモデルのプライベート評価では、n-shot(n >> 5)の評価がまだ使用されています。ただし、n-shotの多くのパターンの結果を確認することは計算リソースの観点から現実的ではありません。また、多くのn-shotの結果をテストすることはp値hackingになる可能性があります(これは、偶然に非常に高い結果を見つける可能性が高いことを意味します)。そのため、リーダーボードには0-shotの結果のみを使用することにしました。

#### We reject the different prompt selection for each task / 各タスクごとに異なるプロンプトを利用することを禁じます

In the previous version, we allowed different prompt selections for each task. However, this setting is not fair because the prompt selection should not be optimized for each task, and it should be unique to the model. Moreover, testing huge patterns of prompt sets for tasks is not practical because it requires a lot of computational resources. In addition, testing many prompt sets could be p-hacking. Therefore, we decided to enforce the same prompt version for all tasks. However, in some tasks, some prompt versions are missing (e.g., chase-1.0-0.1.2). In those cases, the most similar prompt version is enforced (e.g., chabsa-1.0-0.1 is used instead of chabsa-1.0-0.1.2).

以前のバージョンでは、各タスクごとに異なるプロンプトの選択を許可していました。しかし、プロンプトの選択は各タスクごとに最適化されるべきではなく、モデルに固有であるべきです。また、タスクのためのプロンプトセットの多くのパターンをテストすることは計算リソースの観点から現実的ではありません。さらに、多くのプロンプトセットをテストすることはp値hackingになる可能性があります。そのため、すべてのタスクに同じプロンプトバージョンを強制することにしました。ただし、一部のタスクでは、一部のプロンプトバージョンが欠落している場合があります(例:chabsa-1.0-0.1.2など)。そのような場合は、最も類似したプロンプトバージョンが強制されます(例:chabsa-1.0-0.1.2の代わりにchabsa-1.0-0.1が使用されます)。
Binary file modified analysis/figs/ave-chabsa.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified analysis/figs/ave-cma_basics.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified analysis/figs/ave-cpa_audit.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified analysis/figs/ave-fp2.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified analysis/figs/ave-security_sales_1.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
127 changes: 86 additions & 41 deletions analysis/generate.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,34 +10,51 @@

root_path = os.path.dirname(os.path.dirname(os.path.abspath(__file__)))

for file_name in glob.glob("models/*/*/result.json", root_dir=root_path):
_, company, modelname, _ = file_name.split("/")
for dir_name in glob.glob("models/*/*/"):
_, company, modelname, _ = dir_name.split("/")
company_model = company + "/" + modelname
with open(file_name, "r") as f:
data = json.load(f)
results_sum = 0
count = 0
for task, results in data["results"].items():
task_name = task.split("-")[0]
result = results.get("f1", results.get("acc"))
result = 100 * (result if result else 0)
result_dict.setdefault(company_model, {})[task_name] = result
if result:
results_sum += result
count += 1
result_dict.setdefault(company_model, {})["Ave."] = (
results_sum / count if count > 0 else 0
)
results_files = [
x.replace("harness", "result").replace(".sh", ".json")
for x in glob.glob(dir_name + "harness*.sh")
]
if sum([not os.path.exists(x) for x in results_files]) > 0:
print(f"Skipping {company_model} as results are not available")
continue
data = {}
best_data = {}
best_ave = 0.0
for resuls_file in results_files:
task_version = resuls_file.split("/")[-1].split("-", 1)[-1].replace(".json", "")
with open(resuls_file, "r") as f:
results_data = json.load(f)
results_sum = 0
count = 0
tmp_result_dict = {}
for task, results in results_data["results"].items():
task_name = task.split("-")[0]
result = results.get("f1,none", results.get("acc,none"))
result = 100 * (result if result else 0)
tmp_result_dict[task_name] = result
if result:
results_sum += result
count += 1
tmp_result_dict["Ave."] = results_sum / count if count > 0 else 0
data[task_version] = tmp_result_dict
if tmp_result_dict["Ave."] > best_ave:
best_ave = tmp_result_dict["Ave."]
best_data = tmp_result_dict
for task, result in best_data.items():
result_dict.setdefault(company_model, {})[task] = result

average = np.array(list(map(lambda x: x["Ave."], result_dict.values())))
target = np.array(list(map(lambda x: x["chabsa"], result_dict.values())))


def curv(x, a, b, c):
return c * (1 - np.exp(a * (x - b)))
def curv(x, a, b, range_min, range_max):
return (range_max - range_min) / (1 + np.exp(-a * (x - b))) + range_min


popt, pocv = curve_fit(curv, average, target, p0=[-0.1, 30, 95])
popt, pocv = curve_fit(curv, average, target, p0=[-0.1, 30, 100, 20])
residuals = target - curv(average, *popt)
rss = np.sum(residuals**2)
tss = np.sum((target - np.mean(target)) ** 2)
Expand All @@ -48,13 +65,13 @@ def curv(x, a, b, c):
plt.plot(
dummy_x,
dummy_y,
label=f"${popt[2]:0.2f} "
+ "\\times"
+ " (1 - \\exp{("
+ f"{popt[0]:0.2f}"
label=f"$({popt[2]:0.2f} - {popt[3]:0.2f}) "
+ "\\times (1 + \\exp{("
+ f"{-popt[0]:0.2f}"
+ " \\times (x - "
+ f"{popt[1]:0.2f}"
+ "))})$\n$R^2="
+ "))}) + "
+ f"{popt[3]:0.2f}$\n$R^2="
+ f"{r_squared:0.2f}$",
)
plt.xlabel("Ave.")
Expand All @@ -68,11 +85,11 @@ def curv(x, a, b, c):
target = np.array(list(map(lambda x: x["cma_basics"], result_dict.values())))


def curv(x, a, b):
return a * (x - b)
def curv(x, a, b, range_min, range_max):
return (range_max - range_min) / (1 + np.exp(-a * (x - b))) + range_min


popt, pocv = curve_fit(curv, average, target, p0=[1, 0])
popt, pocv = curve_fit(curv, average, target, p0=[0.1, 20, 100, 20])
residuals = target - curv(average, *popt)
rss = np.sum(residuals**2)
tss = np.sum((target - np.mean(target)) ** 2)
Expand All @@ -83,7 +100,14 @@ def curv(x, a, b):
plt.plot(
dummy_x,
dummy_y,
label=f"${popt[0]:0.2f} (x - {popt[1]:0.2f})$\n$R^2=" + f"{r_squared:0.2f}$",
label=f"$({popt[2]:0.2f} - {popt[3]:0.2f}) "
+ "\\times (1 + \\exp{("
+ f"{-popt[0]:0.2f}"
+ " \\times (x - "
+ f"{popt[1]:0.2f}"
+ "))}) + "
+ f"{popt[3]:0.2f}$\n$R^2="
+ f"{r_squared:0.2f}$",
)
plt.xlabel("Ave.")
plt.ylabel("cma_basics")
Expand All @@ -96,11 +120,11 @@ def curv(x, a, b):
target = np.array(list(map(lambda x: x["cpa_audit"], result_dict.values())))


def curv(x, a, b):
return a * (x - b)
def curv(x, a, b, range_min, range_max):
return (range_max - range_min) / (1 + np.exp(-a * (x - b))) + range_min


popt, pocv = curve_fit(curv, average, target, p0=[1, 0])
popt, pocv = curve_fit(curv, average, target, p0=[0.1, 70, 100, 20])
residuals = target - curv(average, *popt)
rss = np.sum(residuals**2)
tss = np.sum((target - np.mean(target)) ** 2)
Expand All @@ -111,7 +135,14 @@ def curv(x, a, b):
plt.plot(
dummy_x,
dummy_y,
label=f"${popt[0]:0.2f} (x - {popt[1]:0.2f})$\n$R^2=" + f"{r_squared:0.2f}$",
label=f"$({popt[2]:0.2f} - {popt[3]:0.2f}) "
+ "\\times (1 + \\exp{("
+ f"{-popt[0]:0.2f}"
+ " \\times (x - "
+ f"{popt[1]:0.2f}"
+ "))}) + "
+ f"{popt[3]:0.2f}$\n$R^2="
+ f"{r_squared:0.2f}$",
)
plt.xlabel("Ave.")
plt.ylabel("cpa_audit")
Expand All @@ -124,11 +155,11 @@ def curv(x, a, b):
target = np.array(list(map(lambda x: x["fp2"], result_dict.values())))


def curv(x, a, b):
return a * (x - b)
def curv(x, a, b, range_min, range_max):
return (range_max - range_min) / (1 + np.exp(-a * (x - b))) + range_min


popt, pocv = curve_fit(curv, average, target, p0=[1, 0])
popt, pocv = curve_fit(curv, average, target, p0=[0.2, 70, 100, 20])
residuals = target - curv(average, *popt)
rss = np.sum(residuals**2)
tss = np.sum((target - np.mean(target)) ** 2)
Expand All @@ -139,7 +170,14 @@ def curv(x, a, b):
plt.plot(
dummy_x,
dummy_y,
label=f"${popt[0]:0.2f} (x - {popt[1]:0.2f})$\n$R^2=" + f"{r_squared:0.2f}$",
label=f"$({popt[2]:0.2f} - {popt[3]:0.2f}) "
+ "\\times (1 + \\exp{("
+ f"{-popt[0]:0.2f}"
+ " \\times (x - "
+ f"{popt[1]:0.2f}"
+ "))}) + "
+ f"{popt[3]:0.2f}$\n$R^2="
+ f"{r_squared:0.2f}$",
)
plt.xlabel("Ave.")
plt.ylabel("fp2")
Expand All @@ -152,11 +190,11 @@ def curv(x, a, b):
target = np.array(list(map(lambda x: x["security_sales_1"], result_dict.values())))


def curv(x, a, b):
return a * (x - b)
def curv(x, a, b, range_min, range_max):
return (range_max - range_min) / (1 + np.exp(-a * (x - b))) + range_min


popt, pocv = curve_fit(curv, average, target, p0=[1, 0])
popt, pocv = curve_fit(curv, average, target, p0=[0.1, 20, 100, 20])
residuals = target - curv(average, *popt)
rss = np.sum(residuals**2)
tss = np.sum((target - np.mean(target)) ** 2)
Expand All @@ -167,7 +205,14 @@ def curv(x, a, b):
plt.plot(
dummy_x,
dummy_y,
label=f"${popt[0]:0.2f} (x + {-popt[1]:0.2f})$\n$R^2=" + f"{r_squared:0.2f}$",
label=f"$({popt[2]:0.2f} - {popt[3]:0.2f}) "
+ "\\times (1 + \\exp{("
+ f"{-popt[0]:0.2f}"
+ " \\times (x - "
+ f"{popt[1]:0.2f}"
+ "))}) + "
+ f"{popt[3]:0.2f}$\n$R^2="
+ f"{r_squared:0.2f}$",
)
plt.xlabel("Ave.")
plt.ylabel("security_sales_1")
Expand Down
Loading

0 comments on commit 135f1e4

Please sign in to comment.