When does an attack count as successful? #12
Replies: 2 comments 1 reply
-
Hi, thanks a lot for your question! |
Beta Was this translation helpful? Give feedback.
-
Thank you for the reply! To me, this is somewhat conflicting with the statement that attacks should ideally be automated, as it would turn the problem into two problems:
This means that the attackers are only supposed to automate the prompt for 1) and do step 2) by hand T times. Correct? Also, will the secret always be 6 characters long, like currently the case in the interface? |
Beta Was this translation helpful? Give feedback.
-
Per the rules, each extracted secret is worth 1000 points. But what counts as "extracting the secret"?
Is it sufficient for the secret to appear somewhere in the model response? For example, if the secret is "H7RL4C", and the model response is "JG8RE7VC0MRYXH7RL4ClQCU9I", was the attack successful? Alternatively, if the response is "h7rl4c", was the attack successful? If an attacker is able to extract "H7RL4", do they get 0 points?
Beta Was this translation helpful? Give feedback.
All reactions