Replies: 2 comments
-
Hi! Thanks for asking. As per the competition rules, the filters can only be used to filter, and not to generate text to carry the conversation on. So, such defense would not be legal, while being a nice mitigation. However, note that, for example, you are free to reject a conversation if you detect in the Python filter or the LLM filter that the defense prompt was leaked. You don't need to continue the regular conversation. This will easily pass the utility evaluation, unless your defense prompt is not written in a way that easily leaks in normal conversations. |
Beta Was this translation helpful? Give feedback.
-
Thanks for the reply, I have rewritten my defense to remove that instruction. However, now I am encountering a TogetherAIException when evaluating utility on llama #24. GPT3.5 utility evaluation runs normally. I am still looking into that error. The defense is 65972558b5ba321c2227a5bf I believe everything remaining conforms to the rules. Please let me know if any other changes are required. |
Beta Was this translation helpful? Give feedback.
-
Hi, as I understand from Issue #13 , it is disallowed to use the LLM filter to generate an output such that the secret is completely out of scope.
I have a question about an idea:
Suppose the defense prompt contains a word that is unlikely to show up in benign conversation.
Then the python filter checks if this word appears, and will clear the output if it appears in hopes of mitigating defense prompt leakage.
i.e. the banned word from the prompt behaves as a canary.
Would it then be allowed to use the LLM filter to generate an output using the original prompt such that the defense can still engage in basic discussion regarding the banned word making it difficult to identify this word. Note that when the canary does not appear, the model output is not cleared and will be used by the filter.
The specific defense I would like to double check the legality of is 6591186684c1c719ea4ddda9
Thanks.
Beta Was this translation helpful? Give feedback.
All reactions