-
Notifications
You must be signed in to change notification settings - Fork 236
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
probe: Past Tense Vulnerability #924
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: Shine-afk <[email protected]>
Signed-off-by: Shine-afk <[email protected]>
DCO Assistant Lite bot All contributors have signed the DCO ✍️ ✅ |
Thanks for this! Can you get it to pass tests? |
Signed-off-by: Shine-afk <[email protected]>
Signed-off-by: Shine-afk <[email protected]>
Signed-off-by: Shine-afk <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work!
I wonder about the module name here, while past_tense
describes the end technique I believe a more generic term for the module might be in order. Consider probes.phrasing.PastTense
.
garak/detectors/keywords.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This class may fit better in the existing detectors/specialwords.py
. Something like detectors.specialwords.KeywordsFromTL
or maybe detectors.specialwords.TriggerList
for possible reuse.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added the Prefixes class to specialwords.py.
garak/resources/plugin_cache.json
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just for future reference this file does not need to be committed with PRs, automation will maintain this file on main
.
Note this was due to the PR being offered from main
in the fork. For any future PRs please contributed code to come from a unique branch in your repository. This helps protect the process, ensure users are aware of commits on the branch being considered for merge, allows for a location for more commits to be offered without mingling with other contributor changes and allows contributors to make progress while a PR is still being reviewed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I apologize for this. I don't have much experience working with GitHub.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR also cannot remove
this file. If you are comfortable with performing a rebase the commit deleting this file can be removed, otherwise add a new commit to revert the removal. (A rebase on the upstream main would also address the current conflicts with specialwords.py
The original comment was for future reference as any PR from main
in a fork will trigger the automation process that maintains this files content. This is part of why requesting PRs come from a unique branch helps avoid churn.
This is pretty great, and fits many of the garak patterns well, thank you! We'd like to integrate the entire set of past-tense phrases, supplied in the paper repo https://github.com/tml-epfl/llm-past-tense, to expand beyond the set of prompts supplied here. The method from the paper - "To automatically reformulate an arbitrary request, we use GPT-3.5 Turbo with the prompt shown in Table 2 that relies on a few illustrative examples." - can be used, and the prompt is given in their code. They also do a future tense probe. It seems likely that once the past tense code is done, doing the future tense version should be easy. We might be able to pitch in and help finish that one. For eval: this is a bit trickier, it seems to rely on issue #419 , which is slated for completion by end October. I can see from their code that they've used High-level requests:
Also as jmartin-tech notes, if you have a chance to do this dev on a dedicated branch in your repo instead of |
Signed-off-by: Shine-afk <[email protected]>
Signed-off-by: Shine-afk <[email protected]>
Signed-off-by: Shine-afk <[email protected]>
Signed-off-by: Shine-afk <[email protected]>
Signed-off-by: Shine-afk <[email protected]>
Signed-off-by: Shine-afk <[email protected]>
Signed-off-by: Shine-afk <[email protected]>
Signed-off-by: Shine-afk <[email protected]>
Signed-off-by: Shine-afk <[email protected]>
Signed-off-by: Shine-afk <[email protected]>
Signed-off-by: Shine-afk <[email protected]>
Signed-off-by: Shine-afk <[email protected]>
Signed-off-by: Shine-afk <[email protected]>
I removed the lines 'illegal' and 'not legal' from the new detector because they generate a lot of false positives. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for aligning more with the current code. I think this is pretty close to a workable solution.
Some ideas that this PR brings up that might improve on the usability of the techniques this probe is adding:
The paper suggest that source prompts should be rephrased multiple ways to lead to finding a phrasing that elicits the LLM output to bypass restrictions. Some feedback form the author suggest that at many as 20 past tense variation were needed for some initial phrases.
Given this, to get higher value from the a static probe, I can see a dataset probe that has a pre-compiled data set for a small number of existing prompts with 20 past tense permutations and selects a random subset if for each prompt if generations
is less than 20. Another approach might be for the static format probe to have a some DEFAULT_PARAMS
that define number of unique base questions to send permutations based on and combines that with generations to determine the prompt set that would be used.
As a separate future PR, introducing another probe that defaults to have an openai generator it requests rephrasing from could be used for a more dynamic probe.
I also see a possibility to use tenseflow to do a local rephrasing using NLP in a lighter weight semi-offline (it needs a cached model/dictionary like nltk_data) to process present tense prompts. Drawback to this is it might only get us 1 rephrase per present tense prompt, so might not have high
value.
Again in another iteration after the above past tense
capabilities are completed, the dynamic
probes suggested could be enhance to supply future
tense prompts to for each pattern.
garak/detectors/specialwords.py
Outdated
from garak.detectors.base import TriggerListDetector | ||
from garak.attempt import Attempt | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please remove imports not used directly in this class.
garak/resources/plugin_cache.json
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR also cannot remove
this file. If you are comfortable with performing a rebase the commit deleting this file can be removed, otherwise add a new commit to revert the removal. (A rebase on the upstream main would also address the current conflicts with specialwords.py
The original comment was for future reference as any PR from main
in a fork will trigger the automation process that maintains this files content. This is part of why requesting PRs come from a unique branch helps avoid churn.
@jmartin-tech Where are you on either landing this + expanding later re: conversations with the author, vs. expanding it now + then landing this PR for the full probe? |
I would like the If that looks promising enough all the suggested additional testing patterns can added as future iterations. |
Signed-off-by: Shine-afk <[email protected]>
Alright. Let's consider using the data in Andriushchenko's repo for static data.
Do we want to have this probe running w/ one inference done per prompt, and picking a number of prompts equal to
That works. Will have to move them to
Agree |
Here are all the hits from a run on [ pro: this looks like the detector in the paper, meaning we have some scientific replicability |
Signed-off-by: Shine-afk <[email protected]>
Signed-off-by: Shine-afk <[email protected]>
Signed-off-by: Shine-afk <[email protected]>
I've fixed the detector. Now it produces actual hits. |
My hitlog for GPT-3.5-turbo. |
Signed-off-by: Leon Derczynski <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Upstream changes from #916 have moved files like this to the garak/data/
path and should be accessed via import of data_path
:
from garak.data import path as data_path
data_file_path = data_path / "phrasing" / "past_tense.txt"
Simply rephrasing malicious requests in the past tense often allows you to bypass failure mechanisms.