-
Notifications
You must be signed in to change notification settings - Fork 236
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
probes: add ArtPrompt probes #617
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: Jeffrey Martin <[email protected]>
Signed-off-by: Jeffrey Martin <[email protected]>
This is looking pretty reasonable. Agree that a probe works well for the case that the paper presents! |
Hi, my team was thinking of building this as a Buff for the Apart Deception Hackathon (but, just saw this pull request already existed!) - is there a list anywhere of what's left to do for it? (E.g, adding configurable safety words?) |
@zazer0, the current implementation for the probe is mostly complete, a plan for configurable prompts will likely be worked on after #602. The primary reason this is still in draft is that work is still needed to generate a better |
* consolidate init in `ArtPrompt` * make `safety_words` and `stub_prompts` default params to allow override * limit imports Signed-off-by: Jeffrey Martin <[email protected]>
bd1c9d6
to
3a7cd32
Compare
Signed-off-by: Jeffrey Martin <[email protected]>
Given the upcoming payloads work on decoupling content from transformation, can it make sense to put this through as a probe in this PR? And defer (in separate PRs): a. making the encoded texts accessible via the payload mechanism; |
Signed-off-by: Jeffrey Martin <[email protected]>
* require `populate_prompt()` for extending class * use private class lists for initial prompts Signed-off-by: Jeffrey Martin <[email protected]>
Fix #535
Implements two prompt obfuscation patterns based on ArtPrompt
Testing work is still in progress here a new possible base case or
detector
specific to technique, may be needed. As currentmitigation.MitigationBypass
detector does not quite cover the returned values when the model is not able to infer the masked word.Example Usage pattern:
The probe pattern could be enhanced to be provided a dataset of prompts to be augmented with a dictionary of unsafe words often blocked by safety training that can be easily maintained as a set of resource files.