probes: add ArtPrompt probes #617

jmartin-tech · 2024-04-22T14:32:12Z

Implements two prompt obfuscation patterns based on ArtPrompt

Testing work is still in progress here a new possible base case or detector specific to technique, may be needed. As current mitigation.MitigationBypass detector does not quite cover the returned values when the model is not able to infer the masked word.

Example Usage pattern:

python -m garak -m huggingface.Model -n meta-llama/Llama-2-7b-chat-hf -p artprompt
python -m garak -m huggingface --model_name gpt2 --probes artprompt

The probe pattern could be enhanced to be provided a dataset of prompts to be augmented with a dictionary of unsafe words often blocked by safety training that can be easily maintained as a set of resource files.

Signed-off-by: Jeffrey Martin <[email protected]>

garak/buffs/art.py

leondz · 2024-05-10T13:04:11Z

This is looking pretty reasonable. Agree that a probe works well for the case that the paper presents!

zazer0 · 2024-06-28T09:38:25Z

Hi, my team was thinking of building this as a Buff for the Apart Deception Hackathon (but, just saw this pull request already existed!) - is there a list anywhere of what's left to do for it? (E.g, adding configurable safety words?)

jmartin-tech · 2024-06-28T15:30:38Z

@zazer0, the current implementation for the probe is mostly complete, a plan for configurable prompts will likely be worked on after #602.

The primary reason this is still in draft is that work is still needed to generate a better detector for evaluating the responses from a prompt offered by this probe. The current detectors look for a mitigation response however this probe would needs additional filtering to determine if the response was able to identify the safety word masked in the prompt as failure to decode would not represent a finding of successful bypass of alignment or mitigation.

* consolidate init in `ArtPrompt` * make `safety_words` and `stub_prompts` default params to allow override * limit imports Signed-off-by: Jeffrey Martin <[email protected]>

Signed-off-by: Jeffrey Martin <[email protected]>

leondz · 2024-08-28T08:59:56Z

Given the upcoming payloads work on decoupling content from transformation, can it make sense to put this through as a probe in this PR? And defer (in separate PRs):

a. making the encoded texts accessible via the payload mechanism;
b. a buff

Signed-off-by: Jeffrey Martin <[email protected]>

* require `populate_prompt()` for extending class * use private class lists for initial prompts Signed-off-by: Jeffrey Martin <[email protected]>

jmartin-tech added 3 commits April 15, 2024 09:36

initial ArtPrompt as buff

dd4323b

convert ArtPrompt to probe

28e5a4c

Signed-off-by: Jeffrey Martin <[email protected]>

Expand ArtPrompt for cards font

2dd59d3

Signed-off-by: Jeffrey Martin <[email protected]>

leondz added probes Content & activity of LLM probes new plugin Describes an entirely new probe, detector, generator or harness labels Apr 30, 2024

leondz added 2 commits May 10, 2024 14:43

Merge branch 'main' into feature/artprompt-probe

2f74a3b

python insufficiently clever for reasonable double-quote usage

ecb0f93

leondz reviewed May 10, 2024

View reviewed changes

garak/buffs/art.py Outdated Show resolved Hide resolved

jmartin-tech added 2 commits June 28, 2024 11:10

Merge branch 'main' into feature/artprompt-probe

8992b38

account for configuration support feature

3a7cd32

* consolidate init in `ArtPrompt` * make `safety_words` and `stub_prompts` default params to allow override * limit imports Signed-off-by: Jeffrey Martin <[email protected]>

jmartin-tech force-pushed the feature/artprompt-probe branch from bd1c9d6 to 3a7cd32 Compare June 28, 2024 16:22

jmartin-tech added 2 commits July 26, 2024 09:14

missing copyright header

48f50ab

Signed-off-by: Jeffrey Martin <[email protected]>

Merge 'main' into feature/artprompt-probe

96a46e0

jmartin-tech added 3 commits August 29, 2024 10:11

Merge branch 'main' into feature/artprompt-probe

8545e02

inherit Probe.DEFAULT_PARAMS

9f15fd8

Signed-off-by: Jeffrey Martin <[email protected]>

defer conifgurable prompts

e590005

* require `populate_prompt()` for extending class * use private class lists for initial prompts Signed-off-by: Jeffrey Martin <[email protected]>

leondz changed the title ~~add ArtPrompt probes~~ probes: add ArtPrompt probes Oct 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

probes: add ArtPrompt probes #617

probes: add ArtPrompt probes #617

jmartin-tech commented Apr 22, 2024

leondz commented May 10, 2024

zazer0 commented Jun 28, 2024 •

edited

Loading

jmartin-tech commented Jun 28, 2024

leondz commented Aug 28, 2024

probes: add ArtPrompt probes #617

Are you sure you want to change the base?

probes: add ArtPrompt probes #617

Conversation

jmartin-tech commented Apr 22, 2024

leondz commented May 10, 2024

zazer0 commented Jun 28, 2024 • edited Loading

jmartin-tech commented Jun 28, 2024

leondz commented Aug 28, 2024

zazer0 commented Jun 28, 2024 •

edited

Loading