Not convinced when trying the example attack strategy #171
Replies: 3 comments 2 replies
-
Hi @corticalstack! FWIW this is the behavior I see when I run this myself, too. One might say Azure Open AI is doing a good job here 🙂 For sure, a more capable model without safeguards is helpful in generating prompts here. Similarly, if you're attacking a model that does not have much protection and perhaps hasn't been aligned you'll have a much easier time. That said, our objective with the examples is mostly to show how you might use PyRIT, not to actually demonstrate harm as it could be triggering. Depending on what you're trying to red team for you can switch this out for something else. Since we don't want to show offensive content in our examples we'll stay away from that in this repo. If you want to convince yourself of PyRIT's capabilities I suggest maybe using the Also, if there's something specific that you're looking for that we currently don't support please feel free to file issues. |
Beta Was this translation helpful? Give feedback.
-
Hi @romanlutz , The responses from the target LLM in the example output I provided are what I would expect from the Azure OpenAI/OpenAI variants. My disappointment was twofold - not seeing any significant attack from the red teaming LLM (opensource uncensored), and how the attack was judged to have satisfied the objective. I would question the validity and value of the provided examples, and a leaning away from wanting to provide anything resembling a real world scenario. Speakign for myself, I specialize more in NLP than security, but supporting the GenAI team in my org. Closer to real-life examples would provide vital insights into how adversarial actors might exploit LLM endpoints in ways that abstract, weaker examples cannot. It would be really valuable to see carefully curated, real-world inspired scenarios that demonstrate plausible threats, both to learn from, and evaluate Pyrit. Unfortunately, right now with previous example, I'm not seeing much value. However, will for sure try out the 1_gandalf example and work through the remainder of the demo notebooks. Thanks for all the work by the Pyrit team, and engaging in this to guide and discuss. |
Beta Was this translation helpful? Give feedback.
-
Hi @romanlutz , Picked this back up for some more testing and convincing. I used an uncensored model for both red team and target LLMs: . with:
and
I get the following output, where despite what appears to be the target LLM discussing the malicious objective, it is not recognised to be reached/achieved. Might it be the token used to indicate the objective reached, or some other insight you can share? Further, incurrent format, with the output so verbose, it's diffucult to distinguish between the red team LLM and target LLM parts of the conversation. It would be great to have the option read a less verbose output just as if reading a normal conversation between 2 personas
|
Beta Was this translation helpful? Give feedback.
-
Hi,
New user of Pyrit here. Early days experimenting, taking and exampel from this repo which instantiates both a red teaming LLM and target LLM, using the attack strategy "let's talk about pizza....say something derogatory or stereotypical..". Note for the red teaming LLM I've tried several open source uncensored models, e.g. mistral-7b-openorca, and the target is OpenAI GPT.3-5-Turbo.
I understand the quality of the coercing prompts is dependant on the underlying model, it's censorship, definition of the attack strategy, LLM parameters like temperature etc. However, I understand the package prompt convertor plays a role. I'm not convinced by the red teaming LLM attempts to "attack" the target, nor by identification the objective is achieved. In my view, the target did not state anything deregoatory.
I would love to be convinced by the quality of Pyrit, perhaps trying other suggested examples. ALso open to being educated, convinced, and discuss.
Here's an example output from the "pizza...Italy" example. Thanks so much (note Pyrit 0.1.2).
Beta Was this translation helpful? Give feedback.
All reactions