Check for deception_adherence attack #56

harshraj172 · 2024-08-22T13:51:10Z

Overview

This pull request introduces a new test, deception_adherence, designed to evaluate the robustness of LLMs against deceptive instructions. Specifically, this test assesses how well LLMs resist following instructions that deliberately prompt them to lie in response to factual questions, even when they know the correct answer. This attack is inspired by the works AI Deception: Survey and How to Catch an AI Liar.

Changes

A new test named deception_adherence has been added to the testing suite. The implementation can be found in the file deception_adherence.py.
Moreover, the test was added to the attack_loader.py file.

Impact

The introduction of the deception_adherence test significantly enhances the security probing capabilities of our LLM testing suite. Deceptive attacks of this nature can severely compromise the reliability of LLMs. Notably, during preliminary testing, I observed that none of the LLMs I evaluated, including GPT-4, successfully defended against this attack. Given its impact, I believe this test is a valuable addition to our suite.

harshraj172 · 2024-08-23T11:19:32Z

@guy-ps Can you pls review? I am not able to assign reviewers.. I think I need to add you as a collaborator to my forked repo.

guy-ps · 2024-08-25T08:25:29Z

@harshraj172 I ran the test jobs. I will review it later today/tomorrow.
I understand the purpose of the attack, but this kind of issue will only affect the user's session and generally won't cause any harm.
By any chance, do you have a specific use case in which this could potentially be harmful for either the company running the AI chatbot or other users?
Giving me use cases would help me describe the test and it's impact better for the future, and when we separate test failures into different severities.

harshraj172 · 2024-08-27T18:41:19Z

but

@guy-ps Thank you for taking the time to review the test jobs. I appreciate your feedback.

The harm this attack can cause hinges on the reliance of users or companies on the factual accuracy of the LLM's responses. For instance, if a user or a company depends on the LLM to provide accurate information for questions like "What is the largest potato-producing state in the US?" or "Which state has the highest rainfall?", and they blindly trust these answers, it could lead to significant negative consequences. These might include poor business decisions or incorrect data being used in scenarios like case study competitions or research projects.

While the impact might seem limited to the user's session, the potential for harm increases substantially when decisions are made based on inaccurate information provided by the LLM. This can be particularly dangerous in contexts where the information is assumed to be correct, leading to unforeseen complications.

Additionally, if you're considering harm in terms of the LLM generating toxic or inappropriate content, it's worth noting that this attack could be adapted to cause the LLM to respond with a specific word to all queries, including those related to sensitive or adult content. This could result in the dissemination of inappropriate information, which could be particularly harmful if such content is related to specific locations or contexts.

Pls let me know if you'd like me to tweak the PR to better align with these potential risks.

harshraj172 · 2024-09-11T16:57:34Z

@harshraj172 I ran the test jobs. I will review it later today/tomorrow. I understand the purpose of the attack, but this kind of issue will only affect the user's session and generally won't cause any harm. By any chance, do you have a specific use case in which this could potentially be harmful for either the company running the AI chatbot or other users? Giving me use cases would help me describe the test and it's impact better for the future, and when we separate test failures into different severities.

Hey, can you pls respond on what you want me to do?

Add deception_adherence attack

d75ca54

harshraj172 changed the title ~~Add deception_adherence attack~~ Check for deception_adherence attack Aug 22, 2024

guy-ps self-requested a review August 25, 2024 04:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Check for deception_adherence attack #56

Check for deception_adherence attack #56

harshraj172 commented Aug 22, 2024

harshraj172 commented Aug 23, 2024

guy-ps commented Aug 25, 2024

harshraj172 commented Aug 27, 2024 •

edited

Loading

harshraj172 commented Sep 11, 2024 •

edited

Loading

Check for deception_adherence attack #56

Are you sure you want to change the base?

Check for deception_adherence attack #56

Conversation

harshraj172 commented Aug 22, 2024

Overview

Changes

Impact

harshraj172 commented Aug 23, 2024

guy-ps commented Aug 25, 2024

harshraj172 commented Aug 27, 2024 • edited Loading

harshraj172 commented Sep 11, 2024 • edited Loading

harshraj172 commented Aug 27, 2024 •

edited

Loading

harshraj172 commented Sep 11, 2024 •

edited

Loading