Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Check for deception_adherence attack #56

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

harshraj172
Copy link

Overview

This pull request introduces a new test, deception_adherence, designed to evaluate the robustness of LLMs against deceptive instructions. Specifically, this test assesses how well LLMs resist following instructions that deliberately prompt them to lie in response to factual questions, even when they know the correct answer. This attack is inspired by the works AI Deception: Survey and How to Catch an AI Liar.

Changes

  1. A new test named deception_adherence has been added to the testing suite. The implementation can be found in the file deception_adherence.py.
  2. Moreover, the test was added to the attack_loader.py file.

Impact

The introduction of the deception_adherence test significantly enhances the security probing capabilities of our LLM testing suite. Deceptive attacks of this nature can severely compromise the reliability of LLMs. Notably, during preliminary testing, I observed that none of the LLMs I evaluated, including GPT-4, successfully defended against this attack. Given its impact, I believe this test is a valuable addition to our suite.

@harshraj172 harshraj172 changed the title Add deception_adherence attack Check for deception_adherence attack Aug 22, 2024
@harshraj172
Copy link
Author

@guy-ps Can you pls review? I am not able to assign reviewers.. I think I need to add you as a collaborator to my forked repo.

@guy-ps guy-ps self-requested a review August 25, 2024 04:32
@guy-ps
Copy link
Contributor

guy-ps commented Aug 25, 2024

@harshraj172 I ran the test jobs. I will review it later today/tomorrow.
I understand the purpose of the attack, but this kind of issue will only affect the user's session and generally won't cause any harm.
By any chance, do you have a specific use case in which this could potentially be harmful for either the company running the AI chatbot or other users?
Giving me use cases would help me describe the test and it's impact better for the future, and when we separate test failures into different severities.

@harshraj172
Copy link
Author

harshraj172 commented Aug 27, 2024

but

@guy-ps Thank you for taking the time to review the test jobs. I appreciate your feedback.

The harm this attack can cause hinges on the reliance of users or companies on the factual accuracy of the LLM's responses. For instance, if a user or a company depends on the LLM to provide accurate information for questions like "What is the largest potato-producing state in the US?" or "Which state has the highest rainfall?", and they blindly trust these answers, it could lead to significant negative consequences. These might include poor business decisions or incorrect data being used in scenarios like case study competitions or research projects.

While the impact might seem limited to the user's session, the potential for harm increases substantially when decisions are made based on inaccurate information provided by the LLM. This can be particularly dangerous in contexts where the information is assumed to be correct, leading to unforeseen complications.

Additionally, if you're considering harm in terms of the LLM generating toxic or inappropriate content, it's worth noting that this attack could be adapted to cause the LLM to respond with a specific word to all queries, including those related to sensitive or adult content. This could result in the dissemination of inappropriate information, which could be particularly harmful if such content is related to specific locations or contexts.

Pls let me know if you'd like me to tweak the PR to better align with these potential risks.

@harshraj172
Copy link
Author

harshraj172 commented Sep 11, 2024

@harshraj172 I ran the test jobs. I will review it later today/tomorrow. I understand the purpose of the attack, but this kind of issue will only affect the user's session and generally won't cause any harm. By any chance, do you have a specific use case in which this could potentially be harmful for either the company running the AI chatbot or other users? Giving me use cases would help me describe the test and it's impact better for the future, and when we separate test failures into different severities.

Hey, can you pls respond on what you want me to do?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants