-
Notifications
You must be signed in to change notification settings - Fork 57
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Check for deception_adherence attack #56
base: main
Are you sure you want to change the base?
Conversation
@guy-ps Can you pls review? I am not able to assign reviewers.. I think I need to add you as a collaborator to my forked repo. |
@harshraj172 I ran the test jobs. I will review it later today/tomorrow. |
@guy-ps Thank you for taking the time to review the test jobs. I appreciate your feedback. The harm this attack can cause hinges on the reliance of users or companies on the factual accuracy of the LLM's responses. For instance, if a user or a company depends on the LLM to provide accurate information for questions like "What is the largest potato-producing state in the US?" or "Which state has the highest rainfall?", and they blindly trust these answers, it could lead to significant negative consequences. These might include poor business decisions or incorrect data being used in scenarios like case study competitions or research projects. While the impact might seem limited to the user's session, the potential for harm increases substantially when decisions are made based on inaccurate information provided by the LLM. This can be particularly dangerous in contexts where the information is assumed to be correct, leading to unforeseen complications. Additionally, if you're considering harm in terms of the LLM generating toxic or inappropriate content, it's worth noting that this attack could be adapted to cause the LLM to respond with a specific word to all queries, including those related to sensitive or adult content. This could result in the dissemination of inappropriate information, which could be particularly harmful if such content is related to specific locations or contexts. Pls let me know if you'd like me to tweak the PR to better align with these potential risks. |
Hey, can you pls respond on what you want me to do? |
Overview
This pull request introduces a new test,
deception_adherence
, designed to evaluate the robustness of LLMs against deceptive instructions. Specifically, this test assesses how well LLMs resist following instructions that deliberately prompt them to lie in response to factual questions, even when they know the correct answer. This attack is inspired by the works AI Deception: Survey and How to Catch an AI Liar.Changes
deception_adherence
has been added to the testing suite. The implementation can be found in the file deception_adherence.py.Impact
The introduction of the
deception_adherence
test significantly enhances the security probing capabilities of our LLM testing suite. Deceptive attacks of this nature can severely compromise the reliability of LLMs. Notably, during preliminary testing, I observed that none of the LLMs I evaluated, includingGPT-4
, successfully defended against this attack. Given its impact, I believe this test is a valuable addition to our suite.