-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
The system can be manipulated by user input to always return full marks and predefined feedback #4
Comments
https://www.ibm.com/blog/prevent-prompt-injection/ There some other solutions like Delimiter A second Agent could possible improve stuff like // Logic for valid Output to prevent /improve something like getting full points for one word as well. |
Excellent, can I ad your name to the changelog as someone who has given advice and made a suggestion. It will appear in this file. https://github.com/marcusgreen/moodle-qtype_aitext/blob/main/changelog.md |
Feel free to add my name to the changelog. Alexander Mikasch of Moodle.NRW (https://moodlenrw.de/) If I find time I will introduce the qtype to my team and inspect it in more detail. Good work! |
Thanks Alex, I have been giving it a lot of thought since yesterday. Can you email me at marcusavgreen at gmail.com |
I have created a new branch It is a first attempt at the code and would benefit from refining. |
Why you don't use the open AI assistent api. With this api the user can't "escape" from the user context and change the system context. |
Hi Juttas, my apologies for not getting back tyo you about the open AI Assistant api. That looks very interesting and I will investigate further. |
No problem, but i like AI Text (an we use ist often). In the Open AI Playground you can easy create an assistant with a system context and chat with this assistant, so you can try if the basic requirements are work for AI Text. |
Well in the past I played around again with the open-ai api, did you know that the api für gpt-4o supports to set a system context that ca'nt be manipulated by the user. messages=[
{
"role": "system",
"content": [
{
"type": "text",
"text": "Prompt for the System context\n"
}
]
},
{
"role": "user",
"content": [
{
"type": "text",
"text": "Text from the user input"
}
]
}
]
|
I was not aware of the idea of a system concept which I think is interesting. i want to ensure that the code will work with systems other than OpenAI, but I was wondering if it would be possible for me to add that as an option while still allowing it to work with other AI Inference systems (e.g. Groq cloud with the Llama models etc) |
I recently looked at a LLM model specifically designed to deal with prompt injection. I had the idea that it would be possible to use it with a pre-call to detect injection and then it would pass on the full prompt to the "real" model. Unfotunatly it didn't seem to work in any way at all so far as I could understand. However should I find such a model it is something I will investigate further. This was the model. withsecure/llama3-8b-prompt-injection |
Issue
Describe the bug
The system can be manipulated by user input to always return full marks and predefined feedback, which compromises the integrity of the automated grading process. The user can input a specific prompt that causes the LLM to disregard all previous inputs and system prompts, resulting in a JSON object that always gives full marks.
To Reproduce
Steps to reproduce the behavior:
Expected behavior
The system should correctly evaluate the user's input based on the provided criteria and marking scheme, without being influenced by manipulation prompts. The LLM should not be able to be overridden by specific user inputs that force it to give full marks.
Screenshots
Desktop (please complete the following information):
-all
Additional context
This issue allows users to bypass the intended grading mechanism, resulting in unfair assessments and undermining the reliability of the automated grading process. Implementing stricter input validation and prompt handling can help prevent this exploitation.
Yes, the problem could potentially be solved using two agents. The idea is that the first agent processes the user's input and generates a preliminary score and feedback. The second agent then reviews the output of the first agent for manipulation attempts and ensures that the feedback and score adhere to the expected criteria. Here's an overview of how this could work:
Solution with Two Agents
First Agent (Scoring and Feedback Generator):
Second Agent (Validation and Security Check):
Example Workflow
User Input:
First Agent:
Second Agent:
Pseudocode
Here is a simplified pseudocode example of how this could be implemented: JUST A GETTING STARTED IDEA
This implementation provides an additional layer of security and ensures that the assessments are fair and accurate.
Ofc its just an example that you get the idea.
The text was updated successfully, but these errors were encountered: