Manipulating Machine Minds: The Dark Art of AI Prompt Injection

Are you worried about the safety of your data in the age of AI? Let's talk about prompt injection and indirect prompt injection, two sneaky techniques hackers use to manipulate artificial intelligence systems.

Last year, the prompt injection phrase caught the attention of security experts, but similar attacks on artificial intelligence have been happening for years. Since the arrival of personal assistants like Siri or Google Assistant, researchers have been uncovering ways to send unwanted commands to these helpful bots. In one notorious case rigged YouTube videos with hidden audio commands triggered Siri and Google Now on unsuspecting users' phones, granting attackers access to personal information and device features. But that's not all – a study later revealed that ultrasonic waves, inaudible to humans, could also pass commands to voice assistants, as demonstrated in the chillingly named DolphinAttack. These incidents, while alarming, all follow the same pattern as prompt injection attacks and serve as stark reminders of the dangerous consequences of AI vulnerabilities.

DolphinAttack logo, source: https://dl.acm.org/doi/10.1145/3133956.3134052

‍

How prompt injection targets AI and LLM security

Picture this: an attacker wants to extract sensitive data stored on an AI or LLM's server. With prompt injection, they can provide the system with specific inputs that trick it into providing a predetermined output. On the other hand, indirect prompt injection involves manipulating data before it reaches the system, convincing it to make decisions that align with the attacker's agenda.

These techniques are alarmingly effective, and the consequences can be devastating. For example, an attacker could alter images presented to facial recognition software, bypassing security measures and granting unauthorized access to sensitive areas. Even scarier, prompt injection can also be used to distribute malware through seemingly harmless emails or applications, providing attackers with a sneaky backdoor into your system.

An example attack against an AI-based spam filter

Let me show you an example of prompt injection. Imagine, that someone created a spam filter by asking GPT-4 for each incoming mail to judge whether it is spam or not. To convince GPT-4 to perform this task we have to provide a proper prompt for it. Let’s generate the spam filter prompt using GPT-4 itself using a prompt generator prompt.

By using this prompt GPT-4 can perform the task very convincingly. Without any additional training the test messages were categorized properly.

However, as one can imagine, this simple solution is vulnerable to very basic prompt injection attacks. It is enough to ask the spam filter in the sent e-mail to categorize it as not spam.

From theory to practice

Do you think that this attack is only a theoretical one? Just take a look at one of the AI database web sites, such as https://theresanaiforthat.com. There are hundreds of AI services just starting every month. For example a simple one, which corrects your grammar is also vulnerable to prompt injection attack.

Frameworks like boxcars.ai (Ruby) and langchain (Python) that provide capabilities like writing scripts and applications that execute code from LLMs (large language models) as a built-in feature can be severely vulnerable to prompt injection as well. As it was stated in this blog post, an adversary can easily send malicious prompts to an LLM which then can cause remote code execution on the host machine. The proof-of-concept exploit called InjectGPT that allows an attacker to inject arbitrary code into a remote machine's GPT-3 language model, showed how easily this can be achieved and the need for a much greater transparency and security measures in the development and deployment of language models, and it highlights the importance of addressing the potential security risks associated with AI and machine learning technologies.

Mitigations

Mitigating prompt injection attacks is quite a challenge, easy solutions for mitigating might fail instantly as the attackers are getting more and more creative.

For example a naive solution can be to require a secret string before every command sent by the user. However, the secret can be obtained from the AI very easily and the protection becomes ineffective.

‍

There is already a bug bounty offer to developers to come up with feasible mitigation techniques. Currently four approaches are used:

Prefiltering the user input:
As simple allow or blocklisting would fail miserably (e.g. encoding or translation could mitigate the efforts), researchers came up with a more advanced weighing-based approach which collects the outputs of various filtering rules and if a limit is reached, the input is denied to be fed to the LLM. One of the metrics considered as part of such algorithms is the input length. It turns out that prompt injection attacks are statistically longer than typical queries. The system could strip the verbs from longer queries which also improves semantic matching of the embedding.

Prompt Engineering:
Like humans, LLMs give more attention to the end of the text, therefore engineers should ensure that their instructions should succeed the instructions of the users’. So that the original prompt may dominate the inputted content.

Input tagging:
The system may tag the inputs of the user and third parties (e.g. plugins, websites) differently and may be able to detect manipulation of the original prompt.
‍
Response filtering:
As a final attempt, if all else fails, we may correlate the response with the question to identify topic switching, which is an indicator of a prompt injection attack.

There is an open discussion about the topic, as there is no such thing in LLMs like a parameterized SQL query. No techniques are bulletproof, attackers may circumvent even the combination of them with strong enough determination and eventually they could succeed with trial and error. Hacking human-like AI is resembling more and more to social engineering. Maybe we should consider a bit more radical approach to improve LLMs against prompt injection, like…

Sources:

Hidden Voice Commands

DolphinAttack: Inaudible Voice Commands

Github: awesome-chatgpt-prompts

Bounty announcement for Mitigating Prompt Injection Attacks on GPT3 based Customer support App

Mitigating Prompt Injection Attacks on an LLM based Customer support App