The GPT-4 large language model from OpenAI can exploit real-world vulnerabilities without human intervention, a new study by University of Illinois Urbana-Champaign researchers has found. Other open-source models, including GPT-3.5 and vulnerability scanners, are not able to do this.

A large language model agent — an advanced system based on an LLM that can take actions via tools, reason, self-reflect and more — running on GPT-4 successfully exploited 87% of “one-day” vulnerabilities when provided with their National Institute of Standards and Technology description. One-day vulnerabilities are those that have been publicly disclosed but yet to be patched, so they are still open to exploitation.

“As LLMs have become increasingly powerful, so have the capabilities of LLM agents,” the researchers wrote in the arXiv preprint. They also speculated that the comparative failure of the other models is because they are “much worse at tool use” than GPT-4.

The findings show that GPT-4 has an “emergent capability” of autonomously detecting and exploiting one-day vulnerabilities that scanners might overlook.

Daniel Kang, assistant professor at UIUC and study author, hopes that the results of his research will be used in the defensive setting; however, he is aware that the capability could present an emerging mode of attack for cybercriminals.

He told TechRepublic in an email, “I would suspect that this would lower the barriers to exploiting one-day vulnerabilities when LLM costs go down. Previously, this was a manual process. If LLMs become cheap enough, this process will likely become more automated.”

How successful is GPT-4 at autonomously detecting and exploiting vulnerabilities?

GPT-4 can autonomously exploit one-day vulnerabilities

The GPT-4 agent was able to autonomously exploit web and non-web one-day vulnerabilities, even those that were published on the Common Vulnerabilities and Exposures database after the model’s knowledge cutoff date of November 26, 2023, demonstrating its impressive capabilities.

“In our previous experiments, we found that GPT-4 is excellent at planning and following a plan, so we were not surprised,” Kang told TechRepublic.

SEE: GPT-4 cheat sheet: What is GPT-4 & what is it capable of?

Kang’s GPT-4 agent did have access to the internet and, therefore, any publicly available information about how it could be exploited. However, he explained that, without advanced AI, the information would not be enough to direct an agent through a successful exploitation.

“We use ‘autonomous’ in the sense that GPT-4 is capable of making a plan to exploit a vulnerability,” he told TechRepublic. “Many real-world vulnerabilities, such as ACIDRain — which caused over $50 million in real-world losses — have information online. Yet exploiting them is non-trivial and, for a human, requires some knowledge of computer science.”

Out of the 15 one-day vulnerabilities the GPT-4 agent was presented with, only two could not be exploited: Iris XSS and Hertzbeat RCE. The authors speculated that this was because the Iris web app is particularly difficult to navigate and the description of Hertzbeat RCE is in Chinese, which could be harder to interpret when the prompt is in English.

GPT-4 cannot autonomously exploit zero-day vulnerabilities

While the GPT-4 agent had a phenomenal success rate of 87% with access to the vulnerability descriptions, the figure dropped down to just 7% when it did not, showing it is not currently capable of exploiting zero-day vulnerabilities. The researchers wrote that this result demonstrates how the LLM is “much more capable of exploiting vulnerabilities than finding vulnerabilities.”

It’s cheaper to use GPT-4 to exploit vulnerabilities than a human hacker

The researchers determined the average cost of a successful GPT-4 exploitation to be $8.80 per vulnerability, while employing a human penetration tester would be about $25 per vulnerability if it took them half an hour.

While the LLM agent is already 2.8 times cheaper than human labour, the researchers expect the associated running costs of GPT-4 to drop further, as GPT-3.5 has become over three times cheaper in just a year. “LLM agents are also trivially scalable, in contrast to human labor,” the researchers wrote.

GPT-4 takes many actions to autonomously exploit a vulnerability

Other findings included that a significant number of the vulnerabilities took many actions to exploit, some up to 100. Surprisingly, the average number of actions taken when the agent had access to the descriptions and when it didn’t only differed marginally, and GPT-4 actually took fewer steps in the latter zero-day setting.

Kang speculated to TechRepublic, “I think without the CVE description, GPT-4 gives up more easily since it doesn’t know which path to take.”

How were the vulnerability exploitation capabilities of LLMs tested?

The researchers first collected a benchmark dataset of 15 real-world, one-day vulnerabilities in software from the CVE database and academic papers. These reproducible, open-source vulnerabilities consisted of website vulnerabilities, containers vulnerabilities and vulnerable Python packages, and over half were categorised as either “high” or “critical” severity.

List of the 15 vulnerabilities provided to the LLM agent and their descriptions.
List of the 15 vulnerabilities provided to the LLM agent and their descriptions. Image: Fang R et al.

Next, they developed an LLM agent based on the ReAct automation framework, meaning it could reason over its next action, construct an action command, execute it with the appropriate tool and repeat in an interactive loop. The developers only needed to write 91 lines of code to create their agent, showing how simple it is to implement.

System diagram of the LLM agent.
System diagram of the LLM agent. Image: Fang R et al.

The base language model could be alternated between GPT-4 and these other open-source LLMs:

  • GPT-3.5.
  • OpenHermes-2.5-Mistral-7B.
  • LlaMA-2 Chat (70B).
  • LLaMA-2 Chat (13B).
  • LLaMA-2 Chat (7B).
  • Mixtral-8x7B Instruct.
  • Mistral (7B) Instruct v0.2.
  • Nous Hermes-2 Yi 34B.
  • OpenChat 3.5.

The agent was equipped with the tools necessary to autonomously exploit vulnerabilities in target systems, like web browsing elements, a terminal, web search results, file creation and editing capabilities and a code interpreter. It could also access the descriptions of vulnerabilities from the CVE database to emulate the one-day setting.

Then, the researchers provided each agent with a detailed prompt that encouraged it to be creative, persistent and explore different approaches to exploiting the 15 vulnerabilities. This prompt consisted of 1,056 “tokens,” or individual units of text like words and punctuation marks.

The performance of each agent was measured based on whether it successfully exploited the vulnerabilities, the complexity of the vulnerability and the dollar cost of the endeavour, based on the number of tokens inputted and outputted and OpenAI API costs.

SEE: OpenAI’s GPT Store is Now Open for Chatbot Builders

The experiment was also repeated where the agent was not provided with descriptions of the vulnerabilities to emulate a more difficult zero-day setting. In this instance, the agent has to both discover the vulnerability and then successfully exploit it.

Alongside the agent, the same vulnerabilities were provided to the vulnerability scanners ZAP and Metasploit, both commonly used by penetration testers. The researchers wanted to compare their effectiveness in identifying and exploiting vulnerabilities to LLMs.

Ultimately, it was found that only an LLM agent based on GPT-4 could find and exploit one-day vulnerabilities — i.e., when it had access to their CVE descriptions. All other LLMs and the two scanners had a 0% success rate and therefore were not tested with zero-day vulnerabilities.

Why did the researchers test the vulnerability exploitation capabilities of LLMs?

This study was conducted to address the gap in knowledge regarding the ability of LLMs to successfully exploit one-day vulnerabilities in computer systems without human intervention.

When vulnerabilities are disclosed in the CVE database, the entry does not always describe how it can be exploited; therefore, threat actors or penetration testers looking to exploit them must work it out themselves. The researchers sought to determine the feasibility of automating this process with existing LLMs.

SEE: Learn how to Use AI for Your Business

The Illinois team has previously demonstrated the autonomous hacking capabilities of LLMs through “capture the flag” exercises, but not in real-world deployments. Other work has mostly focused on AI in the context of “human-uplift” in cybersecurity, for example, where hackers are assisted by a GenAI-powered chatbot.

Kang told TechRepublic, “Our lab is focused on the academic question of what are the capabilities of frontier AI methods, including agents. We have focused on cybersecurity due to its importance recently.”

OpenAI has been approached for comment.

Subscribe to the Innovation Insider Newsletter

Catch up on the latest tech innovations that are changing the world, including IoT, 5G, the latest about phones, security, smart cities, AI, robotics, and more. Delivered Tuesdays and Fridays

Subscribe to the Innovation Insider Newsletter

Catch up on the latest tech innovations that are changing the world, including IoT, 5G, the latest about phones, security, smart cities, AI, robotics, and more. Delivered Tuesdays and Fridays