Over the past year, we’ve been exploring a big question: Can AI actually find good bugs in zero-knowledge circuits and applications? And if it can… what does that mean for us as auditors? Is our job safe, or are we on the brink of faster, cheaper, AI-powered audits? After digging into this at EthCC in Cannes and on the Zero Knowledge podcast, we went one step further — we built SnarkSentinel, our own experimental AI-powered ZK auditing tool. In this post, I’ll share what worked, what didn’t, and what the future of auditing might look like when humans and AI team up… or clash.
From Idea to Implementation: Circom as Ground Zero
We started zkSecurity in early 2023, just a few months after the first release of ChatGPT. We were impressed, but we didn’t yet grasp how quickly things would evolve. Fast forward two years, and we now have astonishing LLMs capable of doing all sorts of astonishing things.
This made us wonder… is AI coming for us? Then a guy in a leather jacket, looking suspiciously like a prophet for our times, took the stage at some conference and said something that stuck:
“AI is not gonna take your jobs. The person who uses AI is gonna take your job.” – guy in leather jacket
Game on. We decided we were going to be part of the problem, and started experimenting with our own tooling.
Our first target was Circom code. If you don’t know Circom, it’s a programming language to write zk circuits. We chose it as we were already auditing a number of projects that were building using this framework (see for example). Also, Circom projects are often self-contained and include a lot of low-level algorithms that involve cryptography.
If we believe the latest report from Anthropic (System Card: Claude Opus 4 & Claude Sonnet 4), Large Language Models (LLMs) have a much easier time solving Capture The Flag (CTF) challenges involving simple web applications compared to cryptography-related ones. So the thought here was that if could tackle that hard problem first, then any other problem should be easy.
From Naive Prompting to Context Engineering
Our first approach was quite naive: we sat down and quickly wrote a prompt (literally the one you can see above) and then pasted all the code that mattered underneath it.
In the picture above, words appear color coded because text is seen by LLMs as tokens (each color represents a token). Tokens are the unit of currency of LLMs: you pay per token (and output tokens are generally more expensive than input tokens). This is important as every model has a limit for the total number of tokens it can ingest and produce, the so-called context window.
The first models we used had quite small context windows and we often hit the limit (in spite of Circom applications being relatively small compared to your usual codebase). As time passed, LLMs have increased their context window and we now usually have access to at least 200k tokens (and sometimes even more, like with Gemini 1.5 pro model and its 2M context window!)
What we quickly realized, is that even if we had a context window large enough to contain all of the stuff we pasted into it, the results we would get were iffy. On the other hand, if we curated the content we put in the prompt to what was only relevant to a bug we already knew existed, then the LLM would sometimes find it! This showed that too much data confused the LLM, but the right amount of data helped it focus.
There were a number of approaches that we could take, the first one was to fine-tune a model to continue training it on a specific codebase. But this is costly, has to be done for every new codebase or model (and oh boy did we swap models to test different approaches or to upgrade to whatever was the latest trendy model). Furthermore it doesn’t work too much unless you have a lot of data, and as we realized the state-of-the-art (SOTA) models all managed to navigate Circom code really well out-of-the-box (with some fineprints that we will address later).
The next solution we turned towards to was RAG, which stands for Retrieval Augmented Generation. The concept of RAG is usually implemented using vector databases that can ingest all type of documents (here our codebase content) and serve similarity searches. (Under the hood vector stores work by figuring out which vectors are the closest, but this is outside of the scope of this article.) A human orchestrator can then perform queries (for example, on some functions that are relevant to the function we’re trying to analyze) and insert the results in the prompt.
But we’re trying to automate things! So at this point, we moved from simple generation to using agents that can retrieve data from these vector databases by themselves. Perhaps it is a good moment to introduce the concept of an agent, which is a way to have a back-and-forth between you (the user) and the LLM to cooperatively build a prompt. This is similar to chatting with an AI as we’re used to in our day-to-day interactions with applications like ChatGPT and Le Chat. The end result can be a structured output, for example, some JSON with some expected fields, describing what we want the LLM to eventually generate.
Using an agent that could query a vector store containing chunks of a codebase was a promising approach, until we realised that we could do better that just “getting close to what we were looking for”. We already had all the information we needed from Circom afterall. So we forked Circom and made it output some dependency graph that we could use to serve our agents’ queries.
assistant: what is the implementation of this function?
user: It’s …, and it’s also calling … and is called by …
This approach worked well for Circom, but not for general codebases. In the latter case, it turns out that giving an agent access to cat
, ls
, and grep
is enough to generalize the tool to all kinds of codebases.
At this point, having the agent request the code it needs is a good way to reduce the amount of tokens we insert in the context window. But sometimes it doesn’t request enough, and starts hallucinating things. For example, it’ll hallucinate that some function returns a boolean that you’re supposed to assert, but the assertion actually is done inside the function being called and no boolean is returned. With some prompt engineering we can force it to reference sources and snippets from the transcript (and thus to request missing implementations). A lot of the time we spent working on the tool was spent debugging traces to see where the agent would mess up, and then fix it via better prompting.
On the other hand, the exploration can sometimes involve too many useless cat
and ls
and unrelated content from read files. This is a general issue that we observed with web searches, documentation searches, and so on. For this reason, we often try to front tools with their own “expert agents”. In this example, we created a codebase expert agent to explore the codebase on demand and return a summarized result. This is costly but worth it, as what’s in context become much more compact and relevant to the prompt.
Mix of Agents and the Frameworks War
At this point we were using service agents behind security agents. But security agents could only take so much by themselves, so we also introduced an overview agent whose main goal in life was to create an overview of the codebase and find smaller self-contained areas to analyze. For each of these smaller areas, we would then spin up a security agent that was tasked to constrain themselves to these corners of the codebase only. This mix of agents approach is a natural evolution, and we foresee that the next evolution will probably see even more agents, potentially with a long list of “micro agents” that will focus on specific types of bugs only. On top of that, we restarted using RAG to allow humans to add more relevant documents to the context without overwhelming it, using document-expert agents to front vector stores.
To orchestrate these complex agentic workflows we took a look at the list of frameworks that existed out there. Unfortunately, it appeared that the situation was akin to 2016 javascript with too many frameworks to choose from. Most frameworks will also break your application on each update as AI concepts are still in flux and are evolving rapidly. Some of the frameworks also have heavily biased abstractions that make it difficult to reason about low-level components.
At the end of the day, we decided to use the openAI agent SDK as the main framework as it offered interesting built-in services while remaining low-level enough (which should allow us to swap framework farely easily if some framework ends up becoming the React of agentic frameworks). We then still made use of other frameworks when they offered specific tooling (like llama-index for RAG).
In-Context Learning: Saving Noob Models Since 2023
Finding the best available model is one of the easiest way to improve the tool. In our quest to find the champion of them all, we realized that most models did not have a great understanding of Circom, both of its functionalities (e.g. assert
is not a secure function to use because it does not create constraints) and of its threat model (i.e. hints should not be trusted as the prover can arbitrarily modify their logic).
Worse, no small model (including distilled ones with ~7B parameters) could even recall any circom code from their trainings.
To figure out how different models faired, we used a trick called LLM-as-a-judge to let an agent (who knows the answer) judge another agent’s answers to our trick questions.
The results were surprising: even the SOTA models would get the questions wrong, in spite of getting it right some of the time! To solve that we first gave the models some guidance, and we later ended up adding most of Circom’s documentation in context as we realized that the models we were using still had an infinite number of questions about how Circom worked (this was obvious from the websearches they would make).
While this shows that fine-tuning could help us at some point, this in-context learning approach (as opposed to a training approach) worked extremely well with the SOTA models, and especially well with reasoning models.
Putting It to the Test: Results
OK how about the actual results? To experiment with SnarkSentinel we ran it on a number of open source projects, but also on real audit projects right after finishing the audit in order to see if the tool could find bugs that we actually found (and potentially more).
The tool ended up being very good at finding minor bugs that are well-known and not necessarily Circom specific. For example, it found some legitimate off-by-ones (that were not exploitable). We also found that it excelled at comparing a specification with an implementation (which is often something you need to do as an auditor, and it can be a tedious process). It also found critical cryptographic bugs, including one that we had to report (see next section).
We also got a number of false negatives and almost-false negatives. One finding pointed at a critical bug we had found but the agent wrongly thought the bug was of minor impact (we assumed it couldn’t understand the impact well and that this was a lack-of-context problem), another finding pointed to a critical bug we had found but the agent said that it was actually not an exploitable bug due to an assumption it quoted directly from the project’s documentation. This one was interesting, the documentation of the project did say that 2048-bit numbers were too hard to factor, wrongly assuming that such numbers couldn’t be smooth.
For other critical bugs we had found, it often pointed in their direction but did not find them (and found something unrelated instead). We assume at this point that some low-level cryptographic bugs are just too hard to find because they are not common and require more than just recognizing patterns. Although, as pointed out earlier, we are now looking into using micro-agents specifically tuned to find these kind of bugs to see if we can bridge the gap in some way. We have a number of ideas that we will keep for a future update!
Furthermore we are thinking of different techniques that could improve the tool overall results. This is important as it seems like every run produces different bugs, and one might have to run the tool many times to find a specific bug. This was confirmed many times by similar research, for example this article mentions that:
o3 finds the kerberos authentication vulnerability in the benchmark in 8 of the 100 runs
For this reason we currently also allow the tool to be ran multiple times and use an LLM-as-a-judge approach to figure out if new bugs are duplicates of previously found ones. This process currently involves an agent making queries to a vector store of reported bugs, which is a bit of a slow process but could be improved.
The bigger issue in our experiments were by far the false positives. We received ton of them, which made triaging the most time-consuming part of this research. We think that a tool that has too many false positives is not going to be useful to anyone but auditors at this point, and even for auditors it might lead to too much friction to be used during audits. We experimented with asking agents to produce proof of concepts, which worked well in some cases (especially if a testing framework was already setup), but when exploits involved cryptography the tool would often get stuck (for example, trying to produce keypairs with specific attributes).
In any case, the tool still seems to be a net positive as it can allow for a more exhaustive coverage of a project, and give good ideas and leads, even when it doesn’t find actual bugs. Overall it seems also that the tool require heavy human attention, from the start with gathering more context, to the runs with debugging traces, but especially at the end to provide expertise to understand if findings are truly findings.
Side-Quest: Uncovering and Reporting a Nasty Bug
During one of the runs, SnarkSentinel found a critical bug in an out-of-scope dependency of the project we had been reviewing. (The nice thing about SnarkSentinel is that it will even inspect the transitive dependencies of a project.) More specifically, SnarkSentinel found a bug in the PSE binary Merkle root library.
The library can be used to compute the Merkle root of a binary Merkle tree, given a leaf
, a list of siblings
and an array of indices
.
This indices
array is supposed to be composed of 0
and 1
values, but the tool found that nowhere in the circuit were the indices actually enforced to be booleans, and thus could be any field element.
In the library’s implementation, the MultiMux1
template from circomlib is used to select which values to hash (given a bit selector) to produce the root of the hashing (via the Poseidon hash function) required by the Merkle tree:
var c[2][2] = [ [nodes[i], siblings[i]], [siblings[i], nodes[i]] ];
var childNodes[2] = MultiMux1(2)(c, indices[i]);
nodes[i + 1] <== Poseidon(2)(childNodes);
However, taking a look at the MultiMux1 template, we can see that it is just computing a linear combination of the two constants assuming that the given indices are already constrained to be 0
or 1
.
template MultiMux1(n) {
signal input c[n][2]; // Constants
signal input s; // Selector
signal output out[n];
for (var i=0; i<n; i++) {
out[i] <== (c[i][1] - c[i][0])*s + c[i][0];
}
}
With a bit of algebra, anyone can find an index value and a sibling such that the output of the MultiMux1
template is equivalent to some honest pair of leaves.
In other words, an attacker could craft malicious inputs such that any leaf
could be proven to be part of the Merkle tree, even if it was not.
The tool was not able to correctly assess the impact of this bug, and it did not flag it immediately as critical. However, it gave us enough details to point us in the right direction and allowed us to confirm it quickly. This is a good example of how AI can help to enhance human auditing, as it can point out areas of code that are worth investigating, even if it does not find the bugs themselves.
After we discovered the bug, we immediately reached out to PSE, who were already aware of the issue and fixed it in the v2.0.0
release of the library.
What’s Next for AI Audits
It’s hard to predict what’s going to happen next. We know that AI is getting better and better, but we are also starting to understand its limitations. At the moment, AI is only good at things humans are good at, and the more niche something is the harder it is to get good results from it.
In the current state, AI is most likely not going to replace us for serious audits that involve advanced cryptography applications. But there is no doubt that it will for low-stake audits (e.g. some webapps) and for well-understood domains (e.g. smart contracts). We think a lot of consulting shops will shrink or die. Interestingly, bug bounties are mostly going to become AI reporting bugs to AI agents triaging them, if they’re not already.
Another interesting question we’re asking ourselves is “are we going to be more secure or less secure in general?” It seems to us that finding and exploiting bugs is going to be even easier with AI, potentially allowing less technical people (so-called script kiddies) to become much more dangerous. At the same time developers (if they don’t get replaced either) will get access to the same tools, and will be able to secure their applications much more efficiently.
In any case, one thing’s clear, we are getting more and more dangerous as auditors too.