We have audited the ChaCha20 circuits of the Reclaim protocol.

You can find the report here as well as Reclaim's own announcement here.
Through two audit iterations, it became evident that the word-based circuit approach still posed challenges, as several bit-level operations could not be efficiently constrained. This realization led us to revisit the initial individual-bits approach, optimize it, and undergo the audit for the third, final, time.
The outcome of this comprehensive audit was a significant performance improvement. Our new circuit, validated as both secure and efficient, emerged as a 10% smaller and faster alternative to the word-based approach. This achievement not only enhances the performance of our cryptographic solutions but also underscores our commitment to providing the best security practices to our users.
The implementation used Circom, targeting the Groth16 proof system, and offering a flexible API for the user to use. In addition, it purposefully does not provide authentication over the ciphertext as the Reclaim protocol does not need it. Oracles can see the ciphertexts and authenticate their integrity by passing them as public inputs to the circuit when verifying the proofs.
The audit followed RFC 7539: ChaCha20 and Poly1305 for IETF Protocols in order to ensure that the implementation matched the specification.
To address original findings and subsequent reviews, Questbook ended up changing its approach to encode all bitwise logic in the circuits with bit-level constraints. We directly copy below the overview of the changes and the reasoning behind them as written in the report.
Questbook uses the Groth16 proof system, which works with circuits that are encoded using the R1CS arithmetization. Such an arithmetization allows someone to encode a circuit as a list of quadratic constraints which are often referred to as the rows or constraints of the circuits. Specifically, each constraint has full access to a list of "witness values" (that usually starts with the public input values), and can encode the multiplication of two linear combinations of that witness list and enforce that it is equal to another linear combination of that same witness list.
In other words, for a witness vector $\vec{w} = (w_0, w_1, \cdots)$ we can add new constraints by hardcoding the scalar values $a_i, b_i, c_i$ which will enforce the following:
$$ (a_0 w_0 + a_1 w_1 + \cdots)(b_0 w_0 + b_1 w_1 + \cdots) = c_0 w_0 + c_1 w_1 + \cdots $$
Circuit size is usually paramount to a zero-knowledge application, as it directly impacts the proof size and prover time. As such, reducing a circuit is often a priority.
Non-snark friendly cryptography primitives, like ChaCha20 implemented here, are often full of bitwise operations (e.g. XOR, ROT, etc.) which are not efficient to encode in such arithmetization systems. This often leads to circuits blowing up.
Some systems accelerate unfriendly low-level operations by making use of lookup tables or higher-degree constraints (that fits in a single row). Groth16 does not support any of these. For this reason, bitwise operations in the ChaCha20 circuit have to be encoded by first unpacking values down to their bit representations, and then acting on the bits directly. Later, bits can be packed back into field elements if needed.
Originally, the ChaCha20 circuit attempted to save constraints by avoiding the packing and unpacking of values. Dealing with bits directly is undesirable as each value being unpacked leads to a growth of the witness size by the number of bits. Then each bit must be constrained to ensure that they can only be 0 or 1 and nothing else. These "bit constraints" are quadratic constraints (i.e. $x(x-1) = 0$) and thus cost one constraint/row in R1CS for each of them. That is unlike the packing and unpacking operations ($x = \sum_i x_i 2^i$) which are linear and thus can be encoded in a single constraint/row.
Unfortunately, as it was found in the original review contained which forms the base of this report, that approach was found not to be sound.
The circuits were subsequently refactored to perform all bitwise operations on the bits of the values directly. In addition, packing and unpacking operations are removed as bits can be passed from one operation to the other without paying the cost of constantly re-encoding.