Optimizing Barrett Reduction: Tighter Bounds Eliminate Redundant Subtractions

Written by Suneal Gong on May 1, 2025

Barrett reduction is a widely used algorithm for reducing a value modulo $m$ . Our analysis, conducted during the Rust p256 crate audit, shows that the error bound for Barrett reduction can be tighter than traditionally assumed. For most moduli used in cryptography (e.g., NIST curves), the quotient approximation error is at most 1 (not 2). This improvement eliminates the need for a second subtraction in practice. By adopting this optimization, RustCrypto p256 achieves a 14% performance improvement in scalar multiplication.

What Is Barrett Reduction?

Barrett reduction is a method for efficiently computing the remainder of a division operation (i.e., modulus operation), $x \mod m$ , without performing an actual division.

We want to compute $r = x \mod m$ , which can be expressed as $x = q \cdot m + r$ , where $q$ is the quotient and $r$ is the remainder. In practice (e.g., cryptographic field arithmetic), the modulus $m$ can be a large integer represented using $k$ limbs. Each limb is a 32-bit or 64-bit value (depending on the machine word), making the radix $b = 2^{32}$ or $2^{64}$ . In practice, the value $x$ is a $2 k$ -limb integer because it results from the multiplication of two $k$ -limb integers.

One way to compute $r$ is to first calculate $q$ as:

$q = ⌊ x / m ⌋$

Once $q$ is determined, $r$ can be computed as $r = x - q \cdot m$ . Barrett reduction provides a more efficient way to approximate $q$ , avoiding the need for direct division, which is computationally expensive.

First, let’s rewrite the formula above as:

$q = ⌊ x / m ⌋ = ⌊ \frac{x}{m} \cdot \frac{b^{2 k}}{b^{2 k}} ⌋ = ⌊ \frac{x}{m} \cdot \frac{b^{2 k}}{b^{k + 1} \cdot b^{k - 1}} ⌋ = ⌊ \frac{b^{2 k}}{m} \cdot \frac{x}{b^{k - 1}} \cdot \frac{1}{b^{k + 1}} ⌋$

So far, our computation is exact. However, instead of computing $q$ exactly, we’ll approximate it by computing:

$\tilde{q} = ⌊ \frac{⌊ \frac{x}{b^{k - 1}} ⌋ \cdot ⌊ \frac{b^{2 k}}{m} ⌋}{b^{k + 1}} ⌋$

This allows us to precompute $μ = ⌊ \frac{b^{2 k}}{m} ⌋$ and rewrite the above as:

$q = ⌊ \frac{⌊ \frac{x}{b^{k - 1}} ⌋ \cdot μ}{b^{k + 1}} ⌋$

Note: Both $⌊ \frac{\cdot}{b^{k - 1}} ⌋$ and $⌊ \frac{\cdot}{b^{k + 1}} ⌋$ are fast to compute using right shifts.

Since the two approximations can be smaller than their exact computations, we have $\tilde{q} \leq q$ . Traditional analysis has shown that $\tilde{q} \in [q - 2, q]$ (we will elaborate on this in the analysis section). This means the approximate quotient $\tilde{q}$ is at most 2 less than the true quotient $q$ .

Computing $r$ in $x \approx \tilde{q} \cdot m + r$

If we had computed $q$ exactly, we would simply compute:

$r = x - q \cdot m$

Since $r \in [0, m)$ , only the bit length of $m$ is involved in that subtraction. Thus, we can compute it faster by involving only the least significant $b^{k}$ bits:

$r = x - q \cdot m \mod b^{k} = (x \mod b^{k}) - (q \cdot m \mod b^{k})$

(where modulo $b^{k}$ can be computed efficiently)

r-mod

However, remember that we have computed an approximation $\tilde{q}$ of $q$ , where $\tilde{q} \in [q - 2, q]$ .

We know that either:

$r = x - \tilde{q} \cdot m$
$r = x - (\tilde{q} + 1) \cdot m$
$r = x - (\tilde{q} + 2) \cdot m$

To compute $r$ , we first calculate case 1. If the result is not smaller than $m$ , we subtract $m$ once or twice to bring it into the correct range.

This means that we might not have $\tilde{r} = x - \tilde{q} \cdot m < b^{k}$ immediately. Instead, the value could be $m$ or $2 m$ times larger.

Since $m < b^{k}$ , we know $2 \cdot m < 2 \cdot b^{k}$ . This allows us to upper bound our approximation:

$r \leq \tilde{r} \leq r + 2 m < b^{k} + 2 \cdot b^{k} < b^{k + 1}$

Then, we can calculate $\tilde{r}$ more efficiently:

$\tilde{r} = ((x mod b^{k + 1}) - (\tilde{q} \cdot m mod b^{k + 1})) mod b^{k + 1}$

tilde-r-mod

Again, the $mod b^{k + 1}$ operation is very efficient on binary machines.

We finally arrive at the algorithm as described in Handbook of Applied Cryptography, Chapter 14. The steps outlined above closely follow the process detailed in the book.

Barrett Algorithm

Source: Handbook of Applied Cryptography, Chapter 14

A Tighter Bound Analysis

The bound of $\tilde{q}$ directly determines how many times $r$ needs to be subtracted by $m$ to fall within the correct range. Traditionally, the bound is known to be $\tilde{q} \in [q - 2, q]$ . In this section, we will show that for almost all moduli in practice, the tighter bound holds: $\tilde{q} \in [q - 1, q]$ .

Remember that

$q = ⌊ \frac{x}{m} ⌋$

$\tilde{q} = ⌊ \frac{⌊ \frac{x}{b^{k - 1}} ⌋ \cdot ⌊ \frac{b^{2 k}}{m} ⌋}{b^{k + 1}} ⌋$

Since $\tilde{q}$ is derived from a truncated approximation of $\frac{x}{m}$ , it naturally satisfies $\tilde{q} \leq q$ .

Let’s denote $α = x \mod b^{k - 1}$ , where $α < b^{k - 1}$ . Denote $β = b^{2 k} \mod m$ , where $β < m$ . Then we can remove the floor operation:

$⌊ \frac{x}{b^{k - 1}} ⌋ = \frac{x - α}{b^{k - 1}}$

$⌊ \frac{b^{2 k}}{m} ⌋ = \frac{b^{2 k} - β}{m}$

Then we can simplify $\tilde{q}$ :

\begin{aligned} \tilde{q} & = ⌊ \frac{⌊ \frac{x}{b^{k - 1}} ⌋ \cdot ⌊ \frac{b^{2 k}}{m} ⌋}{b^{k + 1}} ⌋ \\ = ⌊ \frac{\frac{x - α}{b^{k - 1}} \cdot \frac{b^{2 k} - β}{m}}{b^{k + 1}} ⌋ \\ = ⌊ \frac{(x - α) \cdot (b^{2 k} - β)}{m \cdot b^{2 k}} ⌋ \\ = ⌊ \frac{x}{m} - \frac{α \cdot b^{2 k} + β \cdot (x - α)}{m \cdot b^{2 k}} ⌋ \end{aligned}

We use $z$ to denote the red part (note that $z \geq 0$ ). Then we have $\tilde{q} = ⌊ \frac{x}{m} - z ⌋$ . By the floor function inequality $⌊ x ⌋ + ⌊ y ⌋ + 1 \geq ⌊ x + y ⌋$ , we have:

$⌊ \frac{x}{m} - z ⌋ + ⌊ z ⌋ + 1 \geq ⌊ (\frac{x}{m} - z) + z ⌋ = ⌊ \frac{x}{m} ⌋ = q$

That is saying $\tilde{q} + ⌊ z ⌋ + 1 \geq q$ .

The bound of $z$ is the key to analyzing the bound of $\tilde{q}$ . If we can prove that $0 \leq z < 2$ , then we will have $⌊ z ⌋ \leq 1$ and finally $\tilde{q} + 2 \geq \tilde{q} + ⌊ z ⌋ + 1 \geq q$ . Let’s analyze the bound of $z$ .

Prove that $\tilde{q} \in [q - 2, q]$

We have:

\begin{aligned} z & = \frac{α \cdot b^{2 k} + β \cdot (x - α)}{m \cdot b^{2 k}} \\ < \frac{b^{k - 1} \cdot b^{2 k} + β \cdot b^{2 k}}{m \cdot b^{2 k}} \\ = \frac{b^{k - 1} + β}{m} \end{aligned}

We know that $b^{k - 1} < m$ (because $m$ is $k$ -limb) and $β < m$ . Then we have:

$z < \frac{b^{k - 1} + β}{m} < \frac{m + m}{m} = 2$

And thus:

$⌊ z ⌋ \leq 1$

That’s exactly what we want. Therefore, we conclude that $\tilde{q} + 2 \geq \tilde{q} + ⌊ z ⌋ + 1 \geq q$ .

Tighter Bound in Practice: $\tilde{q} \in [q - 1, q]$

We have proved that $\tilde{q} \in [q - 2, q]$ under all circumstances. However, this is a loose bound. Here, we claim that for almost all moduli $m$ in practice, we have a tighter bound: $z < 1$ , and thus $\tilde{q} + 1 \geq \tilde{q} + ⌊ z ⌋ + 1 \geq q$ .

Let’s examine which moduli $m$ satisfy this tighter bound. Recall that $z < \frac{b^{k - 1} + β}{m}$ , so $z < 1$ holds if the following is true:

$\frac{b^{k - 1} + β}{m} \leq 1$

Which means:

$b^{k - 1} + β \leq m$

Further:

$β \leq m - b^{k - 1}$

That’s saying if $β$ is no greater than $m - b^{k - 1}$ , then $z < 1$ and thus $\tilde{q} \in [q - 1, q]$ . We can formalize this as the Tighter Bound Criterion:

$Given a modulus m, if β \leq m - b^{k - 1} (where β = b^{2 k} mod m), then \tilde{q} \in [q - 1, q] .$

It’s worth noting that $β = b^{2 k} \mod m$ falls within $[0, m)$ . In practice, where $b$ is either $2^{32}$ or $2^{64}$ , and $m$ approaches $b^{k}$ , the value $m - b^{k - 1}$ is very close to $m$ .

The behavior of $β = b^{2 k} mod m$ resembles a uniform random distribution over $[0, m)$ for any random modulus $m$ . This makes it highly probable that $β \leq m - b^{k - 1}$ . Consequently, most moduli $m$ will satisfy $z < 1$ , leading to $\tilde{q} + 1 \geq q$ .

To quantify this probability, let’s assume the common case where $\frac{b^{k}}{2} < m < b^{k}$ and $β$ follows a uniform distribution. Under these conditions:

$Pr [β \leq m - b^{k - 1}] = \frac{m - b^{k - 1}}{m} > 1 - \frac{2}{b}$

To put this in perspective:

When $b = 2^{32}$ , the probability of achieving the tighter bound exceeds $1 - \frac{1}{2^{31}}$ .
When $b = 2^{64}$ , the probability of achieving the tighter bound exceeds $1 - \frac{1}{2^{63}}$ .

That is, for nearly all moduli in practice, the tighter bound $\tilde{q} \in [q - 1, q]$ holds.

Here is an intuitive explanation: $μ$ (i.e., $⌊ \frac{b^{2 k}}{m} ⌋$ ) is the quotient of $m$ divide $b^{2 k}$ , and $β$ is the remainder. If $β$ is very small, then $μ$ will be very close to the actual quotient and thus the calculation of $\tilde{q}$ will have less approximation errors. Our analysis shows that as long as $β$ is no greater than $m - b^{k - 1}$ , the calculated $\tilde{q}$ is at most $1$ less than $q$ . As that is a very loose requirement for $β$ (or $m$ ), most of the moduli have a tighter bound.

Practical Optimization For Barrett Reduction Implementation

The tighter bound enables faster Barrett reduction implementation. According to traditional analysis, we may need to subtract $r$ by $m$ twice to get the final result. For a given modulus $m$ , if the tighter bound holds, then we can subtract $r$ by $m$ at most once. This means we can save one subtraction.

This is very useful for constant time implementation where it always substract the max times. For example, in RustCrypto P-256 scalar field implementation, it always subtracts twice.

pub(super) const fn barrett_reduce(lo: U256, hi: U256) -> U256 {
    [...]

    let r1 = [a0, a1, a2, a3, a4];
    let r2 = q3_times_n_keep_five(&q3);
    let r = sub_inner_five(r1, r2);

    // Result is in range (0, 3*n - 1),
    // and 90% of the time, no subtraction will be needed.
    let r = subtract_n_if_necessary(r);
    let r = subtract_n_if_necessary(r);
    U256::new([r[0], r[1], r[2], r[3]])
}

As P-256 scalar field satisfies $β \leq m - b^{k - 1}$ , the tighter bound $\tilde{q} \in [q - 1, q]$ holds. This implies the calculated $\tilde{q}$ is at most $1$ less than the actual quotient $q$ . This means r can be subtracted at most once to get the final result. The second subtract_n_if_necessary call in the code above is unnecessary and can be safely removed.

Benchmarks show that simply removing the second subtraction improves multiplication and inversion performance by 14%.

scalar operations/mul
    time:   [38.900 ns 38.957 ns 39.026 ns]
    change: [-14.379% -14.052% -13.734%] (p = 0.00 < 0.05)
    Performance has improved.
scalar operations/invert
    time:   [20.716 µs 20.758 µs 20.823 µs]
    change: [-14.817% -14.331% -13.969%] (p = 0.00 < 0.05)
    Performance has improved.

As analyzed in the previous section, the tighter bound holds for almost all moduli. Thus, this optimization applies to most of the Barrett reductions with fixed moduli (e.g., ECC, ZKP).

Here is the Python script to test if a modulus $m$ has the tighter bound:

# Tighter Bound Criterion for the Barret reduction
def tighter_bound_criterion(m):
    def inner_test(m, b):
        # k chosen such that b^{k-1} < m < b^k
        k = 1
        while b**k < m:
            k += 1
        print("k = ", k)
        beta = b**(2*k) % m

        # calculate criterion
        return beta <= m - b**(k-1)

    # test both b=2^32 and b=2^64
    return inner_test(m, 2**32) and inner_test(m, 2**64)

# P-256 scalar field
assert(tighter_bound_criterion(0xffffffff00000000ffffffffffffffffbce6faada7179e84f3b9cac2fc632551) == True)

# P-256 base field
assert(tighter_bound_criterion(0xffffffff00000001000000000000000000000000ffffffffffffffffffffffff) == True)

Conclusion

Our analysis demonstrates that the error bound for the quotient approximation $\tilde{q}$ can be tightened from $[q - 2, q]$ to $[q - 1, q]$ for almost all moduli used in practice. This tighter bound eliminates the need for a second subtraction in most cases, enabling faster and more efficient implementations.

We showed how this optimization applies to real-world cryptographic libraries, such as RustCrypto’s P-256 scalar field implementation, where removing the unnecessary subtraction improved performance significantly. These findings are broadly applicable to other cryptographic systems and zero-knowledge proof frameworks that rely on fixed moduli.