Quake III's Q_rsqrt on Apple M4 vs Zen 3: when it actually beats 1/sqrtf in 2026
Contents
I tested Quake III’s Fast Inverse Square Root (Q_rsqrt, the famous 0x5f3759df one) on two machines: an Apple M4 Mac mini and a Ryzen 7 5800HS Windows laptop running WSL2. The short version: Q_rsqrt is still faster than 1.0f / sqrtf(x) on both platforms, but the reason and the size of the gap depend more on the compiler flags and target than on the bit hack itself.
My prior expectation was that a 1999 bit hack should have no chance in 2026, with frsqrte / rsqrtss baked into hardware and clang supposedly using them. The numbers said otherwise. On M4 with Apple clang 17 at -O2, Q_rsqrt was 1.55× faster than 1/sqrtf. On Zen 3 with clang 18 at -O2, it was 12× faster — but only because clang refused to rewrite 1/sqrtf until I added -ffast-math, after which the two basically tied. Hand-writing NEON vrsqrte or SSE rsqrtss inside a float -> float wrapper made things 5× slower than Q_rsqrt on both targets.
While I was at it, I also measured how Newton iteration count (0 / 1 / 2) trades speed for accuracy, and whether Chris Lomont’s 0x5f375a86 constant actually helps. Then I walked back through the IEEE 754 log-approximation math at the end.
The benchmark
Full code is at the bottom of the post. The core is straightforward:
static inline float q_rsqrt_with(uint32_t magic, float number, int newton_iters) {
float x2 = number * 0.5f;
float y = number;
uint32_t i;
memcpy(&i, &y, sizeof(i));
i = magic - (i >> 1);
memcpy(&y, &i, sizeof(y));
for (int n = 0; n < newton_iters; n++) {
y = y * (1.5f - (x2 * y * y));
}
return y;
}
The original code used pointer-cast type punning, which is UB in modern C, so I rewrote it with memcpy. Modern clang collapses memcpy to a bit transfer at compile time, and static inline keeps the loop optimizer happy.
The variants I measured:
0x5f3759df(the original) with Newton 0 / 1 / 2 iterations0x5f375a86(Chris Lomont’s optimized constant) with Newton 0 / 11.0f / sqrtf(x)- A hand-written wrapper around NEON
vrsqrte_f32or SSE_mm_rsqrt_sswith Newton 1 / 2
All running over an array of N = 2,000,000 floats sampled log-uniform from to . Throughput is the minimum of repeated loop times divided by N. Accuracy is measured against 1.0 / sqrt((double)x) as the reference, summarized as max and mean relative error.
On Apple M4 (Mac mini)
Environment
| Item | Value |
|---|---|
| Machine | Mac mini (Mac16,10) |
| CPU | Apple M4, 10 cores (4 Performance + 6 Efficiency) |
| Memory | 16 GB unified memory |
| OS | macOS 26.3 (build 25D125) |
| Compiler | Apple clang 17.0.0 |
| Build | clang -O2 -std=c11 bench.c -o bench -lm |
| Measurement | best-of-5 within the bench |
Results
| Implementation | ns/op (M4) | Max rel err | Mean rel err |
|---|---|---|---|
| Q_rsqrt (0x5f3759df, Newton 0) | 0.102 | 3.438e-02 | 2.294e-02 |
| Q_rsqrt (0x5f3759df, Newton 1) | 0.134 | 1.752e-03 | 9.326e-04 |
| Q_rsqrt (0x5f3759df, Newton 2) | 0.185 | 4.712e-06 | 1.812e-06 |
| Q_rsqrt (Lomont 0x5f375a86, Newton 0) | 0.084 | 3.437e-02 | 2.295e-02 |
| Q_rsqrt (Lomont 0x5f375a86, Newton 1) | 0.113 | 1.751e-03 | 9.332e-04 |
1.0f / sqrtf(x) | 0.208 | 8.940e-08 | 2.956e-08 |
NEON vrsqrte + Newton 1 (via wrapper) | 0.755 | 1.614e-05 | 2.092e-06 |
NEON vrsqrte + Newton 2 (via wrapper) | 0.925 | 1.304e-07 | 3.567e-08 |
Adding -ffast-math produced essentially identical numbers. M4’s 1.0f / sqrtf(x) is already well optimized at -O2, leaving almost nothing for fast-math to rewrite.
The most surprising number
Q_rsqrt (Newton 1) at 0.134 ns/op vs 1.0f / sqrtf(x) at 0.208 ns/op. M4 P-cores run around 4 GHz, so 0.208 ns/op is roughly 0.8 cycles/element and 0.134 ns/op is 0.5 cycles/element. Both are under one cycle per element, which only happens when SIMD is processing 4 elements in parallel.
The whole loop has been quietly converted to NEON. The body of Q_rsqrt is just two memcpys, some integer arithmetic, and three FMAs for one Newton step. Clang’s loop vectorizer expands that into int32x4_t / float32x4_t widths, four elements at a time.
1.0f / sqrtf(x) also has an ARM path that vectorizes to frsqrte + Newton refinement, but sqrtf’s max-ulp accuracy guarantee is stricter than what Q_rsqrt produces, so it ends up using more refinement steps. Fewer instructions per element wins, and Q_rsqrt happens to have fewer instructions.
The bit hack itself isn’t fast. It’s that the bit hack is shallow and simple enough to vectorize well. Same reasoning: Q_rsqrt with Newton 2 drops to 0.185 ns/op, narrowing the gap with 1/sqrtf to just 0.023 ns. The more accuracy (and instructions) you add to Q_rsqrt, the more its lead evaporates.
Hand-written NEON wrapper is slower
This part was unexpected. A direct wrapper around vrsqrte_f32:
static inline float neon_rsqrt_1(float x) {
float32x2_t v = vdup_n_f32(x);
float32x2_t e = vrsqrte_f32(v);
e = vmul_f32(vrsqrts_f32(vmul_f32(v, e), e), e);
return vget_lane_f32(e, 0);
}
came out at 0.755 ns/op, or 0.925 ns/op with Newton 2. That’s more than 5× slower than Q_rsqrt (Newton 1).
static inline should eliminate the call overhead, but looking at the loop overall:
vdup_n_f32(x)forces a scalar→NEON register move per elementvget_lane_f32(e, 0)forces a NEON→scalar move per element- The constant scalar↔NEON shuffling makes the loop vectorizer back off and emit a scalar loop
The hardware vrsqrte is called one element at a time, opposite to the int32x4_t-wide path that Q_rsqrt follows.
The rule of thumb if you want NEON: don’t stick NEON intrinsics inside a float -> float function, write the loop itself at float32x4_t width from the outside. Or just write the straightforward 1.0f / sqrtf(x) and let the compiler do it, which on M4 already produces good code.
How many Newton iterations to use
The other thing I wanted to check: was Quake III’s “Newton 1” choice actually correct? The numbers settle this quickly.
| Newton iters | Max rel err | Mean rel err | ns/op (M4) |
|---|---|---|---|
| 0 | 3.44e-02 | 2.29e-02 | 0.102 |
| 1 | 1.75e-03 | 9.33e-04 | 0.134 |
| 2 | 4.71e-06 | 1.81e-06 | 0.185 |
Newton’s method is quadratically convergent in this form, so digits of precision roughly double each iteration. ~3.4% error at 0 iterations, ~0.17% at 1, ~5e-6 at 2. For 3D lighting and normal computation, 0.17% is invisible to human eyes; 3.4% produces visible shading artifacts on surfaces.
Picking Newton 1 was a sound call both for the 1999 CPU budget and for the error curve. The comment // 2nd iteration, this can be removed isn’t just “we wanted speed,” it’s reading the quadratic-convergence behavior of the underlying math.
The Lomont constant
In his 2003 paper Fast Inverse Square Root, Chris Lomont rederived the constant analytically to minimize max relative error after one Newton step, and proposed 0x5f375a86 as slightly better than 0x5f3759df.
Measured on M4:
| Constant | Newton 0 max_rel | Newton 1 max_rel |
|---|---|---|
| 0x5f3759df (original) | 3.438e-02 | 1.752e-03 |
| 0x5f375a86 (Lomont) | 3.437e-02 | 1.751e-03 |
Indistinguishable in practice. The Newton iteration’s feedback is strong enough to absorb the initial-value quality difference. The only place a difference shows up is Newton 0, and even there the agreement is good to three significant figures.
The “magic constant” reputation belongs to the original 0x5f3759df, but once you commit to one Newton iteration, the choice of constant doesn’t matter much.
On x86_64 (Zen 3 / WSL2)
Once I had a feel for Apple Silicon, I wanted to run the same code on a different ISA. I built bench.c under WSL2 on a Windows laptop. The NEON branch automatically drops out under #if defined(__aarch64__), and I added a parallel _mm_rsqrt_ss-based variant for x86.
Environment
| Item | Value |
|---|---|
| CPU | AMD Ryzen 7 5800HS (Zen 3, 8C/16T, base 3.2 GHz / boost ~4.4 GHz) |
| SIMD | AVX2, FMA, F16C, AES-NI (no AVX-512) |
| Host OS | Windows 11 Home (26200) |
| Runtime | WSL2 / Ubuntu 24.04.3 LTS (kernel 6.6.87.2-microsoft-standard-WSL2) |
| Memory | ~7.6 GB allocated to WSL2 |
| Compiler | clang 18.1.3 (Ubuntu) |
| Build | clang -O2 -std=gnu11 -mavx2 -mfma bench.c -o bench -lm |
| Measurement | taskset -c 0 ./bench, best-of-30 within the bench |
I took the minimum across three best-of-30 runs. WSL2’s scheduler jitter is real — single-run timings swing by ±50% easily. Pinning to one core with taskset plus enough repetitions is the minimum needed for stable numbers. When comparing directly against the M4 best-of-5, keep this condition difference in mind.
Results
| Implementation | ns/op (Zen 3 / WSL) | Max rel err | Mean rel err |
|---|---|---|---|
| Q_rsqrt (0x5f3759df, Newton 0) | 0.172 | 3.438e-02 | 2.295e-02 |
| Q_rsqrt (0x5f3759df, Newton 1) | 0.173 | 1.752e-03 | 9.334e-04 |
| Q_rsqrt (0x5f3759df, Newton 2) | 0.197 | 4.717e-06 | 1.815e-06 |
| Q_rsqrt (Lomont 0x5f375a86, Newton 0) | 0.174 | 3.437e-02 | 2.295e-02 |
| Q_rsqrt (Lomont 0x5f375a86, Newton 1) | 0.172 | 1.751e-03 | 9.340e-04 |
1.0f / sqrtf(x) (-O2 only) | 2.155 | 8.941e-08 | 2.957e-08 |
1.0f / sqrtf(x) (-O2 -ffast-math) | 0.169 | 1.957e-07 | 3.301e-08 |
SSE rsqrtss (Newton 0, scalar) | 0.291 | 2.586e-04 | 6.415e-05 |
SSE rsqrtss + Newton 1 (scalar) | 0.663 | 2.040e-07 | 3.371e-08 |
Where Apple Silicon and x86 diverge
The same pattern from M4 — “the bit hack isn’t fast, it just vectorizes well” — shows up on Zen 3, but more extreme.
Q_rsqrt settles at ~0.17 ns/op across all variants because the loop expands cleanly to AVX2 width (8 floats in parallel). At 4.4 GHz boost with 8-wide AVX2, the theoretical peak is roughly 1 / 4.4 GHz / 8 ≈ 0.028 ns per element. 0.17 ns/op corresponds to about 6 cycles / 8 elements ≈ 0.75 cycles per element, which is a reasonable number for this kind of code.
The interesting one is 1.0f / sqrtf(x). At -O2 alone it runs at 2.155 ns/op — over 12× slower than Q_rsqrt. That’s a different universe from M4’s 0.208 ns/op.
The reason is simple: x86 clang 18 at -O2 refuses to rewrite sqrtf to vrsqrtps (the approximate reciprocal-sqrt instruction). sqrtf is required to give a correctly rounded IEEE 754 result, and vrsqrtps only delivers about 12 bits of precision, so the safe-by-default path is to keep emitting scalar vsqrtss. Inspecting the assembly confirms it: the loop is calling vsqrtss + vdivss one element at a time.
Add -ffast-math and 1.0f / sqrtf(x) snaps down to 0.169 ns/op, on par with Q_rsqrt. Clang now replaces the expression with vrsqrtps + a Newton step, and the resulting precision (max 1.96e-7) is actually two orders of magnitude better than Q_rsqrt.
Apple clang on M4 rewrites 1/sqrtf to a frsqrte-based path at plain -O2, while x86 clang refuses to do the same without -ffast-math. Same compiler family, different floating-point defaults per target. Carrying the M4 conclusion straight over to x86 would have missed this entirely, which is the most surprising thing I found this run.
SSE rsqrtss is also slower when wrapped
The “wrapping a SIMD intrinsic in a float -> float function kills vectorization” effect from M4 reproduces directly on x86. The wrapper around _mm_rsqrt_ss runs at 0.291 ns/op, and 0.663 ns/op once you add a Newton refinement — 3–4× slower than Q_rsqrt.
rsqrtss operates only on the bottom lane of a 128-bit XMM register, so if you sandwich it between _mm_set_ss and _mm_cvtss_f32 inside a scalar-shaped function, the compiler can no longer widen the loop to rsqrtps. Same wall as vget_lane_f32 on ARM.
The practical move on x86 is the same as on ARM: write the loop body at __m256 width from the outside, or accept -ffast-math and lean on 1.0f / sqrtf(x).
When Q_rsqrt actually wins in 2026
Combining the two targets’ results:
- M4 (Apple clang 17,
-O2):Q_rsqrt0.134 vs1/sqrtf0.208. 1.55× win, both already auto-vectorized. - Zen 3 (clang 18,
-O2):Q_rsqrt0.173 vs1/sqrtf2.155. 12× win, but only because1/sqrtfis stuck on scalarvsqrtss. - Zen 3 (clang 18,
-O2 -ffast-math):Q_rsqrt0.173 vs1/sqrtf0.169. Effectively tied, and1/sqrtfis two orders of magnitude more accurate.
“Q_rsqrt is still faster in 2026” is only true under x86_64 with -O2 -fno-fast-math. And even there, the win comes from clang playing it safe on 1/sqrtf, not from Q_rsqrt having an algorithmic edge. Allow -ffast-math and the two basically tie, with 1/sqrtf ahead on accuracy.
Wrapping rsqrtss or vrsqrte in a scalar function is a portable footgun in both directions.
Walking back through the algorithm
Now that the runtime behavior makes sense, here’s the math behind why this works at all. This is the part most Q_rsqrt write-ups lead with — I left it for last so the benchmark context is fresh.
The trick has four moving parts:
- Reading a float’s bit pattern as an integer gives you something close to
- corresponds to in log space
- Log-space arithmetic can be reproduced on integer bits as a right shift by one + subtraction from a constant
- The resulting estimate is decent but not great, so one Newton iteration tightens it up
IEEE 754 bits as log₂
IEEE 754 single precision (float) lays out its 32 bits as:
| sign 1bit | exponent 8bit | mantissa 23bit |
For a positive with and the actual exponent, the stored fields are:
- Exponent field
- Mantissa field
Reading the bit pattern as an unsigned integer gives:
Since , we can approximate over with a first-order fit. The least-squares-flavored fit is:
Substituting:
Reading a float’s bits as an integer gives you scaled and shifted by constants. That’s the foundation everything else builds on.
Computing 1/√x in log space
We want . Taking on both sides:
Now translate that into the integer representations of and . Applying the approximation to both:
Pull out the second term as a single integer constant :
Which leaves:
That’s exactly the line:
i = 0x5f3759df - (i >> 1);
(i >> 1) is , and 0x5f3759df - ... is . One integer line covers the log-space halving and the constant offset.
The Newton step
The bit hack alone leaves a few percent of error, so the code finishes with one Newton iteration. We want , so , i.e. the root of . Its derivative is:
Newton’s iteration becomes:
The last line of Q_rsqrt, y = y * (1.5f - x2 * y * y);, is this expression directly. x2 = x / 2 is precomputed at the top.
In this form Newton’s method is quadratically convergent, so each iteration roughly doubles the digits of precision. The bench numbers — 0 iters ~3%, 1 iter ~0.17%, 2 iters ~5e-6 — line up with that ratio.
The whole pipeline
flowchart TD
A["float x as IEEE 754 bits"] --> B["reinterpret as uint32_t"]
B --> C["scaled approximation of log2(x)"]
C --> D["right shift 1 bit = log2(x)/2"]
D --> E["subtract from 0x5f3759df = -log2(x)/2 + K"]
E --> F["reinterpret as float ~ 1/sqrt(x)"]
F --> G["one Newton step"]
G --> H["final output (max rel err 0.175%)"]
The core move is converting “log → subtract → exponentiate” into nothing but bit reinterpretation of a float. No call to log2, no call to exp2, no lookup table. The IEEE 754 layout itself encodes a log-scale, and the code exploits that directly.
Should you write Q_rsqrt today?
Based on the numbers I just collected, no.
Q_rsqrt did beat 1/sqrtf on both M4 and Zen 3, but in both cases the speedup came from “the loop happens to auto-vectorize well,” not from the bit hack being algorithmically clever. Plain 1.0f / sqrtf(x) at -O2 (or -O2 -ffast-math on x86) gets the same treatment from the compiler, with better accuracy, portability, and readability. And wrapping vrsqrte / rsqrtss inside a float -> float function actively slows things down on both targets.
A reasonable priority order is: write 1.0f / sqrtf(x) first and benchmark it. If it’s not enough, drop down to hand-written SIMD at float32x4_t / __m256 width from the outside of the loop. If that’s still not enough, the real fix is usually loop layout and data parallelism, not the per-element math. Q_rsqrt doesn’t fit anywhere in that order. It was the right answer in 1999; in 2026 it’s a reading exercise.
Aside: where the // what the fuck? comment came from
The infamous // what the fuck? comment on Q_rsqrt became public when id Software released the Quake III source under GPL in August 2005. Even inside id, the provenance of the constant wasn’t fully documented; John Carmack himself later said he didn’t write that code.
Later investigation found that:
- Greg Walsh (Ardent Computer) had implemented a similar trick in the late 1980s
- A version popularized by Cleve Moler (the original author of MATLAB) appears to have been the direct ancestor
- Related ideas show up in Kahan/Ng SGI code and IBM bit-twiddling notes
Rather than a single inventor, this is the visible tip of a 1980s–90s floating-point hack culture that happened to surface when Quake III went open source.
Full bench code
bench.c as run on the ARM side. For the x86 side I added a #if defined(__x86_64__) block with _mm_rsqrt_ss wrappers; I’ll show the ARM-only version here.
#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <string.h>
#include <math.h>
#include <time.h>
#if defined(__aarch64__)
#include <arm_neon.h>
#endif
static const uint32_t MAGIC_Q3 = 0x5f3759dfu;
static const uint32_t MAGIC_LOMONT = 0x5f375a86u;
static inline float q_rsqrt_with(uint32_t magic, float number, int newton_iters) {
float x2 = number * 0.5f;
float y = number;
uint32_t i;
memcpy(&i, &y, sizeof(i));
i = magic - (i >> 1);
memcpy(&y, &i, sizeof(y));
for (int n = 0; n < newton_iters; n++) {
y = y * (1.5f - (x2 * y * y));
}
return y;
}
static inline float q_rsqrt_quake_n0(float x) { return q_rsqrt_with(MAGIC_Q3, x, 0); }
static inline float q_rsqrt_quake(float x) { return q_rsqrt_with(MAGIC_Q3, x, 1); }
static inline float q_rsqrt_quake_n2(float x) { return q_rsqrt_with(MAGIC_Q3, x, 2); }
static inline float q_rsqrt_lomont_n0(float x){ return q_rsqrt_with(MAGIC_LOMONT, x, 0); }
static inline float q_rsqrt_lomont(float x) { return q_rsqrt_with(MAGIC_LOMONT, x, 1); }
static inline float std_rsqrt(float x) { return 1.0f / sqrtf(x); }
#if defined(__aarch64__)
static inline float neon_rsqrt_1(float x) {
float32x2_t v = vdup_n_f32(x);
float32x2_t e = vrsqrte_f32(v);
e = vmul_f32(vrsqrts_f32(vmul_f32(v, e), e), e);
return vget_lane_f32(e, 0);
}
static inline float neon_rsqrt_2(float x) {
float32x2_t v = vdup_n_f32(x);
float32x2_t e = vrsqrte_f32(v);
e = vmul_f32(vrsqrts_f32(vmul_f32(v, e), e), e);
e = vmul_f32(vrsqrts_f32(vmul_f32(v, e), e), e);
return vget_lane_f32(e, 0);
}
#endif
static uint64_t now_ns(void) {
struct timespec ts;
clock_gettime(CLOCK_MONOTONIC, &ts);
return (uint64_t)ts.tv_sec * 1000000000ull + (uint64_t)ts.tv_nsec;
}
#define N 2000000
static float input[N];
static float output[N];
typedef float (*rsqrt_fn)(float);
static void bench(const char* name, rsqrt_fn fn, int repeats) {
for (int i = 0; i < N; i++) output[i] = fn(input[i]);
uint64_t best = UINT64_MAX;
for (int r = 0; r < repeats; r++) {
uint64_t t0 = now_ns();
for (int i = 0; i < N; i++) output[i] = fn(input[i]);
uint64_t dt = now_ns() - t0;
if (dt < best) best = dt;
}
double ns_per_op = (double)best / (double)N;
double max_rel = 0.0, sum_rel = 0.0;
for (int i = 0; i < N; i++) {
double ref = 1.0 / sqrt((double)input[i]);
double got = (double)fn(input[i]);
double rel = fabs(got - ref) / ref;
if (rel > max_rel) max_rel = rel;
sum_rel += rel;
}
printf("%-32s %8.3f ns/op max_rel=%.3e avg_rel=%.3e\n",
name, ns_per_op, max_rel, sum_rel / (double)N);
}
static volatile float sink;
int main(void) {
srand(42);
for (int i = 0; i < N; i++) {
double u = (double)rand() / (double)RAND_MAX;
double e = -2.0 + u * 6.0;
input[i] = (float)pow(10.0, e);
}
bench("Q_rsqrt (0x5f3759df, N=0)", q_rsqrt_quake_n0, 5);
bench("Q_rsqrt (0x5f3759df, N=1)", q_rsqrt_quake, 5);
bench("Q_rsqrt (0x5f3759df, N=2)", q_rsqrt_quake_n2, 5);
bench("Q_rsqrt (Lomont, N=0)", q_rsqrt_lomont_n0,5);
bench("Q_rsqrt (Lomont, N=1)", q_rsqrt_lomont, 5);
bench("1.0f / sqrtf(x)", std_rsqrt, 5);
#if defined(__aarch64__)
bench("NEON vrsqrte + 1 Newton", neon_rsqrt_1, 5);
bench("NEON vrsqrte + 2 Newton", neon_rsqrt_2, 5);
#endif
sink = output[N - 1];
(void)sink;
return 0;
}
ARM build and run:
clang -O2 -std=c11 bench.c -o bench -lm
./bench
For x86, I replaced the NEON block with _mm_rsqrt_ss wrappers, built with -mavx2 -mfma under WSL2’s clang 18, and ran with taskset -c 0 ./bench to pin one core and reduce jitter.
The headline finding from running this on both M4 and Zen 3 wasn’t “Q_rsqrt is still fast.” It was that the same code, same compiler family, gives radically different answers depending on target and flags: tied on M4 at -O2, 12× ahead on Zen 3 at -O2, then tied again on Zen 3 with -ffast-math. The bit hack’s speed in 2026 is a story about compiler defaults, not about the bit hack.