A floating point number f is represented as \(f=s \cdot m \cdot 2^e = s \cdot m \cdot 2^{E-127}\). Here s is the sign (+, -), m is the mantissa (digit string of the number with \(1.0 \le m \lt 2.0\)), e is the exponent and E is the "biased" exponent which is actually stored in memory. As an example, \(3=+1.5 \cdot 2^1\), so s=+, m=1.5, e=1 and E=128. A single precision floating point number (or float32 for short) consists of 32bits, using 1bit for the sign s, 8bit for the biased exponent E, and 23bit for the mantissa.
Fig. 1: Bit-level representation of a float32 number. The first row shows the sizes of the bit fields while the second row shows the offsets (zero-based numbering).
We have a floating point number (ignoring the sign bit from now on) \(x=m \cdot 2^e\) and want to compute \(\frac{1}{\sqrt{x}} = \frac{1}{\sqrt{m \cdot 2^e}} = \frac{1}{\sqrt{m}} \cdot 2^{-e/2}\). A simple approximation would be to ignore the mantissa and just care about the exponent. Then we have \(\frac{1}{\sqrt{x}} \approx 2^{-e/2}\). This approximation is correct if m=1.
We now implement the approximation \(\frac{1}{\sqrt{x}} \approx 2^{-e/2} = 2^{-(E-127)/2} = 2^{-E/2+127/2}\) in C++. As already mentioned, only the biased exponent E is stored in memory, and the bias term -127 is assumed implicitly. We will only use one integer subtraction, one bit-shift and one bit-mask and call this procedure "bit-magic" from now on. The steps are as follows (illustrated in Fig. 2):
Fig. 2: Illustration of the operations applied to the integer representation i of the float32 value.
Fig. 3: True value of \(1/\sqrt{x}\) in green and approximation computed by "bit-magic" in red for the interval \(x \in [1, 16]\).
Further, an error is introduced if the mantissa is not exactly 1. The "bit-magic" is piece-wise constant, with the constant segments starting at odd e values, continuing over the following even e value range, and ending just before the next odd e value. For an example, see Fig. 3, where a segment starts at \(2^1=2\) and ends before \(2^3=8\). The true value is \(y_T = 1/\sqrt{x} = 1/\sqrt{(m \cdot 2^e} = 1/\sqrt{m} \cdot 2^{-e/2}\) with e fixed to the odd e value and m running from 1 to 4. The value computed by the "bit-magic" is \(y_M = 1/\sqrt{2} \cdot 2^{-e/2}\). The relative error is \(\epsilon_{rel} = (y_M-y_T)/y_T = \frac{\sqrt{m}}{\sqrt{2}} - 1\). This is exactly the behaviour which can be observed by the measured relative error as shown in Fig. 4: It starts at ~-0.29 and then goes up tp ~0.41, until it jumps back to ~-0.29 again. The relative error has a repeating pattern on a logarithmic scale.Fig. 4: Relative error of the "bit-magic" (left) and the Newton refinement (right) for the interval \(x \in [1, 16]\).
The "bit-magic" gives a rough estimate for the true value, but can have a relative error of up to 41%. To improve on this, a small number of Newton iterations is applied. Newton's method computes the zero of a function, i.e. it computes y such that \(f(y)=0\). We re-formulate \(y=1/\sqrt{x}\) into a zero-finding problem, that is \(f(y) = y^2-1/x = 0\). Applying the Newton update \(y_{k+1}=y_k - f(y_k)/f'(y_k)\) gives a better approximation for y than we had before, where f'(y) is the first derivative of f(y). The maximum relative error decreases fast:
#include <cstdint> #include <cstddef> float fast_inv_sqrt(float x) { float y = x; // y holds the current guess for 1/sqrt(x) uint32_t *i = reinterpret_cast<uint32_t *>(&y); // i points to current guess y const uint32_t exp_mask = 0x7F800000; // 0xFF<<23 const uint32_t magic_number = 0x5f000000; // 190<<23 // initial guess using magic number *i = magic_number - ((*i >> 1) & exp_mask); // refine guess using small number of Newton iterations const size_t num_newton_iter = 2; for (size_t i = 0; i < num_newton_iter; ++i) { y = (x * y * y + 1) / (2 * x * y); } return y; }
Compared to the original hack, the version analyzed in this article only uses the exponent bit field of a float32 number. This gives an order of magnitude worse relative error, however, it also makes analyzing the "bit-magic" much easier. Using only two Newton iterations drives the worst-case relative error down to <0.2%. The implementation ignores the sign bit and does not work for de-normalized float32 numbers.
Harald Scheidl, 2020