Algorithms used in decimal-scaled¶
Catalogue of the published algorithms the crate evaluates, with academic citations and the source files where each is implemented. This is engineering credit for the ideas the crate's own code is built from. The crate incorporates no third-party source code; see LICENSE-THIRD-PARTY.
Integer arithmetic¶
Möller–Granlund magic-number division by an invariant¶
Used for the ÷ 10^SCALE step in every Mul / Div operator and in
rescale. The divisor 10^SCALE is known at compile time, so a
pre-computed magic constant and a single 128-bit multiplication plus
a one-step correction replace a generic divide instruction. The
crate ships a 39-entry table (MG_EXP_MAGICS, scales 0–38) and the
256/128-bit fast-2-word divide built around it.
Möller, N. and Granlund, T. (2011). "Improved Division by Invariant Integers." IEEE Transactions on Computers 60(2), 165–175. DOI: 10.1109/TC.2010.143.
Earlier foundational reference for the magic-multiplier idea:
Granlund, T. and Montgomery, P. L. (1994). "Division by Invariant Integers using Multiplication." Proc. PLDI '94. ACM, 61–72. DOI: 10.1145/178243.178249.
Implementation: src/algos/support/mg_divide.rs (divmod_pow10_2word,
MG_EXP_MAGICS, mg_reciprocal). The
magic table is re-derived from the paper's reciprocal formula by a
const fn generator (mg_reciprocal) that performs the
floor(2^256 / (10^k << s)) long division at compile time, and the
divide body follows the paper's normalise / estimate / single-add-back
algorithm.
Prior art and clean-room declaration. Applying Möller–Granlund to
the constant divisor 10^SCALE for fixed-point rescaling is prior art:
it appears in
primitive_fixed_point_decimal
by Wu Bingzheng (MIT-licensed), which we acknowledge as the inspiration
for this approach. This crate's implementation is nonetheless an
independent clean-room rewrite derived directly from the Möller–Granlund
2011 paper — its code expression, structure, naming, comments, and the
generated magic table are the crate's own and were not copied or adapted
from that crate's source. The computed magics are mathematically
determined facts; their bit-identity to any correct implementation is a
property of the algorithm, not of shared code.
Further reading:
- Wikipedia - Division algorithm § Division by a constant
- Wolfram MathWorld - Division
- Niels Möller's homepage: https://www.lysator.liu.se/~nisse/
- Torbjörn Granlund's homepage (GMP project): https://gmplib.org/~tege/
Limb storage shape — [u64; N]¶
The wide integer is a single const-generic type, Int<N> (and its
unsigned sibling Uint<N>), stored as a little-endian [u64; N]
array of N u64 limbs — one type for every width, not a family of
per-tier IntNNN types. A power-of-two width is just N = bits / 64
(e.g. Int<4> is 256-bit, Int<8> is 512-bit), and the half-width
tiers fall out at the in-between N. The [u64; N] choice exposes
native u64 × u64 → u128 and u128 / u64 hardware instructions
directly (Zen 4 / Intel Golden Cove issue one widening 64×64 mul per
cycle in steady state) where a [u128; N] layout would have to
soft-emulate every mul as four u64 × u64 sub-products plus a nested
carry chain. The public from_limbs_le(limbs: [u64; N]) -> Self /
limbs_le(self) -> [u64; N] API takes and returns the [u64; N]
limbs directly — each body is a one-line move (Self { limbs } /
self.limbs), with no shape conversion.
Implementation: the [u64; N] storage and from_limbs_le / limbs_le
accessors in src/int/types/mod.rs; the primitive limb operations
(add, shift, compare) in src/int/algos/support/limbs.rs.
Base-2⁶⁴ schoolbook multiplication¶
Standard O(n²) schoolbook on u64 limbs, using the native
u64 × u64 → u128 widening multiply for each sub-product.
Karatsuba was implemented (and tested) but lost to schoolbook at
every tier this crate emits because the per-cycle widening mul
throughput beats Karatsuba's 3·(n/2)² + add/sub overhead until
n > ~64 limbs (beyond our widest tier). The Karatsuba code is
retained in src/int/algos/mul/mul_karatsuba.rs as a reference / future
SIMD baseline.
Implementation: src/int/algos/mul/mul_schoolbook.rs::mul_schoolbook.
Further reading:
- Wikipedia - Multiplication algorithm
- Wolfram MathWorld - Multiplication
Möller–Granlund 2-by-1 invariant reciprocal in Knuth's q̂ loop¶
Inside Knuth Algorithm D,
each quotient limb's q̂ estimate would otherwise cost a 64-iteration
bit-recovery loop. Precomputing the divisor's 2-by-1 reciprocal
(Möller-Granlund Algorithm 4) once per call collapses the q̂ step
to ~2 multiplies plus a constant fix-up. Setup cost is one hardware
u128 / u64 divide; amortised across the m + 1 quotient-limb
estimations the per-limb work drops by ~30×.
Möller, N. and Granlund, T. (2011). "Improved Division by Invariant Integers." IEEE Transactions on Computers 60(2), 165–175.
Implementation: src/int/algos/div/div_mg.rs::Mg2By1 (renamed from MG2by1U64).
A 3-by-2 sibling (Mg3By2, also implemented per MG Algorithm 5
with the paper's Algorithm 6 reciprocal refinement that accounts
for d0) is kept available for arbitrary-divisor use cases. It was
not faster than 2-by-1 + refinement loop on decimal divisors
because the refinement loop almost never fires for our
(well-conditioned) divisors, so the 3-by-2's per-call extra
multiply costs more than it saves.
Knuth Algorithm D — multi-limb divide¶
Textbook Algorithm D (Knuth, TAOCP Vol. 2, §4.3.1) adapted to base
2⁶⁴. Normalise the divisor so its top limb has the high bit set,
then for each quotient limb estimate q̂ from the top two limbs of
the running dividend (via the MG 2-by-1 reciprocal above), refine
once against the second-from-top divisor limb, multiply-subtract,
and add-back on the rare miss. Complexity O(m·n) limb-ops vs the
shift-subtract fallback's O((m+n)·n·64).
Implementation: src/int/algos/div/div_knuth.rs::div_knuth,
dispatched by src/int/policy/div_rem.rs.
Base-2⁶⁴ schoolbook long division (u64-divisor fast path)¶
For divisors that fit a single u64 word, the crate uses one hardware
u128 / u64 divide per dividend limb — every 10^scale with
scale ≤ 19 lands here. This is the standard schoolbook long
division, now riding the native hardware instruction directly
(previously this path had to split each u128 limb into 64-bit halves
and do two divides per limb).
Implementation: src/int/algos/div/div_rem.rs::div_rem (fast path B,
single-limb hardware divide); src/algos/support/mg_divide.rs::div_long_256_by_128
(256-bit specialisation for D38 transcendentals, still u128-typed by design).
Further reading:
- Wikipedia - Long division
- Wolfram MathWorld - Long division
Binary shift-subtract long division (const fallback)¶
Last-resort divide for arbitrary multi-limb divisors in const
context. Runtime callers route through the
div_knuth dispatch
dispatcher to Knuth instead; this path stays as the const fn
backstop for wrapping_div / wrapping_rem and as the setup path
for the MG reciprocal computation.
Implementation: src/int/algos/div/div_rem.rs::div_rem (general / shift-subtract path).
Further reading:
- Wikipedia - Division algorithm § Restoring division
- Wolfram MathWorld - Division
Roots¶
Newton iteration for integer square root (isqrt)¶
x_{k+1} = (x_k + N / x_k) / 2, started from a power-of-2
overestimate so the sequence decreases monotonically. Converges
quadratically.
At wide tiers the per-iteration divide routes through
div_knuth dispatch,
i.e. Knuth Algorithm D with the MG 2-by-1 q̂ reciprocal. Earlier
versions used the const-context limbs_divmod_u64 shift-subtract
path here, which made wide-tier sqrt 24–92× slower than necessary;
swapping to the runtime dispatcher closes that gap completely. D38
keeps its hand-tuned isqrt_256 because the 256-bit specialisation
out-paces the generic path at single-limb scales.
Implementation: src/algos/support/mg_divide.rs::isqrt_256 (D38 2-limb fast path),
src/int/algos/isqrt/isqrt_newton.rs::isqrt_newton (wide-tier Newton kernel).
Further reading:
- Wikipedia - Methods of computing square roots § Heron's method
- Wikipedia - Newton's method § Description (the parent recurrence)
- Wolfram MathWorld - Square Root, Newton's Method
Newton iteration for integer cube root (icbrt)¶
x_{k+1} = (2·x_k + N / x_k²) / 3. Same monotone-decreasing setup
as isqrt. Implementation: src/algos/support/mg_divide.rs::icbrt_384 (D38 3-limb fast path),
src/int/algos/icbrt/icbrt_newton.rs::icbrt_newton (wide-tier Newton kernel),
src/macros/wide_roots.rs (decl_wide_roots! emits correctly-rounded
wrappers per wide tier).
Further reading:
- Wikipedia - Cube root § Numerical methods
- Wolfram MathWorld - Cube Root
Correctly-rounded sqrt / cbrt¶
After the integer root q = floor(N^{1/k}), the crate decides
"round up to q+1?" by an integer comparison of N against the
midpoint, which is an integer for sqrt (the midpoint test is
N − q² > q) and a multiple of 1/8 for cbrt (the test is
8N ≥ (2q + 1)³). For integer N the midpoint is never an integer
in either case, so the rounding decision is mode-independent -
every RoundingMode agrees with the half-to-nearest choice.
Implementation: src/algos/support/mg_divide.rs::sqrt_raw_correctly_rounded /
cbrt_raw_correctly_rounded (D38 path); the wide-tier counterparts in
src/macros/wide_roots.rs.
Further reading:
- Wikipedia - IEEE 754 § Roundings to nearest (the "correctly rounded" contract the crate emulates at the storage scale)
Transcendentals¶
Table-driven ln / exp / sin_cos / atan / hyperbolic (Tang) — narrow-GUARD bands¶
For working-scale bands clustered around each tier's "design SCALE"
(roughly half the storage's decimal capacity), the strict-mode wide
kernels first consult a precomputed lookup at the working width and
fall back to the artanh / Taylor path only on a table miss. The
table shape follows P. T. P. Tang's table-driven decomposition: split
the argument as x = T_i + r where T_i is drawn from a small
(256- or 512-entry) lookup of decimal-aligned breakpoints, then
evaluate the short residual series on r. Convergence on r is one
or two pair-terms instead of the artanh / Taylor path's ~p / 2, so
end-to-end time at the narrow-GUARD bands drops by 3× to 34× over
the canonical fallback (peak measured 33.8× at D1232<615>).
Tables ship as plain const arrays sized for each tier's
(width, working-scale band) pair —
D57<18-22>, D115<57>, D153<70-82>, D307<140-160>, D462<225-235>,
D616<300-315>, D924<455-465>, D1232<610-620> — and are bit-stable
across builds.
Tang, P. T. P. (1989). "Table-driven implementation of the exponential function in IEEE floating-point arithmetic." ACM Transactions on Mathematical Software 15(2), 144–157. DOI: 10.1145/63522.214389.
Tang, P. T. P. (1990). "Table-driven implementation of the logarithm function in IEEE floating-point arithmetic." ACM Transactions on Mathematical Software 16(4), 378–400. DOI: 10.1145/98267.98294.
Implementation: src/algos/exp/exp_tang.rs::exp_tang (all tiers),
src/algos/ln/ln_tang.rs::ln_tang (all tiers),
src/algos/trig/sincos_tang.rs and sincos_tang_3limb_s18_22.rs (sin/cos),
src/algos/trig/hyper_exp_identity.rs (hyperbolic),
src/algos/trig/atan_tang_3limb_s44_56.rs and inverse_tang_3limb_s18_22.rs (atan/asin/acos).
Dispatched per (width, scale) by src/policy/ matchers. Outside the listed
bands the artanh / Taylor path below handles the call.
ln via multi-level sqrt argument reduction + Mercator's artanh¶
Range-reduce x = 2^k · m with m ∈ [1, 2). Then apply l square
roots to shrink m → m^(1/2^l), so the artanh-series argument
t = (m' − 1)/(m' + 1) is bounded by 0.35 · 2^-l. The series then
converges in ~p / (2 + 2l) pair-terms instead of ~p / 2, traded
against l extra wide-isqrt calls. We pick l ≈ √p_bits / 4
empirically.
Reassembly: ln(m) = 2^(l+1) · artanh(t) (the extra 2^l factor
folds into the bit-shift since ln(m^(1/2^l)) = ln(m) / 2^l). The
storage-scale identity ln(x) = k·ln 2 + ln(m) stays unchanged.
Brent, R. P. (1976). "Multiple-precision zero-finding methods and the complexity of elementary function evaluation." In Analytic Computational Complexity, Academic Press.
Implementation: src/macros/wide_transcendental.rs::ln_fixed (wide tiers). The
fastnum crate uses the same sqrt-halving recursion as its ln_
inner — we cross-validated against it.
ln via plain artanh series (legacy reference)¶
Range-reduce x = 2^k · m with m ∈ [1, 2), then compute
ln(m) = 2·artanh((m − 1) / (m + 1)). The argument t = (m − 1) /
(m + 1) lies in [0, 1/3], so the Mercator series
artanh(t) = t + t³/3 + t⁵/5 + … converges as roughly 3^(-n) -
about 22 terms per decimal digit.
Mercator's logarithm series:
Mercator, N. (1668). Logarithmotechnia. (Cited via Borwein & Borwein, "Pi and the AGM", 1987, Wiley.)
The artanh form is a textbook identity; combined with bit-by-bit range reduction it's sometimes called the "Cody–Waite" approach after the influential 1980 implementation:
Cody, W. J. and Waite, W. (1980). "Software Manual for the Elementary Functions." Prentice-Hall.
Implementation: src/algos/ln/ln_series_2limb.rs::ln_fixed (D38 narrow path),
src/macros/wide_transcendental.rs::ln_fixed (every wide tier — D57 / D76 / D115 / D153 / D230 / D307 / D462 / D616 / D924 / D1232).
Further reading:
- Wikipedia - Mercator series (the
ln(1+x)expansion at the top) - Wikipedia - Inverse hyperbolic functions § Series expansions (the
artanhseries the crate evaluates) - Wolfram MathWorld - Mercator Series, Inverse Hyperbolic Tangent
exp via two-stage argument reduction + Taylor¶
Wide-tier exp uses Brent's two-stage argument reduction (dashu's
exp_internal pattern, traced to Brent 1976 §3):
- Stage 1 — modular:
k = round(v/ln 2);s = v − k·ln 2, giving|s| ≤ ln 2 / 2 ≈ 0.347. - Stage 2 — multiplicative:
s ← s / 2^nwithn ≈ √(precision_bits). After both reductions the Taylor argument satisfies|r| < 2⁻ⁿ ≈ 2⁻√ᵖ, so the series converges inO(√p)terms instead ofO(p). n is chosen via integersqrt(3w + 1)(usingw·log₂(10) ≈ 3.32was the bit estimate). - Taylor on the reduced argument.
- Reassembly: square
ntimes to undo stage 2, then bit-shift bykto undo stage 1.
The squaring step replaces ≈ 60 Taylor mul+div pairs with
n plain wide multiplies, which is the dominant saving (a divide
is more expensive than a multiply at our widths even after the
u64 storage migration).
Brent, R. P. (1976). "Fast multiple-precision evaluation of elementary functions." Journal of the ACM 23(2), 242–251. DOI: 10.1145/321941.321944.
Implementation: src/macros/wide_transcendental.rs::exp_fixed (wide tiers);
narrow tier in src/algos/exp/exp_series_2limb.rs::exp_fixed.
Further: dashu-float::exp::Context::exp_internal is the modern
reference implementation we cross-checked against.
Cached 10^w divisor in Taylor / AGM / Newton inner loops¶
Every mul(a, b, w) / div(a, b, w) in wide_transcendental does
a round_div against 10^w. Computing lit(10).pow(w) from
scratch on each call costs ~10–50 µs at D307<150> (w=180, ~8 wide
squarings followed by ~180 cumulative multiplies). The cached
variants mul_cached(a, b, pow10_w) / div_cached(...) accept
a precomputed 10^w and skip the recomputation; the inner Taylor
/ AGM / Newton loops hoist let pow10_w = pow10(w); once and pass
it down, saving ~20 % on every wide transcendental.
Implementation: src/macros/wide_transcendental.rs::mul_cached,
div_cached; consumed by exp_fixed, ln_fixed, sin_taylor,
atan_taylor.
exp via plain Taylor (legacy reference)¶
Range-reduce x = k · ln 2 + s with |s| ≤ ln 2 / 2, then
exp(x) = 2^k · exp(s). The Taylor series for exp(s) converges
absolutely on the reduced interval. The same Cody–Waite shape.
Implementation: src/algos/exp/exp_series_2limb.rs::exp_fixed (D38 narrow path),
src/macros/wide_transcendental.rs::exp_fixed (wide tiers).
Further reading:
- Wikipedia - Exponential function § Computation (the Taylor series and the
2^k · exp(s)reduction) - Wikipedia - Taylor series § Exponential function
- Wolfram MathWorld - Exponential Function, Maclaurin Series
sin via [0, π/4] reduction with sin / cos branching¶
Reduce v mod τ to r ∈ [−π, π]; fold to |r| ∈ [0, π/2] via
sin(π − x) = sin(x); then route to two sub-kernels based on
whether the reduced argument lies in the lower or upper half of
[0, π/2]:
r ≤ π/4:sin_taylor(r)— standardr − r³/3! + r⁵/5! − …series.r > π/4:cos_taylor(π/2 − r)—1 − r²/2! + r⁴/4! − …series on an argument in[0, π/4]. Cos's leading constant-1term means it converges marginally faster than sin at the same argument, and the [0, π/4] cap halves the Taylor argument range vs the historic [0, π/2].
Reference: Muller 2016 §11.4 attributes the "switch to cos at π/4" trick to standard mid-precision practice.
Implementation: src/macros/wide_transcendental.rs::sin_fixed
(routes to sin_taylor / cos_taylor).
sin_cos joint kernel¶
sin_cos_strict(self) -> (sin, cos): shares the Taylor evaluation
between sin and cos, recovers cos via the Pythagorean identity
|cos| = √(1 − sin²). Net cost ≈ one sin_strict + one wide
sqrt + one wide mul vs the historic two independent sin
evaluations (cos = sin(x + π/2)). 2–3× faster when both values
are needed.
Pattern adapted from
fastnum::decimal::dec::math::sin_cos.
Implementation: src/macros/wide_transcendental.rs::sin_cos_fixed
and sin_cos_strict.
sin / cos via plain range-reduced Taylor (legacy reference)¶
Earlier implementation reduced to [0, π/2] and ran sin Taylor
directly without the cos branch; cos(x) = sin(x + π/2) was a
full second sin evaluation. Replaced by the variants above.
Implementation: src/algos/trig/trig_series_2limb.rs::sin_fixed (D38 narrow path),
src/macros/wide_transcendental.rs::sin_fixed / sin_taylor (wide tiers).
Further reading:
- Wikipedia - Taylor series § Trigonometric functions (the
sin x = x − x³/3! + …andcos x = 1 − x²/2! + …series) - Wolfram MathWorld - Sine, Cosine, Maclaurin Series
atan via per-width argument halvings + Taylor¶
The identity atan(x) = 2·atan(x / (1 + √(1 + x²))) halves the
argument; applying it l times reduces |x| by ~2^l, then the
Taylor series for atan converges in ~p_bits / (2l) terms.
The halving count is chosen per working scale w:
w < 60→ 5 halvings (D38 / D18 strict path)60 ≤ w < 110→ 6 halvings (D57 / D76 / light D115)w ≥ 110→ 7 halvings (D115 / D153 / D230 / D307 / D462 / D616 / D924 / D1232)
Break-even rationale: each halving costs ~one wide mul + one wide
sqrt + one wide div; each saved Taylor term saves ~one wide mul.
The trade-off depends on p_bits and sits in the 5–7 range for
our tiers.
Implementation: src/algos/trig/trig_series_2limb.rs::atan_fixed (D38 narrow path),
src/macros/wide_transcendental.rs::atan_fixed (wide tiers).
atan2 with max-branch quotient selection¶
atan2(y, x) evaluates atan(y/x) plus a quadrant offset. The
historical implementation always fed y/x to atan_fixed, losing
~log₂(|y/x|) bits when |y| ≫ |x| because atan_fixed's
argument-halving cascade had to consume them. The current
implementation max-branches: it feeds atan_fixed whichever of
y/x or x/y has |·| ≤ 1 and applies the identity
atan(t) = sign(t)·π/2 − atan(1/t) for |t| > 1 to recover the
quotient. Eliminates the asymptotic-edge precision loss; modest
speed win at any |y/x| significantly different from 1.
Implementation: src/algos/trig/trig_series_2limb.rs::atan2_kernel (D38 narrow path),
src/macros/wide_transcendental.rs::atan2_strict (wide tiers).
asin / acos two-range kernel¶
Earlier path: asin(x) = atan(x / √(1 − x²)). At |x| → 1 the
1 − x² subtraction cancelled all leading bits, losing ≈
log(1/(1−|x|²)) digits of precision and risking 1-ULP error at
the asymptotic edge.
Two-range kernel preserves the 0-ULP contract at every representable input:
|x| ≤ 0.5: existingatan(x / √(1 − x²))path. At this range1 − x² ∈ [0.75, 1]— no cancellation, full precision.|x| > 0.5: half-angle identityasin(|x|) = π/2 − 2·asin(√((1 − |x|)/2)). Recurses once on√((1 − |x|)/2) ∈ (0, 0.5], which hits the stable branch. The inner(1 − |x|)/2is exact (no cancellation:|x| ≤ 1guarantees1 − |x| ≥ 0), so the precision floor scales with the working scale instead of with the input's distance from 1.
acos shares the same kernel via acos(x) = π/2 − asin(x).
Implementation: src/algos/trig/trig_series_2limb.rs / src/algos/trig/inverse_tang_3limb_s18_22.rs
(asin_strict, acos_strict for D38 / D57) and the wide-tier variants in
src/macros/wide_transcendental.rs (asin_strict,
asin_strict_with, acos_strict, acos_strict_with).
Further reading:
- Wikipedia - Inverse trigonometric functions § Infinite series (the
atanTaylor series) - Wikipedia - Inverse trigonometric functions § Argument halving (the halving identity)
- Wolfram MathWorld - Inverse Tangent
π via Machin's formula (wide tier only)¶
π = 16·atan(1/5) − 4·atan(1/239). Each atan is evaluated via the
crate's Taylor implementation; with the small arguments the series
converges fast.
Machin, J. (1706). Cited via Beckmann, P. (1971). A History of π. St. Martin's Press.
Implementation: src/macros/wide_transcendental.rs::pi. (D38
embeds π to 63 fractional digits as a literal - no series at run
time, since the constant fits the working width comfortably.)
Further reading:
- Wikipedia - Machin-like formula (the
π = 16 atan(1/5) − 4 atan(1/239)equation at the top) - Wolfram MathWorld - Machin's Formula, Pi Formulas
Hyperbolic functions¶
Composed from exp/ln:
- sinh(x) = (eˣ − e⁻ˣ) / 2
- cosh(x) = (eˣ + e⁻ˣ) / 2
- tanh(x) = sinh(x) / cosh(x)
- asinh(x) = ln(x + √(x² + 1)) (with the x ≥ 1 form factored as
ln(x) + ln(1 + √(1 + 1/x²)) to keep x² in the working width)
- acosh(x) = ln(x + √(x² − 1)) (analogous factoring for x ≥ 2)
- atanh(x) = ln((1 + x) / (1 − x)) / 2
All textbook identities - no specific paper attribution.
Implementation: src/algos/trig/trig_series_2limb.rs (D38 narrow path),
src/algos/trig/hyper_exp_identity.rs / src/macros/wide_transcendental.rs (wide tiers).
Further reading:
- Wikipedia - Hyperbolic functions § Definitions in terms of the exponential function (the
sinh/cosh/tanhidentities) - Wikipedia - Inverse hyperbolic functions § Logarithmic forms (the
asinh/acosh/atanhlog-forms) - Wolfram MathWorld - Hyperbolic Functions, Inverse Hyperbolic Functions
Rounding¶
Half-to-even (banker's rounding) and the IEEE-754 family¶
The crate's default rounding rule. Implementation in
src/support/rounding.rs::should_bump, dispatched per
RoundingMode via a strategy hook.
IEEE Std 754-2019. "IEEE Standard for Floating-Point Arithmetic." IEEE Standards Association.
Further reading:
- Wikipedia - Rounding § Round half to even (the tie-breaking rule the crate uses by default)
- Wikipedia - IEEE 754 § Roundings to nearest
Constants¶
The mathematical constants in src/types/consts/d38.rs / src/types/consts/wide.rs (pi, tau,
half_pi, quarter_pi, e, golden) are stored as raw Int<4>
(256-bit) integer literals at SCALE_REF = 75 for D18/D38/D76, with
wider reference constants in wide.rs for D153/D307 and beyond. Sources:
pi,tau,half_pi,quarter_pi: ISO 80000-2.e: OEIS A001113.golden: OEIS A001622.
Cross-over algorithms¶
- Karatsuba multiplication. Implemented in
src/int/algos/mul/mul_karatsuba.rs::mul_karatsubaand dispatched to bysrc/int/policy/mul.rswhen both operands are equal-length and at leastKARATSUBA_THRESHOLD = 256u64 limbs. Every shipped tier (widest work-int Int12288 = 192 u64 limbs, normal arithmetic at ≤ 96 limbs) sits below the threshold, so the canonical path is u64 schoolbook in practice. An M2 micro-bench at L = 16 – 96 limbs measured schoolbook 1.07× – 1.92× faster than Karatsuba everywhere: the LLVM-unrolled limb-by-limbu64 × u64 → u128schoolbook keeps both multiplier ports saturated, while the recursive3·(n/2)² + add/subdecomposition pays an allocation (Vecscratch forz0,z1,z2,sum_a,sum_b) per recursive call that swamps the asymptotic win at any length this crate emits. The implementation and a property-test oracle against schoolbook are retained for future use — SIMD widening, extra-wide tiers past D1232, or a scratch-passing rewrite that removes the heap allocations. (Karatsuba, A. and Ofman, Yu. (1962). Doklady Akad. Nauk SSSR 145, 293–294.) Anatoly Karatsuba (1937–2008) and Yuri Ofman are both deceased; see the Wikipedia biography links below. Further reading: Karatsuba algorithm, Anatoly Karatsuba bio, Yuri Ofman bio, MathWorld - Karatsuba Algorithm. - AGM-based ln / exp (Brent–Salamin 1976).
ln_strict_agm(every wide tier) uses Brent's identityln(s) ≈ π / (2 · AGM(1, 4/s))with range reductionln(x) = ln(x · 2^m) − m·ln 2.exp_strict_agmuses Newton's iteration onln_strict_agm. Both converge quadratically -O(log p)iterations vs the artanh path'sO(p)series terms - so they win asymptotically as working scale grows. Currently exposed as the alternate path; the canonicalln_strict/exp_strictstays on the artanh / Taylor implementations until a bench at the relevant working scale shows AGM winning by a benchmarked margin. Caveat: the present implementation runs the AGM iteration at the same working scalewas the artanh path; at storage scales beyond ~30 the early- phasesqrt(a·b)step's truncation error amplifies and the output drops to ~p/2 bits of precision. Brent §3 fixes this by raising intermediate AGM precision; recorded as a follow-up. (Brent, R. P. (1976). "Fast multiple-precision evaluation of elementary functions." J. ACM 23(2), 242–251.) Richard Brent is at ANU - homepage: https://maths-people.anu.edu.au/~brent/. Further reading: Arithmetic–geometric mean (theaₙ₊₁ = (aₙ+bₙ)/2,bₙ₊₁ = √(aₙ bₙ)recurrence), Gauss–Legendre algorithm (the same AGM iteration applied to π), MathWorld - Arithmetic-Geometric Mean. - Burnikel–Ziegler recursive division.
div_burnikel_ziegler_with_knuthinsrc/int/algos/div/div_burnikel_ziegler_with_knuth.rsis the recursive wrapper; its base case is the in-crate Knuth Algorithm D port (div_knuth,src/int/algos/div/div_knuth.rs, TAOCP §4.3.1 adapted to base 2⁶⁴). Both functions sit alongside the canonical const-fndiv_rem- the canonical path is unchanged, anddiv_knuth/div_burnikel_ziegler_with_knuthare exposed for bench-driven promotion. Knuth'sO(m·n)multi-limb shape beats the shift-subtract path'sO((m+n)·n·64)for any multi-limb divisor; BZ's recursion adds value only past the threshold (currentlyBZ_THRESHOLD = 16u64 limbs) and the full §3 two-by-one / three-by-two recursion is recorded as the next layer to add once a bench shows it winning at this crate's widths. (Burnikel, C. and Ziegler, J. (1998). "Fast recursive division." MPI-I-98-1-022; Knuth, D. E. (1981). The Art of Computer Programming, Vol. 2: Seminumerical Algorithms, §4.3.1.) Donald Knuth's homepage: https://www-cs-faculty.stanford.edu/~knuth/. Further reading: Division algorithm § Newton–Raphson division and recursive division (no dedicated BZ article, but the parent page lists the recursive-division family), MathWorld - Long Division. The Burnikel–Ziegler tech report is the canonical algorithm reference: MPI-I-98-1-022.
Evaluated and not used¶
Algorithms whose implementation was researched, prototyped, or analytically vetted, and rejected for this crate's mix of fixed-point storage, medium working widths, and software-only target. Each entry records the reason so a future contributor does not relitigate the same trade-off.
Comba multiplication¶
Single-pass accumulator-per-output-column schoolbook variant: each
output limb collects its sub-products into a wide accumulator stream,
folding the carry chain into the loop body instead of a separate
sweep. Evaluated as a candidate replacement for the canonical
limb-by-limb schoolbook inside mul_schoolbook. Not adopted: at
every length this crate emits (≤ 96 u64 limbs in the widest work
integer) the LLVM-unrolled schoolbook already saturates both
mulhi / mullo ports on Zen 4 / Golden Cove, so the Comba
reordering wins nothing in steady-state. The crossover where
Comba's column accumulator pays off lies past every shipped tier.
Comba, P. G. (1990). "Exponentiation cryptosystems on the IBM PC." IBM Systems Journal 29(4), 526–538. DOI: 10.1147/sj.294.0526.
Johansson denominator-collection for atan¶
Fredrik Johansson's medium-precision elementary-function paper
proposes collecting per-term denominators in the inverse-trig Taylor
series so that the loop performs one deferred long-division instead
of O(p) per-term divides. Evaluated as a candidate speed-up for
atan_fixed at the wide tiers. Not adopted: the accumulated
denominator product overflows a single wide-limb well before the
Taylor loop terminates at this crate's widths, so the supposed single
deferred divide degenerates back into per-term divides plus extra
book-keeping. The crate's per-width argument-halving cascade ahead
of the Taylor step is the larger lever and is already taken (see
the atan Taylor section above).
Johansson, F. (2015). "Efficient implementation of elementary functions in the medium-precision range." 22nd IEEE Symposium on Computer Arithmetic (ARITH-22).
Further reading: arXiv:1410.7176 (preprint).
CORDIC¶
Coordinate Rotation Digital Computer — bit-by-bit shift-add-rotation iteration popular in hardware floating-point. Not adopted: each iteration delivers ~one bit of accuracy, so the wide tiers (D1232<615> ≈ 2046 working bits) would need thousands of iterations versus the artanh / Taylor path's tens of pair-terms. CORDIC's hardware win comes from its shift-add primitives being one cycle each; in software the wide bit-shifts plus fixed-point sign-table lookups cost more per bit than a wide multiply delivers in tens of bits, so the asymptotic loss is decisive.
Further reading: - Wikipedia — CORDIC (the rotation-mode and vectoring-mode iterations). - Wolfram MathWorld — CORDIC.
Direct mpfr-style arbitrary-precision libmpfr port¶
Considered as a "just wrap MPFR" alternative for the wide-tier transcendentals. Not adopted: MPFR is LGPL-licensed (with the GMP backend either LGPL or GPL depending on build), incompatible with this crate's MIT-OR-Apache-2.0 licence. The wide-tier kernels are therefore implemented from the original published papers (Brent, Tang, Möller-Granlund, Mercator) rather than transliterated from MPFR or any other viral-licensed reference implementation.
Newton–Raphson reciprocal divide for ÷ 10^SCALE¶
Considered as an alternative to the Möller-Granlund magic-multiplier
at the wide tiers, where MG's setup (one hardware u128 / u64 divide
to compute the magic) is unavoidable per (SCALE, width) and a
Newton iteration on 1/10^SCALE could amortise it. Not adopted:
MG's per-call cost is already a single 128-bit multiply plus a
constant fix-up (no iteration), so the Newton path needs to break
even against ~3 wide multiplies and ends up slower until storage
width exceeds the largest current tier. The MG path also yields
chain-MG (div_mg + Mg2By1) which already
covers the multi-limb divisor case at hardware speed.
Round-to-odd directed rounding¶
Boldo–Melquiond's round-to-odd technique: round the wide working value
to an "odd" sticky representative, then re-round once to the storage
scale, so the directed modes (Floor / Ceiling / Trunc /
HalfTowardZero / HalfAwayFromZero) need only a single narrowing.
Prototyped behind an off-by-default feature and A/B-benched against the
shipped residual-sign Ziv escalation. Not adopted: it tied Ziv
mid-cell and regressed ~30 % near grid lines — the very inputs it was
meant to accelerate. The cause is structural: the kernels deliver 0.5
ULP at the storage scale, not at the working-scale LSB, so the
round-to-odd marker is unreliable precisely near a boundary, and the
implementation has to defer those cases back to Ziv — computing the
value twice. Unlike the multiplication / division entries above (which
a future CPU or a wider tier could flip into a win), this mismatch is
inherent to the fixed-point rounding model, so the technique is not
retained even as compiled-out reference.
Boldo, S. & Melquiond, G. (2008). "Emulation of FMA and correctly rounded sums: proved algorithms using rounding to odd." IEEE Transactions on Computers 57(4), 462–471. DOI: 10.1109/TC.2007.70819.
Related external crates (benchmark baselines only)¶
bnum- fixed-width big-integer crate, used as a wide-int baseline inbenches/wide_int_backends.rs.ruint- Ethereum-flavoured wide-integer crate, used as a 256-bit baseline.rust_decimal- 96-bit-mantissa decimal crate, used as a decimal baseline.fixed- binary fixed-point crate, used for the I64F64 baseline.
These crates are dev-dependencies only - they are never compiled
into a normal build.