Skip to content

Algorithms used in decimal-scaled

Catalogue of the published algorithms the crate evaluates, with academic citations and the source files where each is implemented. This is engineering credit - it complements LICENSE-THIRD-PARTY (which covers verbatim/adapted code from upstream repositories) by giving the idea attributions. For the lines-of-code attributions see LICENSE-THIRD-PARTY.

Integer arithmetic

Möller–Granlund magic-number division by an invariant

Used for the ÷ 10^SCALE step in every Mul / Div operator and in rescale. The divisor 10^SCALE is known at compile time, so a pre-computed magic constant and a single 128-bit multiplication plus a one-step correction replace a generic divide instruction. The crate ships a 39-entry table (MG_EXP_MAGICS, scales 0–38) and the 256/128-bit fast-2-word divide built around it.

Möller, N. and Granlund, T. (2011). "Improved Division by Invariant Integers." IEEE Transactions on Computers 60(2), 165–175. DOI: 10.1109/TC.2010.143.

Earlier foundational reference for the magic-multiplier idea:

Granlund, T. and Montgomery, P. L. (1994). "Division by Invariant Integers using Multiplication." Proc. PLDI '94. ACM, 61–72. DOI: 10.1145/178243.178249.

Implementation: src/mg_divide.rs (mul2, div_exp_fast_2word, div_exp_fast_2word_with_rem, MG_EXP_MAGICS). The algorithm shape was adapted from the primitive_fixed_point_decimal crate - see LICENSE-THIRD-PARTY for the verbatim attribution.

Further reading:

Limb storage shape — [u64; 2·N]

The wide-integer types (the eight power-of-two widths Int256, Int512, Int1024, Int2048, Int4096, Int8192, Int12288, Int16384 plus the half-width siblings Int192, Int384, Int768, Int1536, Int3072, Int6144) are stored as little-endian [u64; L] arrays where L is the bit-width divided by 64. The choice exposes native u64 × u64 → u128 and u128 / u64 hardware instructions directly (Zen 4 / Intel Golden Cove issue one widening 64×64 mul per cycle in steady state) where the historical [u128; N] layout had to soft-emulate every mul as four u64 × u64 sub-products plus a nested carry chain. The public from_limbs_le([u128; N]) / limbs_le() -> [u128; N] API preserves its u128-shaped signatures via a 4-line const-fn boundary conversion, so the wire format and downstream pattern-matching stay bit-stable.

Implementation: every limbs_*_u64 routine in src/wide_int/mod.rs; the $U / $S storage in src/wide_int/macros/mod.rs.

Base-2⁶⁴ schoolbook multiplication

Standard O(n²) schoolbook on u64 limbs, using the native u64 × u64 → u128 widening multiply for each sub-product. Karatsuba was implemented (and tested) but lost to schoolbook at every tier this crate emits because the per-cycle widening mul throughput beats Karatsuba's 3·(n/2)² + add/sub overhead until n > ~64 limbs (beyond our widest tier). The Karatsuba code is retained in src/wide_int/mod.rs (u128-base) as a reference / future SIMD baseline. Implementation: src/wide_int/mod.rs::limbs_mul_u64.

Further reading:

Möller–Granlund 2-by-1 invariant reciprocal in Knuth's q̂ loop

Inside Knuth Algorithm D, each quotient limb's q̂ estimate would otherwise cost a 64-iteration bit-recovery loop. Precomputing the divisor's 2-by-1 reciprocal (Möller-Granlund Algorithm 4) once per call collapses the q̂ step to ~2 multiplies plus a constant fix-up. Setup cost is one hardware u128 / u64 divide; amortised across the m + 1 quotient-limb estimations the per-limb work drops by ~30×.

Möller, N. and Granlund, T. (2011). "Improved Division by Invariant Integers." IEEE Transactions on Computers 60(2), 165–175.

Implementation: src/wide_int/mod.rs::MG2by1U64.

A 3-by-2 sibling (MG3by2U64, also implemented per MG Algorithm 5 with the paper's Algorithm 6 reciprocal refinement that accounts for d0) is kept available for arbitrary-divisor use cases. It was not faster than 2-by-1 + refinement loop on decimal divisors because the refinement loop almost never fires for our (well-conditioned) divisors, so the 3-by-2's per-call extra multiply costs more than it saves.

Knuth Algorithm D — multi-limb divide

Textbook Algorithm D (Knuth, TAOCP Vol. 2, §4.3.1) adapted to base 2⁶⁴. Normalise the divisor so its top limb has the high bit set, then for each quotient limb estimate q̂ from the top two limbs of the running dividend (via the MG 2-by-1 reciprocal above), refine once against the second-from-top divisor limb, multiply-subtract, and add-back on the rare miss. Complexity O(m·n) limb-ops vs the shift-subtract fallback's O((m+n)·n·64).

Implementation: src/wide_int/mod.rs::limbs_divmod_knuth_u64, fronted by limbs_divmod_dispatch_u64.

Base-2⁶⁴ schoolbook long division (u64-divisor fast path)

For divisors that fit a single u64 word, the crate uses one hardware u128 / u64 divide per dividend limb — every 10^scale with scale ≤ 19 lands here. This is the standard schoolbook long division, now riding the native hardware instruction directly (previously this path had to split each u128 limb into 64-bit halves and do two divides per limb). Implementation: src/wide_int/mod.rs::limbs_divmod_u64 (fast path B); src/mg_divide.rs::div_long_256_by_128 (256-bit specialisation for D38 transcendentals, still u128-typed by design).

Further reading:

Binary shift-subtract long division (const fallback)

Last-resort divide for arbitrary multi-limb divisors in const context. Runtime callers route through the limbs_divmod_dispatch_u64 dispatcher to Knuth instead; this path stays as the const fn backstop for wrapping_div / wrapping_rem and as the setup path for the MG reciprocal computation. Implementation: src/wide_int/mod.rs::limbs_divmod_u64 general path.

Further reading:

Roots

Newton iteration for integer square root (isqrt)

x_{k+1} = (x_k + N / x_k) / 2, started from a power-of-2 overestimate so the sequence decreases monotonically. Converges quadratically.

At wide tiers the per-iteration divide routes through limbs_divmod_dispatch_u64, i.e. Knuth Algorithm D with the MG 2-by-1 q̂ reciprocal. Earlier versions used the const-context limbs_divmod_u64 shift-subtract path here, which made wide-tier sqrt 24–92× slower than necessary; swapping to the runtime dispatcher closes that gap completely. D38 keeps its hand-tuned isqrt_256 because the 256-bit specialisation out-paces the generic path at single-limb scales.

Implementation: src/mg_divide.rs::isqrt_256, src/wide_int/mod.rs::limbs_isqrt_u64.

Further reading:

Newton iteration for integer cube root (icbrt)

x_{k+1} = (2·x_k + N / x_k²) / 3. Same monotone-decreasing setup as isqrt. Implementation: src/mg_divide.rs::icbrt_384, src/macros/wide_roots.rs (decl_wide_roots! emits a 384/512-bit variant per wide tier).

Further reading:

Correctly-rounded sqrt / cbrt

After the integer root q = floor(N^{1/k}), the crate decides "round up to q+1?" by an integer comparison of N against the midpoint, which is an integer for sqrt (the midpoint test is N − q² > q) and a multiple of 1/8 for cbrt (the test is 8N ≥ (2q + 1)³). For integer N the midpoint is never an integer in either case, so the rounding decision is mode-independent - every RoundingMode agrees with the half-to-nearest choice. Implementation: src/mg_divide.rs::sqrt_raw_correctly_rounded / cbrt_raw_correctly_rounded; the wide-tier counterparts in src/macros/wide_roots.rs.

Further reading:

Transcendentals

ln via multi-level sqrt argument reduction + Mercator's artanh

Range-reduce x = 2^k · m with m ∈ [1, 2). Then apply l square roots to shrink m → m^(1/2^l), so the artanh-series argument t = (m' − 1)/(m' + 1) is bounded by 0.35 · 2^-l. The series then converges in ~p / (2 + 2l) pair-terms instead of ~p / 2, traded against l extra wide-isqrt calls. We pick l ≈ √p_bits / 4 empirically.

Reassembly: ln(m) = 2^(l+1) · artanh(t) (the extra 2^l factor folds into the bit-shift since ln(m^(1/2^l)) = ln(m) / 2^l). The storage-scale identity ln(x) = k·ln 2 + ln(m) stays unchanged.

Brent, R. P. (1976). "Multiple-precision zero-finding methods and the complexity of elementary function evaluation." In Analytic Computational Complexity, Academic Press.

Implementation: src/macros/wide_transcendental.rs::ln_fixed. The fastnum crate uses the same sqrt-halving recursion as its ln_ inner — we cross-validated against it.

ln via plain artanh series (legacy reference)

Range-reduce x = 2^k · m with m ∈ [1, 2), then compute ln(m) = 2·artanh((m − 1) / (m + 1)). The argument t = (m − 1) / (m + 1) lies in [0, 1/3], so the Mercator series artanh(t) = t + t³/3 + t⁵/5 + … converges as roughly 3^(-n) - about 22 terms per decimal digit.

Mercator's logarithm series:

Mercator, N. (1668). Logarithmotechnia. (Cited via Borwein & Borwein, "Pi and the AGM", 1987, Wiley.)

The artanh form is a textbook identity; combined with bit-by-bit range reduction it's sometimes called the "Cody–Waite" approach after the influential 1980 implementation:

Cody, W. J. and Waite, W. (1980). "Software Manual for the Elementary Functions." Prentice-Hall.

Implementation: src/log_exp_strict.rs::ln_fixed (D38), src/macros/wide_transcendental.rs::ln_fixed (every wide tier — D56 / D76 / D114 / D153 / D230 / D307 / D461 / D615 / D923 / D1231).

Further reading:

exp via two-stage argument reduction + Taylor

Wide-tier exp uses Brent's two-stage argument reduction (dashu's exp_internal pattern, traced to Brent 1976 §3):

  1. Stage 1 — modular: k = round(v/ln 2); s = v − k·ln 2, giving |s| ≤ ln 2 / 2 ≈ 0.347.
  2. Stage 2 — multiplicative: s ← s / 2^n with n ≈ √(precision_bits). After both reductions the Taylor argument satisfies |r| < 2⁻ⁿ ≈ 2⁻√ᵖ, so the series converges in O(√p) terms instead of O(p). n is chosen via integer sqrt(3w + 1) (using w·log₂(10) ≈ 3.32w as the bit estimate).
  3. Taylor on the reduced argument.
  4. Reassembly: square n times to undo stage 2, then bit-shift by k to undo stage 1.

The squaring step replaces ≈ 60 Taylor mul+div pairs with n plain wide multiplies, which is the dominant saving (a divide is more expensive than a multiply at our widths even after the u64 storage migration).

Brent, R. P. (1976). "Fast multiple-precision evaluation of elementary functions." Journal of the ACM 23(2), 242–251. DOI: 10.1145/321941.321944.

Implementation: src/macros/wide_transcendental.rs::exp_fixed; narrow tier in src/log_exp_strict.rs.

Further: dashu-float::exp::Context::exp_internal is the modern reference implementation we cross-checked against.

Cached 10^w divisor in Taylor / AGM / Newton inner loops

Every mul(a, b, w) / div(a, b, w) in wide_transcendental does a round_div against 10^w. Computing lit(10).pow(w) from scratch on each call costs ~10–50 µs at D307<150> (w=180, ~8 wide squarings followed by ~180 cumulative multiplies). The cached variants mul_cached(a, b, pow10_w) / div_cached(...) accept a precomputed 10^w and skip the recomputation; the inner Taylor / AGM / Newton loops hoist let pow10_w = pow10(w); once and pass it down, saving ~20 % on every wide transcendental.

Implementation: src/macros/wide_transcendental.rs::mul_cached, div_cached; consumed by exp_fixed, ln_fixed, sin_taylor, atan_taylor.

exp via plain Taylor (legacy reference)

Range-reduce x = k · ln 2 + s with |s| ≤ ln 2 / 2, then exp(x) = 2^k · exp(s). The Taylor series for exp(s) converges absolutely on the reduced interval. The same Cody–Waite shape. Implementation: src/log_exp_strict.rs::exp_fixed, src/macros/wide_transcendental.rs::exp_fixed.

Further reading:

sin via [0, π/4] reduction with sin / cos branching

Reduce v mod τ to r ∈ [−π, π]; fold to |r| ∈ [0, π/2] via sin(π − x) = sin(x); then route to two sub-kernels based on whether the reduced argument lies in the lower or upper half of [0, π/2]:

  • r ≤ π/4: sin_taylor(r) — standard r − r³/3! + r⁵/5! − … series.
  • r > π/4: cos_taylor(π/2 − r)1 − r²/2! + r⁴/4! − … series on an argument in [0, π/4]. Cos's leading constant-1 term means it converges marginally faster than sin at the same argument, and the [0, π/4] cap halves the Taylor argument range vs the historic [0, π/2].

Reference: Muller 2016 §11.4 attributes the "switch to cos at π/4" trick to standard mid-precision practice.

Implementation: src/macros/wide_transcendental.rs::sin_fixed (routes to sin_taylor / cos_taylor).

sin_cos joint kernel

sin_cos_strict(self) -> (sin, cos): shares the Taylor evaluation between sin and cos, recovers cos via the Pythagorean identity |cos| = √(1 − sin²). Net cost ≈ one sin_strict + one wide sqrt + one wide mul vs the historic two independent sin evaluations (cos = sin(x + π/2)). 2–3× faster when both values are needed.

Pattern adapted from fastnum::decimal::dec::math::sin_cos.

Implementation: src/macros/wide_transcendental.rs::sin_cos_fixed and sin_cos_strict.

sin / cos via plain range-reduced Taylor (legacy reference)

Earlier implementation reduced to [0, π/2] and ran sin Taylor directly without the cos branch; cos(x) = sin(x + π/2) was a full second sin evaluation. Replaced by the variants above. Implementation: src/trig_strict.rs::sin_fixed, src/macros/wide_transcendental.rs::sin_fixed / sin_taylor.

Further reading:

atan via per-width argument halvings + Taylor

The identity atan(x) = 2·atan(x / (1 + √(1 + x²))) halves the argument; applying it l times reduces |x| by ~2^l, then the Taylor series for atan converges in ~p_bits / (2l) terms.

The halving count is chosen per working scale w:

  • w < 60 → 5 halvings (D38 / D9 / D18 strict path)
  • 60 ≤ w < 110 → 6 halvings (D56 / D76 / light D114)
  • w ≥ 110 → 7 halvings (D114 / D153 / D230 / D307 / D461 / D615 / D923 / D1231)

Break-even rationale: each halving costs ~one wide mul + one wide sqrt + one wide div; each saved Taylor term saves ~one wide mul. The trade-off depends on p_bits and sits in the 5–7 range for our tiers.

Implementation: src/trig_strict.rs::atan_fixed, src/macros/wide_transcendental.rs::atan_fixed.

atan2 with max-branch quotient selection

atan2(y, x) evaluates atan(y/x) plus a quadrant offset. The historical implementation always fed y/x to atan_fixed, losing ~log₂(|y/x|) bits when |y| ≫ |x| because atan_fixed's argument-halving cascade had to consume them. The current implementation max-branches: it feeds atan_fixed whichever of y/x or x/y has |·| ≤ 1 and applies the identity atan(t) = sign(t)·π/2 − atan(1/t) for |t| > 1 to recover the quotient. Eliminates the asymptotic-edge precision loss; modest speed win at any |y/x| significantly different from 1.

Implementation: src/trig_strict.rs::atan2_strict (D38), src/macros/wide_transcendental.rs::atan2_strict (wide tier).

asin / acos two-range kernel

Earlier path: asin(x) = atan(x / √(1 − x²)). At |x| → 1 the 1 − x² subtraction cancelled all leading bits, losing ≈ log(1/(1−|x|²)) digits of precision and risking 1-ULP error at the asymptotic edge.

Two-range kernel preserves the 0-ULP contract at every representable input:

  • |x| ≤ 0.5: existing atan(x / √(1 − x²)) path. At this range 1 − x² ∈ [0.75, 1] — no cancellation, full precision.
  • |x| > 0.5: half-angle identity asin(|x|) = π/2 − 2·asin(√((1 − |x|)/2)). Recurses once on √((1 − |x|)/2) ∈ (0, 0.5], which hits the stable branch. The inner (1 − |x|)/2 is exact (no cancellation: |x| ≤ 1 guarantees 1 − |x| ≥ 0), so the precision floor scales with the working scale instead of with the input's distance from 1.

acos shares the same kernel via acos(x) = π/2 − asin(x).

Implementation: src/trig_strict.rs::asin_strict / acos_strict (D38) and the wide-tier variants in src/macros/wide_transcendental.rs (asin_strict, asin_strict_with, acos_strict, acos_strict_with).

Further reading:

π via Machin's formula (wide tier only)

π = 16·atan(1/5) − 4·atan(1/239). Each atan is evaluated via the crate's Taylor implementation; with the small arguments the series converges fast.

Machin, J. (1706). Cited via Beckmann, P. (1971). A History of π. St. Martin's Press.

Implementation: src/macros/wide_transcendental.rs::pi. (D38 embeds π to 63 fractional digits as a literal - no series at run time, since the constant fits the working width comfortably.)

Further reading:

Hyperbolic functions

Composed from exp/ln: - sinh(x) = (eˣ − e⁻ˣ) / 2 - cosh(x) = (eˣ + e⁻ˣ) / 2 - tanh(x) = sinh(x) / cosh(x) - asinh(x) = ln(x + √(x² + 1)) (with the x ≥ 1 form factored as ln(x) + ln(1 + √(1 + 1/x²)) to keep in the working width) - acosh(x) = ln(x + √(x² − 1)) (analogous factoring for x ≥ 2) - atanh(x) = ln((1 + x) / (1 − x)) / 2

All textbook identities - no specific paper attribution. Implementation: src/trig_strict.rs, src/macros/wide_transcendental.rs.

Further reading:

Rounding

Half-to-even (banker's rounding) and the IEEE-754 family

The crate's default rounding rule. Implementation in src/rounding.rs::should_bump, dispatched per RoundingMode via a strategy hook.

IEEE Std 754-2019. "IEEE Standard for Floating-Point Arithmetic." IEEE Standards Association.

Further reading:

Constants

The mathematical constants in src/consts.rs (pi, tau, half_pi, quarter_pi, e, golden) are stored as 37-digit raw i128 values at SCALE_REF = 37 (the i128 maximum for the largest of the six). Sources:

  • pi, tau, half_pi, quarter_pi: ISO 80000-2.
  • e: OEIS A001113.
  • golden: OEIS A001622.

Cross-over algorithms

  • Karatsuba multiplication. Implemented in wide_int::limbs_mul_karatsuba and dispatched to by wide_int::limbs_mul_fast when both operands are equal-length and at least KARATSUBA_MIN = 16 limbs. Below that threshold, schoolbook wins outright; at or above it, the recursive (a₁b₁, (a₁+a₀)(b₁+b₀) − a₁b₁ − a₀b₀, a₀b₀) decomposition reduces the asymptotic cost. In practice the threshold means storage tiers through Int1024 (8 limbs) use schoolbook; the 2048-/4096-bit work integers behind the wide-tier strict transcendentals (Int2048 = 16 limbs, Int4096 = 32 limbs) hit the Karatsuba path. (Karatsuba, A. and Ofman, Yu. (1962). Doklady Akad. Nauk SSSR 145, 293–294.) Anatoly Karatsuba (1937–2008) and Yuri Ofman are both deceased; see the Wikipedia biography links below. Further reading: Karatsuba algorithm, Anatoly Karatsuba bio, Yuri Ofman bio, MathWorld - Karatsuba Algorithm.
  • AGM-based ln / exp (Brent–Salamin 1976). ln_strict_agm (every wide tier) uses Brent's identity ln(s) ≈ π / (2 · AGM(1, 4/s)) with range reduction ln(x) = ln(x · 2^m) − m·ln 2. exp_strict_agm uses Newton's iteration on ln_strict_agm. Both converge quadratically - O(log p) iterations vs the artanh path's O(p) series terms - so they win asymptotically as working scale grows. Currently exposed as the alternate path; the canonical ln_strict / exp_strict stays on the artanh / Taylor implementations until a bench at the relevant working scale shows AGM winning by the OVERRIDE_POLICY.md margin. Caveat: the present implementation runs the AGM iteration at the same working scale w as the artanh path; at storage scales beyond ~30 the early- phase sqrt(a·b) step's truncation error amplifies and the output drops to ~p/2 bits of precision. Brent §3 fixes this by raising intermediate AGM precision; recorded as a follow-up. (Brent, R. P. (1976). "Fast multiple-precision evaluation of elementary functions." J. ACM 23(2), 242–251.) Richard Brent is at ANU - homepage: https://maths-people.anu.edu.au/~brent/. Further reading: Arithmetic–geometric mean (the aₙ₊₁ = (aₙ+bₙ)/2, bₙ₊₁ = √(aₙ bₙ) recurrence), Gauss–Legendre algorithm (the same AGM iteration applied to π), MathWorld - Arithmetic-Geometric Mean.
  • Burnikel–Ziegler recursive division. limbs_divmod_bz in src/wide_int/mod.rs is the recursive wrapper; its base case is the in-crate Knuth Algorithm D port (limbs_divmod_knuth, TAOCP §4.3.1 adapted to base 2^128). Both functions sit alongside the canonical const-fn binary limbs_divmod - the canonical path is unchanged, and _knuth / _bz are exposed for bench-driven promotion. Knuth's O(m·n) multi-limb shape beats the binary path's O((m+n)·n·128) for any multi-limb divisor; BZ's recursion adds value only past the threshold (currently BZ_THRESHOLD = 8 limbs) and the full §3 two-by-one / three-by-two recursion is recorded as the next layer to add once a bench shows it winning at this crate's widths. (Burnikel, C. and Ziegler, J. (1998). "Fast recursive division." MPI-I-98-1-022; Knuth, D. E. (1981). The Art of Computer Programming, Vol. 2: Seminumerical Algorithms, §4.3.1.) Donald Knuth's homepage: https://www-cs-faculty.stanford.edu/~knuth/. Further reading: Division algorithm § Newton–Raphson division and recursive division (no dedicated BZ article, but the parent page lists the recursive-division family), MathWorld - Long Division. The Burnikel–Ziegler tech report is the canonical algorithm reference: MPI-I-98-1-022.
  • CORDIC. Common in hardware floating-point; not competitive with Taylor + reduction in a software fixed-point context. Further reading: CORDIC (the rotation-mode and vectoring-mode iterations are the central equations there), MathWorld - CORDIC.
  • bnum - fixed-width big-integer crate, used as a wide-int baseline in benches/wide_int_backends.rs.
  • ruint - Ethereum-flavoured wide-integer crate, used as a 256-bit baseline.
  • rust_decimal - 96-bit-mantissa decimal crate, used as a decimal baseline.
  • fixed - binary fixed-point crate, used for the I64F64 baseline.

These crates are dev-dependencies only - they are never compiled into a normal build.