Roadmap¶

Known performance gaps and planned improvements. Tracked by tier of the §5 Library-comparison benchmark in docs/benchmarks.md. Cells where decimal-scaled already wins are out of scope - these are the loss columns and how we plan to close them.

The crate's accuracy invariants are not on this roadmap. decimal-scaled is 0 ULP correctly-rounded on every transcendental tested at every tier, and stays that way. The roadmap is throughput-only - give people a way to keep the exactness when they need it and an opt-out when they don't.

Versioning intent¶

target	gating work
0.3.0	The current development cycle. Ships: the half-width tier ladder (D56 / D114 / D230 / D461 / D615 / D923 / D1231); the comprehensive cross-tier `widen()` / `narrow()` chain (breaking — D38.widen() now returns D56, etc.); the chain-of-÷10^38 wide-tier `mul` speedup (≥ 2× at D307<150>); MG magic-multiply table extension across every wide-tier SCALE; trig functions in the per-width summary chart family.
0.4.0	(1) Signed `SCALE` (`SCALE: i32`) so callers can express implicit-trailing-zero magnitudes (`D38<-3>` = "stored value × 10³"). Shares the per-tier `10^k` constant tables with the magic-multiply extension landed in 0.3. (2) Cryptographically-secure RNG surface: uniform-decimal sampling over `[0, 1)`, `[a, b]`, and full-storage; rejection-sampling at any SCALE; bring-your-own `CryptoRng` so callers can plug `OsRng` / `ChaCha20Rng` / hardware RNGs.
1.0.0	The version stays pre-1.0 until either (a) the wide-tier `mul` / `div` numbers are competitive with the best peer at every shipped width — currently the `dashu-float` heap-arbitrary-precision baseline, which we trail by ~14× to ~100× at the wide tiers — or (b) the gap has a clearly-defensible structural reason (different storage shape, different precision invariant, different ULP contract) documented per row in the benchmarks. Adapter + ecosystem crates (per the sections below) ship at their own pace and do not gate the core 1.0.

Wide-tier `÷ 10^SCALE` - primary bottleneck¶

The Möller–Granlund magic-multiply (mg_divide) is the kernel behind every wide-tier mul and div (they both end with a ÷ 10^SCALE step to keep the result at the right scale). At D38 it's the right algorithm. At D76 and above the magic constant has to be widened a tier above the storage width, which serialises a chain of limb multiplies through a single carry-propagating accumulator.

Concrete symptom: at 256-bit / s=35, decimal-scaled div is ~830× fastnum's div (5.08 µs vs 6.07 ns).

approach	status	expected win
Bench-pick MG vs alternatives at each width / scale point	TODO	knowing the breakeven matters for the next two
Burnikel–Ziegler recursive divide on top of `limbs_divmod`	TODO	the right asymptote for D153+; D307 should benefit most
Newton-iteration reciprocal as a `mg_divide` fast-path replacement at extreme scales	TODO	flatlines div cost across the deepest tiers if the iteration count stays bounded

Wide-tier multiplication - Karatsuba / Toom-Cook¶

At D76 / D153 / D307 the multiplication kernel is straight schoolbook over [u64; 4] / [u64; 8] / [u64; 16]. The crossover for Karatsuba is typically around 8 limbs, Toom-3 around 32. D153 and D307 sit squarely in Karatsuba and Karatsuba-vs-Toom-3 territory, respectively, but neither is implemented.

Concrete symptom: 1024-bit mul is 66.7 µs in decimal-scaled vs 141 ns in bigdecimal. The crate carries the cost of 16 × 16 = 256 limb multiplies serially.

approach	status	expected win
Karatsuba on `Int512` / `Int1024` mul	TODO	~2× at D153, ~3-4× at D307
Toom-3 on `Int1024` mul, gated by limb count	TODO	further ~1.5-2× on top of Karatsuba at the very deepest scale
SIMD limb multiplies (AVX-512, NEON) gated behind a `simd` feature	speculative	hardware-dependent; worth a probe bench before committing

Wide-tier transcendentals - give callers an opt-out¶

decimal-scaled deliberately keeps every transcendental at 0 ULP correctly rounded, regardless of tier. At D76+ that costs ~µs per call (ln, exp, sin); at D307 it's ~ms. For callers that don't need 0-ULP determinism (e.g. plotting a curve, doing approximate convergence checks) this is overkill.

*_fast already exists on every width, but on the wide tiers it routes through to_f64 / f64::ln / from_f64 and the result collapses to 16 decimal digits regardless of the storage width - a precision cliff that's hard to communicate.

approach	status	expected win
`_approx(working_digits: u32)` family - same series as `_strict` but with caller-controlled working-scale cutoff	TODO	linear cost reduction proportional to the requested digit cut
Document the precision cliff of `*_fast` on wide tiers more loudly	TODO	non-code; reader expectations
Newton-on-AGM `ln` / `exp` paths past D153 - quadratic convergence, asymptotically wins where the artanh series stalls	partial (`bench-alt`)	not yet promoted by the dispatcher; crossover point measured in `benches/agm_vs_taylor.rs`

More decimal widths - fill the tier ladder¶

Current widths cover the power-of-two storage sequence (32 / 64 / 128 / 256 / 512 / 1024 bits). Real-world picks often fall between these - e.g. a D57 covers IEEE 754 binary192 mantissa precision, a D462 covers cryptographic-class high-precision intermediates without paying the full D616 cost.

Plan:

Double the top end up to 4096 bits. D307 (1024 bit) is the current ceiling; add D616 (2048 bit) and D1232 (4096 bit).
Fill in the half-step widths between each existing pair.

Resulting tier ladder:

storage bits	type	safe decimal digits	status
32	`D9`	9	shipped
48	`D14`	14	TODO
64	`D18`	18	shipped
96	`D28`	28	TODO
128	`D38`	38	shipped
192	`D57`	57	TODO
256	`D76`	76	shipped
384	`D115`	115	TODO
512	`D153`	153	shipped
768	`D230`	230	TODO
1024	`D307`	307	shipped
1536	`D462`	462	TODO
2048	`D616`	616	TODO
3072	`D924`	924	TODO
4096	`D1232`	1232	TODO

Each new tier needs its own IntN storage in crate::wide_int, the corresponding MAX_SCALE plumbing, and matching wide-int + strict transcendental kernels (the macros already generate the per-tier code once the storage type exists). Cargo features follow the existing wide / x-wide pattern - probably a new xx-wide / xxx-wide gate for the additions past D307 to keep default build times sane.

Narrow-tier - already competitive¶

D9 / D18 / D38 arithmetic already matches or beats fixed::I*F* (the only directly-comparable competitor at these widths). D38 transcendentals are 1.47 µs ln, 40.5 µs exp at s=19, vs fastnum's 16 ns / 8.92 µs - but those are f64-bridge for fastnum (1 ULP off) vs 0 ULP for us. No roadmap item here unless the accuracy contract changes.

Methodology / infrastructure¶

approach	status	expected win
Split `benches/library_comparison.rs` into one bench-binary per width (`lib_cmp_d38.rs`, `lib_cmp_d76.rs`, `lib_cmp_d307.rs`, …) so `cargo bench --bench lib_cmp_d307` can iterate on a single tier without re-running the whole matrix	TODO (post-0.3.0)	minutes vs hours per iteration when tuning one tier; each file stays focused on its peer set
Re-bench every release on a single dedicated machine, not whatever runner happened to be available	TODO	reduces inter-release noise that currently looks like regressions
Track ULP deltas continuously, not one-shot at 0.2.5	TODO	catches accuracy regressions early; cheap to run
Cross-platform bit-determinism CI (Linux/macOS/Windows × x86_64/aarch64)	TODO	proves the `*_strict` invariant the docs claim

Wide-tier MG magic-multiply extension + negative SCALE¶

approach	status	expected win
Extend the Möller–Granlund magic-multiply tables past `10^38` to cover every wide-tier SCALE (target: `10^SCALE` for `SCALE` up to the tier MAX_SCALE) so the `÷10^SCALE` step on D76 and above swaps multi-limb Knuth divide for one magic multiply + a fix-up	TODO	est. 3–10× on wide-tier mul/div per the 2026-05-17 gap research
Make `SCALE` signed (i32) so callers can express implicit-trailing-zero magnitudes (D38<-3> = "stored value × 10³"); orthogonal to the magic-multiply work but shares the per-tier `10^k` constant tables — a single tables-rewrite covers both	TODO	enables values up to `±i128::MAX × 10^(-SCALE)` without burning storage on zero-padding; common in actuarial / national-accounts work

See research/2026_05_17_wide_mul_div_gap.md for the gap analysis and research/2026_05_17_mg_magic_extension_eval.md for the design eval combining both items.

Random number generation (0.4.0 target)¶

A cryptographically-secure RNG surface for sampling decimals. Same out-of-tree-via-trait pattern as the rest of the ecosystem: bring your own RngCore + CryptoRng from rand, the crate provides the decimal-shaped sampling primitives.

primitive	shape
`gen_unit::<T, R>(rng)`	uniform `T` in `[0, 1)` — generate SCALE random decimal digits
`gen_range::<T, R>(rng, lo..hi)`	uniform `T` in a closed-or-half-open range; rejection-sampling at any SCALE so the distribution stays unbiased
`gen_storage::<T, R>(rng)`	fill the storage bits directly — useful for token-like opaque IDs
`gen_signed_unit::<T, R>(rng)`	uniform `T` in `(-1, 1)` with the sign bit also sampled

Design choices:

No global state. The crate doesn't ship its own thread_rng() or default RNG. Callers pass an R: RngCore + CryptoRng. This matches the *_with(mode) story for rounding — explicit > magic.
no_std-friendly. Trait-bound RNG so getrandom / OsRng aren't required dependencies. Embedded callers can plug their own HRNG-backed RngCore.
Cryptographic correctness. The rejection sampler for gen_range follows the well-trodden "draw N bytes, modulo by range only when below the rejection threshold" pattern; no modulo bias even at the widest tiers.
Distribution helpers in decimal-scaled-math. Normal / log-normal / exponential / gamma / Box-Muller etc. live in the ecosystem math crate, not in the core. The core only provides the uniform primitives that everything else composes on top of.

Adapter crates (in-workspace) — DB / serialisation bridges¶

The core crate is deliberately compile-time-fixed-precision: a runtime-variable scale would break const-fn arithmetic, deterministic limb work, and the per-tier specialised transcendentals. Database / serialisation ergonomics that need a runtime scale are a better fit as thin adapter crates layered on top.

Layout decision: adapters live in this repo as sibling workspace crates. They're thin shims, they version-couple tightly to the core, and atomic cross-crate refactors (D38.widen() moving, SCALE going signed, etc.) land in one PR with one CI run.

crate (proposed)	what it bridges	status
`decimal-scaled-sqlx`	Map SQL `NUMERIC(p, s)` columns to a caller-chosen `D{N}<SCALE>`; handle string-form fallback for non-matching scale	TODO
`decimal-scaled-diesel`	Same shape for Diesel's `Numeric` SQL type	TODO
`decimal-scaled-arrow`	Arrow `Decimal128` / `Decimal256` column round-trip	TODO
`decimal-scaled-protobuf`	A protobuf `Decimal` message round-trip helper	TODO

These intentionally live outside the core crate so the core stays no_std, has no DB / serialisation drivers as deps, and keeps a small public surface. Each adapter owns its own runtime-scale negotiation and converts at the boundary into the caller's compile-time-fixed tier.

Ecosystem crates (separate repos under the `mootable` org) — applications of the 0-ULP core¶

Layout decision: ecosystem crates live in their own repos under a shared GitHub org, not in this workspace. They're substantial standalone codebases with distinct contributor audiences (finance, symbolic-math, formula-DSL), independent release cadence, and per-domain CI policy (finance wants regulator-driven golden-vector tests; expr might want fuzz CI; math wants property-based tests over algebraic laws). Splitting them out keeps each repo focused and lets specialists own their domain without learning the wide-int internals.

The core ships the deterministic primitive; the interesting applications layer above it. Three planned downstream crates that together turn decimal-scaled into a complete numerical toolkit without bloating the core:

crate (proposed)	what it adds	why it wants this backend
`decimal-scaled-expr`	Dual-track expression engine: type-level builders for compile-time-known shapes that monomorphise to direct decimal ops, plus a runtime AST for spreadsheet-style string formulas	Bit-exact reproducible formula / rule engine; the only such engine that doesn't drift on `0.1 + 0.2`. Shared substrate for the math + finance crates below
`decimal-scaled-math`	Extended types: complex, rationals, vectors, matrices (small + sparse), statistical distributions, interval arithmetic, error propagation	0-ULP determinism propagates through every algebra; lets you compose linear-algebra pipelines that are bit-identical across machines
`decimal-scaled-finance`	Time-value-of-money (NPV, IRR, PV, FV), amortisation schedules, day-count conventions (ACT/360, 30/360, ACT/ACT, ACT/365), bond pricing, Black-Scholes, FX with caller-chosen rounding	Finance is the original deterministic-decimal use case; every regulator-facing calc must reproduce exactly across re-runs and across counterparties

Same out-of-tree principle as the adapters: each lives in its own crate, depends on decimal-scaled only, opts into the tier (d76, d307, etc.) it actually needs, and exposes its surface generically over Decimal-trait-implementing types so the caller picks the storage tier. The finance crate in particular benefits from *_with(mode) propagation: every regulator has its own last-digit-rounding rule (HALFUP, HALFDOWN, HALFEVEN), and we can honour each per-call without forking the engine.

Expression engine design notes (`decimal-scaled-expr`)¶

Two complementary tracks under a single Compute trait:

Type-level expression templates. Operator overloads on a small Expr<...> newtype build a zero-sized AST in the type system (Add<X, Mul<Y, Lit>>). #[inline(always)] traversal at .eval() time lets the compiler see straight through to direct decimal ops — when the operands are bound at the call site this should compile to the same machine code as hand-written arithmetic. Math + finance APIs use this track because their formulas are mostly fixed-shape at the call site (NPV is always Σ cf_i / (1+r)^i; cross-product is always a × b - c × d).
Runtime AST. Box<Node> shape for spreadsheet-style string-parsed formulas, evaluated by tree traversal. Required for the dynamic-expression use case; same Compute trait so downstream code is path-agnostic.

The "lazy decimal" idea is the bridge: the expr types behave like decimals (impl Add / Mul / etc.) and can be passed wherever a Decimal is expected. Code that materialises immediately (let result: D38<12> = (x + y * 2).compute();) compiles away the AST entirely under the type-level track. Code that defers the materialisation gets symbolic manipulation, caching, partial evaluation, etc., for free.

Math and finance crates compose on top by importing the trait and writing their algorithms once — npv(cashflows, rate) works identically whether cashflows is a Vec<D38<12>> (immediate arithmetic), a Vec<Expr<D38<12>>> (lazy with re-evaluation), or a Vec<DynExpr> (parsed from a spreadsheet cell).

Whole-tree serialisation is a first-class requirement. Expressions (the entire AST, not just the materialised result) need to round-trip through Serialize / Deserialize so they can be:

persisted to disk (spreadsheet save / load, business-rule storage, version-controlled formula libraries);
transmitted over the network (API submission of a custom formula; remote evaluation; send-formula-to-server with the values resolved on the receiving side);
written to audit logs for regulator-facing finance work (every applied formula recorded in its exact deserialisable form, so a re-run reproduces bit-identically — input values, intermediate operator tree, applied rounding mode, all stored as one tagged payload);
diff-able as text (RON / JSON / S-expression for human-readable change review of business rules);
equality-comparable in serialised form (two formulas that serialise to the same payload are structurally identical; useful for memoisation keys and conflict detection).

Implementation shape: the runtime AST Box<Node> is the natural serialisation target — every operator node, every literal, every variable reference, and every nested sub-tree emits as a tagged-union element in the payload. The type-level templates can serialise (visit the type-level AST, emit the same tagged-union sequence) but deserialisation always materialises a DynExpr because the inbound shape isn't known at compile time. The Compute trait abstracts both so callers don't care which side they got. Multiple wire formats supported behind feature flags (serde-json, serde-postcard, serde-ron) without forcing a default dependency. A schema versioning scheme on the root node keeps long-lived persisted formulas decodable as the AST grows new node types.