aozora is a 21-crate workspace. The split exists for three reasons:
narrow each crate’s compile surface (faster cargo check), pin
dependency boundaries (cycles are forbidden by the layout), and let
each binding (CLI, WASM, FFI, Python) compose only the layers it
needs.
Single source of truth for shared types: Span, Diagnostic, TriggerKind, PairKind, PUA sentinel codepoints, SLUGS dispatch table. No internal dependencies — every other crate may depend on it.
no_std Eytzinger-layout sorted-set lookup. Cache-friendly binary search for sub-256-entry registries.
aozora-syntax
AST node types — AozoraNode<'src>, Container<'src>, Bouten<'src>, Ruby<'src>, …. Borrows from the bumpalo arena.
aozora-encoding
Shift_JIS decoding, JIS X 0213 patch, 外字 PHF resolver, accent decomposition.
aozora-scan
SIMD-friendly multi-pattern byte scanner (Phase 1’s trigger scan). One of three crates that locally relaxes unsafe_code — for aligned-load SIMD intrinsics.
Four-phase lexer (sanitize → events → pair → classify) plus the lex_into_arena orchestrator that fuses normalize + registry + diagnostics into a single output walk.
aozora-render
HTML and canonical-serialisation walkers. Single O(n) tree pass each; no allocation outside the output buffer.
A single-crate workspace with the same code would force a full
re-compile on any internal change. With the workspace split, a
change in the renderer doesn’t touch the lexer, scanner, or any of
the bindings — incremental compile times stay sub-second on
iteration.
aozora-veb and aozora-spec are no_std-clean. aozora-scan is
no_std-clean by default; the SIMD backends opt in to the std
feature for runtime CPU detection. That matters for the wasm32
build (where std is a real cost) and would matter for embedded
targets if anyone ever needed one. Keeping them in dedicated crates
enforces the no_std discipline at the crate-graph level —
adding a std import would require depending on a std-using
crate, which is a visible Cargo.toml change.
The C ABI driver (aozora-ffi) needs aozora + serde and nothing
else. It does not pull in the bench harness, the trace loader, or
the corpus crate. The wasm driver is similarly minimal. Each
binding’s dependency closure is exactly what it needs — which is
what keeps the wasm bundle inside its 500 KiB budget.
A few things stay co-located despite plausible split points:
HTML render and canonical serialise in aozora-render. Both
are tree walkers; sharing the visitor helper between them keeps
the implementation small.
Phase 0 sanitize sub-passes in aozora-pipeline. Each sub-pass
is < 100 LOC and operates on the same &str slice; pulling them
out would create a 5-crate ecosystem for a transformation that’s
conceptually one phase.
Trigger-byte enum and pair-kind enum in aozora-spec. They’re
used by both aozora-scan (which produces them) and
aozora-pipeline (which consumes them); putting them in spec
avoids a back-reference.
Splits aren’t free — every additional crate adds a Cargo.toml, a
README, doc-link reachability, and a test surface. Splits land when
the cohesion benefit (one of the three above) is real.