Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Crate map

aozora is an 18-crate workspace. The split exists for three reasons: narrow each crate’s compile surface (faster cargo check), pin dependency boundaries (cycles are forbidden by the layout), and let each binding (CLI, WASM, FFI, Python) compose only the layers it needs.

At a glance

flowchart TD
    subgraph foundation
      spec
    end
    subgraph types
      veb
      syntax
      encoding
      scan
    end
    subgraph parser
      lexer
      lex
      render
    end
    subgraph facade
      aozora_facade["aozora"]
    end
    subgraph bindings
      cli
      ffi
      wasm
      py
    end
    subgraph dev
      bench
      corpus
      test_utils["test-utils"]
      trace
      xtask
    end

    spec --> veb
    spec --> syntax
    spec --> encoding
    spec --> scan
    veb --> lexer
    syntax --> lexer
    encoding --> lexer
    scan --> lexer
    lexer --> lex
    lex --> render
    render --> aozora_facade
    aozora_facade --> cli
    aozora_facade --> ffi
    aozora_facade --> wasm
    aozora_facade --> py
    aozora_facade --> bench
    corpus --> bench
    test_utils --> lexer
    trace --> xtask

Per-crate purpose

Foundation

CrateRole
aozora-specSingle source of truth for shared types: Span, Diagnostic, TriggerKind, PairKind, PUA sentinel codepoints. No internal dependencies — every other crate may depend on it.

Types & primitives

CrateRole
aozora-vebno_std Eytzinger-layout sorted-set lookup. Cache-friendly binary search for sub-256-entry registries.
aozora-syntaxAST node types — AozoraNode<'src>, Container<'src>, Bouten<'src>, Ruby<'src>, …. Borrows from the bumpalo arena.
aozora-encodingShift_JIS decoding, JIS X 0213 patch, 外字 PHF resolver, accent decomposition.
aozora-scanSIMD-friendly multi-pattern byte scanner. The only crate (besides aozora-ffi) that locally relaxes unsafe_code — for aligned-load SIMD intrinsics.

Parser

CrateRole
aozora-lexerSeven-phase classifier pipeline (sanitize → scan → tokenize → classify → pair → resolve → diagnostics). Emits the diagnostic catalogue.
aozora-lexStreaming orchestrator — fused lex_into_arena over the lexer’s per-phase calls. The front door for the public crate.
aozora-renderHTML and canonical-serialisation walkers. Single O(n) tree pass each; no allocation outside the output buffer.

Facade

CrateRole
aozoraPublic facade. Document::parse() -> AozoraTree<'_>, tree.to_html(), tree.serialize(), tree.diagnostics(). The single import for library consumers.

Bindings

CrateRole
aozora-cliThe aozora binary (check / fmt / render).
aozora-ffiC ABI driver. Opaque handles, JSON-encoded structured data. Locally relaxes unsafe_code; every block carries a // SAFETY: comment.
aozora-wasmwasm32-unknown-unknown target with wasm-bindgen exports.
aozora-pyPyO3 binding shipped via maturin.

Development-only

CrateRole
aozora-benchCriterion + corpus-driven probes. Source of the PGO training data.
aozora-corpusCorpus source abstraction (zstd-archived, blake3-pinned). Dev-only.
aozora-proptestShared proptest strategies. Dev-only.
aozora-traceDWARF symbolicator + samply gecko-trace loader. Dev-only.
aozora-xtaskHost-side dev tooling (samply wrapper, trace analysis, corpus pack/unpack). Not on the just build path.

Why 18 crates?

Three concrete wins from the split.

1. Compile latency

A single-crate workspace with the same code would force a full re-compile on any internal change. With 18 crates, a change in the renderer doesn’t touch the lexer, scanner, or any of the bindings — incremental compile times stay sub-second on iteration.

2. No-std reach

aozora-veb and aozora-spec are no_std-clean. That matters for the wasm32 build (where std is a real cost) and would matter for embedded targets if anyone ever needed one. Keeping them in dedicated crates enforces the no_std discipline at the crate-graph level — adding a std import would require depending on a std-using crate, which is a visible Cargo.toml change.

3. Binding modularity

The C ABI driver (aozora-ffi) needs aozora + serde_json and nothing else. It does not pull in the bench harness, the trace loader, or the corpus crate. The wasm driver is similarly minimal. Each binding’s dependency closure is exactly what it needs — which is what keeps the wasm bundle inside its 500 KiB budget.

What we deliberately don’t split

A few things stay co-located despite plausible split points:

  • HTML render and canonical serialise in aozora-render. Both are tree walkers; sharing the walk() helper between them keeps the implementation small.
  • Phase 0 sanitize sub-passes in aozora-lexer. Each sub-pass is < 100 LOC and operates on the same &str slice; pulling them out would create a 5-crate ecosystem for a transformation that’s conceptually one phase.
  • Trigger-byte enum and pair-kind enum in aozora-spec. They’re used by both aozora-scan (which produces them) and aozora-lexer (which consumes them); putting them in spec avoids a back-reference.

Splits aren’t free — every additional crate adds a Cargo.toml, a README, doc-link reachability, and a test surface. Splits land when the cohesion benefit (one of the three above) is real.

See also