Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Seven-phase lexer

aozora-lexer runs as seven distinct phases, each a pure function on the previous phase’s output. The split exists because each phase has a different cost profile — separating them keeps the dominant hot path (Phase 2 tokenize) tight, and lets the bench harness measure each phase independently via the phase_breakdown probe.

Phase ordering

flowchart LR
    p0["Phase 0<br/>sanitize"]
    p1["Phase 1<br/>scan triggers"]
    p2["Phase 2<br/>tokenize"]
    p3["Phase 3<br/>classify"]
    p4["Phase 4<br/>pair containers"]
    p5["Phase 5<br/>resolve targets"]
    p6["Phase 6<br/>diagnostics"]

    p0 --> p1 --> p2 --> p3 --> p4 --> p5 --> p6

Each arrow carries a small data structure (offsets, slices, AST nodes); no phase reads back into a previous phase’s output.

PhaseInputOutputWhat it does
0 — Sanitizeraw &strnormalised &strBOM strip, CRLF → LF, accent decomp, PUA assignment for gaiji refs
1 — Scannormalised &strtrigger offsets &[Trigger]SIMD multi-pattern scan for |《》※[]
2 — Tokenizenormalised &str + offsets&[Token]Slice the source at trigger boundaries; classify each slice as Plain / Open / Close / RefMark
3 — Classify&[Token]&[ClassifiedToken]Recogniser registry decides what each [#…] body actually is
4 — Pair&[ClassifiedToken]&[Container]Bracket matching: openers ↔ closers, build container tree
5 — Resolve&[Container] + sourceAozoraTree<'_>Look-back resolution for bouten / tcy targets, tie inline annotations to AST nodes
6 — DiagnosticsAozoraTree<'_> + accumulatorDiagnosticsCollect diagnostics from earlier phases, sort by span, pin codes

Phase 0: sanitize

The most varied phase by what it touches. Sub-passes:

  • bom_strip — UTF-8 / UTF-16 BOM detection and removal.
  • crlf — CRLF → LF in one memchr2 pass.
  • rule_isolate — separate inline ※[#…] from following text so the tokenizer has unambiguous boundaries.
  • accent — 114 ASCII digraph / ligature decomposition (see Notation → Gaiji).
  • pua_scan — assign each ※[#…] reference a private-use codepoint inline so subsequent phases treat it as a single character.

Each sub-pass is independent; phase0_breakdown probe measures them separately. In the corpus sweep, pua_scan dominates Phase 0 (60% of phase wall time on average) because it has to ※[#…] scan the whole document — the SIMD scanner from Phase 1 isn’t yet active.

Phase 1: scan triggers

The hot path. SIMD multi-pattern scan for the seven trigger bytes:

|  《  》  ※  [  ]    (full-width space)

The chosen scanner backend (Teddy, Hoehrmann DFA, memchr-based) produces a &[Trigger] of byte offsets. See SIMD scanner backends for the selection logic.

Throughput on a typical mid-size work (crime_and_punishment.txt, ~600 KiB UTF-8): ~12 GB/s on Teddy, ~3.5 GB/s on the DFA fallback. Both are well above the rest of the pipeline’s throughput — Phase 1 is essentially free at the corpus level.

Phase 2: tokenize

Slice the source at trigger boundaries and classify each slice:

pub enum Token<'src> {
    Plain(&'src str),
    Open(OpenKind, Span),
    Close(CloseKind, Span),
    RefMark(Span),                // ※ in isolation
}

Single linear pass over the trigger array; no allocation outside the output Vec (which is sized exactly from the trigger count).

Phase 3: classify

The most code-heavy phase. The classifier registry has one recogniser per [#…] directive family:

  • RubyRecogniser
  • BoutenRecogniser
  • TcyRecogniser
  • IndentRecogniser
  • AlignRecogniser
  • LineLengthRecogniser
  • BreakRecogniser
  • KaeritenRecogniser
  • … 17 in total

The recognisers run in deterministic order; the first recogniser that matches the directive body wins. Order matters because some directive bodies are valid prefixes of others (e.g. ここから2字下げ is valid prefix of ここから2字下げ、地寄せ). Compile-time tests in aozora-lexer enforce ordering invariants.

The recognisers themselves are short (most are < 50 LOC) — the bulk of classify cost is the phf::Map of directive prefixes the recognisers share for opener detection.

Phase 4: pair

Bracket matching. Walk the classified token stream, push openers onto a stack, pop on closers, fail if mismatched. The output is a tree of Container<'_> nodes whose children are flat &[Token<'_>] slices.

Single linear pass; the stack is a SmallVec<[ContainerKind; 8]> so it stays on the stack for typical 1–4 deep nesting.

Phase 5: resolve

Bouten / tcy targets quote-by-look-back: the directive [#「平和」に傍点] applies to the most recent 平和 in the preceding text. Phase 5 walks the container tree and resolves these references.

Pre-Phase-5 the tree carries unresolved BoutenRef { target: "平和" } nodes; post-Phase-5 they’re Bouten { target_span: Span } pointing at the actual matched run. The resolver uses an aho-corasick DFA over the live target strings — single-pass over the source, no recogniser-order dependencies.

Phase 6: diagnostics

Collect, sort by span, pin codes. Diagnostics emitted in earlier phases were buffered in a DiagnosticAccumulator threaded through the call stack; Phase 6 sorts them and assigns the stable error codes (E0001, W0001, …).

Why seven phases, not one big function?

Three reasons.

  1. Bench-driven optimisation. The phase_breakdown probe reports per-phase wall time per corpus document. Knowing that “this document spends 80% of parse time in Phase 3 classify” tells you exactly where to focus a perf PR. A monolithic lex() would force you to re-instrument every PR.
  2. Spec compliance. Each phase corresponds to a discrete transformation that the spec describes. If a spec gap shows up in production, it almost always lands in one phase, and the test harness can pin a regression test that exercises that phase only.
  3. Composability. aozora-lexer exposes both the fused lex_into_arena and the per-phase calls. The fused version is what aozora-lex ships to consumers; the per-phase calls are what the bench harness and integration tests use to isolate regressions.

The cost is conceptual (more API surface internal to the lexer); the win is that every perf decision in the parser has a measurement attached.

See also