Welcome

aozora is a pure-functional Rust parser for 青空文庫記法 (Aozora Bunko notation) — the in-text annotation language used by 青空文庫, the long-running volunteer digital library of Japanese literature in the public domain.

It handles ruby (｜青梅《おうめ》), bouten / bousen (［＃「X」に傍点］), 縦中横, gaiji references (※［＃…、第3水準1-85-54］), kunten / kaeriten, indent and align containers (［＃ここから2字下げ］… ［＃ここで字下げ終わり］), and page / section breaks — every notation that appears in a real Aozora Bunko .txt source.

The repository is CommonMark-free, Markdown-free: aozora deals only with the 青空文庫 notation. The renderer emits semantic HTML5; the lexer reports structured diagnostics; the AST is a borrowed-arena tree that can be walked in O(n) without copying source bytes. If you want a Markdown dialect that also understands aozora notation, see the sibling project afm, which is built on top of this parser.

What this handbook is for

A practical tour and a deep reference, in one document.

Tour — install the CLI, drop the library into a Rust project, or call it from WASM, C, or Python.
Notation reference — every annotation aozora recognises, with examples, output, edge cases, and the diagnostics that fire when authors get them subtly wrong.
Architecture — what makes aozora fast and small: the borrowed-arena AST, the seven-phase lexer, the SIMD scanner backends (Teddy, structural bitmaps, Hoehrmann-style multi-pattern DFA), Eytzinger-layout sorted-set lookup, and the Shift_JIS + 外字 resolver. Every choice is motivated against the alternative we didn’t take.
Performance — the release-profile decisions, PGO pipeline, samply workflow, criterion benchmarks, and the parallel corpus sweep that exercises the parser against every Aozora Bunko work.
Reference & contributing — CLI, env vars, rustdoc API, and how the dev loop / TDD policy / release pipeline fit together.

Project shape

aozora is a single-author, green-field project that takes the opportunity to reach for the good algorithm and data structure for each problem rather than the obvious naive one. That orientation permeates every chapter — when you read about the scanner or the arena or the gaiji table, you’ll see why this technique spelled out, not just what the code does.

Status

v0.2.x working set. The CLI, Rust library, WASM, C ABI, and Python binding all build and pass the integration smoke tests in CI. Public crates.io publication is gated on the v1.0 API freeze; in the meantime, depend on a tagged commit (see Install).

A live build of this site lives at https://p4suta.github.io/aozora/; the rustdoc API reference is layered underneath at https://p4suta.github.io/aozora/api/aozora/.

Install

aozora ships in five shapes — pick the one that matches how you want to consume the parser.

CLI binary (release archive)

Pre-built aozora binaries for the three Tier-1 platforms ride on every GitHub Release:

aozora-vX.Y.Z-x86_64-unknown-linux-gnu.tar.gz
aozora-vX.Y.Z-aarch64-apple-darwin.tar.gz
aozora-vX.Y.Z-x86_64-pc-windows-msvc.zip

Each archive is shipped with a SHA256SUMS companion. Browse them at https://github.com/P4suta/aozora/releases.

curl -L -O \
  https://github.com/P4suta/aozora/releases/latest/download/aozora-x86_64-unknown-linux-gnu.tar.gz
tar -xzf aozora-*.tar.gz
sudo install -m 0755 aozora /usr/local/bin/
aozora --version

CLI binary (build from source)

Cargo can build the CLI directly from the repository. The --locked flag is non-negotiable — it pins to the exact Cargo.lock we shipped, which matters because the workspace uses fat LTO (mismatched dep versions silently change inlining behaviour).

Latest main (default — tracks the development tip):

cargo install --git https://github.com/P4suta/aozora --locked aozora-cli

Reproducible build pinned to a release tag (replace the tag with the current value from the releases page):

cargo install --git https://github.com/P4suta/aozora \
              --tag v0.3.0 --locked aozora-cli

Rust library

aozora is not yet on crates.io — public release tracks the v1.0 API freeze. Until then, depend on a tagged commit. This snippet is the single source of truth for the recommended pin — every other doc link here instead of inlining the tag, so a new release only needs this one block updated:

[dependencies]
aozora          = { git = "https://github.com/P4suta/aozora.git", tag = "v0.3.0" }
aozora-encoding = { git = "https://github.com/P4suta/aozora.git", tag = "v0.3.0" }

The current tag is whatever GitHub Releases is marked Latest; bump the two tag = "..." lines accordingly.

Ship-it pattern: pin the tag in Cargo.toml, let Dependabot bump it on the next release. The repo follows Conventional Commits and SemVer; breaking changes always advance the major version (post-1.0) or the minor version (during 0.x).

WASM (browser / Node)

rustup target add wasm32-unknown-unknown        # one-time
wasm-pack build --target web --release crates/aozora-wasm

The post-wasm-opt artifact has a 500 KiB size budget. See Bindings → WASM for the JS surface and the post-build wasm-opt invocation we recommend.

C ABI

cargo build --release -p aozora-ffi
# → target/release/libaozora_ffi.{so,dylib,a}
# → target/release/aozora.h          (cbindgen-generated)

Link with -laozora_ffi and include aozora.h. See Bindings → C ABI for the API surface and memory ownership rules.

Python

pip install maturin                              # one-time
cd crates/aozora-py
maturin develop -F extension-module              # install in current venv
maturin build   -F extension-module --release    # produce a redistributable wheel

See Bindings → Python for the API and the unsendable thread-safety contract.

Toolchain pin

aozora pins Rust 1.95.0 as its MSRV (rust-toolchain.toml). CI enforces it via a dedicated msrv job. If you run rustup show inside the repo and see something else, your local override needs updating.

CLI Quickstart

The aozora binary covers three operations:

aozora check  FILE.txt          # lex + report diagnostics on stderr
aozora fmt    FILE.txt          # round-trip parse ∘ serialize, print to stdout
aozora render FILE.txt          # render to HTML on stdout

- (or no path argument) reads from stdin. --encoding sjis (alias -E sjis) decodes Shift_JIS source — Aozora Bunko’s distributed .txt files are Shift_JIS, so this flag is the common case for real corpus work.

Common invocations

# Lex an Aozora Bunko file and print diagnostics
aozora check -E sjis crime_and_punishment.txt

# Render to HTML (stdout)
aozora render -E sjis crime_and_punishment.txt > out.html

# Pipe from stdin
cat src.txt | aozora render -

# CI gate: fail if format is not idempotent
aozora fmt --check src.txt

Flag reference

Flag	Subcommand	Effect
`-E sjis`, `--encoding sjis`	all	Decode Shift_JIS source. Default is UTF-8.
`--strict`	`check`	Exit non-zero on any diagnostic.
`--check`	`fmt`	Exit non-zero if formatted output differs from input.
`--write`	`fmt`	Overwrite the input file with the canonical form. (Ignored when reading from stdin.)
`--no-color`	all	Disable ANSI colour in diagnostics output.
`--verbose`	all	Print parse phase timings to stderr.

Exit codes

Code	Meaning
`0`	Success.
`1`	Diagnostics emitted under `--strict`, or formatting mismatch under `--check`.
`2`	Usage error (bad flag, missing file, decode error).

Diagnostics format

aozora check prints diagnostics in miette style — a coloured source snippet with carets pointing at the byte range, a short message, and (where applicable) a help line:

  × ruby reading mismatch: target spans 3 chars but ｜《》 reading is empty
   ╭─[input.txt:42:9]
42 │ ｜青梅《》
   · ───┬───
   ·    ╰── empty reading
   ╰────
  help: provide a reading inside 《…》 or remove the ｜ marker

Every diagnostic carries a stable error code (E0001, E0002, …); see the Diagnostics catalogue for the full list.

Why not a single subcommand?

check / fmt / render are intentionally separate so each one has a single, predictable failure mode in shell pipelines:

check exits 0 on parse success, regardless of warnings (use --strict for “no diagnostics allowed”).
fmt is a pure-text transform: stdin in, canonical text out. --check upgrades it to a CI gate without forking a second binary.
render is a pure-text-to-HTML transform with the same exit-code shape.

Combining them behind flags would make the exit-code semantics ambiguous (does --check mean format-check or strict-check?). Keeping them split is the same logic that splits gofmt from vet from go build.

Library Quickstart

The minimal Rust use of aozora is six lines:

use aozora::Document;

fn main() {
    let source = std::fs::read_to_string("src.txt").unwrap();
    let doc = Document::new(source);
    let tree = doc.parse();
    println!("{}", tree.to_html());
}

That’s enough to get HTML out of any UTF-8 青空文庫 source. The rest of this page covers the lifetime model, the diagnostic stream, and the AST walk — three things you’ll need once you do anything beyond “render to HTML”.

The lifetime model

Document owns two things: a bumpalo::Bump arena and the source Box<str>. AozoraTree<'a> borrows from both:

let doc  = aozora::Document::new(source);   // Document: 'static
let tree = doc.parse();                     // AozoraTree<'_> bound to &doc
let html = tree.to_html();                  // walks the borrow

// dropping doc releases every node in a single Bump::reset()
drop(doc);

That is: hand the Document around, not the tree. If you need to keep a parse result alive across function boundaries, the function takes ownership of (or borrows) the Document, and re-derives the tree on the inside. This is unusual for Rust libraries — most parse APIs hand back an owned tree — but it’s what makes aozora’s zero-copy AST safe. See Architecture → Borrowed-arena AST for why this trade is worth it.

Shift_JIS input

Aozora Bunko ships its corpus as Shift_JIS. Decode through aozora-encoding first:

use aozora::Document;
use aozora_encoding::sjis;

let bytes = std::fs::read("src.sjis.txt")?;
let utf8  = sjis::decode_to_string(&bytes)?;   // returns Cow<'_, str>
let doc   = Document::new(utf8.into_owned());
let tree  = doc.parse();

sjis::decode_to_string handles BOM stripping, JIS X 0213 codepoints, and the Aozora-specific 外字 references that survive the decode pass as private-use sentinels (resolved later in the parser).

Diagnostics

use aozora::Diagnostic;

let diags: &[Diagnostic] = tree.diagnostics();
for d in diags {
    eprintln!("[{}] {} @ {}..{}", d.code, d.message, d.span.start, d.span.end);
}

Each Diagnostic carries a stable error code, a span, and a level. Diagnostics are non-fatal by design: the parser always produces a tree, even from malformed input. Callers that want strict behaviour treat any diagnostic as an error themselves. See the Diagnostics catalogue for the code list.

Walking the AST

AozoraTree exposes a flat node iterator and a typed enum:

use aozora::AozoraNode;

for node in tree.nodes() {
    match node {
        AozoraNode::Plain(s)    => print!("{s}"),
        AozoraNode::Ruby(r)     => print!("[ruby:{}={}]", r.target(), r.reading()),
        AozoraNode::Bouten(b)   => print!("[bouten {}]", b.kind().slug()),
        AozoraNode::Tcy(t)      => print!("[tcy:{}]", t.text()),
        AozoraNode::Gaiji(g)    => print!("[gaiji {}]", g.codepoint()),
        AozoraNode::Container(c)=> { /* recurse into c.children() */ }
        // …
    }
}

For richer traversal patterns (visitor, fold, structural diff), the nodes implement Copy (they’re effectively (tag, &str, &Bump-slice) triples), so you can keep references around freely as long as the Document lives.

Round-trip and canonicalisation

Every parse should round-trip:

let parsed = doc.parse();
let canonical: String = parsed.serialize();
assert_eq!(canonical, doc.source());     // for *canonical* input

Real Aozora Bunko sources contain stylistic variations (CRLF vs LF, NFC vs NFD around accents, half-width vs full-width punctuation) that the lexer normalises before tokenising. For those the assertion above holds after aozora fmt has been applied once.

The pure round-trip property is what aozora fmt --check exercises in CI, and what the corpus sweep verifies across the full Aozora Bunko catalogue (~17 000 works).

Where to next

Notation reference for what each node type represents.
Architecture → Pipeline overview for what happens between Document::new and Document::parse.
API reference for the rustdoc-generated surface.

Node reference

aozora exposes 19 NodeKind variants. Each is documented on its own page with source examples, the rendered HTML, the serialize round-trip output, the in-memory AST shape, and the diagnostics it can fire alongside.

The page layout matches the aozora explain <kind> CLI subcommand: once you find the variant in the table, the deep dive is one click — or one shell invocation — away.

Variant	Wire tag	Notation
Ruby	`ruby`	`｜base《reading》`
Bouten	`bouten`	`［＃「target」に傍点］`
TateChuYoko	`tateChuYoko`	`［＃「12」は縦中横］`
Gaiji	`gaiji`	`※［＃...、第3水準1-85-54］`
Indent	`indent`	`［＃2字下げ］`
AlignEnd	`alignEnd`	`［＃地から2字上げ］`
Warichu	`warichu`	`［＃割り注］...`
Keigakomi	`keigakomi`	`［＃罫囲み］`
PageBreak	`pageBreak`	`［＃改ページ］`
SectionBreak	`sectionBreak`	`［＃改丁］`
AozoraHeading	`heading`	`［＃見出し］`
HeadingHint	`headingHint`	`［＃「対象」は中見出し］`
Sashie	`sashie`	`［＃挿絵（path.png）入る］`
Kaeriten	`kaeriten`	`［＃返り点一・二］`
Annotation	`annotation`	`［＃任意のコメント］`
DoubleRuby	`doubleRuby`	`《《重要》》`
Container	`container`	`［＃ここから...］...［＃ここで...終わり］`
ContainerOpen	`containerOpen`	(NodeRef projection)
ContainerClose	`containerClose`	(NodeRef projection)

How to read these pages

Every node page follows the same skeleton:

Section	Content
Source examples	One or two minimal Aozora-notation strings that produce this variant.
Rendered HTML	What `Document::new(src).parse().to_html()` emits.
Serialize output	What `serialize()` emits — typically the canonical form of the source.
AST shape	The borrowed-AST struct fields the variant carries.
When emitted	Phase 3 classification rule that produces this variant.
Diagnostics	Codes that may accompany this variant.
Related kinds	Cross-links to neighbours (`Bouten` ↔ `Bousen`, `Indent` ↔ `Container::Indent`, etc.).

#[non_exhaustive] on NodeKind: a future minor release adding a new variant lands here without a breaking change. Downstream consumers that match on NodeKind exhaustively must include a _ arm.

NodeKind::Ruby

Wire tag: ruby — base text + reading annotation. The most common non-trivial variant in Aozora Bunko.

Source examples

｜青梅《おうめ》
青梅《おうめ》

Both forms classify as Ruby; the leading ｜ (U+FF5C) makes the delimiter explicit and lets the parser disambiguate the base run when ambiguous neighbours could otherwise extend the base.

Rendered HTML

<ruby>青梅<rp>(</rp><rt>おうめ</rt><rp>)</rp></ruby>

<rp> parens are emitted so HTML clients without ruby support still display a readable fallback.

Serialize output

serialize() always emits the explicit-delimiter form (｜base《reading》), so a parse → serialize → parse round-trip is a fixed point regardless of which form the source used.

AST shape

pub struct Ruby<'src> {
    pub base: NonEmpty<Content<'src>>,
    pub reading: NonEmpty<Content<'src>>,
    pub delim_explicit: bool,
}

Both fields are NonEmpty<Content>; empty base or reading is rejected upstream and never produces a Ruby node.

When emitted

Phase 3 classifies a 《…》 pair as ruby when the preceding run is a sequence of CJK / kana / latin glyphs and the close is followed by neither a glyph (which would extend the base further) nor a stray opener.

Diagnostics

aozora::lex::unclosed_bracket — unbalanced 《 reaches EOF.
aozora::lex::unmatched_close — stray 》 with no matching open.

DoubleRuby — 《《…》》 double-bracket variant.
Annotation::InvalidRubySpan — fallback when the ruby pair could not be parsed cleanly.

NodeKind::Bouten

Wire tag: bouten — emphasis dots / sidelines over a target span.

Source examples

青空に［＃「青空」に傍点］
青空に［＃「青空」に丸傍点］

The bracketed annotation refers backwards to the literal text quoted with 「…」, so the parser resolves the target by string match against the preceding line(s).

Rendered HTML

<em class="aozora-bouten aozora-bouten-goma aozora-bouten-right">青空</em>に

The two trailing class slots carry the bouten kind (goma, circle, wavy-line, …) and the position (right for vertical text, left for the rare under-side variant).

Serialize output

Round-trips to the explicit ［＃「target」に<kind>傍点］ form.

AST shape

pub struct Bouten<'src> {
    pub kind: BoutenKind,
    pub target: NonEmpty<Content<'src>>,
    pub position: BoutenPosition,
}

BoutenKind enumerates the 11 visual variants (Goma, WhiteSesame, Circle, …); BoutenPosition is Right (default for vertical text) or Left.

When emitted

Phase 3 sees ［＃「QUOTE」に <slug>傍点］ / ［＃「QUOTE」に <slug>傍線］, walks back through the recent text to find QUOTE, and emits the node with the matched span.

Diagnostics

aozora::lex::unclosed_bracket — annotation ［＃ opened with no matching ］.
Annotation (fallback) — quote target unresolved.

Annotation — fallback when the target cannot be matched.

NodeKind::TateChuYoko

Wire tag: tateChuYoko — horizontal text inside a vertical writing-mode run (縦中横, “vertical-with-horizontal-inside”).

Source examples

昭和［＃「12」は縦中横］年

Rendered HTML

<span class="aozora-tcy">12</span>

Downstream CSS gives the span text-combine-upright: all for proper vertical-writing display.

Serialize output

Round-trips to ［＃「target」は縦中横］.

AST shape

pub struct TateChuYoko<'src> {
    pub text: NonEmpty<Content<'src>>,
}

When emitted

Phase 3 matches the directive ［＃「TARGET」は縦中横］ and resolves TARGET in preceding text, then emits with the matched span.

Diagnostics

aozora::lex::unclosed_bracket if ［＃ is unmatched.

Annotation — fallback if target resolution fails.

NodeKind::Gaiji

Wire tag: gaiji — out-of-character-set glyph reference. The historical Aozora-Bunko notation for characters Shift_JIS could not encode; modern files mostly use them for genuine non-Unicode glyphs.

Source examples

※［＃「木＋吶のつくり」、第3水準1-85-54］

The ※ (U+203B) flags the construct; ［＃description、mencode］ carries the human description and a structured Mojikyō / JIS / U+ identifier.

Rendered HTML

<span class="aozora-gaiji" title="木＋吶のつくり" data-mencode="第3水準1-85-54">〓</span>

The fallback glyph 〓 (U+3013, “geta mark”) is the conventional Japanese typesetting placeholder for missing glyphs. When the resolver finds a Unicode mapping the inner text becomes the resolved character instead of the geta mark.

Serialize output

Round-trips to ※［＃description、mencode］.

AST shape

pub struct Gaiji<'src> {
    pub description: &'src str,
    pub ucs: Option<Resolved>,
    pub mencode: Option<&'src str>,
}

Resolved is either a single Unicode scalar or one of 25 predefined static combining sequences (e.g. か゚ — か + the IPA voicing-pair-mark — kept as a static constant so the borrowed-AST stays Copy).

When emitted

Phase 3 sees the ※[#…] digraph and parses the description / mencode payload. The encoding crate’s gaiji resolver lifts the mencode reference into a Unicode character when one exists.

Diagnostics

None on a well-formed ※[#...]. Ambiguous descriptions land as Annotation::Unknown instead of Gaiji.

Annotation — fallback when description is malformed.

NodeKind::Indent

Wire tag: indent — single-line ［＃N字下げ］ indent marker.

Source examples

［＃2字下げ］
［＃3字下げ］もう一段下げる

Rendered HTML

<span class="aozora-indent" data-amount="2"></span>

CSS controls the actual padding (typically padding-inline-start: Nem).

Serialize output

Round-trips to ［＃N字下げ］.

AST shape

pub struct Indent {
    pub amount: u8,
}

When emitted

Phase 3 matches the digraph plus a numeric prefix and emits a single inline marker. For paired indent regions (［＃ここから2字下げ］ … ［＃ここで字下げ終わり］), see Container.

Diagnostics

None on well-formed input.

Container — paired indent / dedent regions (ContainerKind::Indent).
AlignEnd — right-edge alignment counterpart.

NodeKind::AlignEnd

Wire tag: alignEnd — right-edge alignment marker (字上げ).

Source examples

［＃地付き］
［＃地から3字上げ］

Rendered HTML

<span class="aozora-align-end" data-offset="0"></span>

offset is 0 for 地付き, N for 地から N 字上げ.

Serialize output

Round-trips to ［＃地付き］ / ［＃地からN字上げ］.

AST shape

pub struct AlignEnd {
    pub offset: u8,
}

When emitted

Phase 3 matches the directive form. Paired alignment regions (［＃ここから地から N 字上げ］ … ［＃ここで字上げ終わり］) are Container instead.

Diagnostics

None.

Indent — left-edge counterpart.
Container — paired regions (ContainerKind::AlignEnd).

NodeKind::Warichu

Wire tag: warichu — split-line annotation (割注). Two text runs are stacked into a single line of the surrounding text.

Source examples

［＃割り注］上の段／下の段［＃割り注終わり］

Rendered HTML

<span class="aozora-warichu">
  <span class="aozora-warichu-upper">上の段</span>
  <span class="aozora-warichu-lower">下の段</span>
</span>

Serialize output

Round-trips to the explicit ［＃割り注］...／...［＃割り注終わり］.

AST shape

pub struct Warichu<'src> {
    pub upper: Content<'src>,
    pub lower: Content<'src>,
}

upper / lower are plain Content; empty halves are valid (one-sided warichu).

When emitted

The single-line ［＃割り注］...［＃割り注終わり］ form is inline-classified; multi-line ［＃割注］ containers become a Container of kind Warichu.

Diagnostics

None on well-formed input.

Container — multi-line counterpart.

NodeKind::Keigakomi

Wire tag: keigakomi — ruled-box annotation (罫囲み).

Source examples

［＃罫囲み］本文［＃罫囲み終わり］

Rendered HTML

<span class="aozora-keigakomi"></span>

(Inline marker; the multi-line container form yields a <div class="aozora-container-keigakomi"> wrapper instead — see Container.)

Serialize output

Round-trips to ［＃罫囲み］...［＃罫囲み終わり］.

AST shape

pub struct Keigakomi;

Marker struct with no payload — the surrounding text carries the content.

When emitted

Phase 3 sees the inline form. Multi-line keigakomi blocks classify as Container Keigakomi.

Diagnostics

None on well-formed input.

Container — multi-line counterpart.

NodeKind::PageBreak

Wire tag: pageBreak — ［＃改ページ］ page break marker.

Source examples

end of chapter
［＃改ページ］
beginning of next chapter

Rendered HTML

<div class="aozora-page-break"></div>

CSS gives the div a page-break-before: always for paged media (EPUB / print).

Serialize output

Round-trips to ［＃改ページ］\n.

AST shape

AozoraNode::PageBreak is a unit variant — no payload.

When emitted

Phase 3 sees ［＃改ページ］ and emits a single BlockLeaf classification covering the whole bracket span.

Diagnostics

None on well-formed input.

SectionBreak — ［＃改丁］ family.

NodeKind::SectionBreak

Wire tag: sectionBreak — section breaks (改丁 / 改段 / 改見開き).

Source examples

［＃改丁］
［＃改段］
［＃改見開き］

Rendered HTML

<div class="aozora-section-break aozora-section-break-choho"></div>

The second class slot carries the variant slug (choho, dan, spread, other).

Serialize output

Round-trips to ［＃改丁］ etc.

AST shape

AozoraNode::SectionBreak(SectionKind)

SectionKind is Choho (改丁) / Dan (改段) / Spread (改見開き).

When emitted

Phase 3 matches each directive; the kind enum captures which.

Diagnostics

None on well-formed input.

PageBreak — finer-grained ［＃改ページ］ variant.

NodeKind::AozoraHeading

Wire tag: heading — Aozora 見出し (window / sub heading).

Source examples

［＃見出し］序章［＃見出し終わり］

Rendered HTML

<h2 class="aozora-heading aozora-heading-window">序章</h2>

The Pandoc projection uses level 2 for Window, level 3 for Sub.

Serialize output

Round-trips to ［＃<kind>見出し］...［＃<kind>見出し終わり］.

AST shape

pub struct AozoraHeading<'src> {
    pub kind: AozoraHeadingKind,
    pub text: NonEmpty<Content<'src>>,
}

AozoraHeadingKind is Window (窓見出し) or Sub (副見出し).

When emitted

Phase 3 matches the keyword 見出し family and binds the body run.

Diagnostics

None on well-formed input.

HeadingHint — forward-reference style heading hint.

NodeKind::HeadingHint

Wire tag: headingHint — forward-reference heading hint (［＃「target」は中見出し］).

Source examples

序章
［＃「序章」は中見出し］

The hint refers to a quoted target string in the preceding line(s); downstream renderers pick this up as “promote the matched run to a heading.”

Rendered HTML

The marker itself emits no visible content; renderers that honour the hint elevate the previously-matched span to a <h2> / <h3> retroactively. The default HTML renderer in aozora-render emits a structural marker comment.

Serialize output

Round-trips to ［＃「target」は<level>見出し］.

AST shape

pub struct HeadingHint<'src> {
    pub level: u8,
    pub target: NonEmptyStr<'src>,
}

level follows the Aozora convention: 1=大見出し, 2=中見出し, 3=小見出し.

When emitted

Phase 3 matches the directive and records the level + target. Empty target is rejected and falls through to plain text.

Diagnostics

None on well-formed input.

AozoraHeading — direct heading-marker variant.

NodeKind::Sashie

Wire tag: sashie — illustration reference (挿絵).

Source examples

［＃挿絵（cover.png）入る］
［＃挿絵（pages/03.jpg、第3章扉絵）入る］

Rendered HTML

<figure class="aozora-sashie">
  <img src="cover.png" alt="">
</figure>

When a caption is present it lands as a <figcaption> next to the <img>.

Serialize output

Round-trips to ［＃挿絵（path[、caption]）入る］.

AST shape

pub struct Sashie<'src> {
    pub file: NonEmptyStr<'src>,
    pub caption: Option<Content<'src>>,
}

Empty file is rejected upstream — the construct cannot ship a nameless image.

When emitted

Phase 3 matches the 挿絵（…）入る digraph and parses out the path

optional caption.

Diagnostics

None on well-formed input.

Annotation — fallback when the directive is malformed.

NodeKind::Kaeriten

Wire tag: kaeriten — kanbun reading-order marker (返り点).

Source examples

読［＃返り点 一・二］本

Rendered HTML

<sup class="aozora-kaeriten" data-mark="一・二"></sup>

CSS positions the sup glyph appropriately for vertical / horizontal writing mode.

Serialize output

Round-trips to ［＃返り点 mark］.

AST shape

pub struct Kaeriten<'src> {
    pub mark: NonEmptyStr<'src>,
}

When emitted

Phase 3 matches 返り点 keyword + marker payload. Empty marker rejected upstream.

Diagnostics

None on well-formed input.

None.

NodeKind::Annotation

Wire tag: annotation — generic ［＃...］ annotation that no specific recogniser claimed.

Source examples

text［＃任意のメモ］more
text［＃ふりがな付きの説明］more

Rendered HTML

<span class="aozora-annotation" title="..."></span>

The default renderer suppresses the body; downstream filters can match on aozora-annotation to surface the comment.

Serialize output

Round-trips to ［＃<raw>］.

AST shape

pub struct Annotation<'src> {
    pub raw: NonEmptyStr<'src>,
    pub kind: AnnotationKind,
}

AnnotationKind discriminates the recognised sub-variants (Unknown, AsIs, TextualNote, InvalidRubySpan, …); raw carries the raw bracket body for any further analysis.

When emitted

Phase 3 reaches ［＃...］ after no specific recogniser matched. Annotation is the fallback that always preserves the user’s content rather than dropping it.

Diagnostics

None — Annotation is the recovery path for unrecognised directives. A genuine invalid-bracket diagnostic (unclosed_bracket / unmatched_close) appears separately.

Bouten — recognised variant.
Kaeriten — recognised variant.
Sashie — recognised variant.

NodeKind::DoubleRuby

Wire tag: doubleRuby — double-bracket bouten (《《重要》》).

Source examples

《《重要》》

Rendered HTML

<em class="aozora-double-ruby">重要</em>

CSS typically sets font-weight: bold or attaches sidelines for this construct; the default class hand-off lets stylesheets pick the visual.

Serialize output

Round-trips to 《《content》》.

AST shape

pub struct DoubleRuby<'src> {
    pub content: NonEmpty<Content<'src>>,
}

content is NonEmpty — empty 《《》》 is rejected upstream and falls through to plain text rather than producing an empty node.

When emitted

Phase 3 sees 《《 as a single tokenised opener (not two 《); the classifier matches 《《...》》 as a single pair and emits the node.

Diagnostics

unclosed_bracket for 《《 without 》》.

Ruby — single-bracket variant.

NodeKind::Container

Wire tag: container — paired-container wrapping (［＃ここから...］...［＃ここで...終わり］).

Source examples

［＃ここから2字下げ］
　第一節
　第二節
［＃ここで字下げ終わり］

［＃罫囲み］
本文
［＃罫囲み終わり］

［＃地から3字上げ］
寄付者一覧
［＃字上げ終わり］

Rendered HTML

<div class="aozora-container-indent" data-amount="2">
  ...
</div>

The wrapping div carries the kind-specific class (aozora-container-indent, aozora-container-warichu, aozora-container-keigakomi, aozora-container-align-end) plus any structural data (indent amount, align offset) on data-*.

Serialize output

Round-trips to the explicit-paired directive form.

AST shape

pub struct Container {
    pub kind: ContainerKind,
}

pub enum ContainerKind {
    Indent { amount: u8 },
    Warichu,
    Keigakomi,
    AlignEnd { offset: u8 },
}

The Container payload appears wrapping the content — the actual walker driver fires visit_container_open on enter and visit_container_close on exit so renderers wrap the body cleanly.

When emitted

Phase 2 pairs the ［＃ここから…］ / ［＃ここで…終わり］ openers and closers; Phase 3’s BlockOpen / BlockClose events project to this variant.

Diagnostics

unclosed_bracket for unbalanced opens.

ContainerOpen — NodeRef projection of the open boundary.
ContainerClose — NodeRef projection of the close boundary.
Indent, AlignEnd, Warichu, Keigakomi — single-line counterparts.

NodeKind::ContainerOpen

Wire tag: containerOpen — paired-container open boundary marker.

This variant only appears in NodeRef-flavoured wire output (e.g. serialize_nodes); the structural AozoraNode::Container payload covers the wrapping construct itself.

Source examples

［＃ここから2字下げ］     <- ContainerOpen
indented body
［＃ここで字下げ終わり］   <- ContainerClose

Rendered HTML

The default HTML renderer routes the open / close pair through visit_container_open / visit_container_close and emits the opening <div class="aozora-container-..."> wrapping the body.

Serialize output

Round-trips together with the matching close to the ［＃ここから…］...［＃ここで…終わり］ form.

AST shape

NodeRef::BlockOpen(ContainerKind) — see ContainerKind.

When emitted

Phase 2 pairs the open / close brackets; Phase 3’s normalised text emits a BlockOpen PUA sentinel at the position of the opener so the registry can dispatch the open event during walking.

Diagnostics

unclosed_bracket if the open never finds a matching close.

ContainerClose — paired close-side counterpart.
Container — the structural payload variant.

NodeKind::ContainerClose

Wire tag: containerClose — paired-container close boundary marker.

NodeRef-only counterpart of ContainerOpen.

Source examples

［＃ここから2字下げ］     <- ContainerOpen
body
［＃ここで字下げ終わり］   <- ContainerClose

Rendered HTML

Routed through visit_container_close; the default renderer emits the closing </div> of the <div class="aozora-container-..."> opened by the matching ContainerOpen.

Serialize output

Round-trips with the matching open.

AST shape

NodeRef::BlockClose(ContainerKind).

When emitted

Phase 3 normalised-text emits a BlockClose PUA sentinel at the matching close position.

Diagnostics

unmatched_close if the close has no open partner — in which case no ContainerClose is emitted and the close-bracket bytes flow through as plain.

ContainerOpen — open-side counterpart.
Container — structural payload.

Notation overview

青空文庫記法 is a small, line-oriented annotation language layered inside a plain-text Japanese document. Authors mark up the text in two distinct registers:

Inline markers — single-character sigils (｜, 《, 》, ※) that fence inline annotations directly inside the prose.
Block annotations — ［＃…］ brackets containing a Japanese directive in natural language (“ここから2字下げ”, “「X」に傍点”, …) that act as openers, closers, or self-contained directives.

aozora recognises every annotation that survives in real Aozora Bunko sources — the volunteer corpus has ~17 000 works in active rotation, and the parser is exercised against the entire archive in CI as part of the corpus sweep.

Notations covered

Chapter	What it marks
Ruby	Pronunciation glosses (`｜青梅《おうめ》`, `青梅《おうめ》`).
Bouten / bousen	Emphasis dots and lines: 傍点 (sesame, white sesame, filled circle, open circle, …) and 傍線 (single, double, dashed, …).
縦中横	Horizontally-set runs inside vertical text (`［＃「数字」は縦中横］`).
Gaiji	Out-of-Shift_JIS character references (`※［＃…、第3水準1-85-54］`) and accented-Latin decomposition.
Kunten	漢文 reading marks: 返り点 (`レ`, `一`, `二`, `上`, `中`, `下`), 再読文字, 送り仮名.
Indent containers	`［＃ここから2字下げ］… ［＃ここで字下げ終わり］` and the geji / 地付き / 地寄せ family.
Page & section breaks	改ページ, 改丁, 改見開き, 改段.
Diagnostics	The catalogue of structured diagnostics the parser emits.

Spec source of truth

The authoritative spec lives at https://www.aozora.gr.jp/annotation/index.html. A snapshot is vendored at docs/specs/aozora/ in the repo so that every page in this handbook can link to a stable fragment (the upstream HTML reorganises occasionally; the snapshot shields chapter cross-references from rot).

When this handbook says “the spec says X”, that means that snapshot. Where the live spec drifts, we update the snapshot, then update the parser, then update this handbook — in that order.

How a sample input looks

｜青梅《おうめ》街道を歩いて、※［＃「魚＋師のつくり」、第3水準1-94-37］を見た。
［＃ここから2字下げ］
　［＃「平和」に傍点］という言葉は、もう古い。
［＃ここで字下げ終わり］
［＃改ページ］

That single sample exercises ruby, gaiji, indent containers, bouten, and a page break. The parser turns it into a flat node stream — see the per-chapter pages for the exact AST shapes.

Notation we deliberately omit

Aozora Bunko’s spec mentions a handful of annotations that don’t appear in the maintained corpus:

Image references beyond ［＃挿絵］ — covered up to the caption, no actual image rendering.
キャプション alignment edge cases that the spec lists but no active work uses (verified against the corpus sweep).

These are recognised as Container::Unknown with a W0010 advisory diagnostic. Adding full support is a one-PR job once a real corpus document needs it.

Ruby (`｜青梅《おうめ》`)

Ruby is a pronunciation gloss attached to a run of base text. In 青空文庫 source it appears in two shapes:

｜青梅《おうめ》            ← explicit-base form
青梅《おうめ》              ← implicit-base form (auto-detect)

Both forms render the same HTML:

<ruby>青梅<rt>おうめ</rt></ruby>

Explicit base (`｜…《…》`)

The full-width vertical bar ｜ (U+FF5C) marks the start of the base text; 《…》 (U+300A / U+300B) wraps the reading. The base runs from ｜ to the 《. Use this form when:

The base contains characters that the auto-detect heuristic would otherwise skip (kana, ASCII letters, mixed scripts).
The boundary between base and surrounding text is ambiguous.

｜山田《やまだ》さん         → <ruby>山田<rt>やまだ</rt></ruby>さん
｜HTTP《ハイパー・テキスト》 → <ruby>HTTP<rt>ハイパー・テキスト</rt></ruby>

Implicit base

When 《…》 follows a run of kanji without a leading ｜, the parser auto-detects the base by scanning backwards through the kanji run. The auto-detect terminates at the first non-kanji character (kana, punctuation, ASCII, full-width digit).

青梅《おうめ》     → <ruby>青梅<rt>おうめ</rt></ruby>
お青梅《おうめ》   → お<ruby>青梅<rt>おうめ</rt></ruby>

The “kanji” predicate is CJK Unified Ideographs + CJK Compatibility Ideographs + CJK Unified Ideographs Extension A–F

the iteration mark 々. JIS X 0213 plane-2 ideographs not in Unicode are represented as gaiji references (see Gaiji) and likewise terminate the auto-detect.

Empty reading

｜青梅《》 is a parse error. The lexer emits diagnostic E0001 (“ruby reading mismatch: target spans N chars but ｜《》 reading is empty”) and the node is dropped from the AST.

The implicit-base form silently skips a 《》 with empty contents — that combination cannot have arisen from valid markup, so the parser treats the bare 《》 as literal text.

Nested ruby (forbidden)

The spec disallows ruby inside ruby. Sources with ｜青梅《｜お《お》うめ》 are rejected with diagnostic E0002.

AST shape

pub struct Ruby<'src> {
    pub target:  &'src str,   // borrowed from source
    pub reading: &'src str,   // borrowed from source
    pub span:    Span,        // byte range in the source
    pub explicit_base: bool,  // true if the input used the ｜…《…》 form
}

Both target and reading are &str slices into the Document-owned source — no allocation, no copy. Re-emitting canonical form is exactly:

match (ruby.explicit_base, ruby.target, ruby.reading) {
    (true,  t, r) => format!("｜{t}《{r}》"),
    (false, t, r) => format!("{t}《{r}》"),
}

Edge cases

Input	Output
`青梅《おうめ》`	`<ruby>青梅<rt>おうめ</rt></ruby>`
`｜青梅《おうめ》`	`<ruby>青梅<rt>おうめ</rt></ruby>` (canonical-equivalent)
`｜山田《やまだ》`	`<ruby>山田<rt>やまだ</rt></ruby>`
`｜HTTP《ハイパー・テキスト》`	`<ruby>HTTP<rt>ハイパー・テキスト</rt></ruby>`
`お青梅《おうめ》`	`お<ruby>青梅<rt>おうめ</rt></ruby>` (auto-detect skips kana)
`1青梅《おうめ》`	`1<ruby>青梅<rt>おうめ</rt></ruby>` (auto-detect skips digit)
`｜青梅《》`	parse error `E0001`
`《おうめ》`	literal text (no preceding kanji to anchor)
`｜青梅《｜お《お》うめ》`	parse error `E0002`

Bouten / bousen (傍点・傍線)

Bouten (傍点) are emphasis dots placed beside characters in vertical text — the Japanese typographic equivalent of italic or bold. Bousen (傍線) are the same idea with a line instead of dots. The spec recognises eleven dot variants and six line variants; aozora accepts every one.

Notation forms

Two indirection styles, both common in real corpus:

［＃「平和」に傍点］           ← target-by-quoting
平和［＃「平和」に傍点］        ← redundant explicit copy (also accepted)
［＃ここから傍点］平和［＃ここで傍点終わり］  ← container form

The target-by-quoting form is by far the most common: the inline annotation looks backwards in the text for the most recent occurrence of the quoted string and applies the bouten to that run.

Variant catalogue

Slug	Source kanji	Renders as
`sesame`	傍点	small black sesame `﹅`
`white_sesame`	白ゴマ傍点	small white sesame `﹆`
`circle`	丸傍点	filled circle `●`
`white_circle`	白丸傍点	open circle `○`
`dot`	黒点傍点	bold black dot
`triangle`	三角傍点	filled triangle
`white_triangle`	白三角傍点	open triangle
`bullseye`	二重丸傍点	bullseye
`kotenten`	コ点傍点	small katakana ko-mark
`kotenten_white`	白コ点傍点	white ko-mark
`linear`	線傍点	dotted underline
`single_line`	傍線	single line
`double_line`	二重傍線	double line
`dashed_line`	鎖線	dashed line
`wavy_line`	波線	wavy line
`chained_line`	二重鎖線	double dashed line
`under_dotted`	下線	dotted underline

Each variant has a stable BoutenKind::slug() that the HTML renderer emits as a class name (e.g. <em class="aozora-bouten-sesame">). See Architecture → HTML renderer for the full class-name scheme.

Default rendering

aozora emits <em class="aozora-bouten-<slug>">…</em> so that an external stylesheet can pick the visual treatment per variant. Default CSS hooks live at the consumer side; the parser ships no stylesheet of its own.

<!-- 平和［＃「平和」に傍点］ -->
平和<em class="aozora-bouten-sesame">平和</em>

(The redundant copy is intentional — the ［＃…］ indirection re-emits the target wrapped in <em>, leaving the original run in place. The HTML rendering matches what print Aozora Bunko output does in practice.)

Container form

For runs that span multiple lines or include other annotations, use the container form:

［＃ここから傍点］
平和は手の届かないものだった。
そして、戦争もまた。
［＃ここで傍点終わり］

Renders as:

<em class="aozora-bouten-sesame">
平和は手の届かないものだった。
そして、戦争もまた。
</em>

The opening directive can be any of the variant openers (ここから二重傍線, ここから波線, …); the matching closer must use the same family (ここで傍線終わり for any 線 variant, ここで傍点終わり for any 点 variant). Mismatched closers fire diagnostic E0004.

AST shape

pub struct Bouten<'src> {
    pub target: &'src str,        // the run wrapped in emphasis
    pub kind:   BoutenKind,       // one of 17 variants
    pub form:   BoutenForm,       // Indirect | Inline | Container
    pub span:   Span,
}

BoutenKind is a flat enum with slug accessors; see the rustdoc for the exact variant list.

縦中横 (tate-chū-yoko)

縦中横 (tate-chū-yoko, “horizontal in vertical”) is a typographic construct that lays a short run — usually digits, Latin letters, or mixed punctuation — horizontally inside otherwise vertical text. In print, it is the common treatment for two- or three-digit numbers in a vertical paragraph.

Notation

The annotation always uses the indirect-quoting form:

昭和27年生まれ［＃「27」は縦中横］

Renders as:

昭和<span class="aozora-tcy">27</span>年生まれ

The ［＃…］ directive looks back through the most recent text and applies the tcy treatment to the most recent occurrence of the quoted run. The target text is not re-emitted — the wrapper is applied in place, unlike bouten.

Container form

For longer mixed-orientation runs (multi-line table data, Latin abbreviations spanning a paragraph), the container form sits inside an outer indent block:

［＃ここから縦中横］
27 / 100 = 0.27
［＃ここで縦中横終わり］

Renders as:

<div class="aozora-tcy-block">
27 / 100 = 0.27
</div>

Common targets

Source	Output
`27［＃「27」は縦中横］`	`<span class="aozora-tcy">27</span>`
`100％［＃「100」は縦中横］`	`<span class="aozora-tcy">100</span>％`
`A4［＃「A4」は縦中横］`	`<span class="aozora-tcy">A4</span>`
`&［＃「&」は縦中横］`	`<span class="aozora-tcy">&</span>`

(HTML escapes are handled by the renderer, not the AST.)

Anchor lookup

The lookup that finds the target run:

Scans backwards from the ［＃…］ directive through the current line.
Stops at the first match for the quoted run.
Falls through to the previous line if no match (with an upper bound of 64 KiB or one paragraph break, whichever comes first).

If no match is found, diagnostic W0001 fires and the directive is dropped from the output. Authors get the same look-back semantics they’d get from bouten — see Bouten for the symmetric case.

Why a span, not a flow rotation?

Web renderers reach for writing-mode: horizontal-tb inside a writing-mode: vertical-rl parent, but that has poor browser support and breaks line-break propagation. aozora’s HTML output uses a single class hook (<span class="aozora-tcy">) so the consuming stylesheet can decide:

print stylesheet → font-feature-settings: "vert"; text-combine-upright: all;
screen stylesheet → leave horizontal, set monospace
e-book renderer → use the renderer’s native tcy primitive

Pushing this decision into the HTML output (e.g. emitting an inline SVG with rotated glyphs) would lock consumers into a specific typographic model. The class-hook output keeps the HTML semantic and defers presentation to the consumer.

AST shape

pub struct Tcy<'src> {
    pub text: &'src str,
    pub form: TcyForm,    // Inline | Container
    pub span: Span,
}

Gaiji (外字 references)

Aozora Bunko predates ubiquitous Unicode support; many works still ship as Shift_JIS source. Characters that don’t fit in Shift_JIS — JIS X 0213 plane-2 ideographs, accented Latin letters, ad-hoc combining marks — appear in source as gaiji references:

※［＃「魚＋師のつくり」、第3水準1-94-37］
※［＃「彳＋寺」、U+5F85、393-13］
※［＃濁点付き片仮名ヰ］

The leading ※ (U+203B, reference mark) opens the annotation; the ［＃…］ body describes the character in three orthogonal ways:

A descriptive name in Japanese (「魚＋師のつくり」 — “魚 plus the right-hand side of 師”) for human readers.
A JIS X 0213 plane / row / cell triple (第3水準1-94-37 — plane 1, row 94, cell 37).
A Unicode codepoint (U+5F85) when the character has one.

aozora resolves gaiji references through a compile-time PHF lookup table built from the JIS X 0213 official mapping plus the Unicode UCS register, with the descriptive name as a tertiary fallback.

Why a compile-time table?

The gaiji table has ~14 000 entries. Loading it at runtime from a JSON / TOML asset would:

Add a startup cost on every Document::new (the parser is supposed to start reading bytes within microseconds).
Force every binding (CLI, WASM, FFI, Python wheel) to ship the table as a separate asset, complicating distribution.
Defeat dead-code elimination — the linker can’t strip entries the consumer’s input never references if they’re loaded behind an opaque file read.

A phf::Map baked into the binary at compile time wins on every axis: zero-allocation lookup, single-binary distribution, full DCE and LTO visibility. The build cost is real (~40 s the first time, ~0 s incremental) but happens once per workspace build, not per-invocation.

phf over static HashMap (which would require runtime construction in a OnceLock): phf produces a true compile-time perfect-hash table — O(1) lookup with no first-call cost and no synchronisation on the hot path.

Resolution order

For a reference like ※［＃「魚＋師のつくり」、第3水準1-94-37］:

Unicode codepoint if the source explicitly provided one (U+XXXX) — used directly.
JIS X 0213 plane-row-cell lookup (第N水準P-R-C) — most ideographs land here.
Descriptive name — the parser ships a curated mapping for the ~120 characters that have no JIS / Unicode codepoint at all. Misses fire diagnostic W0006 and the gaiji is rendered as the descriptive text in <span> brackets.

AST shape

pub struct Gaiji<'src> {
    pub description:    &'src str,        // 「魚＋師のつくり」
    pub jis:            Option<JisCode>,  // (plane, row, cell)
    pub unicode:        Option<char>,     // resolved codepoint
    pub resolution:     GaijiResolution,  // Direct | Lookup | Fallback
    pub span:           Span,
}

pub enum GaijiResolution {
    /// The source provided U+XXXX directly.
    Direct,
    /// Resolved via JIS table.
    Lookup,
    /// Could not resolve; rendered as descriptive text.
    Fallback,
}

Render output

Resolution	HTML
`Direct` / `Lookup`	the resolved codepoint inline, with a `data-aozora-gaiji-jis="1-94-37"` attribute for downstream analysis tools.
`Fallback`	`<span class="aozora-gaiji-fallback" title="魚＋師のつくり">[魚＋師のつくり]</span>`

Accent decomposition

Aozora Bunko also encodes accented Latin letters (è, ñ, ä) using a separate notation that does not go through ※［＃…］:

M&iexcl;cher    ← in some sources
me-zin       ← in others

The full table is at https://www.aozora.gr.jp/accent_separation.html — 114 ASCII digraphs / ligatures mapping to Unicode. aozora applies this decomposition during the lexer’s Phase 0 (sanitize), so by the time classification runs the source is pure Unicode. See Architecture → Seven-phase lexer for the phase ordering.

Kunten / kaeriten (訓点・返り点)

Kunten are the marginal annotations Japanese readers add to classical Chinese (漢文) source so that it can be read in Japanese word order. The two categories aozora handles:

Kaeriten (返り点) — reading-order marks inserted between characters: レ, 一, 二, 三, 上, 中, 下, 甲, 乙, 天, 地, 人.
Saidoku-moji (再読文字) — characters that are read twice with different glosses (e.g. 未, 将, 当).

A handful of late-Edo / Meiji Aozora Bunko works carry these. The notation:

有﹅レ朋﹅自﹅遠﹅方﹅来

…where ﹅ stands in for the actual kaeriten character. In real source the marks are interleaved between characters using either the direct character or a ［＃…］ annotation:

有［＃二］朋自遠方来［＃一］

Notation forms

Inline (preferred in modern works)

The kaeriten character is inserted directly between the source characters:

有レ朋自遠方来

Renders as:

有<span class="aozora-kaeriten" data-aozora-kaeriten="レ">レ</span>朋自遠方来

Bracketed (older works)

有［＃二］朋自遠方来［＃一］

Renders as:

有<span class="aozora-kaeriten" data-aozora-kaeriten="二">二</span>朋自遠方来<span class="aozora-kaeriten" data-aozora-kaeriten="一">一</span>

The bracketed form is useful when the kaeriten character would otherwise be ambiguous with the surrounding text (e.g. a real 一 that is not a reading mark).

Saidoku-moji

未［＃「未」に二の字点］

The 二の字点 / 一二点 prefix tells the renderer that the preceding character is read twice. aozora emits a data-aozora-saidoku data attribute on the wrapper.

AST shape

pub struct Kaeriten<'src> {
    pub mark: KaeritenKind,    // Re | Ichi | Ni | San | Jou | Chuu | Ge | Kou | Otsu | Ten | Chi | Jin
    pub form: KaeritenForm,    // Inline | Bracketed
    pub span: Span,
}

pub struct Saidoku<'src> {
    pub target: &'src str,     // the character being re-read
    pub gloss:  &'src str,     // the second reading
    pub span:   Span,
}

Why a flat enum, not just `&str`?

The 13 kaeriten kinds form a closed set fixed by the spec — there will never be a 14th. A KaeritenKind enum lets the renderer match exhaustively (the compiler catches unhandled variants), and pins the data-aozora-kaeriten attribute value to a stable slug rather than the literal source character. That matters because the inline form uses the actual 一 / 二 / 上 / … glyphs, which are also valid plain text — the enum lets the AST distinguish “a 一 that’s a kaeriten” from “the digit one in the running text”.

Diagnostics

Code	Condition
`W0007`	Kaeriten outside a 漢文-like context (lookahead heuristic)
`E0009`	Bracketed kaeriten with no matching pair

Indent & align containers (字下げ)

Aozora Bunko uses paired ［＃ここから…］ / ［＃ここで…終わり］ brackets to delimit blocks of text with custom layout. The five families:

Family	Opener	Closer	Effect
字下げ (indent)	`［＃ここから2字下げ］`	`［＃ここで字下げ終わり］`	Indent every line by N full-width chars
地付き (right-flush)	`［＃ここから地付き］`	`［＃ここで地付き終わり］`	Flush right (vertical: 地 = ground = bottom)
地寄せ (right-align with margin)	`［＃ここから2字下げ、地寄せ］`	`［＃ここで字下げ終わり］`	Right-align with N-char inset
字詰め (line-length)	`［＃ここから30字詰め］`	`［＃ここで字詰め終わり］`	Force a line length of N chars
中央揃え	`［＃ここから中央揃え］`	`［＃ここで中央揃え終わり］`	Centre each line

aozora parses every variant; the HTML renderer maps them to a <div class="aozora-indent-N"> / aozora-align-end / etc. wrapper.

Single-line forms

Some directives apply only to the next single line and don’t need a closer:

　［＃地付き］平和への誓い

Renders as:

<div class="aozora-align-end">平和への誓い</div>

AST shape

pub struct Container<'src> {
    pub kind:    ContainerKind,
    pub indent:  Option<u8>,      // 字 count for indent variants
    pub form:    ContainerForm,   // SingleLine | Block
    pub children: &'src [AozoraNode<'src>],
    pub span:    Span,
}

pub enum ContainerKind {
    Indent,
    AlignEnd,
    AlignEndWithIndent,
    LineLength,
    Centre,
    /// Composite: indent + align-end on a single block.
    Composite { indent: u8, align: ContainerAlign },
    /// Bouten / 縦中横 / 鎖線 / 罫囲み container forms.
    Emphasis(EmphasisKind),
    /// Spec-listed but not present in maintained corpus.
    Unknown,
}

Why a small flat enum?

ContainerKind is closed by spec. A flat enum (vs a trait object or string tag) gives the parser O(1) variant dispatch in the lexer’s classify phase and the renderer’s HTML walk, and lets clippy’s exhaustiveness check enforce that every variant has a render path.

The Composite variant is the one place we don’t extend the enum horizontally — composite indent+align combinations would explode the enum to ~30 variants, most of which never appear in real corpus. A nested struct with a sub-enum keeps the variant count finite while staying matchable.

large_enum_variant clippy lint: Container::Composite is the largest variant at 4 bytes; the others are ≤ 2 bytes. The variant data is tiny enough that boxing would add a pointer chase for no real layout win — see the [workspace.lints.clippy] large_enum_variant = "allow" carve-out in Cargo.toml.

Composition

Containers nest:

［＃ここから2字下げ］
　通常の段落。
　［＃ここから地付き］
　　右寄せの行。
　［＃ここで地付き終わり］
　通常に戻る。
［＃ここで字下げ終わり］

Renders as nested divs:

<div class="aozora-indent-2">
通常の段落。
<div class="aozora-align-end">
右寄せの行。
</div>
通常に戻る。
</div>

Mismatched closers (e.g. ［＃ここから地付き］ … ［＃ここで字下げ終わり］) fire diagnostic E0005 and the parser auto-closes the offending opener at the closer’s position.

Why containers, not stack-based push/pop tokens?

The spec describes these as opener / closer brackets, but the natural implementation in Rust is a recursive container node. That choice:

Lets the renderer walk the tree once with a single match on ContainerKind, instead of maintaining a render-time stack.
Surfaces shape errors (mismatched closers, dangling openers) at parse time — the lexer’s classify phase already has all the information to decide.
Makes the canonical-serialise pass trivial (each container prints its opener, walks its children, prints its closer).

The trade-off is one extra heap touch per container — a single bumpalo slice for children. The arena is already hot, so the cost is negligible (bumpalo returns aligned pointers in O(1) bumps).

Page & section breaks (改ページ・改丁)

Aozora Bunko inherits print conventions for page-level structure. Four annotations split a work into pages, signatures, and openings:

Notation	Renders as	Meaning
`［＃改ページ］`	`<div class="aozora-page-break"/>`	Begin a new page
`［＃改丁］`	`<div class="aozora-page-break aozora-recto"/>`	Begin a new recto (right-hand) page
`［＃改見開き］`	`<div class="aozora-page-break aozora-spread"/>`	Begin a new two-page spread
`［＃改段］`	`<div class="aozora-section-break"/>`	Section break (smaller than a page)

All four are self-contained directives — no opener / closer pair, no inner content. They appear on their own line in the source.

AST shape

pub enum Break {
    Page,
    PageRecto,        // 改丁
    PageSpread,       // 改見開き
    Section,          // 改段
}

pub struct BreakNode {
    pub kind: Break,
    pub span: Span,
}

Why distinct variants for each break flavour?

The four flavours render to identical HTML structure (an empty <div>) but different class hooks. Collapsing them to a single variant with a string tag would:

Force the renderer to plumb the original notation through to the output, defeating the AST’s role as a normalised IR.
Lose the type-system check that every break flavour has a render path — clippy’s exhaustiveness lint catches the bug at compile time.
Make it impossible to count page breaks of a specific flavour at the AST level without a string match.

The 4-variant enum is 1 byte plus discriminant — no real cost over the alternative.

Composition with other annotations

Breaks unconditionally close any open inline annotation (ruby, bouten, tcy) at their line. They do not close container directives (字下げ, 地付き, etc.) — those persist across page boundaries, which matches print typography.

［＃ここから2字下げ］
　第一節
［＃改ページ］
　第二節 (still 2字下げ)
［＃ここで字下げ終わり］

Diagnostics

Code	Condition
`W0008`	Page break inside a single-line container (drops the container)

Diagnostics catalogue

aozora is non-fatal by design: the parser always produces a tree, even from malformed input, and reports problems through structured diagnostics that callers can choose to treat as errors. This page lists every diagnostic the lexer can emit.

Each diagnostic carries:

A stable code (E0001, W0001, …). The number suffix is permanent across versions; codes are added but never renumbered.
A level: Error, Warning, Info.
A span (byte range in the source).
A message in English.
(optional) a help line suggesting a fix.

The CLI renders diagnostics through miette; all bindings (Rust library, FFI JSON, WASM JSON, Python list) carry the same structured data.

E-codes (errors)

E0001 — empty ruby reading

｜青梅《》

The base text is given but the reading inside 《…》 is empty. Fix: provide a reading or remove the ｜ marker.

E0002 — nested ruby

｜青梅《｜お《お》うめ》

The spec disallows ruby inside ruby; the inner ｜…《…》 is ambiguous. Fix: restructure so the readings are siblings, not nested.

E0004 — mismatched bouten container closer

［＃ここから傍点］…［＃ここで傍線終わり］

The opener was a bouten variant; the closer was a bousen variant. Fix: match the closer to the opener family (傍点終わり for any 点 variant; 傍線終わり for any 線 variant).

E0005 — mismatched container closer

［＃ここから2字下げ］…［＃ここで地付き終わり］

Different container kinds. The parser auto-closes the offending opener at the closer’s position. Fix: match opener and closer.

E0009 — bracketed kaeriten with no pair

有［＃二］朋自遠方来    （［＃一］ missing）

The bracketed kaeriten form requires a paired closer. Fix: add the matching ［＃一］ (or remove the ［＃二］).

W-codes (warnings)

W0001 — tcy target not found

昭和27年生まれ［＃「999」は縦中横］

The quoted run does not appear in the look-back window (current line

previous line, max 64 KiB). The directive is dropped. Fix: quote a run that actually appears in the source.

W0003 — bouten target ambiguous

平和平和［＃「平和」に傍点］

Two candidate runs in the look-back window. The parser applies the bouten to the most recent match (right-most in vertical / left-to- right reading); W0003 flags the ambiguity for the author to disambiguate.

W0006 — unresolved gaiji reference

The gaiji reference resolved to neither a Unicode codepoint nor a JIS X 0213 entry, and no descriptive-name fallback applied. The character is rendered as descriptive text in <span> brackets. Fix: check the JIS triple, add the codepoint manually, or extend the descriptive-name table.

W0007 — kaeriten outside 漢文 context

こんにちは レ

A kaeriten character (レ, 一, 上, …) appeared in a context that doesn’t look like 漢文 (no preceding kanji run, surrounded by kana). The parser still emits the kaeriten node but flags the suspicious placement.

W0008 — break inside single-line container

　［＃地付き］right-flushed［＃改ページ］

The page break terminates the single-line container before its implicit end-of-line closer. The container is dropped from the output.

W0010 — unrecognised container directive

The ［＃ここから…］ directive matched no known container kind. The parser emits a Container::Unknown and copies the directive verbatim into the canonical-serialise output.

I-codes (info)

I0001 — accent decomposition applied

M[i!]cher  →  Micher

Reported once per source for each distinct ASCII digraph that the sanitize phase decomposed. Off by default; enable with --diagnostics info on the CLI.

Why a stable code, not just a message?

Two reasons.

Test stability. The corpus sweep counts diagnostics by code to detect parser regressions. A test like “the corpus emits at most 12 W0006 warnings” is robust against message wording tweaks; a test that greps the message string breaks every localisation pass.
Tool integration. Editors / LSPs / CI lints filter diagnostics by code (e.g. “treat E* as error, ignore W0010 for legacy files”). String matching there is fragile in practice.

The cost is a small lookup table (code → message); the win is that diagnostics survive refactors and translation.

Pipeline overview

aozora is a pure-functional parser: given the same input, the same arena, and the same compile-time configuration, the output is bit-for-bit identical. There are no thread-locals, no OnceCell caches in the parse path, no environmental side effects. The only state the parser owns is the arena and a string interner, both reset per Document.

Three layers

flowchart TD
    src["source text<br/>(UTF-8 or Shift_JIS)"]
    decode["Shift_JIS decode<br/>(aozora-encoding)"]
    lex["Lex<br/>(aozora-lex)<br/>sanitize → tokenize → pair → classify"]
    tree["AozoraTree&lt;'arena&gt;<br/>(borrowed AST)"]
    render["Render<br/>(aozora-render)<br/>html  /  serialize"]
    out["HTML  /  canonical 青空文庫 source"]

    src --> decode --> lex --> tree --> render --> out

Each arrow is a pure function. The arena is threaded through lex; nothing else holds state.

Crate dependency graph

flowchart TD
    spec["aozora-spec<br/>shared types"]
    encoding["aozora-encoding<br/>SJIS + 外字 PHF"]
    scan["aozora-scan<br/>SIMD multi-pattern"]
    veb["aozora-veb<br/>Eytzinger sorted-set"]
    syntax["aozora-syntax<br/>AST node types"]
    lexer["aozora-lexer<br/>7-phase classifier"]
    lex["aozora-lex<br/>fused orchestrator"]
    render["aozora-render<br/>html / serialize"]
    facade["aozora<br/>public facade"]
    cli["aozora-cli"]
    ffi["aozora-ffi"]
    wasm["aozora-wasm"]
    py["aozora-py"]

    spec --> encoding
    spec --> scan
    spec --> veb
    spec --> syntax
    encoding --> syntax
    scan --> lexer
    veb --> lexer
    syntax --> lexer
    lexer --> lex
    lex --> render
    render --> facade
    facade --> cli
    facade --> ffi
    facade --> wasm
    facade --> py

aozora-spec is the foundation — every other crate depends on it. The dependency graph forms a strict DAG; circular deps are forbidden by clippy’s [cyclic_module] lint and by the cargo metadata check in just lint.

What each layer does

Sanitize → Tokenize → Pair → Classify

The lexer pipeline is split into four sub-phases because each stage has a different cost / cache profile:

Sub-phase	Input	Output	Why separate
Sanitize	raw `&str`	normalised `&str`	BOM / CRLF / accent-decomposition / PUA assignment all happen here, once, before any expensive lookahead. Keeps later phases linear-time.
Tokenize	normalised `&str`	trigger offsets	SIMD scanner fires here; finds every `｜` `《` `》` `※` `［` `］` byte.
Pair	trigger offsets	balanced `(open, close)` pairs	Bracket matching only; no semantic interpretation.
Classify	pairs + slices	`AozoraNode<'_>` stream	Decides “is this `［＃…］` an indent opener, a bouten directive, a tcy directive, …”.

Splitting them lets the parser ship two surface APIs without code duplication:

lex_into_arena() — fused, allocates one tree.
Per-phase calls — used by the bench harness’s phase_breakdown probe (and the aozora-lexer integration tests for spec-conformance).

Sanitize details

Phase 0 sanitize covers:

BOM strip — UTF-8 and UTF-16 BOMs (rare in source, but real).
CRLF normalisation — CRLF → LF.
Rule isolation — separates inline ※［＃…］ from following text so the tokenizer has unambiguous boundaries.
Accent decomposition — 114 ASCII digraphs / ligatures → Unicode (see Gaiji).
PUA assignment — gaiji references get private-use codepoints inline so the tokenizer can treat them as single-character tokens without re-parsing the ※［＃…］ body.

Tokenize: SIMD scan

Trigger byte detection runs the SIMD multi-pattern scanner. Three backends:

Teddy (Hyperscan-style packed-pattern via aho-corasick) on x86_64 with AVX2.
Hoehrmann-style multi-pattern DFA (regex-automata engine) as the portable fallback.
Memchr-based for wasm32 until wasm-simd lands in the workspace.

See Architecture → SIMD scanner backends for the selection logic and what each backend looks like in samply.

Pair → Classify

Bracket matching is a single linear-time stack walk over the trigger offsets. Classify then does the actual recognition: each opener type has a recogniser registered under aozora-lexer::recognisers::*. The recognisers run in deterministic order (see Architecture → Seven-phase lexer).

Render

Two render walkers:

html::render_to_string — a single O(n) tree walk emitting semantic HTML5 with aozora-* class hooks.
serialize::serialize — re-emits canonical 青空文庫 source.

Both are pure functions; both allocate exactly the output buffer and nothing else.

What the pipeline does not do

No tree mutation between layers. No optimisation passes. No “resolver” stage that mutates the AST. The lexer produces the final tree; the renderer consumes it; that’s it. This is the same shape as a functional reactive pipeline, and it’s what lets the borrowed-arena AST (next chapter) work without RefCell or UnsafeCell.

Borrowed-arena AST

AozoraTree<'a> is not an owned tree. It’s a borrow into two things owned by Document:

the source Box<str>,
a bumpalo::Bump arena that holds every intermediate node and child slice.

flowchart LR
    subgraph Document
        src["Box&lt;str&gt; source"]
        bump["bumpalo::Bump arena"]
    end
    tree["AozoraTree&lt;'a&gt;"]
    walk["render / serialize / iterate"]

    src -.borrows.-> tree
    bump -.borrows.-> tree
    tree --> walk

When the Document drops, the source Box<str> and the arena’s single backing buffer drop in two free() calls — every node, every container, every interned string releases together. There is no per-node destructor and no walk-the-tree-to-free pass.

Why an arena and not `Box<Node>` everywhere?

The naive Rust shape — enum Node { Ruby { target: String, … }, … } — would allocate per node, per String, per Vec<Node> for container children. For a typical Aozora Bunko work (~500 KiB source, ~50 000 nodes) that’s:

~50 000 individual heap allocations,
~50 000 individual frees on drop (each is a syscall away from the heap allocator’s free list),
16+ bytes of allocator metadata per allocation,
random-access fragmentation that defeats prefetch.

The arena variant produces:

~16 bump allocations (4 KiB pages, refilled on overflow),
1 free on drop (Bump::reset returns the pages to the OS, the pages themselves are typically reused via the cargo / system allocator’s page cache).
Sequential layout: nodes that were lexed near each other live near each other in memory, which is exactly the order the renderer walks them.

Measured on the corpus sweep: the arena variant parses 6.4× faster than the equivalent Box<Node> shape, and the peak RSS is 30% lower. The win is cumulative — every binding (CLI / WASM / FFI / Python) inherits it.

Why `bumpalo` over `typed-arena`, `slotmap`, or hand-rolled?

Crate	Shape	Why aozora doesn’t use it
`typed-arena`	One arena per type (`Arena<Ruby>`, `Arena<Bouten>`, …)	aozora has 30+ node types; managing 30 arenas is operationally awkward and forces lifetime-bound `&'a` per type.
`slotmap`	Index-keyed nodes; arena owns; access via `SlotMap::get`	Adds an indirection (key → slot → node) on every walk, regressing render throughput by ~25% on the bench harness. Also forces `Copy` keys, which for variable-length text fields means re-interning.
`id-arena` / `index_vec`	Index-typed, `&str` borrowing	Same indirection cost as `slotmap`.
Hand-rolled bump	Custom; tightest control	Correct, but `bumpalo` is already a stable, mainstream, allocator-aware bump arena with `bumpalo::collections::Vec` for child slices. Reinventing wins nothing.
`bumpalo`	Single arena, type-erased; allocate any `T` with `bump.alloc(T)`	One arena per `Document`; allocate-then-borrow gives `&'a T` for the lifetime of the arena. Matches aozora’s “one arena per Document” need exactly.

bumpalo’s collections::Vec<'bump, T> (used for container child slices) is Vec-shaped but allocated inside the arena — child slices get the same arena lifetime as the parent without a separate allocation strategy.

How the AST shape interacts with the lifetime

pub enum AozoraNode<'src> {
    Plain(&'src str),
    Ruby(Ruby<'src>),
    Bouten(Bouten<'src>),
    Tcy(Tcy<'src>),
    Gaiji(Gaiji<'src>),
    Container(&'src Container<'src>),    // boxed in the arena
    BreakNode(BreakNode),
    // … 30+ variants
}

The 'src lifetime is the arena lifetime (re-using 'src because all node text borrows from the source buffer, which lives at least as long as the arena). Each variant either:

holds a &str slice into the source (zero copy), or
is a small Copy struct (BreakNode, Saidoku, …), or
is &'src Container<'src> — boxed in the arena because Container itself contains a &'src [AozoraNode<'src>] child slice.

The whole AozoraNode is Copy (it’s a tagged union of references and small primitives), so iterating the tree never needs & — just deref the reference, copy the node, walk on.

What you trade

The big trade-off: you can’t outlive the Document. A Vec<AozoraNode<'_>> doesn’t compile because the '_ lifetime is bound to the arena, which is bound to the Document.

In practice this rarely matters — consumers either:

Render the tree immediately and discard (tree.to_html() returns String, which has no lifetime tie).
Walk the tree once and emit their own owned IR (most editor backends do this).
Hold the Document itself across function boundaries and re-derive the tree on the inside.

For consumers that genuinely need an owned tree, aozora::owned (planned for v0.3) will provide a walk helper that builds a Vec<OwnedNode> from a tree pass. We resist shipping it pre-1.0 because the conversion is trivial and shipping a built-in owned version would push consumers toward it even when they don’t need it.

Lifetime safety

The 'src parameter prevents these shapes at compile time:

fn bad() -> AozoraTree<'static> {
    let doc = aozora::Document::new("…".into());
    doc.parse()        // ERROR: cannot return value referencing local
}

Borrow-checker enforcement; no runtime Drop ordering bugs possible.

Seven-phase lexer

aozora-lexer runs as seven distinct phases, each a pure function on the previous phase’s output. The split exists because each phase has a different cost profile — separating them keeps the dominant hot path (Phase 2 tokenize) tight, and lets the bench harness measure each phase independently via the phase_breakdown probe.

Phase ordering

flowchart LR
    p0["Phase 0<br/>sanitize"]
    p1["Phase 1<br/>scan triggers"]
    p2["Phase 2<br/>tokenize"]
    p3["Phase 3<br/>classify"]
    p4["Phase 4<br/>pair containers"]
    p5["Phase 5<br/>resolve targets"]
    p6["Phase 6<br/>diagnostics"]

    p0 --> p1 --> p2 --> p3 --> p4 --> p5 --> p6

Each arrow carries a small data structure (offsets, slices, AST nodes); no phase reads back into a previous phase’s output.

Phase	Input	Output	What it does
0 — Sanitize	raw `&str`	normalised `&str`	BOM strip, CRLF → LF, accent decomp, PUA assignment for gaiji refs
1 — Scan	normalised `&str`	trigger offsets `&[Trigger]`	SIMD multi-pattern scan for `｜《》※［］`
2 — Tokenize	normalised `&str` + offsets	`&[Token]`	Slice the source at trigger boundaries; classify each slice as `Plain` / `Open` / `Close` / `RefMark`
3 — Classify	`&[Token]`	`&[ClassifiedToken]`	Recogniser registry decides what each `［＃…］` body actually is
4 — Pair	`&[ClassifiedToken]`	`&[Container]`	Bracket matching: openers ↔ closers, build container tree
5 — Resolve	`&[Container]` + source	`AozoraTree<'_>`	Look-back resolution for bouten / tcy targets, tie inline annotations to AST nodes
6 — Diagnostics	`AozoraTree<'_>` + accumulator	`Diagnostics`	Collect diagnostics from earlier phases, sort by span, pin codes

Phase 0: sanitize

The most varied phase by what it touches. Sub-passes:

bom_strip — UTF-8 / UTF-16 BOM detection and removal.
crlf — CRLF → LF in one memchr2 pass.
rule_isolate — separate inline ※［＃…］ from following text so the tokenizer has unambiguous boundaries.
accent — 114 ASCII digraph / ligature decomposition (see Notation → Gaiji).
pua_scan — assign each ※［＃…］ reference a private-use codepoint inline so subsequent phases treat it as a single character.

Each sub-pass is independent; phase0_breakdown probe measures them separately. In the corpus sweep, pua_scan dominates Phase 0 (60% of phase wall time on average) because it has to ※［＃…］ scan the whole document — the SIMD scanner from Phase 1 isn’t yet active.

Phase 1: scan triggers

The hot path. SIMD multi-pattern scan for the seven trigger bytes:

｜  《  》  ※  ［  ］  　 (full-width space)

The chosen scanner backend (Teddy, Hoehrmann DFA, memchr-based) produces a &[Trigger] of byte offsets. See SIMD scanner backends for the selection logic.

Throughput on a typical mid-size work (crime_and_punishment.txt, ~600 KiB UTF-8): ~12 GB/s on Teddy, ~3.5 GB/s on the DFA fallback. Both are well above the rest of the pipeline’s throughput — Phase 1 is essentially free at the corpus level.

Phase 2: tokenize

Slice the source at trigger boundaries and classify each slice:

pub enum Token<'src> {
    Plain(&'src str),
    Open(OpenKind, Span),
    Close(CloseKind, Span),
    RefMark(Span),                // ※ in isolation
}

Single linear pass over the trigger array; no allocation outside the output Vec (which is sized exactly from the trigger count).

Phase 3: classify

The most code-heavy phase. The classifier registry has one recogniser per ［＃…］ directive family:

RubyRecogniser
BoutenRecogniser
TcyRecogniser
IndentRecogniser
AlignRecogniser
LineLengthRecogniser
BreakRecogniser
KaeritenRecogniser
… 17 in total

The recognisers run in deterministic order; the first recogniser that matches the directive body wins. Order matters because some directive bodies are valid prefixes of others (e.g. ここから2字下げ is valid prefix of ここから2字下げ、地寄せ). Compile-time tests in aozora-lexer enforce ordering invariants.

The recognisers themselves are short (most are < 50 LOC) — the bulk of classify cost is the phf::Map of directive prefixes the recognisers share for opener detection.

Phase 4: pair

Bracket matching. Walk the classified token stream, push openers onto a stack, pop on closers, fail if mismatched. The output is a tree of Container<'_> nodes whose children are flat &[Token<'_>] slices.

Single linear pass; the stack is a SmallVec<[ContainerKind; 8]> so it stays on the stack for typical 1–4 deep nesting.

Phase 5: resolve

Bouten / tcy targets quote-by-look-back: the directive ［＃「平和」に傍点］ applies to the most recent 平和 in the preceding text. Phase 5 walks the container tree and resolves these references.

Pre-Phase-5 the tree carries unresolved BoutenRef { target: "平和" } nodes; post-Phase-5 they’re Bouten { target_span: Span } pointing at the actual matched run. The resolver uses an aho-corasick DFA over the live target strings — single-pass over the source, no recogniser-order dependencies.

Phase 6: diagnostics

Collect, sort by span, pin codes. Diagnostics emitted in earlier phases were buffered in a DiagnosticAccumulator threaded through the call stack; Phase 6 sorts them and assigns the stable error codes (E0001, W0001, …).

Why seven phases, not one big function?

Three reasons.

Bench-driven optimisation. The phase_breakdown probe reports per-phase wall time per corpus document. Knowing that “this document spends 80% of parse time in Phase 3 classify” tells you exactly where to focus a perf PR. A monolithic lex() would force you to re-instrument every PR.
Spec compliance. Each phase corresponds to a discrete transformation that the spec describes. If a spec gap shows up in production, it almost always lands in one phase, and the test harness can pin a regression test that exercises that phase only.
Composability. aozora-lexer exposes both the fused lex_into_arena and the per-phase calls. The fused version is what aozora-lex ships to consumers; the per-phase calls are what the bench harness and integration tests use to isolate regressions.

The cost is conceptual (more API surface internal to the lexer); the win is that every perf decision in the parser has a measurement attached.

SIMD scanner backends

Phase 1 of the lexer is a multi-pattern byte scan: find every occurrence of the seven trigger bytes (｜《》※［］　) in the source. On a typical Japanese corpus document — where every codepoint is a 3-byte UTF-8 sequence and no trigger byte appears more than once per kilobyte — the scan dominates the interpretation by an order of magnitude. So this is the place where SIMD pays for itself.

aozora-scan ships three backends, one of which is selected per target at compile time:

Backend	Target	Throughput (corpus)	Selection
Teddy	x86_64 + AVX2	~12 GB/s	first choice when AVX2 is available
Hoehrmann DFA	portable	~3.5 GB/s	x86_64 fallback, native arm64, etc.
Memchr-multi	wasm32	~1.2 GB/s	wasm32 until the SIMD proposal lands

Each backend produces the same (offset, TriggerKind) stream; the lexer cannot tell which one ran. Selection happens behind a runtime-dispatched trait so a single binary can carry both the SIMD fast path and a portable fallback.

Backend 1: Teddy (Hyperscan-style packed)

Teddy is the small-string multi-pattern algorithm from Intel’s Hyperscan. The aho-corasick crate ships a packed::teddy implementation that aozora calls into directly.

Why Teddy here:

The trigger set is small (7 patterns) and short (1 char each in full-width form, 3 bytes in UTF-8). Teddy’s regime is exactly N small patterns where N ≤ 64 — ours has 7.
The patterns share no common prefix structure (they’re distinct full-width punctuation), so a Boyer-Moore-style suffix table doesn’t help.
AVX2 lets Teddy compare 32 bytes per cycle against the packed shuffle table, and our patterns fit cleanly into that lane width.

Why not just memchr-multi (the obvious upgrade):

memchr3 does scan for up to 3 bytes simultaneously — but our trigger set is 7 patterns × 3 bytes = 21 raw bytes, which would require seven separate memchr passes (one per pattern). Each pass streams the whole source. Teddy does one pass for all seven patterns. The arithmetic favours Teddy by ~3.5×.

Why not memchr’s own packed-pattern path:

memchr does have a packed multi-pattern API now, but it tops out at ~5 GB/s on our workload because it goes through a generic 16-byte SSE2 lane. Teddy’s AVX2 32-byte lane — combined with aho-corasick’s shuffle-table compilation — wins on the corpus by 2.5×.

Backend 2: Hoehrmann-style multi-pattern DFA

For targets that lack AVX2 (older x86_64, native arm64 on some runners, Alpine builds) the fallback is a byte-DFA built by regex-automata’s dense::Builder. Hoehrmann’s design — single-byte transitions, no backtracking, table-driven — gives O(1) per byte with no SIMD requirement.

Why Hoehrmann-style over Aho-Corasick textbook NFA:

Aho-Corasick at runtime is an NFA with failure transitions; each mismatched byte may walk a chain of failure links before consuming the next input byte. Hoehrmann compiles those failure links into the dense table at build time, so every byte consumes exactly one table lookup. For a small pattern set that fits in cache, the dense table is faster than the NFA representation by 2×.

Why a DFA over a hand-rolled state machine:

regex-automata gives us a battle-tested table compiler with correctness guarantees (panics from malformed transitions are impossible) and the same crate handles the build-time DFA → serialised-table flow if we ever want to ship the table as a static asset. Hand-rolling buys nothing here — the patterns are small enough that the compiler-emitted code generation isn’t the bottleneck.

Backend 3: memchr-multi (wasm32)

wasm32-unknown-unknown doesn’t yet have AVX2 (and even after wasm-simd lands, the lane width is 16 bytes — which would put it between Teddy and the DFA). Until the workspace targets wasm-simd, the wasm build uses memchr’s portable multi-pattern path:

memchr3 for the three single-byte open / close triggers,
a follow-up scan for the multi-byte ｜《》※［］ UTF-8 sequences (these expand to 3-byte each).

Throughput is lower (~1.2 GB/s) but the WASM bundle stays small — no need to ship a Teddy table or a regex-automata DFA in the 500 KiB-budgeted wasm artifact.

Backend selection

pub fn best_scanner_name() -> &'static str {
    if is_x86_feature_detected!("avx2") {
        "teddy"
    } else if cfg!(target_arch = "wasm32") {
        "memchr-multi"
    } else {
        "hoehrmann-dfa"
    }
}

Runtime detection (not compile-time cfg!) so a single x86_64 binary works on AVX2-less CPUs without recompilation.

The dispatch goes through a &'static dyn Scanner trait object; the indirect call is hoisted out of the inner loop in the lexer’s Phase 2, so the trait dispatch is paid once per Document::parse, not per byte.

Why a runtime dispatch over per-target binaries?

Two reasons.

Distribution. Shipping one binary that adapts to its host is simpler than shipping aozora-x86_64-avx2 and aozora-x86_64 separately. The release pipeline only has to manage three archives (linux-gnu, darwin-arm64, windows-msvc), not six.
Container portability. docker run --platform linux/amd64 on an arm64 Mac (Rosetta) lands on x86_64 without AVX2 — runtime detection picks the DFA backend silently. A compile-time-only build would crash with SIGILL on first trigger byte.

The cost is a single indirect call per parse; the win is that the distribution surface stays minimal.

Verifying the scanner is firing

println!("{}", aozora_scan::best_scanner_name());
// "teddy" | "hoehrmann-dfa" | "memchr-multi"

Or under samply, look for one of:

aozora_scan::backends::teddy::scan_offsets — Teddy is firing.
aozora_scan::backends::dfa::scan_offsets — Hoehrmann fallback.
memchr::arch::*::scan — memchr’s own internal SIMD; the scalar / wasm path is firing.

See Performance → Profiling with samply for the full workflow.

Eytzinger sorted-set lookup

aozora-veb is a no_std crate that provides one data structure: a sorted-set lookup over a static byte slice, laid out in Eytzinger order so that the binary search is cache-friendly. It backs the placeholder registry the lexer uses to recognise the fixed-set strings inside ［＃…］ directives (“ここから”, “ここで”, “傍点”, “傍線”, “字下げ”, …).

flowchart LR
    needle["needle: &str"]
    table["Eytzinger-laid sorted set<br/>(static &[&str])"]
    cmp["compare at index, branch left/right"]
    found["Some(idx) | None"]

    needle --> cmp
    table --> cmp
    cmp --> cmp
    cmp --> found

What is Eytzinger order?

A standard sorted array stores elements in their natural order: [a, b, c, d, e, f, g]. Binary search visits indexes 3, 1 or 5, 0/2/4/6 — accesses that are spatially distant in memory. On modern CPUs that’s a cache miss per level past L1.

Eytzinger order stores the same elements in implicit-binary-tree order: the root at index 1 (index 0 is reserved as a sentinel), left child at 2i, right child at 2i+1. The walk visits indexes 1, 2 or 3, 4/5/6/7 — accesses that are consecutive in memory.

For 256+ entries the cache-line packing is a measured 2–3× speedup over std::slice::binary_search on the same data. Below 64 entries the difference is in the noise (everything fits in one cache line). The placeholder registry has ~120 entries — well into Eytzinger’s favourable regime.

Why this and not `phf::Set`?

phf::Set is a perfect-hash table: O(1) lookup, but with a real constant — one hash computation, one table probe, one strcmp. For short strings (the placeholder registry’s median is 4 chars) the hash dominates, and the table probe is a pointer chase to a separate allocation.

Eytzinger search is log N — but for N=120 that’s 7 comparisons, all in one contiguous slice, no hashing, no separate allocation. Measured: Eytzinger is ~1.5× faster than phf::Set on this workload.

For larger sets (the gaiji table at ~14 000 entries), phf::Set wins — log₂(14000) is 14 comparisons and the cache locality stops mattering. The choice is entry-count-dependent. The aozora codebase uses Eytzinger for sub-256-entry tables and phf::Set for larger ones; the cutoff was determined empirically.

Why not a hash table?

A HashMap<&str, ()> allocates and rehashes; phf and Eytzinger don’t. In the lexer’s Phase 3 classify, the placeholder registry is hit once per ［＃…］ directive — measured as ~5 lookups per KB of source. A HashMap’s startup cost (build the table from a const array on first use, even with OnceLock) would dominate the parser’s per-Document::parse cost on tiny inputs.

API

pub struct EytzingerSet<'a> {
    entries: &'a [&'a str],   // already in Eytzinger order
}

impl<'a> EytzingerSet<'a> {
    pub const fn new(entries: &'a [&'a str]) -> Self { Self { entries } }

    pub fn contains(&self, needle: &str) -> bool { … }
    pub fn position(&self, needle: &str) -> Option<usize> { … }
}

new is const fn so registries are computed at compile time and end up in .rodata. Lookup is a single function with no allocation.

Building the order

The crate ships a build-time helper that takes a sorted slice and produces the Eytzinger permutation:

const PLACEHOLDERS: &[&str] = aozora_veb::eytzinger_layout!(
    "ここから", "ここで", "傍点", "傍線", "字下げ", …
);

The macro is const-evaluated; the resulting slice is what EytzingerSet::new takes.

Why a separate crate?

The lookup is no_std and has no aozora-specific dependencies. By extracting it, three things become true:

The lexer can depend on aozora-veb without pulling in any workspace state, which keeps aozora-veb’s test surface small.
aozora-veb can be reused by aozora-encoding (for the accent decomposition table) and by aozora-bench (for category slug lookups in the trace rollup) without forming a circular dependency.
Future consumers can depend on just aozora-veb for the data structure, without taking the whole parser.

Shift_JIS + 外字 resolver

aozora-encoding covers the full source-decoding stack:

Shift_JIS / Shift_JIS-2004 / cp932 byte stream → UTF-8 string.
JIS X 0213 plane-2 ideographs → Unicode (where possible).
外字 references (※［＃…］) → resolved Unicode codepoint, JIS triple, or descriptive-text fallback.
Accent decomposition (114 ASCII digraph / ligature → Unicode).

All four are pure functions; the crate has no global state and nothing that varies per-call.

Decode chain

flowchart TD
    raw["raw bytes<br/>(SJIS-encoded .txt from Aozora Bunko)"]
    sjis["encoding_rs::SHIFT_JIS<br/>or aozora-specific JIS X 0213 patch"]
    utf8["UTF-8 String"]
    sanitize["Phase 0 sanitize<br/>(in aozora-lexer)"]
    pua["PUA assignment for 外字"]
    classified["normalised &str ready for Phase 1 scan"]

    raw --> sjis --> utf8 --> sanitize --> pua --> classified

The Shift_JIS decode itself uses encoding_rs — the same crate Firefox uses for HTML decoding. Battle-tested, SIMD-accelerated, and handles every Shift_JIS variant Aozora Bunko sources have used since the 1990s. We add a thin patch layer for JIS X 0213 plane-2 codepoints that encoding_rs’s strict cp932 mapping doesn’t cover (Aozora’s spec extends Shift_JIS into JIS X 0213 territory; encoding_rs keeps the strict cp932 surface).

外字 (gaiji) PHF table

The reference table contains ~14 000 entries:

static GAIJI_TABLE: phf::Map<&'static str, GaijiEntry> = phf_map! {
    "1-94-37" => GaijiEntry::JisX0213 { plane: 1, row: 94, cell: 37, codepoint: '⿰魚師' },
    "U+5F85"  => GaijiEntry::Direct   { codepoint: '待' },
    "魚＋師のつくり" => GaijiEntry::Description { fallback: "[魚+師]" },
    …
};

Why PHF (perfect hash function):

The table is large enough (~14 000 entries) that linear scan or Eytzinger search would dominate the lookup cost.
It’s static and known at compile time — the perfect hash is computable once.
phf produces zero-allocation, zero-comparison-on-collision lookups. The hash is one wyhash round; the probe is one slice index; the comparison is one strcmp. ~25 ns per lookup on the bench harness.

Why not OnceLock<HashMap>:

First-call cost: building a HashMap<&str, GaijiEntry> from 14 000 entries on first use takes ~5 ms. That’s longer than parsing a small document end-to-end.
Memory: the runtime HashMap takes 2–3× the size of the static PHF (load-factor padding + RawTable metadata).
Concurrency: OnceLock adds an atomic load on every access, even after initialisation. PHF is static — no synchronisation.

Why not load from a JSON / TOML asset:

Adds startup cost on every Document::new (file I/O is microseconds away from the parser’s whole runtime budget for small inputs).
Forces every binding (CLI / WASM / FFI / Python wheel) to ship the asset as a separate file, complicating distribution.
Defeats dead-code elimination: the linker can’t strip entries the consumer’s input never references.

The build-time cost of compiling the PHF (~40 s the first time, 0 s incremental) is paid once per workspace build, not per-invocation.

Resolution order

pub fn resolve(reference: &str) -> Resolved {
    // 1. Direct codepoint (U+XXXX) wins outright.
    if let Some(c) = parse_unicode_form(reference) { return Resolved::Direct(c); }

    // 2. JIS X 0213 plane-row-cell triple.
    if let Some(triple) = parse_jis_triple(reference) {
        if let Some(c) = JIS_TABLE.get(&triple) { return Resolved::Lookup(c); }
    }

    // 3. Descriptive name lookup (curated subset).
    if let Some(fallback) = DESCRIPTION_TABLE.get(reference) {
        return Resolved::Fallback(fallback);
    }

    Resolved::Unknown
}

Three layers, in order. Direct wins because the source author explicitly wrote a Unicode codepoint — overriding it would be wrong even if our JIS table disagreed. Lookup is the common case. Fallback is the curated subset of characters that have no Unicode codepoint at all (~120 entries from the 14 000); we ship a descriptive-text rendering rather than dropping the character. Unknown fires diagnostic W0006.

Accent decomposition

Older Aozora works encode accented Latin letters using a separate notation that is not a ※［＃…］ reference:

M[i!]cher  →  Micher
M[a!]ria   →  Maria
[ae]on     →  Aeon

The full mapping (114 entries — every digraph and ligature in the spec) is at accent_separation.html in the spec snapshot. aozora applies this decomposition during Phase 0 sanitize, before the trigger scan, so by Phase 1 the source is pure Unicode with no ASCII-encoded accents.

The lookup is also Eytzinger-laid (see Eytzinger sorted-set lookup) since 114 entries is well inside its favourable regime.

Why a single crate for all of this?

encoding, gaiji, and accent are three distinct concerns, but:

They all need to be applied once, in order, at the boundary between the source bytes and the parser proper.
Splitting them would force three separate crate surfaces and three separate trigger points in the lexer.
Their data tables are all built from upstream Aozora Bunko spec pages, so a single update workflow (refresh docs/specs/aozora/, re-extract tables) hits all three at once.

Co-locating them in one crate keeps the boundary tight and the update surface predictable.

HTML renderer & canonical serialiser

aozora-render ships two walkers over AozoraTree<'_>:

html::render_to_string — emits semantic HTML5 with aozora-* class hooks.
serialize::serialize — emits canonical 青空文庫 source.

Both are pure functions. Both walk the tree once, in source order, allocating exactly the output buffer (a String pre-sized to the arena footprint).

HTML renderer

Class-name scheme

aozora emits stable class names that downstream stylesheets can hook:

AST node	HTML	Class hook
`Ruby`	`<ruby>X<rt>Y</rt></ruby>`	(no class — semantic ruby element)
`Bouten { kind: Sesame }`	`<em class="aozora-bouten-sesame">…</em>`	`aozora-bouten-<slug>`
`Tcy`	`<span class="aozora-tcy">…</span>`	`aozora-tcy`
`Gaiji { resolution: Direct }`	`<span data-aozora-gaiji-jis="1-94-37">字</span>`	`data-aozora-gaiji-*`
`Gaiji { resolution: Fallback }`	`<span class="aozora-gaiji-fallback" title="…">[…]</span>`	`aozora-gaiji-fallback`
`Container { kind: Indent { n: 2 } }`	`<div class="aozora-indent-2">…</div>`	`aozora-indent-<n>`
`Container { kind: AlignEnd }`	`<div class="aozora-align-end">…</div>`	`aozora-align-end`
`Break::Page`	`<div class="aozora-page-break"/>`	`aozora-page-break`
`Kaeriten { mark: Re }`	`<span class="aozora-kaeriten" data-aozora-kaeriten="レ">レ</span>`	`aozora-kaeriten`

The aozora- prefix is reserved for our class names — a downstream stylesheet can target every aozora-emitted hook with [class^="aozora-"] without conflicting with the consumer’s own classes.

Why a class-hook output instead of inline styles?

Inline styles would force a single typographic decision for every consumer — print stylesheet, screen stylesheet, e-book renderer, and LSP/preview pane all want different presentation. The class-hook output:

Lets each consumer ship its own stylesheet for its medium.
Survives content-security-policy regimes that block style attrs.
Stays diff-able (the rendered HTML is stable across runs; presentation churn doesn’t ripple into snapshot tests).

HTML escaping

The renderer escapes <, >, &, ", ' in user text exactly once, at emission. Pre-escaped or doubly-escaped output is a correctness bug, not a perf decision — every CI run validates render_to_string ∘ html_unescape is the source identity for plain runs.

Canonical serialiser

The serialiser is the inverse of the lexer’s surface form: walk the tree, emit the source notation that would re-parse identically. It exists for three reasons:

Round-trip property. parse ∘ serialize ∘ parse must be stable on the second iteration. The corpus sweep verifies this on every Aozora Bunko work.
aozora fmt. The CLI’s fmt subcommand canonicalises author input (CRLF → LF, accent decomposition, container directive spacing).
Diff-quality output. When the parser drops a malformed construct, the serialiser re-emits the surrounding text without the offending fragment, so authors can see the exact change.

Why a separate walker, not “render with a different visitor”?

The HTML and canonical-serialise outputs differ on every node type:

HTML wraps Ruby { target, reading } in <ruby>X<rt>Y</rt></ruby>; serialise emits ｜X《Y》 (or auto-detect form).
HTML wraps Container { kind: Indent { n } } in <div class="aozora-indent-N">…</div>; serialise emits the bracketed directives ［＃ここからN字下げ］…［＃ここで字下げ終わり］.
HTML emits <span data-aozora-gaiji-jis="1-94-37">字</span> for a resolved gaiji; serialise emits the original ※［＃…、第3水準1-94-37］.

The transformations don’t share enough structure to fit a single “visitor with two methods per node” abstraction. Two purpose-built walkers stay clearer and slightly faster — the compiler can inline the per-node match, which a generic visitor with virtual dispatch prevents.

Walker shape

Both walkers follow the same shape:

pub fn render_to_string(tree: &AozoraTree<'_>) -> String {
    let mut buf = String::with_capacity(tree.estimated_html_size());
    walk(tree, &mut buf);
    buf
}

fn walk(tree: &AozoraTree<'_>, out: &mut String) {
    for node in tree.nodes() {
        match node {
            AozoraNode::Plain(s)     => out.push_str(html_escape(s)),
            AozoraNode::Ruby(r)      => emit_ruby(r, out),
            AozoraNode::Bouten(b)    => emit_bouten(b, out),
            AozoraNode::Tcy(t)       => emit_tcy(t, out),
            AozoraNode::Gaiji(g)     => emit_gaiji(g, out),
            AozoraNode::Container(c) => emit_container(c, out),
            AozoraNode::BreakNode(b) => emit_break(b, out),
            // … exhaustive
        }
    }
}

Single linear pass; no allocation outside the output buffer; no recursion that the compiler can’t unroll (containers recurse, but the fan-out is small — typically 1–4 children per container).

`estimated_html_size` heuristic

The buffer pre-size avoids String reallocations during the walk. Empirical heuristic from the corpus sweep: 2.6 × source_byte_len is at the 95th percentile (some HTML wraps a 3-byte ruby kanji in 30 bytes of <ruby>X<rt>Y</rt></ruby> markup). Going under leaves ~1 reallocation per render in the worst case; going over wastes memory on every render. 2.6× is the measured optimum.

Concrete syntax tree (CST)

A rowan-backed lossless syntax tree lives under the cst Cargo feature on the aozora crate. The CST is a pure projection over the existing parse output — Phase 3 classification is unchanged, the AST stays the perf-critical path, and the CST adds zero overhead for consumers that don’t enable the feature.

Why a CST exists

The borrowed AST (AozoraNode<'src>) is great for renderers: classified spans, typed payload, no whitespace noise. It is the wrong shape for source-faithful tooling:

A formatter rewriting 日本《にほん》 → ｜日本《にほん》 needs the exact whitespace and trivia between tokens.
A LSP textDocument/foldingRange provider needs the open / close positions of every nestable region, including ones the renderer ignores.
A refactor that renames a kanji-range ［＃「青空」に傍点］ to ［＃「あおぞら」に傍点］ must preserve every bracket character the user wrote, not just the parsed target.

A CST whose leaves concatenate to the parser’s input gives those tools what they need without any custom plumbing.

Lossless invariant

The contract is sharp:

Concatenating every leaf token’s text yields the sanitized source bytes the parser actually saw.

“Sanitized” matters: Phase 0 normalises CRLF→LF, strips a leading BOM, isolates long decorative rule lines with a leading blank line, and rewrites 〔…〕 accent spans through accent decomposition. These transformations happen before classification, so source_nodes coordinates address sanitized bytes. The CST tracks that coordinate system; an editor that wants to map back to the user’s raw bytes runs the same Phase 0 transformation and inverts where needed.

The proptest in tests/property_lossless.rs runs the invariant across the full Aozora-shaped input distribution (aozora_fragment / pathological_aozora / unicode_adversarial from aozora-proptest). A regression here breaks every editor surface that walks the CST.

Architecture

The crate stays decoupled by design:

aozora-cst depends on aozora-pipeline + aozora-spec directly, not on the aozora meta crate. Going through aozora would create a cycle (the meta crate’s cst feature re-exports aozora-cst).
build_cst(sanitized_source, source_nodes) -> SyntaxNode takes the lower-level bits explicitly so consumers writing custom pipelines can reach in.
aozora::cst::from_tree(&tree) -> SyntaxNode is the ergonomic entry point; it runs Phase 0 sanitize internally and forwards.
The Phase 3 classifier sees no changes — adding / removing CST consumers cannot perturb AST perf.

SyntaxKind granularity

The CST is intentionally coarser than a token-stream re-construction:

`SyntaxKind`	Role
`Document`	Tree root
`Container`	Paired-container region (`［＃ここから...］...［＃ここで...終わり］`)
`Construct`	Single classified Aozora construct
`ContainerOpen` / `ContainerClose`	Container boundary tokens
`ConstructText`	Source slice of a `Construct`
`Plain`	Plain text run between classifications

Finer per-token granularity (individual punctuation, kana runs, …) can land later once a concrete consumer needs it. The lossless property holds at any granularity, so widening the leaf set is non-breaking for downstream tooling that walks preorder_with_tokens.

Why rowan, not Phase 3 integration

The bumpalo-arena AST stays the hot path; the CST sits on top as an editor-grade convenience layer rather than coupling lossless-tree concerns into the perf-critical classifier. rowan (over cstree) gives the lossless tree a maintained home — rust-analyzer’s tree infrastructure with 86 reverse deps — and the bumpalo / Arc dual-allocator overhead is the price for keeping the AST untouched.

Cross-references

Architecture → Borrowed-arena AST — the underlying perf-critical tree.
Architecture → Seven-phase lexer — where Phase 0 sanitize and Phase 3 classify do their work.
Document::edit — the incremental-parse counterpart that reuses the same CST.

Error recovery

aozora is non-fatal by design: the parser always returns an AozoraTree even when the input violates the spec. Every problem is reported as a structured Diagnostic whose code tooling can match on; nothing is ever raised as a panic from Document::parse.

This page documents what the parser actually does when each diagnostic fires — useful when implementing editor surfaces, lint fixers, or anything else that runs over imperfect documents.

Recovery model

Every diagnostic carries two orthogonal axes:

Axis	Values	Meaning
`severity`	`Error` / `Warning` / `Note`	Routing hint for downstream surfaces; does not affect parsing.
`source`	`Source` / `Internal`	Whether the issue is in the user’s input (`Source`) or in the library’s invariants (`Internal`).

The parser keeps running regardless of severity. Error does not short-circuit; it only marks the surrounding output region as suspect so callers (CLI --strict, LSP) can decide policy. CI gates typically treat any Error as failure, but the AST is still safe to walk — the spans, classifications, and renderer all remain consistent.

Source-side codes

`aozora::lex::source_contains_pua`

Hello, …<U+E001>… world.

A user-supplied codepoint in the range U+E001..U+E004 collides with one of the lexer’s PUA sentinel reservations. The placeholder registry keys on these codepoints, so a bare collision means the classifier could no longer tell user-text occurrences from lexer-inserted markers.

Recovery: the colliding bytes are kept verbatim in the sanitised text — Phase 0 does not delete them. Downstream the character flows through as plain text (the registry has no entry for the position so it is treated as ordinary content). Editors that want to surface the collision visually can match on this code; ordinary HTML rendering is unaffected.

`aozora::lex::unclosed_bracket`

｜青梅《おうめ

An open delimiter (｜, 《, ［, 〔, 「, …) reached end-of-input with no matching close on the pairing stack.

Recovery: no PairLink is emitted for the orphaned opener (Unclosed opens have no partner span and would only confuse editor highlights). Phase 3 then sees no Aozora construct covering the unclosed open and degrades the whole region to plain text — the bytes from the opener to EOF are preserved literally, just without ruby / annotation classification.

`aozora::lex::unmatched_close`

》orphaned

A close delimiter saw an empty pairing stack, or its PairKind mismatched the stack top.

Recovery: the stray close is not matched against any opener; no PairLink is emitted. The bytes flow through as plain text, preserving the user’s content; nothing on the stack pops. The diagnostic span points at the close itself so editors can surface it without corrupting the document tree.

Internal codes

Internal-source diagnostics indicate library bugs — production parses on well-formed input never emit these. They are kept publicly visible so tooling can distinguish “user input has a problem” from “the library has a problem”; the parse still completes best-effort to keep editors usable.

Code	What broke
`residual_annotation_marker`	An `［＃` digraph survived classification — a recogniser is missing for the contained keyword.
`unregistered_sentinel`	A PUA sentinel is in normalised text without a registry entry.
`registry_out_of_order`	The placeholder-registry vector is not strictly position-sorted.
`registry_position_mismatch`	A registry entry references a normalised position whose codepoint is not the expected sentinel kind.

Recovery: the parser never acts on internal diagnostics — the problematic stretch flows through as plain text, the diagnostic records what was wrong, and Document::parse returns normally. Reproductions belong in aozora-spec test fixtures so the bug surface keeps shrinking over releases.

What recovery is not

The parser does not attempt fix-it suggestions. There is no “did you mean ［＃ここで字下げ終わり］?” guess; the diagnostic’s help text describes the symptom, not the cure. Higher-level tooling (LSPs, editor extensions) is the right place for fix-it proposals — they have user context the parser does not.

The parser also does not try to synthesise missing tokens. A truly unclosed bracket stays unclosed in the tree; we don’t insert a phantom 》 to “balance” it. Synthesising tokens hides the diagnostic from any caller that walks the AST instead of the diagnostic list, and turns a fixable user error into a silent correction.

Cross-references

Diagnostics catalogue — code-by-code reference, including the ［＃改ページ］-family directives this page does not cover.
Architecture → Seven-phase lexer — which pipeline phase emits which code.
Wire format → DiagnosticWire — the JSON shape every binding (FFI, WASM, Python) carries diagnostics over.

tree-sitter reference grammar

aozora ships a tree-sitter grammar at grammars/aozora.tree-sitter/grammar.js as a reference implementation alongside the canonical Rust parser. When the two disagree the Rust parser wins; this grammar exists to plug Aozora documents into the tree-sitter ecosystem (neovim, helix, web-tree-sitter / CodeMirror) and to serve as a teaching artefact.

Why a separate grammar at all

The Rust parser is a seven-phase pipeline with a hand-rolled classifier; reading it tells you how the canonical implementation works but not what the spec accepts. A declarative grammar is the language community’s preferred form for “what the spec accepts.” Shipping one alongside the parser lets external tooling consume Aozora without binding to the Rust ABI.

What it does cover

The grammar handles bracket structure faithfully:

｜base《reading》 and base《reading》 — explicit / implicit ruby
《《content》》 — double-bracket bouten
※［＃...］ — gaiji marker
［＃...］ — generic bracket annotation
〔...〕 — tortoise-bracket / accent-decomposition span

Plain text — any byte that is not one of the bracket openers — flows through as a plain_text token, keeping the grammar lossless against the byte stream.

What it deliberately does not cover

Three classes of behaviour are intentionally out of reach:

Stateful container pairing. ［＃ここから2字下げ］ matches ［＃ここで字下げ終わり］ across intervening content; a context- free grammar without a hand-written scanner.c cannot close this. Consumers rely on the body content of the bracket annotation to recognise the pairing themselves, or fall back to the Rust parser.
Forward 「target」に傍点 resolution. The bouten directive walks back through preceding text to bind to a quoted run. The grammar accepts the directive faithfully; the lookup stays the consumer’s job.
Ruby base disambiguation. When the glyph run preceding 《...》 could extend further, the Rust classifier uses a more nuanced rule. The grammar accepts the greedy base match uniformly.

A scanner.c extension could plug some of these gaps, but doing so contradicts the declarative-reference framing of the artefact and would put the canonical-parser-replacement question on the table prematurely.

Status

The grammar covers approximately 40 % of the canonical parser’s constructs as measured by an unweighted variant count. The gap to full coverage is documented; closing it would require a scanner.c extension, which trades the declarative-reference framing for a higher ceiling.

Cross-references

Architecture → Concrete syntax tree — the rowan-backed in-process equivalent.
Conformance suite — a future xtask conformance run --implementation tree-sitter will run the fixture set against this grammar to compute the per-tier pass rate against must / should / may.
grammars/aozora.tree-sitter/README.md — build instructions.

Crate map

aozora is an 18-crate workspace. The split exists for three reasons: narrow each crate’s compile surface (faster cargo check), pin dependency boundaries (cycles are forbidden by the layout), and let each binding (CLI, WASM, FFI, Python) compose only the layers it needs.

At a glance

flowchart TD
    subgraph foundation
      spec
    end
    subgraph types
      veb
      syntax
      encoding
      scan
    end
    subgraph parser
      lexer
      lex
      render
    end
    subgraph facade
      aozora_facade["aozora"]
    end
    subgraph bindings
      cli
      ffi
      wasm
      py
    end
    subgraph dev
      bench
      corpus
      test_utils["test-utils"]
      trace
      xtask
    end

    spec --> veb
    spec --> syntax
    spec --> encoding
    spec --> scan
    veb --> lexer
    syntax --> lexer
    encoding --> lexer
    scan --> lexer
    lexer --> lex
    lex --> render
    render --> aozora_facade
    aozora_facade --> cli
    aozora_facade --> ffi
    aozora_facade --> wasm
    aozora_facade --> py
    aozora_facade --> bench
    corpus --> bench
    test_utils --> lexer
    trace --> xtask

Per-crate purpose

Foundation

Crate	Role
`aozora-spec`	Single source of truth for shared types: `Span`, `Diagnostic`, `TriggerKind`, `PairKind`, PUA sentinel codepoints. No internal dependencies — every other crate may depend on it.

Types & primitives

Crate	Role
`aozora-veb`	`no_std` Eytzinger-layout sorted-set lookup. Cache-friendly binary search for sub-256-entry registries.
`aozora-syntax`	AST node types — `AozoraNode<'src>`, `Container<'src>`, `Bouten<'src>`, `Ruby<'src>`, …. Borrows from the bumpalo arena.
`aozora-encoding`	Shift_JIS decoding, JIS X 0213 patch, 外字 PHF resolver, accent decomposition.
`aozora-scan`	SIMD-friendly multi-pattern byte scanner. The only crate (besides `aozora-ffi`) that locally relaxes `unsafe_code` — for aligned-load SIMD intrinsics.

Parser

Crate	Role
`aozora-lexer`	Seven-phase classifier pipeline (sanitize → scan → tokenize → classify → pair → resolve → diagnostics). Emits the diagnostic catalogue.
`aozora-lex`	Streaming orchestrator — fused `lex_into_arena` over the lexer’s per-phase calls. The front door for the public crate.
`aozora-render`	HTML and canonical-serialisation walkers. Single O(n) tree pass each; no allocation outside the output buffer.

Facade

Crate	Role
`aozora`	Public facade. `Document::parse() -> AozoraTree<'_>`, `tree.to_html()`, `tree.serialize()`, `tree.diagnostics()`. The single import for library consumers.

Bindings

Crate	Role
`aozora-cli`	The `aozora` binary (`check` / `fmt` / `render`).
`aozora-ffi`	C ABI driver. Opaque handles, JSON-encoded structured data. Locally relaxes `unsafe_code`; every block carries a `// SAFETY:` comment.
`aozora-wasm`	`wasm32-unknown-unknown` target with `wasm-bindgen` exports.
`aozora-py`	PyO3 binding shipped via `maturin`.

Development-only

Crate	Role
`aozora-bench`	Criterion + corpus-driven probes. Source of the PGO training data.
`aozora-corpus`	Corpus source abstraction (zstd-archived, blake3-pinned). Dev-only.
`aozora-proptest`	Shared proptest strategies. Dev-only.
`aozora-trace`	DWARF symbolicator + samply gecko-trace loader. Dev-only.
`aozora-xtask`	Host-side dev tooling (samply wrapper, trace analysis, corpus pack/unpack). Not on the `just build` path.

Why 18 crates?

Three concrete wins from the split.

1. Compile latency

A single-crate workspace with the same code would force a full re-compile on any internal change. With 18 crates, a change in the renderer doesn’t touch the lexer, scanner, or any of the bindings — incremental compile times stay sub-second on iteration.

2. No-std reach

aozora-veb and aozora-spec are no_std-clean. That matters for the wasm32 build (where std is a real cost) and would matter for embedded targets if anyone ever needed one. Keeping them in dedicated crates enforces the no_std discipline at the crate-graph level — adding a std import would require depending on a std-using crate, which is a visible Cargo.toml change.

3. Binding modularity

The C ABI driver (aozora-ffi) needs aozora + serde_json and nothing else. It does not pull in the bench harness, the trace loader, or the corpus crate. The wasm driver is similarly minimal. Each binding’s dependency closure is exactly what it needs — which is what keeps the wasm bundle inside its 500 KiB budget.

What we deliberately don’t split

A few things stay co-located despite plausible split points:

HTML render and canonical serialise in aozora-render. Both are tree walkers; sharing the walk() helper between them keeps the implementation small.
Phase 0 sanitize sub-passes in aozora-lexer. Each sub-pass is < 100 LOC and operates on the same &str slice; pulling them out would create a 5-crate ecosystem for a transformation that’s conceptually one phase.
Trigger-byte enum and pair-kind enum in aozora-spec. They’re used by both aozora-scan (which produces them) and aozora-lexer (which consumes them); putting them in spec avoids a back-reference.

Splits aren’t free — every additional crate adds a Cargo.toml, a README, doc-link reachability, and a test surface. Splits land when the cohesion benefit (one of the three above) is real.

Rust library

The first-class binding. Full type safety, zero copy, and the borrowed-arena AST exposed directly.

Adding to a project

The recommended Cargo.toml snippet (with the current release tag) lives in the install chapter. Keeping the pin in one place avoids drift between this doc and the install page when a new release lands.

crates.io publication tracks the v1.0 API freeze; until then, the git tag form documented there is the canonical entry point.

Surface

The public surface is small by design — three types and four methods cover everything:

pub struct Document { /* opaque */ }
impl Document {
    pub fn new(source: String) -> Self;
    pub fn parse(&self) -> AozoraTree<'_>;
    pub fn source(&self) -> &str;
}

pub struct AozoraTree<'a> { /* borrows from Document */ }
impl<'a> AozoraTree<'a> {
    pub fn nodes(&self) -> impl Iterator<Item = AozoraNode<'a>>;
    pub fn to_html(&self) -> String;
    pub fn serialize(&self) -> String;
    pub fn diagnostics(&self) -> &[Diagnostic];
}

pub enum AozoraNode<'src> { Plain(&'src str), Ruby(Ruby<'src>), … }

See Library Quickstart for the walk-through.

Feature flags

aozora exposes one optional feature:

Feature	Default	What it enables
`serde`	off	`serde::Serialize` / `Deserialize` impls on `AozoraNode`, `Diagnostic`, `Span`. Useful for downstream tools that need to ship the AST over a wire.

The default-off policy keeps cargo build aozora slim — the JSON encoders that the bindings need live in the bindings themselves (aozora-ffi, aozora-wasm, aozora-py), not in the core crate.

Error handling

Three philosophies, used consistently:

Diagnostics are not errors. Document::parse() always returns a AozoraTree<'_>. Per-input diagnostics live in tree.diagnostics(). Callers decide whether to treat any diagnostic as fatal.
Decoding is fallible. aozora_encoding::sjis::decode_to_string returns Result<Cow<str>, DecodeError>. Malformed Shift_JIS is the one place a function actually fails — the parser proper assumes UTF-8.
Panics are bugs. No .unwrap() on user-data paths in non-test code; clippy’s unwrap_used and expect_used are warned workspace-wide. If you ever see a panic in aozora::*, file a bug.

Thread safety

Document is Send but not Sync — the bumpalo arena does not support concurrent allocation. Pass a Document between threads freely; do not share &Document across threads.

AozoraTree<'_> borrows from &Document, so by Rust’s lifetime rules the same shape applies: a &AozoraTree is Send + Sync (it’s just & to immutable data), but it can’t outlive its Document.

For parallel corpus processing (e.g. the corpus sweep harness parsing 1000s of documents concurrently), each thread creates its own Document from its own source. The arena resets per-Document, so there’s no contention point.

MSRV policy

aozora pins Rust 1.95.0. The MSRV advances roughly once per quarter, when a new stable feature is needed and the workspace moves to it. The msrv job in CI gates every PR; Dependabot is configured to not auto-bump the MSRV pin (manual decision).

Public API stability

Pre-1.0: minor-version bumps may break the API. cargo-semver-checks runs in CI to catch unintentional breakage between releases, so a v0.2.x → v0.2.y upgrade is always safe; only v0.x.y → v0.x+1.y opens the door for breaks.

Post-1.0 (planned): semver discipline. Breaking changes accumulate on a next branch and ship in a major bump.

WASM (wasm-pack)

The aozora-wasm crate compiles to wasm32-unknown-unknown and exposes a Document class via wasm-bindgen. The wasm artifact has a hard 500 KiB size budget after wasm-opt -O3 — measured on every release.

Build

rustup target add wasm32-unknown-unknown        # one-time
wasm-pack build --target web --release crates/aozora-wasm

Outputs land at crates/aozora-wasm/pkg/:

aozora_wasm_bg.wasm — the binary module
aozora_wasm.js — the wasm-bindgen JS shim
aozora_wasm.d.ts — TypeScript types
package.json — minimal npm-publishable metadata

Why `wasm-opt = false` in `Cargo.toml`?

wasm-pack ships its own bundled wasm-opt (via the binaryen crate) which lags upstream. Recent Rust releases emit bulk-memory opcodes (memory.copy, memory.fill) that the bundled wasm-opt mishandles on -O3, occasionally producing artifacts that crash on init. We disable the bundled run and recommend a fresh wasm-opt invocation externally:

wasm-opt -O3 \
    --enable-bulk-memory \
    --enable-mutable-globals \
    crates/aozora-wasm/pkg/aozora_wasm_bg.wasm \
    -o crates/aozora-wasm/pkg/aozora_wasm_bg.wasm

The post-wasm-opt artifact has a 500 KiB size budget. CI gates on this number — exceeding it is a release-blocking regression.

Usage

import init, { Document } from "./pkg/aozora_wasm.js";

await init();                                  // load the .wasm

const doc = new Document("｜青梅《おうめ》");
const html = doc.to_html();
const canonical = doc.serialize();
const diagnostics = JSON.parse(doc.diagnostics_json());
console.log(html);
doc.free();                                    // release the bumpalo arena

In TypeScript, the .d.ts file gives you full type checking on every method.

API surface

Method	Returns	Notes
`new Document(source: string)`	`Document`	Copies the JS string into a Rust `Box<str>`.
`to_html()`	`string`	Renders to semantic HTML5 with `aozora-*` class hooks.
`serialize()`	`string`	Re-emits canonical 青空文庫 source.
`diagnostics_json()`	`string`	JSON-encoded array of diagnostic objects.
`source_byte_len()`	`number`	Source byte length, useful for progress UI.
`free()`	—	Explicit drop; otherwise the JS GC eventually releases.

The diagnostics JSON shape mirrors aozora-ffi’s C ABI:

interface Diagnostic {
    code:    string;            // "E0001", "W0006", …
    level:   "error" | "warning" | "info";
    message: string;
    span:    { start: number; end: number };
    help?:   string;
}

Why a hand-written JSON projection over `serde-wasm-bindgen`?

serde-wasm-bindgen would let us pass the Diagnostic directly to JS as a structured object — no JSON round-trip needed. We don’t use it because:

It pulls in a meaningful chunk of serde_json machinery that bloats the wasm bundle by ~80 KiB.
The wire format ({ code: "E0001", level: "warning", … }) is exactly what every JS consumer is going to deserialise into anyway.
It would force a serde::Serialize derivation on every diagnostic-related type in aozora-spec, which the Rust library consumers don’t otherwise need (they take &[Diagnostic] directly).

A small, hand-written JSON emitter (one core::fmt::Write impl, ~60 LOC) costs nothing and keeps the bundle small.

Why `Document.free()` and not just GC?

wasm-bindgen does wire Drop to a JS finalizer, but JS finalizers fire on the GC’s schedule — which can be minutes after the last reference goes out of scope, especially on Node.js where the GC batches aggressively. For large documents this means the bumpalo arena (potentially several MB) sits unreleased.

Explicit .free() is the same idiom every wasm-bindgen library exposes for resource-heavy types. Consumers that want JS-native ergonomics wrap the class in their own using (TC39 stage-3 explicit resource management) helper.

Browser support

Tier-1 (CI-tested):

Chrome 110+
Firefox 110+
Safari 16+

Tier-2 (works, not in CI):

Node.js 18+ (use --target nodejs in wasm-pack build)
Deno 1.30+

The bundle uses bulk-memory and mutable-globals; both have been universally supported since 2021.

Why wasm at all?

The CLI and the Rust library cover Linux / macOS / Windows native; the wasm build covers everywhere else — particularly:

Browser-side preview / formatter for a 青空文庫 LSP front-end.
Cloudflare Workers / Vercel Edge / Deno Deploy serverless rendering.
Notebook environments (Jupyter via pyodide, Observable, Quarto).

The same parser, same diagnostics, same canonical-serialise — across every wasm-runtime host.

C ABI

The aozora-ffi crate compiles to a cdylib + staticlib. The API is opaque-handle + JSON-encoded structured data — the C side never sees a Rust type, just opaque pointers and byte buffers.

Build

cargo build --release -p aozora-ffi
# → target/release/libaozora_ffi.{so,dylib,a}
# → target/release/aozora.h          (cbindgen-generated)

The build script regenerates aozora.h automatically. After build, the header lands at:

target/release/aozora.h — host-side convenience copy
$OUT_DIR/aozora.h — cargo build-script standard location

#include "aozora.h" and link with -laozora_ffi.

Smoke test

just smoke-ffi

Builds the cdylib, compiles crates/aozora-ffi/tests/c_smoke/smoke.c against it, runs it end-to-end. CI runs this on every PR — if the ABI shape changes accidentally, the smoke test fails before the PR merges.

Minimal C usage

#include <stdint.h>
#include <stdio.h>
#include <string.h>
#include "aozora.h"

int main(void) {
    const char *src = "｜青梅《おうめ》";
    AozoraDocument *doc = NULL;
    if (aozora_document_new((const uint8_t *)src, strlen(src), &doc) != 0)
        return 1;

    AozoraBytes html = {0};
    if (aozora_document_to_html(doc, &html) != 0) {
        aozora_document_free(doc);
        return 1;
    }
    fwrite(html.ptr, 1, html.len, stdout);

    aozora_bytes_free(&html);
    aozora_document_free(doc);
    return 0;
}

API surface

typedef struct AozoraDocument AozoraDocument;
typedef struct {
    uint8_t *ptr;
    size_t   len;
    size_t   cap;
} AozoraBytes;

extern int32_t aozora_document_new(const uint8_t *src, size_t src_len,
                                   AozoraDocument **out_doc);
extern int32_t aozora_document_to_html(const AozoraDocument *doc,
                                       AozoraBytes *out_html);
extern int32_t aozora_document_serialize(const AozoraDocument *doc,
                                         AozoraBytes *out_canonical);
extern int32_t aozora_document_diagnostics_json(const AozoraDocument *doc,
                                                AozoraBytes *out_json);
extern void    aozora_bytes_free(AozoraBytes *bytes);
extern void    aozora_document_free(AozoraDocument *doc);

Status codes

Code	Meaning
`0`	Ok
`-1`	Null input pointer
`-2`	Input was not valid UTF-8
`-3`	Allocation failed
`-4`	Internal serialisation error

Memory ownership

Every pointer or AozoraBytes returned by an aozora_* function must be released by the matching _free call:

Returned by	Free with
`aozora_document_new` (`AozoraDocument *`)	`aozora_document_free`
`aozora_document_to_html` (`AozoraBytes`)	`aozora_bytes_free`
`aozora_document_serialize` (`AozoraBytes`)	`aozora_bytes_free`
`aozora_document_diagnostics_json` (`AozoraBytes`)	`aozora_bytes_free`

Dropping a handle without _free leaks; freeing then dereferencing is undefined behaviour. This is the standard ABI contract — any unsafe { Box::from_raw(...) } mistake on the consumer side trips both ASan and miri (both run in CI on the FFI test suite).

Why JSON for diagnostics, not a C struct?

Three reasons.

Variant types. Diagnostic has optional fields (help, sometimes a multi-span). A flat C struct would either lose data or grow nullable pointers everywhere. JSON expresses optionality naturally.
Schema stability. Adding a new diagnostic field is a backward-compatible JSON change. Adding a field to a C struct breaks every consumer that compiled against the old size.
Single emitter. The same JSON shape is produced by aozora-wasm (consumed by JS) and aozora-py (consumed by Python). Aligning the C ABI on the same shape means downstream polyglot consumers don’t translate between three different schemas.

The cost is one serde_json::to_string call per aozora_document_diagnostics_json invocation — a one-shot O(N) allocation that is a rounding error compared to the parse itself.

Why opaque handle + bytes, not a flat C struct projection?

A flat C struct projection of AozoraTree would require:

Naming every Rust enum variant in C (not supported cleanly via cbindgen for tagged unions).
Translating the bumpalo arena into a malloc-backed block contiguous with the tree (which means copying the tree out).
Pinning the AST shape across the C ABI — internal refactors (e.g. adding a new AozoraNode variant) would break ABI without warning.

The opaque-handle approach keeps the AST entirely Rust-side. C consumers ask for HTML, canonical text, or JSON-encoded diagnostics — three stable shapes that don’t change with internal refactors.

Use from Go / Zig / Nim

Anything with a C FFI. The aozora.h header is plain C99 — no inline functions, no macros that depend on a compiler-specific extension, no #pragma. Tested in CI by the smoke test against gcc, clang, and msvc.

Python (PyO3 / maturin)

The aozora-py crate is a PyO3 binding shipped via maturin.

Install

pip install maturin                         # one-time

cd crates/aozora-py
maturin develop -F extension-module         # install in current venv
# or
maturin build -F extension-module --release # produce a redistributable wheel

The extension-module feature gates the PyO3 import-side machinery behind a flag, so a plain cargo build --workspace succeeds without Python development headers installed. CI has both modes covered.

Minimal Python usage

from aozora_py import Document

doc = Document("｜青梅《おうめ》")
print(doc.to_html())          # <ruby>青梅<rt>おうめ</rt></ruby>
print(doc.serialize())        # ｜青梅《おうめ》
print(doc.diagnostics())      # JSON-encoded list of diagnostic dicts

API surface

Method	Returns	Notes
`Document(source: str)`	`Document`	The constructor copies `source` into a Rust `Box<str>`.
`to_html() -> str`	str	Renders to semantic HTML5 with `aozora-*` class hooks.
`serialize() -> str`	str	Re-emits canonical 青空文庫 source.
`diagnostics() -> str`	str	JSON-encoded list (same schema as the WASM and FFI bindings).
`source_byte_len() -> int`	int	Source byte length.

The diagnostics JSON shape is shared across every binding — see Bindings → WASM for the schema.

Thread safety: `unsendable`

The Document type is marked unsendable (PyO3 marker) because the underlying bumpalo arena uses interior Cell state. Concurrent access from another Python thread raises a RuntimeError:

import threading
from aozora_py import Document

doc = Document(open("src.txt").read())
def worker(): doc.to_html()              # raises RuntimeError on second thread
threading.Thread(target=worker).start()  # boom

For parallel corpus processing, create a Document per thread. The arena resets per-Document, so there’s no contention point; each thread allocates from its own arena.

Why not `Send`?

PyO3 has a Sendable trait that enables cross-thread access for binding types. We don’t enable it because:

Arena correctness. bumpalo::Bump is !Sync — the per-page allocator state isn’t atomic. Marking it Sendable from PyO3 would require a mutex around every allocation, which is the cost we designed the arena to avoid in the first place.
GIL semantics. Python threads share the GIL; “concurrent” in the Python sense is rarely actually parallel. The unsendable marker turns the misuse case into a loud RuntimeError instead of a silent data race.
Multiprocessing path. The right answer for parallel corpus work is multiprocessing (one Document per process — the arenas are independent by construction). The unsendable marker nudges users toward this.

Why JSON-encoded diagnostics?

Same reason as the WASM binding:

The wire shape is stable across every binding.
Avoids forcing a pyclass declaration on every diagnostic-related type.
Downstream Python consumers json.loads() once and work with native dicts — no second translation.

The diagnostics() method returns a str, not a list[dict], so the json.loads is visible to the caller. Hiding it behind a PyO3 Vec<PyDict> mapping would silently allocate one Python object per diagnostic per call.

Wheel distribution

aozora-py is not yet on PyPI — public release tracks the v1.0 freeze of the core library. Until then, build wheels locally:

maturin build -F extension-module --release  # → target/wheels/*.whl
pip install target/wheels/aozora_py-*.whl

Pre-1.0 distribution will likely use cibuildwheel to ship wheels for every supported (python, target) combination — that’s the mainstream path for PyO3 projects in 2026.

Pandoc integration

The aozora-pandoc crate (workspace-internal, available via the aozora CLI) projects a parsed Aozora document into the Pandoc AST. Once you have Pandoc JSON, every Pandoc output format (HTML, EPUB, LaTeX/PDF, DOCX, ODT, MediaWiki, …) is one shell pipe away.

This is the recommended path if you want to convert Aozora Bunko notation into anything other than the built-in HTML renderer. Adding a new output format means adding a Pandoc filter (or none, if the default Span/Div mapping is enough), not extending the parser crate.

Quickstart

# Pandoc JSON to stdout
aozora pandoc input.txt > out.json

# Or pipe through pandoc directly
aozora pandoc input.txt | pandoc -f json -t html
aozora pandoc input.txt | pandoc -f json -t epub3 -o out.epub

# `--format` is shorthand for the pipe (requires pandoc on PATH)
aozora pandoc input.txt --format html > out.html
aozora pandoc -E sjis legacy.txt -t epub > out.epub

Projection rules

Each AozoraNode variant lifts to a Pandoc construct carrying a stable CSS class so downstream filters or stylesheets can specialise the rendering:

Aozora variant	Pandoc construct	Class on the construct
`Ruby`	`Span`	`aozora-ruby`
↳ base text	nested `Span`	`aozora-ruby-base`
↳ reading text	nested `Span`	`aozora-ruby-reading`
`Bouten`	`Span` over target text	`aozora-bouten`
`TateChuYoko`	`Span`	`aozora-tate-chu-yoko`
`Gaiji`	`Span` carrying mencode	`aozora-gaiji`
`Indent`, `AlignEnd`	empty `Span` (marker)	`aozora-indent` / `align-end`
`Warichu`	`Span` with two children	`aozora-warichu`
`DoubleRuby`	`Span`	`aozora-double-ruby`
`Annotation`, `Kaeriten`, `HeadingHint`	empty `Span` carrying raw	`aozora-annotation` / etc.
`PageBreak`	`HorizontalRule` block	(n/a — semantic block)
`SectionBreak`	empty `Div`	`aozora-section-break`
`AozoraHeading`	`Header` block	`aozora-heading`
`Sashie`	`Para` with `Image`	`aozora-sashie`
Container (字下げ等)	`Div` wrapping inner blocks	`aozora-container-indent` / etc.

The structural attribute kvs (Pandoc’s third Attr tuple) carries non-textual metadata (bouten kind / position, gaiji description / mencode, indent amount, container kind). Filters that want format-native rendering pattern-match on the class + kvs.

Why a Pandoc projection at all

Aozora notation has rich semantic markup (ruby, bouten, tate-chu-yoko, gaiji…) that no single Pandoc native construct captures. The naive shortcut of emitting RawInline("html", "<ruby>…</ruby>") would only work for the HTML writer; every other Pandoc output format would strip the raw HTML and lose the meaning.

By lifting each Aozora variant to a Span / Div with a stable class, the same JSON renders sensibly across every Pandoc format today (each format’s writer renders Span as a stylable container) and stays open for richer format-native rendering tomorrow via filters. That’s the same pattern Pandoc itself uses for [content]{.smallcaps} — semantic in the AST, format-specific in the writer.

Architecture

The library entry point is aozora_pandoc::to_pandoc:

use aozora::Document;
use aozora_pandoc::to_pandoc;

let doc = Document::new(std::fs::read_to_string("input.txt")?);
let pandoc = to_pandoc(&doc.parse());
let json = serde_json::to_string(&pandoc)?;

aozora-cli wires that into aozora pandoc so binary consumers don’t need to write Rust.

Release profile & PGO

aozora’s [profile.release] is tuned for cross-crate inlining at the expense of compile time:

[profile.release]
lto           = "fat"        # full LTO across the whole workspace
codegen-units = 1            # single CGU so LTO sees everything
strip         = "symbols"    # smaller binary, faster cold start
panic         = "abort"      # no unwinding tables in the binary
opt-level     = 3

Why fat LTO over thin

A thin LTO build keeps each crate’s IR isolated; the cross-crate inliner only inlines through summary stubs. Fat LTO concatenates every crate’s IR into one module before optimisation, so the inliner can see across the whole pipeline.

For aozora that pays off because the lex pipeline is deep: aozora-render → aozora → aozora-lex → aozora-lexer Phase functions, each in its own crate. A function call across that depth under thin LTO costs four indirect calls and four stack frames; the fat LTO build folds the chain into ~40 inlined instructions on the hot per-byte path.

Measured on the corpus sweep: fat LTO is 30%+ faster than thin LTO once the lex orchestrator is split across crates. Compile-time cost is real (release builds take ~3 minutes vs ~1 minute for thin), but release builds happen at tag time, not on every iteration.

Why `codegen-units = 1`

codegen-units = N splits each crate into N parallel codegen jobs during compilation. Each unit optimises independently, then the linker stitches them together. With N > 1 the LLVM inliner can’t see across unit boundaries inside a single crate — which under fat LTO defeats half the point.

codegen-units = 1 ensures fat LTO actually sees every function in every crate. Compile time grows; runtime wins back.

Why `panic = "abort"`

aozora is a parser, not a server. There’s no panic handler to recover into — a panic on user input would be a parser bug, not a recoverable error. panic = "abort":

Drops the unwinding tables from the binary (~80 KiB savings on the CLI).
Removes the panic-handling overhead from every function call (the compiler doesn’t insert landing pads).
Surfaces parser bugs as SIGABRT immediately, which is what we want — a panic always indicates an invariant violation that needs fixing, not a state to gracefully degrade through.

For library consumers that want unwinding (e.g. embedding in a long-running server), the dependency-mode build inherits the consumer’s profile, so this only affects the binaries we publish.

Profile-guided optimisation (PGO)

The release pipeline supports PGO via scripts/pgo-build.sh:

./scripts/pgo-build.sh

Three-stage build:

Instrumented build — cargo build --release with RUSTFLAGS="-Cprofile-generate=/tmp/pgo-data". The resulting binary is slower than vanilla release because of the instrumentation overhead.
Profile collection — run the corpus sweep against the instrumented binary. The corpus must contain a representative spread of document sizes and notation density. The aozora-bench throughput_by_class probe handles this.
Final build — cargo build --release with RUSTFLAGS="-Cprofile-use=/tmp/pgo-data/merged.profdata". LLVM uses the profile to drive its inliner, branch-prediction hints, and basic-block ordering decisions.

Measured win on the corpus sweep: 8–12% faster than non-PGO release build. The cost is operational complexity (the build-script needs a real corpus available); the win compounds with fat LTO, since both target the same hot paths.

BOLT (post-link optimisation)

BOLT is the next layer after PGO: it reorders basic blocks in the final binary based on the same profile. scripts/pgo-build.sh ends with an optional BOLT pass when llvm-bolt is on PATH.

BOLT wins another ~3% on top of PGO, mostly by improving I-cache density for the lex hot path. The win is smaller than PGO’s because PGO already used the profile during compilation; BOLT only refines the final binary’s layout.

Why we don’t use specific tricks

-Cforce-frame-pointers=yes — would help samply unwind on some platforms, but the workspace [profile.bench] covers the profiling case (debug = 1 + strip = none). Release builds get the smaller binary.
unsafe perf shortcuts — unsafe_code = "forbid" at the workspace level. Three crates locally relax it (FFI / scan / xtask), each with // SAFETY: comments and #[deny(unsafe_op_in_unsafe_fn)]. Where a perf opportunity needs unsafe, we measure it first and cite the win in the comment.
#[inline(always)] — used sparingly. The compiler’s default heuristics have improved enough that forcing inlining usually costs binary size for negligible win. Where it does help (e.g. the per-byte scanner inner loop), the call site has a measurement comment.

Profiling with samply

samply is the workspace’s sampling profiler. It produces .json.gz traces in the Firefox-Profiler gecko format that can be loaded into the web UI for visual analysis, or fed to the in-tree aozora-trace crate for automated rollups.

Quick commands

# Single corpus document
AOZORA_CORPUS_ROOT=/path/to/corpus \
  just samply-doc 001529/files/50685_ruby_67979/50685_ruby_67979.txt

# Full corpus, parser-bound (5 parse passes after the one-time load)
AOZORA_CORPUS_ROOT=/path/to/corpus just samply-corpus

# Full corpus, render-bound
AOZORA_CORPUS_ROOT=/path/to/corpus just samply-render

# Open in Firefox-Profiler
samply load /tmp/aozora-corpus-<timestamp>.json.gz

All three are wrappers over the aozora-xtask samply subcommand, which:

Builds the bench probe with --profile=bench (debug info preserved).
Runs samply against the resulting binary.
Drops the .json.gz in /tmp/.

Why these run on the host (not Docker)

samply uses perf_event_open(2) for kernel sampling. Docker’s default seccomp profile blocks that syscall. The xtask binary therefore runs on the host (not via docker compose run) and the Justfile recipes are exempt from the workspace’s normal “everything in Docker” policy.

The recipes check /proc/sys/kernel/perf_event_paranoid on entry and print the fix-up command if the value is too high (default 2; needs to be ≤ 1 for unprivileged sampling):

echo 1 | sudo tee /proc/sys/kernel/perf_event_paranoid

Why `--profile=bench` and not `--release`

cargo build --release uses [profile.release], which has debug = 0 + strip = "symbols". Samply still records samples, but they show up as raw addresses (0x8fb61) instead of function names — every sample becomes useless to a human reader.

The workspace [profile.bench] inherits from release but sets debug = 1 + strip = "none". The xtask wrappers automatically build with --profile=bench. If you launch samply manually, do the same.

Corpus load dominates a single-pass trace

throughput_by_class and render_hot_path spend most wall time in Shift_JIS decode + filesystem I/O during the one-time corpus load. A single-pass samply trace puts __memmove_avx_unaligned and encoding_rs::ShiftJisDecoder at the top — not the parser.

Fix: set AOZORA_PROFILE_REPEAT=K (or pass K to just samply-corpus) so the parse pass runs K times after the load. The xtask defaults to 5; raise to 10+ for very small corpora.

Trace analysis from the CLI

aozora-xtask trace … (and the just trace-* shortcuts) load saved .json.gz traces, symbolicate them via the aozora-trace crate (DWARF lookup is pure-Rust through addr2line::Loader), and run the bundled analyses.

# 1. One-time per trace: write the symbol cache next to it
just trace-cache /tmp/aozora-corpus-<ts>.json.gz

# 2. Analyses (cache is auto-loaded if present)
just trace-libs    /tmp/aozora-corpus-<ts>.json.gz                  # binary vs libc vs vdso
just trace-hot     /tmp/aozora-corpus-<ts>.json.gz 25               # top-25 hot leaf frames
just trace-rollup  /tmp/aozora-corpus-<ts>.json.gz                  # bucketed by aozora's built-in categories
just trace-stacks  /tmp/aozora-corpus-<ts>.json.gz 'teddy' 5        # full call chains hitting any frame matching `teddy`
just trace-compare /tmp/before.json.gz /tmp/after.json.gz 25        # before/after diff
just trace-flame   /tmp/aozora-corpus-<ts>.json.gz | flamegraph.pl > flame.svg

Each analysis returns a typed report — HotReport, LibraryReport, RollupReport, ComparisonReport, MatchedStacksReport, FlameReport — whose module docstring explains the algorithm.

Why a pure-Rust DWARF symbolicator?

The mainstream alternative is shelling out to addr2line(1) from binutils. We don’t because:

Process spawn cost. A typical trace has 5 000+ unique addresses; spawning addr2line per address is unworkable. Pipelining through a single subprocess works but ties symbolisation to the presence of binutils on PATH (not always true on minimal containers).
Build-id verification. The aozora-trace::Symbolicator checks the binary’s gnu-build-id against the trace’s codeId so rebuilding between recording and analysis fails loudly rather than producing wrong symbol names. addr2line(1) has no such check.
Caching. The symbolicator writes a sidecar <trace>.symbols.json on first call (~100 ms per binary) and reads from it on every subsequent call (instant). Re-running addr2line per analysis would re-walk DWARF every time.

Verifying the SIMD scanner is firing

// In any binary or test
println!("{}", aozora_scan::best_scanner_name());
// "teddy" | "hoehrmann-dfa" | "memchr-multi"

Or under samply, look for aozora_scan::backends::teddy::scan_offsets in the trace’s call tree. If the trace shows memchr::arch::x86_64::avx2::* instead, you’re on the scalar fallback (which uses memchr’s own SIMD dispatch internally — still SIMD, just not aozora-scan’s).

Workflow recipes

“I changed something, did I regress?”

# Microbench the per-band tokenizer throughput
cargo bench -p aozora-lex --bench tokenize_compare

# Macrobench the full pipeline end-to-end
AOZORA_CORPUS_ROOT=… cargo run --release --example throughput_by_class -p aozora-bench
AOZORA_CORPUS_ROOT=… cargo run --release --example render_hot_path     -p aozora-bench

# Check the worst doc didn't regress
AOZORA_CORPUS_ROOT=… AOZORA_PROBE_DOC=000286/files/49178_ruby_58807/49178_ruby_58807.txt \
  cargo run --release --example pathological_probe -p aozora-bench

“Where is `lex_into_arena` spending its time?”

# Macroscopic per-phase split
AOZORA_CORPUS_ROOT=… cargo run --release --example phase_breakdown -p aozora-bench

# Latency tail shape
AOZORA_CORPUS_ROOT=… cargo run --release --example latency_histogram -p aozora-bench

# Microscopic: which classify recogniser dominates a specific doc?
AOZORA_CORPUS_ROOT=… AOZORA_PROBE_DOC=… \
  cargo run --release --features instrument --example pathological_probe -p aozora-bench

Benchmarks (criterion)

aozora ships two layers of perf measurement:

Criterion microbenchmarks in crates/aozora-lex/benches/ and crates/aozora-render/benches/. Reproducible per-function timings with statistical confidence intervals.
Corpus probes in crates/aozora-bench/examples/. Each probe is a cargo run --release --example <name> binary that reports per-band statistics across a real corpus.

Criterion microbenchmarks

Run a specific bench:

cargo bench -p aozora-lex --bench tokenize_compare
cargo bench -p aozora-render --bench html_emit

Criterion writes HTML reports under target/criterion/. Each bench reports throughput in MB/s, ns/byte, and a confidence interval; the HTML reports include violin plots that surface multi-modal latency distributions (which often indicate cache-line or page-fault effects we’d otherwise miss).

Why criterion over `#[bench]`

Three reasons.

Statistical rigour. #[bench] reports the minimum of N iterations; criterion fits a model and reports a confidence interval. The minimum is a known-bad estimator on a system with any noise (which is every real machine).
Iteration count auto-tuning. Criterion picks the iteration count to reach a target precision; #[bench] requires a hand-picked count.
Stability. #[bench] is unstable Rust, only works on nightly. Criterion is stable Rust.

Corpus probes

Each probe under crates/aozora-bench/examples/ reports a different slice of the workload. All read AOZORA_CORPUS_ROOT; most accept AOZORA_PROFILE_LIMIT=N to cap the sweep.

Probe	Question it answers	Output shape
`throughput_by_class`	Per-band MB/s for `lex_into_arena`	4-band table + p50 / p90 / p99 / max + ns/byte
`phase_breakdown`	Per-phase ms for sanitize / tokenize / pair / classify	per-doc latencies + top-5 worst classify / sanitize
`latency_histogram`	Log-bucketed latency distribution per phase	bar histogram, 10 buckets, 1 µs … 1 s
`pathological_probe`	Single-doc 100-iter avg per phase	tight per-call numbers; takes `AOZORA_PROBE_DOC` for any corpus path
`phase0_breakdown`	Per-sub-pass cost inside Phase 0 sanitize	bom_strip / crlf / rule_isolate / accent / pua_scan
`phase0_impact`	Does Phase 0 sub-pass firing change Phase 1 cost?	bucketed by which sub-passes fired
`phase3_subsystems`	Per-recogniser ms inside classify	requires `--features instrument`
`diagnostic_distribution`	What fraction of docs emit diagnostics?	histogram by diag count; latency-by-diag-bucket
`allocator_pressure`	Arena bytes / source byte ratio + intern dedup	per-doc histograms
`fused_vs_materialized`	Does the deforestation actually win?	per-band gap % between fused (`lex_into_arena`) and materialized (per-phase collect)
`intern_dedup_ratio`	How well does the interner dedup short strings?	corpus-aggregate (cache + table) / calls
`render_hot_path`	Per-band MB/s for HTML render	4-band MB/s + render/parse ratio + out/in size ratio

Each probe is invoked directly:

AOZORA_CORPUS_ROOT=… cargo run --release --example <name> -p aozora-bench

For phase3_subsystems, build with the instrumentation feature:

AOZORA_CORPUS_ROOT=… cargo run --release --features instrument \
  --example phase3_subsystems -p aozora-bench

Why corpus probes and criterion benches?

Different questions.

Criterion answers “is function X faster after my change?” on a fixed input. Microscopic, reproducible, the right tool for optimising a single hot loop.
Corpus probes answer “is the parser faster on the real Aozora Bunko catalogue after my change?” Macroscopic, includes every distribution effect (small-doc dispatch overhead, large-doc cache pressure, gaiji-density variation). The right tool for validating a perf PR end-to-end.

A perf PR that wins on criterion but loses on the corpus is suspicious — usually it’s optimised the small-input path at the cost of the large-input path. The corpus probe catches it.

Phase 3 instrumentation caveat

phase3-instrument wraps every recogniser entry in a SubsystemGuard that calls Instant::now() on construction + drop. For the dominant inner-loop recognisers this adds enough overhead that the report’s own timing is significantly skewed.

Use the instrumentation to compare relative costs between subsystems, not as an absolute number. For absolute numbers, run phase_breakdown (no instrumentation).

Where to look in samply

If a corpus probe regresses, sample-profile the same workload:

AOZORA_CORPUS_ROOT=… just samply-corpus 5
samply load /tmp/aozora-corpus-<ts>.json.gz
# or
just trace-rollup /tmp/aozora-corpus-<ts>.json.gz

The trace-rollup analysis groups samples into aozora’s built-in categories (Phase 0/1/2/3/4 + corpus_load + intern + alloc + …) so a regression’s category jumps out at a glance.

Corpus sweeps

aozora’s tier-A acceptance gate is a corpus sweep: every Aozora Bunko work parses without panicking, and the parse ∘ serialize ∘ parse round-trip is stable. The corpus has ~17 000 works in active rotation; sweeping the lot takes ~90 s on a modern x86_64 desktop.

Setting up the corpus

AOZORA_CORPUS_ROOT should point at a directory containing the unpacked Aozora Bunko tarball:

$AOZORA_CORPUS_ROOT/
├── 000001/
│   └── files/
│       └── 18310_ruby_01058/
│           └── 18310_ruby_01058.txt   ← Shift_JIS .txt source
├── 000002/
│   └── files/
│       └── …
└── …

The structure mirrors the upstream aozorabunko repo. Set the env var once in your shell:

export AOZORA_CORPUS_ROOT=/path/to/aozorabunko

Every probe, every sample-profile recipe, and the corpus sweep test suite reads it.

Running the sweep

just corpus-sweep

Wraps the aozora-corpus crate’s ParallelSweep runner. Iterates every .txt file under $AOZORA_CORPUS_ROOT, parses it, verifies:

No panic.
tree.diagnostics() count is within an expected envelope.
parse(serialize(parse(source))) == parse(source) (round-trip property).
Render emits valid UTF-8 HTML (no broken byte sequences).

Failure: prints the offending document path + diagnostic, exits non-zero.

Why blake3 / zstd for the archive variant?

aozora-corpus ships an archive mode: the corpus packed into a single .zst file with a blake3 manifest. This is what CI uses (the corpus is downloaded once per workflow run and unpacked in-memory).

blake3 for per-entry content-addressed hashing. Used so the archive packer can detect “this work hasn’t changed since the last build” and skip re-encoding it. blake3 over sha256: ~10× faster on the same data, no security trade-off for our use case (we’re not signing anything, just diffing).
zstd for compression. Frame-level random access matters because the ParallelSweep runner wants to mmap individual works on demand without decompressing the whole archive. zstd over gzip / xz: 5–10× faster decompression at comparable ratios.

Both crates are mainstream pure-Rust APIs (the underlying libzstd is C, but the boundary is hidden behind the zstd crate’s safe API).

Why parallel sweep?

A serial sweep runs sequentially through every work; on a 16-core machine that’s wall-clock 16× the per-doc parse time. The ParallelSweep runner uses rayon to parse documents in parallel, sized to physical cores via num_cpus::get_physical() — not logical cores.

The reason is memory bandwidth. The parser is bandwidth-bound, not ALU-bound (the SIMD scanner streams the source through L1 once per trigger byte, then the lexer touches each token a few more times). SMT siblings starve each other for cache lines and bus bandwidth, so oversubscribing logical cores actively slows the sweep. Sized to physical, the throughput peaks where the bandwidth ceiling does.

`posix_fadvise(POSIX_FADV_DONTNEED)` for honest cold-cache numbers

The xtask corpus uncache command evicts every corpus file from the kernel page cache before a measurement run:

cargo run -p aozora-xtask --release -- corpus uncache

It uses posix_fadvise(fd, 0, 0, POSIX_FADV_DONTNEED) per file — no sudo required (unlike echo 1 > /proc/sys/vm/drop_caches, which needs root and drops every cache, defeating the purpose).

Why this matters: a “fresh” benchmark run that finds the corpus already warm in the page cache reports throughput numbers that no cold start can ever achieve. The uncache step makes “cold benchmark” a real, repeatable thing.

Probes that go corpus-wide

Probe	What
`throughput_by_class`	Per-band MB/s for `lex_into_arena`. Splits the corpus by document size (small / medium / large / huge).
`phase_breakdown`	Per-phase ms per doc.
`latency_histogram`	Log-bucketed latency distribution per phase.
`diagnostic_distribution`	What fraction of docs emit diagnostics? Histogram by diag count.
`allocator_pressure`	Arena bytes / source byte ratio + intern dedup ratio.
`render_hot_path`	Per-band render MB/s.

See Benchmarks for the full list.

Why a dedicated `aozora-corpus` crate?

Three concerns kept apart from aozora-bench:

Corpus discovery and loading. Walking the directory, decoding Shift_JIS, applying any per-work filters. This is shared by every probe + by the xtask corpus pack/unpack tooling.
Archive format. The blake3 + zstd packing/unpacking lives here so the bench harness doesn’t pull in compression libraries.
Parallel sweep runner. A reusable rayon::par_iter wrapper with the right ordering (largest documents first to balance load).

aozora-bench then builds on this — each probe is a thin for doc in corpus { measure(doc) } loop, with the corpus crate handling all the I/O.

Why a separate `AOZORA_PROFILE_REPEAT`?

samply traces of probes that include corpus loading get dominated by I/O and Shift_JIS decode (see Profiling with samply). Running the parse pass K times per document after the one-time load gives samply enough parse-bound wall time to catch the parser hot frames. Default K = 5; raise to 10+ for very small corpora.

Phase D — Sentinel enum + single-table registry results

The single-table registry collapsed four per-kind sentinel position tables into one position-keyed EytzingerMap dispatched through a NodeRef enum. Before the refactor the registry held independent inline / block_leaf / block_open / block_close EytzingerMaps and Registry::node_at(pos) swept them in declaration order with four if let Some(...) = table.get(&pos) chains; the current shape is one binary search per lookup, with the variant tag carried on the entry itself.

Structural changes

old  : Registry { inline, block_leaf, block_open, block_close }   // 4× EytzingerMap
       node_at(pos) → 4-way if-let chain, ~4 binary searches worst-case

now  : Registry { table: EytzingerMap<u32, NodeRef<'src>> }       // 1× EytzingerMap
       node_at(pos) → one binary search, NodeRef variant tags the kind

Renderers (crates/aozora-render/src/html.rs, crates/aozora-render/src/serialize.rs) replaced the parallel 4-way if let Some(...) = registry.<kind>.get(...) chains with a single (Structural, NodeRef) cross-product match — the compiler now enforces variant coverage at the call site.

Expected runtime impact

Theoretical: per-lookup binary search count drops from ≤ 4 to 1. Render hot path is dominated by registry lookups inside the memchr2_iter loop in html::render_into (one lookup per PUA sentinel hit), so the savings scale with sentinel density. Aozora corpus profiling against the four-table layout showed registry lookups at ~12 % of render time on bouten-heavy documents; the unified dispatch should absorb roughly that fraction.

Measurement procedure

Run before each minor release:

# Take a baseline against the previous release tag
git checkout v0.3.0
just samply-corpus --repeat 5 --out before.json.gz
git checkout -

# Take a current measurement
just samply-corpus --repeat 5 --out after.json.gz

# Diff at the function level
xtask trace compare before.json.gz after.json.gz

Numbers go in the table below at release time:

Metric	Four-table	Single-table	Δ
Render hot path (corpus median, ns/doc)	to fill	to fill	to fill
Registry lookup CPU share (%)	to fill	to fill	to fill
End-to-end parse + render p50 (ms/doc)	to fill	to fill	to fill

Repro environment recorded in perf/samply.md. Pin the host CPU + corpus version + Rust toolchain so the table is comparable across releases.

CLI reference

Full reference for the aozora binary. For a guided tour, see CLI Quickstart.

Synopsis

aozora [OPTIONS] <SUBCOMMAND> [ARGS]

Subcommands:

Subcommand	What it does
`check`	Lex + report diagnostics.
`fmt`	Round-trip parse ∘ serialize.
`render`	Render to HTML on stdout.

Global options apply to every subcommand:

Option	Effect
`-E sjis`, `--encoding sjis`	Decode Shift_JIS source. Default is UTF-8.
`--no-color`	Disable ANSI colour in diagnostics output.
`--verbose`	Print parse phase timings to stderr.
`--diagnostics LEVEL`	Filter diagnostics by minimum level (`error` \| `warning` \| `info`). Default: `warning`.
`-V`, `--version`	Print version and exit.
`-h`, `--help`	Print help and exit.

`aozora check`

aozora check [OPTIONS] [PATH]

Lex the source and print diagnostics. PATH of - (or omitted) reads from stdin.

Option	Effect
`--strict`	Exit non-zero on any diagnostic.

Exit codes: 0 on parse success (regardless of diagnostics, unless --strict); 1 on diagnostics under --strict; 2 on usage error.

aozora check src.txt                # warnings shown, exit 0
aozora check --strict src.txt       # warnings -> exit 1
aozora check -E sjis crime.txt      # SJIS source
cat src.txt | aozora check          # stdin

`aozora fmt`

aozora fmt [OPTIONS] [PATH]

Round-trip the source through parse ∘ serialize. Default behaviour prints the canonical form on stdout.

Option	Effect
`--check`	Exit non-zero if the formatted output differs from the input. Don’t print the canonical form.
`--write`	Overwrite the input file with the canonical form. (Ignored when reading from stdin.)

Exit codes: 0 on success (or no diff under --check); 1 on a formatting mismatch under --check; 2 on usage error.

aozora fmt src.txt > formatted.txt
aozora fmt --check src.txt          # CI gate
aozora fmt --write src.txt          # in-place
cat src.txt | aozora fmt            # stdin → stdout

`aozora render`

aozora render [OPTIONS] [PATH]

Render the parsed tree to HTML on stdout.

aozora render src.txt > out.html
aozora render -E sjis crime.txt > crime.html
cat src.txt | aozora render -

The output is semantic HTML5 with aozora-* class hooks (no inline styles). See HTML renderer for the class-name reference.

Exit codes

Code	Meaning
`0`	Success.
`1`	Diagnostics emitted under `--strict`, or formatting mismatch under `--check`.
`2`	Usage error (bad flag, missing file, decode error).

Environment

Variable	Effect
`NO_COLOR`	If set (any value), disable ANSI colour output. Same as `--no-color`.
`AOZORA_LOG`	`tracing-subscriber` filter (e.g. `aozora_lex=debug`). For internal debugging; not part of the stable surface.

See Reference → Environment variables for the full env matrix (which includes the bench / profiling vars).

API reference (rustdoc)

The full rustdoc surface for every crate in the workspace is auto-deployed alongside this handbook. Browse it at:

https://p4suta.github.io/aozora/api/aozora/

The landing redirects to the top-level facade (aozora); from there every workspace crate is reachable via the side panel.

Why `/api/` instead of `docs.rs`?

aozora is not yet on crates.io — public release tracks the v1.0 API freeze. Until then, docs.rs has nothing to render against, so the rustdoc API reference is built directly from the workspace and deployed under the GitHub Pages site that serves this handbook.

When the v1.0 release lands and we publish to crates.io, docs.rs will pick up the API reference automatically; the in-tree /api/ copy will keep working as a mirror, since the GitHub Pages deploy runs on every main push regardless.

Layout

Path	What
`/aozora/` (this site)	Handbook (this mdbook)
`/aozora/api/aozora/`	Public facade crate
`/aozora/api/aozora_lex/`	Lexer orchestrator
`/aozora/api/aozora_lexer/`	Seven-phase lexer
`/aozora/api/aozora_render/`	HTML / serialise renderers
`/aozora/api/aozora_syntax/`	AST node types
`/aozora/api/aozora_spec/`	Shared types
`/aozora/api/aozora_scan/`	SIMD scanner
`/aozora/api/aozora_veb/`	Eytzinger sorted-set
`/aozora/api/aozora_encoding/`	SJIS + 外字
`/aozora/api/aozora_cli/`	CLI binary internals
`/aozora/api/aozora_ffi/`	C ABI driver
`/aozora/api/aozora_wasm/`	WASM driver
`/aozora/api/aozora_py/`	Python binding
`/aozora/api/aozora_bench/`	Bench probes
`/aozora/api/aozora_corpus/`	Corpus runner
`/aozora/api/aozora_proptest/`	Proptest strategies
`/aozora/api/aozora_trace/`	Samply trace loader
`/aozora/api/aozora_xtask/`	Dev tooling

Doc-link discipline

The workspace [workspace.lints.rustdoc] block sets every documentation lint to warn (target: deny). Specifically:

broken_intra_doc_links = "warn" — every [name] link in a doc comment must resolve.
private_intra_doc_links = "warn" — links to pub(crate) items flagged so the public docs don’t dangle into private structures.
invalid_codeblock_attributes = "warn" — typos in ```rust,no_run style attributes get caught.
invalid_html_tags = "warn" — accidental <foo> in prose flagged.
invalid_rust_codeblocks = "warn" — every ```rust block must parse as Rust.
bare_urls = "warn" — links must be <https://...> or [label](url), not bare URLs (which markdown parses inconsistently).
redundant_explicit_links = "warn" — [x](x) where the autolink form would do.
unescaped_backticks = "warn" — stray backticks flagged.

The deferred deny upgrade is tracked separately; once the existing warnings are cleaned up the gate will tighten.

Local rustdoc build

just doc                        # workspace-wide rustdoc (no deps)
just doc-open                   # rustdoc + open in default browser

Both run inside the dev container; output lands at target/doc/aozora/index.html.

Building this handbook

just book-build                 # render to crates/aozora-book/book/
just book-serve                 # live-preview at localhost:3000
just book-linkcheck             # lychee link verification

See Contributing → Development loop for the full toolchain.

Environment variables

A central reference for every env var aozora reads. Variables fall into three groups: parser configuration, dev / bench harness, and container plumbing.

Parser configuration

Variable	Read by	Effect
`NO_COLOR`	`aozora-cli`	If set (any value), disable ANSI colour output. Same as `--no-color`. Standard convention from https://no-color.org.
`AOZORA_LOG`	`aozora-cli`, library opt-in	`tracing-subscriber` filter directive (e.g. `aozora_lex=debug,aozora_render=info`). For internal debugging; not part of the stable surface.

Dev / bench harness

Variable	Read by	Effect
`AOZORA_CORPUS_ROOT`	`aozora-corpus`, every probe, every sample-profile recipe, the corpus sweep	Directory of 青空文庫 source files (UTF-8 or Shift_JIS). Required for any corpus-driven operation.
`AOZORA_PROFILE_LIMIT`	`aozora-bench` probes	Cap the number of corpus documents per probe. Useful for fast iteration; set to `100` for a sub-second sweep.
`AOZORA_PROFILE_REPEAT`	`samply-corpus`, `samply-render`	Number of parse / render passes per document after the one-time corpus load. Default `5`; raise to give samply enough parser-bound wall time to attach to.
`AOZORA_PROBE_DOC`	`pathological_probe`	Single corpus path to probe in tight per-call mode. Path is relative to `$AOZORA_CORPUS_ROOT`.
`AOZORA_PROPTEST_CASES`	`aozora-proptest::config`	Override default proptest case count (default `128` per block). `4096` for `just prop-deep`.

Container plumbing

These are set by docker-compose.yml and don’t need manual handling unless you’re invoking cargo directly outside the dev container.

Variable	Set by	Purpose
`CARGO_HOME`	compose	`/workspace/.cargo` — registry + git deps cached on a named volume.
`CARGO_TARGET_DIR`	compose	`/workspace/target` — build output cached on a named volume.
`RUSTC_WRAPPER`	compose	`sccache` — compile cache.
`SCCACHE_DIR`	compose	`/workspace/.sccache` — sccache backing store on a named volume.
`SCCACHE_CACHE_SIZE`	compose	`10G` — default cap.
`CARGO_INCREMENTAL`	compose	`0` — incremental compile defeats sccache; turning it off lets sccache cache the very crates we build most often.
`RUST_BACKTRACE`	compose	`1` — full backtraces on panic.
`GIT_CONFIG_*`	compose	Whitelists `/workspace` for git’s “dubious ownership” check (the bind-mounted host source is a non-root UID; the container runs as root).

Variables we deliberately do not read

A few standard variables aozora intentionally ignores:

Variable	Why ignored
`LANG` / `LC_ALL`	aozora handles its own encoding via `--encoding`. Locale-driven byte interpretation would make the parser non-reproducible across machines.
`RUSTFLAGS` (in non-build context)	The release / bench / PGO profiles set their own flags; per-invocation `RUSTFLAGS` would defeat sccache hits for unrelated crates.
`CARGO_BUILD_JOBS`	Cargo’s default (CPU count) is what we want. Overriding usually fights the bench harness’s own parallelism control.

Conformance suite

aozora ships a WPT-style conformance corpus so other implementations of the Aozora Bunko notation (the tree-sitter reference grammar, third-party ports, alternate parsers in other languages) can measure their adherence against the same set of cases the Rust parser is held to.

Tier model

Level	Meaning	Effect on `xtask conformance run`
`must`	Required for any conforming implementation.	A failure here exits non-zero.
`should`	Recommended but not strictly required.	A failure here logs a warning.
`may`	Optional; implementations decide.	Pure information; never fails.

The tier is declared per case in crates/aozora-conformance/fixtures/render/<case>/meta.toml alongside a feature tag (ruby, bouten, composite, recovery, …). The runner aggregates pass / fail counts by (feature, level).

Running

just conformance               # full suite, exits non-zero on must-fail
just render-gate               # the byte-identical render gate, K3-style
xtask conformance run          # invoke the runner directly

A successful run also writes crates/aozora-book/src/conformance-results.json with per-case detail. The JSON shape is stable; downstream dashboards / shields parse it.

What gets compared

The runner checks two outputs per fixture:

tree.to_html() byte-identical to expected.html.
tree.serialize() byte-identical to expected.serialize.txt.

Both goldens regenerate via UPDATE_GOLDEN=1 cargo test -p aozora-conformance --test render_gate after intentional output changes. The runner does not yet compare diagnostics or wire-format output; both are future extensions of the same manifest.

Implementations

The runner currently targets a single implementation — the Rust parser itself. The results.json format carries an implementation field so external runs can append their own results without disturbing the canonical Rust pass-rate.

AST query DSL

A tree-sitter-flavoured pattern DSL selects nodes / tokens from the concrete syntax tree. Editor surfaces (LSP textDocument/documentHighlight, “find all ruby annotations”, refactoring filters, syntax-aware search) compose against the DSL instead of re-implementing tree walks.

The DSL ships behind the query Cargo feature on the aozora crate; that feature also enables cst since queries run against SyntaxNode.

Quickstart

use aozora::Document;
use aozora::query::compile;

let doc = Document::new("｜青梅《おうめ》と｜青空《あおぞら》");
let cst = aozora::cst::from_tree(&doc.parse());
let query = compile("(Construct @ruby)").expect("compile");
for capture in query.captures(&cst) {
    println!("{} -> {:?}", capture.name, capture.node);
}

Grammar

query   := pattern ('\n' pattern)* '\n'?
pattern := '(' kind capture? ')'
         | '(' '_'  capture? ')'
kind    := SyntaxKind ident      // e.g. `Construct`, `Container`
capture := '@' ident
ident   := [A-Za-z_][A-Za-z0-9_-]*

(Construct) — match every Construct node.
(Construct @ruby) — capture each Construct under the name ruby.
(_) — match any kind (node or token).
(_ @any) — combined; tour every kind in preorder.
Multiple patterns separated by newlines run as an OR — every matching node yields one Capture per pattern that hits.

Execution model

The DSL compiles once into a Vec<Pattern>; the engine then tests every pattern at every preorder step (O(nodes × patterns)). The small capture-only surface keeps the implementation tight while the predicate / field-access / alternation extensions wait for a concrete consumer ask.

Not yet supported

Predicates (#eq?, #match?) — the tree-sitter query language exposes per-capture filters. The DSL ships without them; consumers filter the resulting [Capture] vec in Rust.
Field accessors ((Container body: (Construct))) — the CST has no named fields yet.
Quantifiers ((...)?, (...)*, (...)+).
Alternation [...] between patterns.

These extensions are forward-compatible with the existing API shape (compile → captures); a future release can land them without breaking existing queries.

Cross-references

Architecture → Concrete syntax tree — the CST the DSL queries.
Node reference — NodeKind / SyntaxKind documentation.

Wire format

aozora ships a stable JSON wire format used by every binding — aozora-ffi (C ABI), aozora-wasm (npm), aozora-py (PyO3) — to project the parser’s output across language boundaries. aozora::wire is the single authority for that projection; downstream drivers call into it and receive bit-identical output.

Envelope shape

Every wire JSON has the form

{ "schema_version": 1, "data": [ /* … entries … */ ] }

where schema_version is the major version of the wire contract and data is the per-endpoint payload array.

The four endpoint envelopes are:

Endpoint	Entry shape	JSON Schema
`serialize_diagnostics`	`{ kind, severity, source, span, codepoint? }`	`schema-diagnostics.json`
`serialize_nodes`	`{ kind, span: { start, end } }`	`schema-nodes.json`
`serialize_pairs`	`{ kind, open: { start, end }, close: { … } }`	`schema-pairs.json`
`serialize_container_pairs`	`{ kind, open: { offset }, close: { offset } }`	`schema-container-pairs.json`

SCHEMA_VERSION

The schema_version integer (aozora::wire::SCHEMA_VERSION) bumps on any breaking change to the serialised shape — variant additions exposing as a new kind value, field renames, envelope restructuring. Clients should branch on the version and handle unknown values defensively; schema 1 makes no forward-compatibility guarantees with later schemas.

Stability vs. `non_exhaustive`

Diagnostic and AozoraNode are #[non_exhaustive] — minor releases can add variants. The wire format protects callers in two ways:

Unrecognised variants emit kind: "unknown" rather than failing to serialise, so an old client never sees parse-time data loss.
SCHEMA_VERSION bumps when new variants ship in the wire surface, giving version-branching clients a chance to react before "unknown" shows up in production traffic.

Development loop

aozora’s development workflow is built around three rules:

Docker-only execution. The host toolchain is never invoked.
just is the entry point. Every operation goes through a just recipe that wraps the underlying tool inside the dev container.
Lint gates run automatically. lefthook installs git hooks that run fmt + clippy + typos pre-commit and test + deny pre-push, so a passing local commit roughly mirrors a passing CI run.

First-time setup

git clone git@github.com:P4suta/aozora.git
cd aozora
docker compose build dev        # ~5 min the first time, cached afterwards
just hooks                      # install lefthook git hooks
just test                       # confirm green

Daily loop

just shell                      # drop into the dev container
just build                      # cargo build --workspace --all-targets
just test                       # workspace nextest
just lint                       # fmt + clippy + typos + strict-code
just prop                       # property-based sweep (128 cases / block)
just ci                         # full CI replica (lint + build + test + prop + deny + audit + udeps + coverage + book-build)

just --list enumerates everything available; just --list --unsorted preserves the topical grouping (build → test → lint → deps → bench → docs → release → dev-helpers).

Watch mode (bacon)

just watch                      # default `check` job
just watch clippy
just watch test

Inside bacon: t test, c clippy, d doc, f failing-only, esc previous job, q quit, Ctrl-J list jobs. The watcher runs inside the dev container so file change detection works against the bind-mounted source.

For headless usage (no TTY, e.g. piping to tee):

just watch-headless check       # plain output, no TUI

Why Docker for everything?

Three reasons.

Toolchain reproducibility. The dev image pins rust:1.95.0-bookworm plus exact versions of cargo-nextest, cargo-llvm-cov, cargo-deny, cargo-audit, cargo-udeps, cargo-semver-checks, cargo-fuzz, mdbook, mdbook-mermaid, lychee, git-cliff, bacon, and lefthook. A fresh checkout on any machine produces identical tool behaviour.
sccache hits. The compose file mounts a named volume at /workspace/.sccache and sets RUSTC_WRAPPER=sccache. Across sessions and across branches, the cache stays warm.
Host insulation. Nothing in the workspace touches ~/.cargo, ~/.rustup, or any global state. Removing the project means docker compose down -v && rm -rf aozora/.

The two exceptions to Docker-only:

samply profiling. perf_event_open(2) doesn’t survive the container seccomp profile; the samply-* recipes invoke the host toolchain (see Profiling with samply).
Release builds. GitHub Actions runners build the release binaries natively per OS (the cross-target binary needs to match its runner OS exactly).

Editor / IDE setup

The repository includes a .devcontainer/ config, so:

VS Code with Dev Containers extension — “Reopen in Container” picks up the dev image, the rust-analyzer toolchain, and the aozora-* workspace at once. No host-side rust install needed.
Anything else — point your editor’s rust-analyzer at the dev container via docker exec. The cleanest approach is symlinking target/ from the named volume to a host-visible path; the alternative is the editor’s own remote-LSP support.

sccache stats

After a build cycle, check that the cache is actually warm:

just sccache-stats

Healthy steady state: 80%+ hit rate during normal iteration. A sub-50% hit rate usually means RUSTC_WRAPPER got defeated — the likely culprit is a stray env override or an [env] in .cargo/config.toml. To reset counters before a measurement window:

just sccache-zero && just clean && just build && just sccache-stats

Pre-commit hooks (lefthook)

lefthook.yml configures:

pre-commit (parallel): fmt, clippy, typos.
commit-msg: Conventional Commits regex.
pre-push (parallel): test, deny.

The hooks shell into docker compose run --rm dev … so they’re identical to the just recipes you ran manually. To skip a hook temporarily, push from the dev container’s shell directly (the hooks attach to the host git, not the container’s git).

Why lefthook over husky / pre-commit / cargo-husky?

husky — Node-only ecosystem; would force a Node dep into a Rust workspace.
pre-commit (Python framework) — Python-only ecosystem; same issue inverted.
cargo-husky — abandoned upstream.
lefthook — single Go binary, language-neutral, parallel execution, ships from a small upstream that’s actively maintained. Mainstream choice for polyglot Rust workspaces in 2026.

Conventional commits

The commit-msg hook enforces:

<type>(<scope>): <subject>

git-cliff turns these into the CHANGELOG on release.

Adding a new 青空文庫 notation

End-to-end TDD flow:

Spec fixture. Add a (input, html, serialise) triple under spec/aozora/cases/.
AST variant. Add a borrowed-arena variant to AozoraNode in crates/aozora-syntax/src/borrowed.rs.
Lexer test (red). Add a case to the relevant phase test under crates/aozora-lexer/tests/.
Lexer impl (green). Wire the recogniser into the appropriate phase (sanitize → tokenize → pair → classify).
Renderer. Emit the new HTML shape in crates/aozora-render/src/html.rs and the canonical serialisation in crates/aozora-render/src/serialize.rs.
Cross-layer invariants. Extend the property test or corpus predicate that the new shape interacts with (escape-safety, round-trip, span well-formedness).

Testing strategy

aozora targets C1 100% branch coverage as a goal — but coverage is the floor, not the ceiling. Every invariant is asserted from multiple angles so a single missed test path doesn’t silently hide a regression.

The five test layers

flowchart TD
    A["1. Spec cases<br/>(spec/aozora/cases/*.json)"]
    B["2. Property tests<br/>(crates/*/tests/property_*.rs)"]
    C["3. Corpus sweep<br/>(every Aozora Bunko work)"]
    D["4. Fuzz harness<br/>(cargo-fuzz)"]
    E["5. Sanitizers<br/>(Miri / TSan / ASan)"]

    A --> B --> C --> D --> E

Each layer catches a different kind of bug:

Layer	Catches
Spec cases	Per-feature contract regressions (the `(input, html, canonical)` triple).
Property tests	Invariant violations in the space of inputs (round-trip, escape-safety, span well-formedness).
Corpus sweep	Real-world distribution effects the property generator missed.
Fuzz	Latent panics on adversarial inputs the corpus doesn’t contain.
Sanitizers	UB / data race / heap-corruption issues the language can’t catch.

When you add a new invariant, land all five touchpoints in the same PR, or split them into a chain of PRs that explicitly references the invariant.

Layer 1: spec cases

spec/aozora/cases/
├── ruby-nested-gaiji.json
├── emphasis-bouten.json
├── emphasis-double-ruby.json
├── kunten-kaeriten.json
├── page-break.json
└── …

Each case pins a (input, html, serialise) triple:

{
  "input":     "｜青梅《おうめ》",
  "html":      "<ruby>青梅<rt>おうめ</rt></ruby>",
  "serialise": "｜青梅《おうめ》"
}

The unit test runner (cargo nextest run -p aozora-render) loads every case, parses, renders, serialises, and compares against the pinned strings. The property harness also uses these cases as seed inputs for shrinking.

The flagship in-tree fixture lives at spec/aozora/fixtures/56656/ — the Japanese translation of Crime and Punishment (Aozora Bunko card 56656). It exercises 1000+ ruby annotations, forward-reference bouten, JIS X 0213 gaiji, and accent decomposition edge cases.

Layer 2: property tests

proptest generators in crates/aozora-proptest drive parse / render / round-trip invariants. Default 128 cases per proptest! block (CI budget); just prop-deep runs 4096 per block (release-cut budget).

just prop                       # 128 cases
just prop-deep                  # 4096 cases
AOZORA_PROPTEST_CASES=10000 cargo nextest run --workspace --test 'property_*'

Why proptest over quickcheck:

Proptest’s shrinker is structural (reduces by the generator’s ops), so a counterexample collapses to a minimal reproduction that still fails. Quickcheck shrinks per-type, which produces noisier outputs.
Proptest persists failure seeds to proptest-regressions/ — every reproduced failure becomes a permanent regression test. Quickcheck has nothing like this.

Why a separate generator crate (aozora-proptest):

The generators are non-trivial (they have to produce valid 青空文庫 source — random byte streams would just stress the parser’s error path, which the fuzz harness already covers). Centralising them means every property test in every crate gets the same generator quality, and the generator itself can be unit-tested.

Layer 3: corpus sweep

export AOZORA_CORPUS_ROOT=$HOME/aozora-corpus
just corpus-sweep

Walks every .txt under $AOZORA_CORPUS_ROOT, parses, verifies the round-trip property holds, no panics. ~17 000 works in active rotation; ~90 s sweep on a modern x86_64 desktop using the parallel loader.

The sweep catches what the property generator can’t — every weird real-world idiom the maintained corpus has accumulated over 25 years of volunteer encoding choices. It’s the parser’s truth-from-the-field.

See Performance → Corpus sweeps for the corpus structure, archive format, and parallel loader details.

Layer 4: fuzz

just fuzz parse_render -- -runs=10000

Targets under crates/*/fuzz/fuzz_targets/:

parse_render — feed arbitrary bytes through Document::new ∘ to_html.
serialize_roundtrip — parse ∘ serialize ∘ parse stability.
sjis_decode — aozora_encoding::sjis::decode_to_string on arbitrary byte streams.

Fuzz failures auto-shrink to a minimal byte sequence and land in crates/<crate>/fuzz/artifacts/. Add the failing input to spec/aozora/cases/ as a regression case after diagnosing.

Why libFuzzer / cargo-fuzz:

Mainstream Rust fuzzing runs on libFuzzer via cargo-fuzz; it has the broadest crate-ecosystem support (most upstream crates ship fuzz targets), the corpus-management tooling is mature, and the crash artefacts are diff-able with git diff.

Layer 5: sanitizers

bash scripts/sanitizers.sh miri      # UB on FFI / scan intrinsics
bash scripts/sanitizers.sh tsan      # data races (parallel corpus loader)
bash scripts/sanitizers.sh asan      # heap correctness

Sanitizer runs are slower (~10× under Miri) so they don’t run on every PR — they’re nightly via the dev-image cron in CI, plus release-cut. The slow path catches the slow-class of bugs.

Why all three:

Miri catches undefined behaviour the compiler couldn’t see (out-of- bounds slice access, dangling references, transmute mismatches). The FFI driver and the SIMD scanner have unsafe surfaces; Miri is the only fully-checked oracle for them.
TSan catches race conditions in the parallel corpus loader. We use rayon correctly as far as we know, but TSan is the backstop.
ASan catches the small set of heap-correctness bugs that get through Miri (typically C-side issues in the FFI smoke test).

Coverage measurement

just coverage           # cargo llvm-cov branch coverage; CI gate
just coverage-html      # local HTML report at coverage/html/index.html
just coverage-branch    # nightly toolchain, branch-coverage detail

cargo llvm-cov over tarpaulin: tarpaulin is x86_64-linux only and uses ptrace-based instrumentation that misses some optimised-out branches. llvm-cov uses LLVM’s source-based coverage instrumentation — works on every target and gives accurate branch numbers.

The CI gate is region coverage; branch coverage is informational (it requires the nightly compiler, which the workspace doesn’t pin on the hot path).

Test naming and structure

Unit tests in mod tests {} at the bottom of each module.
Integration tests in crates/<crate>/tests/. One file per area (e.g. tests/lexer_phase0.rs, tests/lexer_phase3.rs).
Property tests prefixed property_ (the prop recipe globs on this).
Doc tests inside ```rust blocks in rustdoc comments. CI runs just test-doc separately because nextest skips them.

Snapshot testing

Where the output is a multi-line string that’s tedious to inline (rendered HTML, diagnostic-formatted text), we use insta:

insta::assert_snapshot!(tree.to_html());

The first run writes tests/snapshots/<test>.snap; subsequent runs compare against it. Updates happen via cargo insta review (the interactive UI inside the dev container), never by manually editing the .snap file.

Release process

aozora releases are git-tag-driven: push an annotated v<semver> tag, and .github/workflows/release.yml builds the cross-platform binaries, generates release notes from Conventional Commits, and publishes the GitHub Release.

Cutting a release

# 1. Pre-flight (everything green locally)
just ci                          # lint + build + test + prop + deny + audit + udeps + coverage + book-build
just prop-deep                   # 4096 cases per proptest block
AOZORA_CORPUS_ROOT=… just corpus-sweep

# 2. Bump workspace version
cargo set-version --workspace 0.2.7
git commit -am "chore(release): bump workspace to v0.2.7"

# 3. Refresh CHANGELOG (Unreleased → version)
just changelog                   # runs git-cliff with --unreleased --prepend
git add CHANGELOG.md && git commit -m "docs: refresh CHANGELOG for v0.2.7"

# 4. Tag (annotated)
git tag -a v0.2.7 -m "v0.2.7"
git push origin main v0.2.7

release.yml reacts to the tag: builds release binaries on three runners (linux x86_64, macOS arm64, windows x86_64), assembles tarballs / zips with the aozora binary + LICENSE-MIT + LICENSE-APACHE + NOTICE + README.md, and publishes the archives plus SHA256SUMS to the GitHub Release.

Sanity check after release

# Verify checksums
curl -L -O https://github.com/P4suta/aozora/releases/download/v0.2.7/SHA256SUMS
curl -L -O https://github.com/P4suta/aozora/releases/download/v0.2.7/aozora-v0.2.7-x86_64-unknown-linux-gnu.tar.gz
sha256sum --check SHA256SUMS

# Verify the binary
tar -xzf aozora-v0.2.7-*.tar.gz
./aozora --version              # prints "aozora 0.2.7"

Why annotated tags?

git tag -a creates a tagged-tag object with a message; git tag alone creates a lightweight tag (a bare ref). git-cliff’s release note extraction only walks annotated tags, and the standard ecosystem expectation (cargo-release, cargo-dist) is that release tags are annotated. Using lightweight tags would silently break the changelog generator.

Why git-tag-driven, not branch-driven?

A release/v0.2.7 branch model is the alternative. We don’t use it because:

Single-author workflow doesn’t benefit from the parallel-tracks model that branch-driven releases enable.
An annotated tag is the release artefact — anything you need to retroactively understand about a release lives in git show v0.2.7. A branch loses that locality.
Rollback is git tag -d + delete the GitHub release. Trivial.

CHANGELOG generation

git-cliff consumes Conventional Commits and produces Keep-a-Changelog formatted output:

just changelog          # incremental: --unreleased --prepend CHANGELOG.md
just changelog-full     # rebuild from scratch

cliff.toml configures the grouping:

Commit type	Section in CHANGELOG
`feat:`	Added
`fix:`	Fixed
`perf:`	Performance
`refactor:`	Changed
`docs:`	Documentation
`test:`	Tests
`build:`	Build
`ci:`	CI
`chore:`	(skipped unless scope is `release`)
`revert:`	Reverted

Non-conventional commits are silently skipped (they survive in git log but don’t pollute the changelog).

Why --unreleased --prepend over -o CHANGELOG.md:

The full-rebuild form (-o) regenerates the entire changelog from git history every time, which churns the diff for past releases even when nothing about them changed (whitespace, footer formatting). The incremental form only writes the new “Unreleased” section between the latest release and HEAD, leaving past entries byte-stable.

Why three release targets and not five?

The CI matrix builds:

x86_64-unknown-linux-gnu (linux x86_64)
aarch64-apple-darwin (macOS arm64)
x86_64-pc-windows-msvc (windows x86_64)

We don’t build x86_64-apple-darwin (macOS Intel — Apple deprecated the platform; arm64 covers all current Apple Silicon machines) or aarch64-unknown-linux-gnu (linux arm64 — covered by cargo install from source for the niche ARM Linux deployment case).

Adding a target is one line in release.yml; we add them when a real consumer asks for a binary build of one. Pre-emptive coverage isn’t worth the CI minutes.

Why not `cargo-dist` / `release-plz`?

Both are mainstream choices; we use a hand-written release.yml because:

cargo-dist is opinionated about archive layout (assumes you ship bin/ + share/); aozora’s archive is flat (aozora + LICENSE-* + NOTICE + README.md).
release-plz automates the version-bump + PR flow; for a single- author repo the manual cargo set-version + git tag is two commands and one fewer integration to debug.

When the workspace grows past three release targets or aozora goes multi-author, both will be worth re-evaluating.

Pre-1.0 SemVer

aozora is currently in the 0.x series. The contract:

0.x.y → 0.x.y+1: patches and additions, no breaks. Always safe to upgrade.
0.x.y → 0.x+1.0: may break the API. cargo-semver-checks flags the breaks during CI; the version-bump commit references the break in its body.
0.x.y → 1.0.0: the API freeze. Post-1.0, breaking changes collect on a next branch and ship in a major bump.

The MSRV pin (rust-toolchain.toml) advances on its own cadence, roughly quarterly. MSRV bumps are not breaking under our pre-1.0 contract — consumers that need a frozen MSRV pin a release tag.

Publishing to crates.io

Deferred until v1.0. The reasoning:

Pre-1.0 every minor bump may break the API; pushing those churns the registry for downstream Cargo.lock consumers.
Once published, the published name becomes load-bearing — name changes cost goodwill. Holding the name unpublished keeps the option to refactor the crate boundary.

When v1.0 lands, the publication workflow will run from a tag: cargo publish per crate in topological order (aozora-spec first, aozora last), driven from release.yml.

Keyboard shortcuts

aozora — 青空文庫記法 Parser Handbook