Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Welcome

aozora is a pure-functional Rust parser for 青空文庫記法 (Aozora Bunko notation) — the in-text annotation language used by 青空文庫, the long-running volunteer digital library of Japanese literature in the public domain.

It handles ruby (|青梅《おうめ》), bouten / bousen ([#「X」に傍点]), 縦中横, gaiji references (※[#…、第3水準1-85-54]), kunten / kaeriten, indent and align containers ([#ここから2字下げ]… [#ここで字下げ終わり]), and page / section breaks — every notation that appears in a real Aozora Bunko .txt source.

The repository is CommonMark-free, Markdown-free: aozora deals only with the 青空文庫 notation. The renderer emits semantic HTML5; the lexer reports structured diagnostics; the AST is a borrowed-arena tree that can be walked in O(n) without copying source bytes. If you want a Markdown dialect that also understands aozora notation, see the sibling project afm, which is built on top of this parser.

What this handbook is for

A practical tour and a deep reference, in one document.

Project shape

aozora is a single-author, green-field project that takes the opportunity to reach for the good algorithm and data structure for each problem rather than the obvious naive one. That orientation permeates every chapter — when you read about the scanner or the arena or the gaiji table, you’ll see why this technique spelled out, not just what the code does.

Status

Released versions track GitHub Releases; the bindings — the CLI, the Rust library, WASM, the C ABI, Go, Python, and the Extism host-SDK — all build and pass CI smoke tests. Public crates.io publication is gated on the v1.0 API freeze; in the meantime, depend on a tagged commit (see Install for the current pin).

A live build of this site lives at https://p4suta.github.io/aozora/; the rustdoc API reference is layered underneath at https://p4suta.github.io/aozora/api/aozora/.

Install

aozora ships in five shapes — pick the one that matches how you want to consume the parser.

CLI binary (release archive)

Pre-built aozora binaries for the three Tier-1 platforms ride on every GitHub Release:

  • aozora-vX.Y.Z-x86_64-unknown-linux-gnu.tar.gz
  • aozora-vX.Y.Z-aarch64-apple-darwin.tar.gz
  • aozora-vX.Y.Z-x86_64-pc-windows-msvc.zip

Each archive is shipped with a SHA256SUMS companion. Browse them at https://github.com/P4suta/aozora/releases.

curl -L -O \
  https://github.com/P4suta/aozora/releases/latest/download/aozora-x86_64-unknown-linux-gnu.tar.gz
tar -xzf aozora-*.tar.gz
sudo install -m 0755 aozora /usr/local/bin/
aozora --version

CLI binary (build from source)

The released CLI is on crates.io — cargo install compiles it from the published source:

cargo install aozora-cli --locked

The --locked flag is non-negotiable — it pins to the exact Cargo.lock we shipped, which matters because the workspace uses fat LTO (mismatched dep versions silently change inlining behaviour).

To track the development tip instead, install from git:

cargo install --git https://github.com/P4suta/aozora --locked aozora-cli

Or pin a specific release tag (the current value is on the releases page):

cargo install --git https://github.com/P4suta/aozora \
              --tag v0.4.1 --locked aozora-cli

Rust library

aozora is on crates.io. Depend on the umbrella crate alone — it is the single front door, and the build-block crates (aozora-encoding, …) are reached through its re-exports (aozora::encoding, …):

[dependencies]
aozora = "0.4"

Bleeding-edge alternative — to track unreleased fixes on main, pin a git tag instead. This block is the single source of truth for the recommended git pin — every other doc links here, so a new release only needs this one tag updated:

[dependencies]
aozora = { git = "https://github.com/P4suta/aozora.git", tag = "v0.4.1" }

The current tag is whatever GitHub Releases is marked Latest. Either way the repo follows Conventional Commits and SemVer: breaking changes advance the major version (post-1.0) or the minor version (during 0.x), so a "0.4" requirement stays safe.

WASM (browser / Node)

The browser package is on npm as aozora-wasm:

npm install aozora-wasm

To build it from a checkout instead:

rustup target add wasm32-unknown-unknown        # one-time
wasm-pack build --target web --release crates/aozora-wasm

The post-wasm-opt artifact has a 500 KiB size budget. See Bindings → WASM for the JS surface and the post-build wasm-opt invocation we recommend.

C ABI

cargo build --release -p aozora-ffi
# → target/release/libaozora_ffi.{so,dylib,a}
# → target/release/aozora.h          (cbindgen-generated)

Link with -laozora_ffi and include aozora.h. See Bindings → C ABI for the API surface and memory ownership rules.

Python

The wheel is on PyPI as aozora_py:

pip install aozora_py

For local development against a checkout, build with maturin instead:

pip install maturin                              # one-time
cd crates/aozora-py
maturin develop -F extension-module              # install in current venv
maturin build   -F extension-module --release    # produce a redistributable wheel

See Bindings → Python for the API and the unsendable thread-safety contract.

Toolchain pin

aozora pins Rust 1.95.0 as its MSRV (rust-toolchain.toml). CI enforces it via a dedicated msrv job. If you run rustup show inside the repo and see something else, your local override needs updating.

CLI Quickstart

The aozora binary covers three operations:

aozora check  FILE.txt          # lex + report diagnostics on stderr
aozora fmt    FILE.txt          # round-trip parse ∘ serialize, print to stdout
aozora render FILE.txt          # render to HTML on stdout

- (or no path argument) reads from stdin. --encoding sjis (alias -E sjis) decodes Shift_JIS source — Aozora Bunko’s distributed .txt files are Shift_JIS, so this flag is the common case for real corpus work.

Common invocations

# Lex an Aozora Bunko file and print diagnostics
aozora check -E sjis crime_and_punishment.txt

# Render to HTML (stdout)
aozora render -E sjis crime_and_punishment.txt > out.html

# Pipe from stdin
cat src.txt | aozora render -

# CI gate: fail if format is not idempotent
aozora fmt --check src.txt

Flag reference

FlagSubcommandEffect
-E sjis, --encoding sjisallDecode Shift_JIS source. Default is UTF-8.
--strictcheckExit non-zero on any diagnostic.
--checkfmtExit non-zero if formatted output differs from input.
--writefmtOverwrite the input file with the canonical form. (Ignored when reading from stdin.)
--no-colorallDisable ANSI colour in diagnostics output.
--verboseallPrint parse phase timings to stderr.

Exit codes

CodeMeaning
0Success.
1Diagnostics emitted under --strict, or formatting mismatch under --check.
2Usage error (bad flag, missing file, decode error).

Diagnostics format

aozora check prints diagnostics in miette style — a coloured source snippet with carets pointing at the byte range, a short message, and (where applicable) a help line:

  × ruby reading mismatch: target spans 3 chars but |《》 reading is empty
   ╭─[input.txt:42:9]
42 │ |青梅《》
   · ───┬───
   ·    ╰── empty reading
   ╰────
  help: provide a reading inside 《…》 or remove the | marker

Every diagnostic carries a stable dotted code (aozora::lex::empty_ruby_reading, aozora::lex::unresolved_gaiji, …); see the Diagnostics catalogue for the full list.

Why not a single subcommand?

check / fmt / render are intentionally separate so each one has a single, predictable failure mode in shell pipelines:

  • check exits 0 on parse success, regardless of warnings (use --strict for “no diagnostics allowed”).
  • fmt is a pure-text transform: stdin in, canonical text out. --check upgrades it to a CI gate without forking a second binary.
  • render is a pure-text-to-HTML transform with the same exit-code shape.

Combining them behind flags would make the exit-code semantics ambiguous (does --check mean format-check or strict-check?). Keeping them split is the same logic that splits gofmt from vet from go build.

Library Quickstart

The minimal Rust use of aozora is six lines:

use aozora::Document;

fn main() {
    let source = std::fs::read_to_string("src.txt").unwrap();
    let doc = Document::new(source);
    let tree = doc.parse();
    println!("{}", tree.to_html());
}

That’s enough to get HTML out of any UTF-8 青空文庫 source. The rest of this page covers the lifetime model, the diagnostic stream, and the AST walk — three things you’ll need once you do anything beyond “render to HTML”.

The lifetime model

Document owns two things: a bumpalo::Bump arena and the source Box<str>. AozoraTree<'a> borrows from both:

let doc  = aozora::Document::new(source);   // Document: 'static
let tree = doc.parse();                     // AozoraTree<'_> bound to &doc
let html = tree.to_html();                  // walks the borrow

// dropping doc releases every node in a single Bump::reset()
drop(doc);

That is: hand the Document around, not the tree. If you need to keep a parse result alive across function boundaries, the function takes ownership of (or borrows) the Document, and re-derives the tree on the inside. This is unusual for Rust libraries — most parse APIs hand back an owned tree — but it’s what makes aozora’s zero-copy AST safe. See Architecture → Borrowed-arena AST for why this trade is worth it.

Shift_JIS input

Aozora Bunko ships its corpus as Shift_JIS. Decode through the umbrella aozora::encoding module first (consumers depend on aozora alone — never on the internal aozora-encoding crate directly):

use aozora::Document;
use aozora::encoding::decode_sjis;

let bytes = std::fs::read("src.sjis.txt")?;
let utf8  = decode_sjis(&bytes)?;   // -> String; Err(DecodeError) on bad input
let doc   = Document::new(utf8);
let tree  = doc.parse();

decode_sjis handles BOM stripping, JIS X 0213 codepoints, and the Aozora-specific 外字 references that survive the decode pass as private-use sentinels (resolved later in the parser). It is strict — malformed bytes return Err(DecodeError) rather than silently substituting replacement characters. A runnable version is just example sjis.

Diagnostics

use aozora::Diagnostic;

let diags: &[Diagnostic] = tree.diagnostics();
for d in diags {
    let span = d.span();
    // `Diagnostic` is an enum — reach its parts through the accessors.
    // `Display` ({d}) renders the human message; there is no `.message`.
    eprintln!("[{:?}] {} @ {}..{}", d.severity(), d.code(), span.start, span.end);
}

Each Diagnostic carries a stable code(), a span(), and a severity() (Error / Warning / Note). A runnable version is just example diagnostics. Diagnostics are non-fatal by design: the parser always produces a tree, even from malformed input. Callers that want strict behaviour treat any diagnostic as an error themselves. See the Diagnostics catalogue for the code list.

Walking the AST

AozoraTree::source_nodes() returns a source-ordered side table — one SourceNode per classified Aozora / container span (plain-text runs between constructs round-trip verbatim and are not listed). It is the surface editor tooling uses for semantic tokens and document symbols:

for entry in tree.source_nodes() {
    let span = entry.source_span;            // byte range into the source
    // `entry.node` is a `NodeRef`: Inline / BlockLeaf / BlockOpen /
    // BlockClose, each wrapping the borrowed AST node or container kind.
    println!("{}..{}  {:?}", span.start, span.end, entry.node);
}

Match on entry.node (NodeRef) to destructure a specific construct — e.g. NodeRef::Inline(AozoraNode::Ruby(r)) gives you the ruby base and reading. A runnable version is just example walk_ast.

The borrowed nodes are cheap to copy (they’re effectively (tag, &str, &Bump-slice) triples), so you can keep references around freely as long as the Document lives.

Round-trip and canonicalisation

Every parse should round-trip:

let parsed = doc.parse();
let canonical: String = parsed.serialize();
assert_eq!(canonical, doc.source());     // for *canonical* input

Real Aozora Bunko sources contain stylistic variations (CRLF vs LF, NFC vs NFD around accents, half-width vs full-width punctuation) that the lexer normalises before tokenising. For those the assertion above holds after aozora fmt has been applied once.

The pure round-trip property is what aozora fmt --check exercises in CI, and what the corpus sweep verifies across the full Aozora Bunko catalogue (~17 000 works).

Where to next

Node reference

aozora exposes 19 NodeKind variants. Each is documented on its own page with source examples, the rendered HTML, the serialize round-trip output, the in-memory AST shape, and the diagnostics it can fire alongside.

The page layout matches the aozora explain <kind> CLI subcommand: once you find the variant in the table, the deep dive is one click — or one shell invocation — away.

VariantWire tagNotation
Rubyruby|base《reading》
Boutenbouten[#「target」に傍点]
TateChuYokotateChuYoko[#「12」は縦中横]
Gaijigaiji※[#...、第3水準1-85-54]
Indentindent[#2字下げ]
AlignEndalignEnd[#地から2字上げ]
Warichuwarichu[#割り注]...
Keigakomikeigakomi[#罫囲み]
PageBreakpageBreak[#改ページ]
SectionBreaksectionBreak[#改丁]
AozoraHeadingheading[#見出し]
HeadingHintheadingHint[#「対象」は中見出し]
Sashiesashie[#挿絵(path.png)入る]
Kaeritenkaeriten[#返り点 一・二]
Annotationannotation[#任意のコメント]
AngleQuoteangleQuote≪重要≫《重要》
Containercontainer[#ここから...]...[#ここで...終わり]
ContainerOpencontainerOpen(NodeRef projection)
ContainerClosecontainerClose(NodeRef projection)

How to read these pages

Every node page follows the same skeleton:

SectionContent
Source examplesOne or two minimal Aozora-notation strings that produce this variant.
Rendered HTMLWhat Document::new(src).parse().to_html() emits.
Serialize outputWhat serialize() emits — typically the canonical form of the source.
AST shapeThe borrowed-AST struct fields the variant carries.
When emittedPhase 3 classification rule that produces this variant.
DiagnosticsCodes that may accompany this variant.
Related kindsCross-links to neighbours (BoutenBousen, IndentContainer::Indent, etc.).

#[non_exhaustive] on NodeKind: a future minor release adding a new variant lands here without a breaking change. Downstream consumers that match on NodeKind exhaustively must include a _ arm.

NodeKind::Ruby

Wire tag: ruby — base text + reading annotation. The most common non-trivial variant in Aozora Bunko.

Source examples

|青梅《おうめ》
青梅《おうめ》

Both forms classify as Ruby; the leading (U+FF5C) makes the delimiter explicit and lets the parser disambiguate the base run when ambiguous neighbours could otherwise extend the base.

Rendered HTML

<ruby>青梅<rp>(</rp><rt>おうめ</rt><rp>)</rp></ruby>

<rp> parens are emitted so HTML clients without ruby support still display a readable fallback.

Serialize output

serialize() always emits the explicit-delimiter form (|base《reading》), so a parse → serialize → parse round-trip is a fixed point regardless of which form the source used.

AST shape

pub struct Ruby<'src> {
    pub base: NonEmpty<Content<'src>>,
    pub reading: NonEmpty<Content<'src>>,
    pub delim_explicit: bool,
}

Both fields are NonEmpty<Content>; empty base or reading is rejected upstream and never produces a Ruby node.

When emitted

Phase 3 classifies a 《…》 pair as ruby when the preceding run is a sequence of CJK / kana / latin glyphs and the close is followed by neither a glyph (which would extend the base further) nor a stray opener.

Diagnostics

  • aozora::lex::unclosed_bracket — unbalanced reaches EOF.
  • aozora::lex::unmatched_close — stray with no matching open.

NodeKind::Bouten

Wire tag: bouten — emphasis dots / sidelines over a target span.

Source examples

青空に[#「青空」に傍点]
青空に[#「青空」に丸傍点]

The bracketed annotation refers backwards to the literal text quoted with 「…」, so the parser resolves the target by string match against the preceding line(s).

Rendered HTML

<em class="aozora-bouten aozora-bouten-goma aozora-bouten-right">青空</em>に

The two trailing class slots carry the bouten kind (goma, circle, wavy-line, …) and the position (right for vertical text, left for the rare under-side variant).

Serialize output

Round-trips to the explicit [#「target」に<kind>傍点] form.

AST shape

pub struct Bouten<'src> {
    pub kind: BoutenKind,
    pub target: NonEmpty<Content<'src>>,
    pub position: BoutenPosition,
}

BoutenKind enumerates the 11 visual variants (Goma, WhiteSesame, Circle, …); BoutenPosition is Right (default for vertical text) or Left.

When emitted

Phase 3 sees [#「QUOTE」に <slug>傍点] / [#「QUOTE」に <slug>傍線], walks back through the recent text to find QUOTE, and emits the node with the matched span.

Diagnostics

  • aozora::lex::unclosed_bracket — annotation [# opened with no matching .
  • Annotation (fallback) — quote target unresolved.
  • Annotation — fallback when the target cannot be matched.

NodeKind::TateChuYoko

Wire tag: tateChuYoko — horizontal text inside a vertical writing-mode run (縦中横, “vertical-with-horizontal-inside”).

Source examples

昭和[#「12」は縦中横]年

Rendered HTML

<span class="aozora-tcy">12</span>

Downstream CSS gives the span text-combine-upright: all for proper vertical-writing display.

Serialize output

Round-trips to [#「target」は縦中横].

AST shape

pub struct TateChuYoko<'src> {
    pub text: NonEmpty<Content<'src>>,
}

When emitted

Phase 3 matches the directive [#「TARGET」は縦中横] and resolves TARGET in preceding text, then emits with the matched span.

Diagnostics

aozora::lex::unclosed_bracket if [# is unmatched.

  • Annotation — fallback if target resolution fails.

NodeKind::Gaiji

Wire tag: gaiji — out-of-character-set glyph reference. The historical Aozora-Bunko notation for characters Shift_JIS could not encode; modern files mostly use them for genuine non-Unicode glyphs.

Source examples

※[#「木+吶のつくり」、第3水準1-85-54]

The (U+203B) flags the construct; [#description、mencode] carries the human description and a structured Mojikyō / JIS / U+ identifier.

Rendered HTML

<span class="aozora-gaiji" title="木+吶のつくり" data-mencode="第3水準1-85-54">〓</span>

The fallback glyph (U+3013, “geta mark”) is the conventional Japanese typesetting placeholder for missing glyphs. When the resolver finds a Unicode mapping the inner text becomes the resolved character instead of the geta mark.

Serialize output

Round-trips to ※[#description、mencode].

AST shape

pub struct Gaiji<'src> {
    pub description: &'src str,
    pub ucs: Option<Resolved>,
    pub mencode: Option<&'src str>,
}

Resolved is either a single Unicode scalar or one of 25 predefined static combining sequences (e.g. か゚ — + the IPA voicing-pair-mark — kept as a static constant so the borrowed-AST stays Copy).

When emitted

Phase 3 sees the ※[#…] digraph and parses the description / mencode payload. The encoding crate’s gaiji resolver lifts the mencode reference into a Unicode character when one exists.

Diagnostics

None on a well-formed ※[#...]. Ambiguous descriptions land as Annotation::Unknown instead of Gaiji.

  • Annotation — fallback when description is malformed.

NodeKind::Indent

Wire tag: indent — single-line [#N字下げ] indent marker.

Source examples

[#2字下げ]
[#3字下げ]もう一段下げる

Rendered HTML

<span class="aozora-indent" data-amount="2"></span>

CSS controls the actual padding (typically padding-inline-start: Nem).

Serialize output

Round-trips to [#N字下げ].

AST shape

pub struct Indent {
    pub amount: u8,
}

When emitted

Phase 3 matches the digraph plus a numeric prefix and emits a single inline marker. For paired indent regions ([#ここから2字下げ][#ここで字下げ終わり]), see Container.

Diagnostics

None on well-formed input.

  • Container — paired indent / dedent regions (ContainerKind::Indent).
  • AlignEnd — right-edge alignment counterpart.

NodeKind::AlignEnd

Wire tag: alignEnd — right-edge alignment marker (字上げ).

Source examples

[#地付き]
[#地から3字上げ]

Rendered HTML

<span class="aozora-align-end" data-offset="0"></span>

offset is 0 for 地付き, N for 地から N 字上げ.

Serialize output

Round-trips to [#地付き] / [#地からN字上げ].

AST shape

pub struct AlignEnd {
    pub offset: u8,
}

When emitted

Phase 3 matches the directive form. Paired alignment regions ([#ここから地から N 字上げ][#ここで字上げ終わり]) are Container instead.

Diagnostics

None.

  • Indent — left-edge counterpart.
  • Container — paired regions (ContainerKind::AlignEnd).

NodeKind::Warichu

Wire tag: warichu — split-line annotation (割注). Two text runs are stacked into a single line of the surrounding text.

Source examples

[#割り注]上の段/下の段[#割り注終わり]

Rendered HTML

<span class="aozora-warichu">
  <span class="aozora-warichu-upper">上の段</span>
  <span class="aozora-warichu-lower">下の段</span>
</span>

Serialize output

Round-trips to the explicit [#割り注].../...[#割り注終わり].

AST shape

pub struct Warichu<'src> {
    pub upper: Content<'src>,
    pub lower: Content<'src>,
}

upper / lower are plain Content; empty halves are valid (one-sided warichu).

When emitted

The single-line [#割り注]...[#割り注終わり] form is inline-classified; multi-line [#割注] containers become a Container of kind Warichu.

Diagnostics

None on well-formed input.

NodeKind::Keigakomi

Wire tag: keigakomi — ruled-box annotation (罫囲み).

Source examples

[#罫囲み]本文[#罫囲み終わり]

Rendered HTML

<span class="aozora-keigakomi"></span>

(Inline marker; the multi-line container form yields a <div class="aozora-container-keigakomi"> wrapper instead — see Container.)

Serialize output

Round-trips to [#罫囲み]...[#罫囲み終わり].

AST shape

pub struct Keigakomi;

Marker struct with no payload — the surrounding text carries the content.

When emitted

Phase 3 sees the inline form. Multi-line keigakomi blocks classify as Container Keigakomi.

Diagnostics

None on well-formed input.

NodeKind::PageBreak

Wire tag: pageBreak[#改ページ] page break marker.

Source examples

end of chapter
[#改ページ]
beginning of next chapter

Rendered HTML

<div class="aozora-page-break"></div>

CSS gives the div a page-break-before: always for paged media (EPUB / print).

Serialize output

Round-trips to [#改ページ]\n.

AST shape

AozoraNode::PageBreak is a unit variant — no payload.

When emitted

Phase 3 sees [#改ページ] and emits a single BlockLeaf classification covering the whole bracket span.

Diagnostics

None on well-formed input.

NodeKind::SectionBreak

Wire tag: sectionBreak — section breaks (改丁 / 改段 / 改見開き).

Source examples

[#改丁]
[#改段]
[#改見開き]

Rendered HTML

<div class="aozora-section-break aozora-section-break-kaicho"></div>

The second class slot carries the variant slug (kaicho, kaidan, kaimihiraki, other).

Serialize output

Round-trips to [#改丁] etc.

AST shape

AozoraNode::SectionBreak(SectionKind)

SectionKind is Choho (改丁) / Dan (改段) / Spread (改見開き).

When emitted

Phase 3 matches each directive; the kind enum captures which.

Diagnostics

None on well-formed input.

  • PageBreak — finer-grained [#改ページ] variant.

NodeKind::AozoraHeading

Wire tag: heading — Aozora 見出し (window / sub heading).

Source examples

[#見出し]序章[#見出し終わり]

Rendered HTML

<h2 class="aozora-heading aozora-heading-window">序章</h2>

The Pandoc projection uses level 2 for Window, level 3 for Sub.

Serialize output

Round-trips to [#<kind>見出し]...[#<kind>見出し終わり].

AST shape

pub struct AozoraHeading<'src> {
    pub kind: AozoraHeadingKind,
    pub text: NonEmpty<Content<'src>>,
}

AozoraHeadingKind is Window (窓見出し) or Sub (副見出し).

When emitted

Phase 3 matches the keyword 見出し family and binds the body run.

Diagnostics

None on well-formed input.

  • HeadingHint — forward-reference style heading hint.

NodeKind::HeadingHint

Wire tag: headingHint — forward-reference heading hint ([#「target」は中見出し]).

Source examples

序章
[#「序章」は中見出し]

The hint refers to a quoted target string in the preceding line(s); downstream renderers pick this up as “promote the matched run to a heading.”

Rendered HTML

The marker itself emits no visible content; renderers that honour the hint elevate the previously-matched span to a <h2> / <h3> retroactively. The default HTML renderer in aozora-render emits a structural marker comment.

Serialize output

Round-trips to [#「target」は<level>見出し].

AST shape

pub struct HeadingHint<'src> {
    pub level: u8,
    pub target: NonEmptyStr<'src>,
}

level follows the Aozora convention: 1=大見出し, 2=中見出し, 3=小見出し.

When emitted

Phase 3 matches the directive and records the level + target. Empty target is rejected and falls through to plain text.

Diagnostics

None on well-formed input.

NodeKind::Sashie

Wire tag: sashie — illustration reference (挿絵).

Source examples

[#挿絵(cover.png)入る]
[#挿絵(pages/03.jpg、第3章扉絵)入る]

Rendered HTML

<figure class="aozora-sashie">
  <img src="cover.png" alt="">
</figure>

When a caption is present it lands as a <figcaption> next to the <img>.

Serialize output

Round-trips to [#挿絵(path[、caption])入る].

AST shape

pub struct Sashie<'src> {
    pub file: NonEmptyStr<'src>,
    pub caption: Option<Content<'src>>,
}

Empty file is rejected upstream — the construct cannot ship a nameless image.

When emitted

Phase 3 matches the 挿絵(…)入る digraph and parses out the path

  • optional caption.

Diagnostics

None on well-formed input.

  • Annotation — fallback when the directive is malformed.

NodeKind::Kaeriten

Wire tag: kaeriten — kanbun reading-order marker (返り点).

Source examples

読[#返り点 一・二]本

Rendered HTML

<sup class="aozora-kaeriten" data-mark="一・二"></sup>

CSS positions the sup glyph appropriately for vertical / horizontal writing mode.

Serialize output

Round-trips to [#返り点 mark].

AST shape

pub struct Kaeriten<'src> {
    pub mark: NonEmptyStr<'src>,
}

When emitted

Phase 3 matches 返り点 keyword + marker payload. Empty marker rejected upstream.

Diagnostics

None on well-formed input.

None.

NodeKind::Annotation

Wire tag: annotation — generic [#...] annotation that no specific recogniser claimed.

Source examples

text[#任意のメモ]more
text[#ふりがな付きの説明]more

Rendered HTML

<span class="aozora-annotation" title="..."></span>

The default renderer suppresses the body; downstream filters can match on aozora-annotation to surface the comment.

Serialize output

Round-trips to [#<raw>].

AST shape

pub struct Annotation<'src> {
    pub raw: NonEmptyStr<'src>,
    pub kind: AnnotationKind,
}

AnnotationKind discriminates the recognised sub-variants (Unknown, AsIs, TextualNote, InvalidRubySpan, …); raw carries the raw bracket body for any further analysis.

When emitted

Phase 3 reaches [#...] after no specific recogniser matched. Annotation is the fallback that always preserves the user’s content rather than dropping it.

Diagnostics

None — Annotation is the recovery path for unrecognised directives. A genuine invalid-bracket diagnostic (unclosed_bracket / unmatched_close) appears separately.

NodeKind::AngleQuote

Wire tag: angleQuote — double-angle quotation (二重山括弧).

A 底本’s twin angle brackets 《…》 would collide with the ruby markers 《…》 (U+300A/U+300B), so Aozora Bunko input encodes them as ≪…≫ (U+226A/U+226B). The renderer restores the display form 《…》.

Source examples

≪重要≫

底本 《重要》 → aozora text ≪重要≫ → display 《重要》.

Rendered HTML

<span class="aozora-angle-quote">《重要》</span>

The 《…》 display glyphs (U+300A/U+300B) are restored inside the span; stylesheets target .aozora-angle-quote for any further treatment.

Serialize output

Round-trips to the input form ≪content≫ (U+226A/U+226B).

AST shape

pub struct AngleQuote<'src> {
    pub content: NonEmpty<Content<'src>>,
}

content is NonEmpty — empty ≪≫ is rejected upstream and falls through to plain text rather than producing an empty node.

When emitted

Phase 1 tokenises / (U+226A/U+226B) as ordinary single-character triggers; Phase 3 pairs ≪…≫ into one AngleQuote node. A stray 底本-style 《《…》》 is not this node — it is two ruby openers and yields a nested-ruby diagnostic with plain fallback.

Diagnostics

  • aozora::lex::unclosed_bracket reaches EOF without .
  • aozora::lex::unmatched_close — stray with no matching open.
  • Ruby《…》 reading marker (the colliding notation).

NodeKind::Container

Wire tag: container — paired-container wrapping ([#ここから...]...[#ここで...終わり]).

Source examples

[#ここから2字下げ]
 第一節
 第二節
[#ここで字下げ終わり]

[#罫囲み]
本文
[#罫囲み終わり]

[#地から3字上げ]
寄付者一覧
[#字上げ終わり]

Rendered HTML

<div class="aozora-container-indent" data-amount="2">
  ...
</div>

The wrapping div carries the kind-specific class (aozora-container-indent, aozora-container-warichu, aozora-container-keigakomi, aozora-container-align-end) plus any structural data (indent amount, align offset) on data-*.

Serialize output

Round-trips to the explicit-paired directive form.

AST shape

pub struct Container {
    pub kind: ContainerKind,
}

pub enum ContainerKind {
    Indent { amount: u8 },
    Warichu,
    Keigakomi,
    AlignEnd { offset: u8 },
}

The Container payload appears wrapping the content — the actual walker driver fires visit_container_open on enter and visit_container_close on exit so renderers wrap the body cleanly.

When emitted

Phase 2 pairs the [#ここから…] / [#ここで…終わり] openers and closers; Phase 3’s BlockOpen / BlockClose events project to this variant.

Diagnostics

unclosed_bracket for unbalanced opens.

NodeKind::ContainerOpen

Wire tag: containerOpen — paired-container open boundary marker.

This variant only appears in NodeRef-flavoured wire output (e.g. serialize_nodes); the structural AozoraNode::Container payload covers the wrapping construct itself.

Source examples

[#ここから2字下げ]     <- ContainerOpen
indented body
[#ここで字下げ終わり]   <- ContainerClose

Rendered HTML

The default HTML renderer routes the open / close pair through visit_container_open / visit_container_close and emits the opening <div class="aozora-container-..."> wrapping the body.

Serialize output

Round-trips together with the matching close to the [#ここから…]...[#ここで…終わり] form.

AST shape

NodeRef::BlockOpen(ContainerKind) — see ContainerKind.

When emitted

Phase 2 pairs the open / close brackets; Phase 3’s normalised text emits a BlockOpen PUA sentinel at the position of the opener so the registry can dispatch the open event during walking.

Diagnostics

unclosed_bracket if the open never finds a matching close.

NodeKind::ContainerClose

Wire tag: containerClose — paired-container close boundary marker.

NodeRef-only counterpart of ContainerOpen.

Source examples

[#ここから2字下げ]     <- ContainerOpen
body
[#ここで字下げ終わり]   <- ContainerClose

Rendered HTML

Routed through visit_container_close; the default renderer emits the closing </div> of the <div class="aozora-container-..."> opened by the matching ContainerOpen.

Serialize output

Round-trips with the matching open.

AST shape

NodeRef::BlockClose(ContainerKind).

When emitted

Phase 3 normalised-text emits a BlockClose PUA sentinel at the matching close position.

Diagnostics

unmatched_close if the close has no open partner — in which case no ContainerClose is emitted and the close-bracket bytes flow through as plain.

Notation overview

青空文庫記法 is a small, line-oriented annotation language layered inside a plain-text Japanese document. Authors mark up the text in two distinct registers:

  1. Inline markers — single-character sigils (, , , ) that fence inline annotations directly inside the prose.
  2. Block annotations[#…] brackets containing a Japanese directive in natural language (“ここから2字下げ”, “「X」に傍点”, …) that act as openers, closers, or self-contained directives.

aozora recognises every annotation that survives in real Aozora Bunko sources — the volunteer corpus has ~17 000 works in active rotation, and the parser is exercised against the entire archive in CI as part of the corpus sweep.

Notations covered

ChapterWhat it marks
RubyPronunciation glosses (|青梅《おうめ》, 青梅《おうめ》).
Bouten / bousenEmphasis dots and lines: 傍点 (sesame, white sesame, filled circle, open circle, …) and 傍線 (single, double, dashed, …).
縦中横Horizontally-set runs inside vertical text ([#「数字」は縦中横]).
GaijiOut-of-Shift_JIS character references (※[#…、第3水準1-85-54]) and accented-Latin decomposition.
Kunten漢文 reading marks: 返り点 (, , , , , ), 再読文字, 送り仮名.
Indent containers[#ここから2字下げ]… [#ここで字下げ終わり] and the geji / 地付き / 地寄せ family.
Page & section breaks改ページ, 改丁, 改見開き, 改段.
DiagnosticsThe catalogue of structured diagnostics the parser emits.

Spec source of truth

The authoritative spec lives at https://www.aozora.gr.jp/annotation/index.html. A snapshot is vendored at docs/specs/aozora/ in the repo so that every page in this handbook can link to a stable fragment (the upstream HTML reorganises occasionally; the snapshot shields chapter cross-references from rot).

When this handbook says “the spec says X”, that means that snapshot. Where the live spec drifts, we update the snapshot, then update the parser, then update this handbook — in that order.

How a sample input looks

|青梅《おうめ》街道を歩いて、※[#「魚+師のつくり」、第3水準1-94-37]を見た。
[#ここから2字下げ]
 [#「平和」に傍点]という言葉は、もう古い。
[#ここで字下げ終わり]
[#改ページ]

That single sample exercises ruby, gaiji, indent containers, bouten, and a page break. The parser turns it into a flat node stream — see the per-chapter pages for the exact AST shapes.

Notation we deliberately omit

Aozora Bunko’s spec mentions a handful of annotations that don’t appear in the maintained corpus:

  • Image references beyond [#挿絵] — covered up to the caption, no actual image rendering.
  • キャプション alignment edge cases that the spec lists but no active work uses (verified against the corpus sweep).

These are kept as a generic Annotation{Unknown} and rendered best-effort (the “no bare [#” guarantee still holds); a ここから… opener that names no known container also emits unrecognised_container_directive. Adding full support is a one-PR job once a real corpus document needs it.

Ruby (|青梅《おうめ》)

Ruby is a pronunciation gloss attached to a run of base text. In 青空文庫 source it appears in two shapes:

|青梅《おうめ》            ← explicit-base form
青梅《おうめ》              ← implicit-base form (auto-detect)

Both forms render the same HTML:

<ruby>青梅<rt>おうめ</rt></ruby>

Explicit base (|…《…》)

The full-width vertical bar (U+FF5C) marks the start of the base text; 《…》 (U+300A / U+300B) wraps the reading. The base runs from to the . Use this form when:

  • The base contains characters that the auto-detect heuristic would otherwise skip (kana, ASCII letters, mixed scripts).
  • The boundary between base and surrounding text is ambiguous.
|山田《やまだ》さん         → <ruby>山田<rt>やまだ</rt></ruby>さん
|HTTP《ハイパー・テキスト》 → <ruby>HTTP<rt>ハイパー・テキスト</rt></ruby>

Implicit base

When 《…》 follows a run of kanji without a leading , the parser auto-detects the base by scanning backwards through the kanji run. The auto-detect terminates at the first non-kanji character (kana, punctuation, ASCII, full-width digit).

青梅《おうめ》     → <ruby>青梅<rt>おうめ</rt></ruby>
お青梅《おうめ》   → お<ruby>青梅<rt>おうめ</rt></ruby>

The “kanji” predicate is CJK Unified Ideographs + CJK Compatibility Ideographs + CJK Unified Ideographs Extension A–F

  • the iteration mark . JIS X 0213 plane-2 ideographs not in Unicode are represented as gaiji references (see Gaiji) and likewise terminate the auto-detect.

Empty reading

|青梅《》 supplies a base but an empty reading. The lexer emits aozora::lex::empty_ruby_reading (an Error) and the construct degrades to plain text — no Ruby node is built.

The implicit-base form silently skips a 《》 with empty contents — the parser can’t be sure a base was intended, so it treats the bare 《》 as literal text and stays silent.

Nested ruby (forbidden)

The spec disallows ruby inside ruby. A reading whose body opens another 《…》 (e.g. |漢《か《ん》じ》) fires aozora::lex::nested_ruby; the outer ruby is still parsed best-effort. (An adjacent 《《…》》 is a different construct — double-bracket bouten — not nested ruby.)

AST shape

pub struct Ruby<'src> {
    pub base:           NonEmpty<Content<'src>>,  // never empty
    pub reading:        NonEmpty<Content<'src>>,  // never empty
    pub delim_explicit: bool,                     // true for the |…《…》 form
}

base and reading are [Content] (a Plain(&str) fast path or a Segments run carrying nested gaiji / annotations), wrapped in NonEmpty so an empty payload is unrepresentable — Phase 3 only emits a Ruby once both sides have content (an empty reading takes the empty-reading path instead). delim_explicit records whether the source used the |…《…》 form so the serializer re-emits the only when the original did.

Edge cases

InputOutput
青梅《おうめ》<ruby>青梅<rt>おうめ</rt></ruby>
|青梅《おうめ》<ruby>青梅<rt>おうめ</rt></ruby> (canonical-equivalent)
|山田《やまだ》<ruby>山田<rt>やまだ</rt></ruby>
|HTTP《ハイパー・テキスト》<ruby>HTTP<rt>ハイパー・テキスト</rt></ruby>
お青梅《おうめ》お<ruby>青梅<rt>おうめ</rt></ruby> (auto-detect skips kana)
1青梅《おうめ》1<ruby>青梅<rt>おうめ</rt></ruby> (auto-detect skips digit)
|青梅《》plain text + empty_ruby_reading
《おうめ》literal text (no preceding kanji to anchor)
|漢《か《ん》じ》best-effort ruby + nested_ruby

See also

Bouten / bousen (傍点・傍線)

Bouten (傍点) are emphasis dots placed beside characters in vertical text — the Japanese typographic equivalent of italic or bold. Bousen (傍線) are the same idea with a line instead of dots. The spec recognises eleven dot variants and six line variants; aozora accepts every one.

Notation forms

Two indirection styles, both common in real corpus:

[#「平和」に傍点]           ← target-by-quoting
平和[#「平和」に傍点]        ← redundant explicit copy (also accepted)
[#傍点]平和[#傍点終わり]     ← range form (bare opener / closer)

The target-by-quoting form is by far the most common: the inline annotation looks backwards in the text for the most recent occurrence of the quoted string and applies the bouten to that run.

Variant catalogue

aozora recognises eleven variants — eight 点 (dot) families and three 線 (line) families:

SlugSource keywordFamily
goma傍点
white-sesame白ゴマ傍点
circle丸傍点
white-circle白丸傍点
double-circle二重丸傍点
janome蛇の目傍点
crossばつ傍点
white-triangle白三角傍点
wavy-line波線
under-line傍線
double-under-line二重傍線

Each variant has a stable slug that the HTML renderer emits as a class name (e.g. <em class="aozora-bouten-goma">). The 点/線 family boundary is what mismatched_bouten_container checks for the range form below.

Default rendering

aozora emits <em class="aozora-bouten-<slug>">…</em> so that an external stylesheet can pick the visual treatment per variant. Default CSS hooks live at the consumer side; the parser ships no stylesheet of its own.

<!-- 平和[#「平和」に傍点] -->
平和<em class="aozora-bouten aozora-bouten-goma aozora-bouten-right">平和</em>

(The redundant copy is intentional — the [#…] indirection re-emits the target wrapped in <em>, leaving the original run in place. The HTML rendering matches what print Aozora Bunko output does in practice.)

Range form

To emphasise a run directly (rather than by quoting it), wrap it between a bare opener and its matching closer — note there is no ここから / ここで (those prefixes are for block layout / 太字 / 斜体, not 傍点):

彼は[#傍点]必ず[#傍点終わり]来る
本文[#二重傍線]乙[#二重傍線終わり]
[#左に傍線]丙[#左に傍線終わり]

Renders inline as <em>:

彼は<em class="aozora-bouten aozora-bouten-goma aozora-bouten-right">必ず</em>来る

The opener can be any variant keyword (傍点, 白丸傍点, 二重傍線, …), with an optional 左に prefix for left-side marks; the closer is the same keyword plus 終わり. The closer’s family must match the opener’s: a 点 opener ([#傍点]) pairs with a 点 closer ([#傍点終わり]), a 線 opener ([#傍線]) with a 線 closer ([#傍線終わり]). A family mismatch fires mismatched_bouten_container.

AST shape

Both the indirect ([#「X」に傍点]) and range ([#傍点]…[#傍点終わり]) forms produce Bouten nodes:

pub struct Bouten<'src> {
    pub kind:     BoutenKind,            // one of 11 variants (点 / 線)
    pub target:   NonEmpty<Content<'src>>, // the emphasised run
    pub position: BoutenPosition,        // Right (default) | Left (左に…)
    pub consumed_predecessor: bool,      // whether it absorbed the run before it
}

BoutenKind is a flat enum (BoutenKind::is_line splits 点 from 線); see the rustdoc for the exact variant list.

See also

縦中横 (tate-chū-yoko)

縦中横 (tate-chū-yoko, “horizontal in vertical”) is a typographic construct that lays a short run — usually digits, Latin letters, or mixed punctuation — horizontally inside otherwise vertical text. In print, it is the common treatment for two- or three-digit numbers in a vertical paragraph.

Notation

The annotation always uses the indirect-quoting form:

昭和27年生まれ[#「27」は縦中横]

Renders as:

昭和<span class="aozora-tcy">27</span>年生まれ

The [#…] directive looks back through the most recent text and applies the tcy treatment to the most recent occurrence of the quoted run. The target text is not re-emitted — the wrapper is applied in place, unlike bouten.

Container form

For longer mixed-orientation runs (multi-line table data, Latin abbreviations spanning a paragraph), the container form sits inside an outer indent block:

[#ここから縦中横]
27 / 100 = 0.27
[#ここで縦中横終わり]

Renders as:

<div class="aozora-tcy-block">
27 / 100 = 0.27
</div>

Common targets

SourceOutput
27[#「27」は縦中横]<span class="aozora-tcy">27</span>
100%[#「100」は縦中横]<span class="aozora-tcy">100</span>%
A4[#「A4」は縦中横]<span class="aozora-tcy">A4</span>
&[#「&」は縦中横]<span class="aozora-tcy">&amp;</span>

(HTML escapes are handled by the renderer, not the AST.)

Anchor lookup

The lookup that finds the target run:

  1. Scans backwards from the [#…] directive through the current line.
  2. Stops at the first match for the quoted run.
  3. Falls through to the previous line if no match (with an upper bound of 64 KiB or one paragraph break, whichever comes first).

If no match is found, diagnostic aozora::lex::tcy_target_not_found fires and the directive degrades to a plain Annotation{Unknown}. Authors get the same look-back semantics they’d get from bouten — see Bouten for the symmetric case.

Why a span, not a flow rotation?

Web renderers reach for writing-mode: horizontal-tb inside a writing-mode: vertical-rl parent, but that has poor browser support and breaks line-break propagation. aozora’s HTML output uses a single class hook (<span class="aozora-tcy">) so the consuming stylesheet can decide:

  • print stylesheet → font-feature-settings: "vert"; text-combine-upright: all;
  • screen stylesheet → leave horizontal, set monospace
  • e-book renderer → use the renderer’s native tcy primitive

Pushing this decision into the HTML output (e.g. emitting an inline SVG with rotated glyphs) would lock consumers into a specific typographic model. The class-hook output keeps the HTML semantic and defers presentation to the consumer.

AST shape

pub struct Tcy<'src> {
    pub text: &'src str,
    pub form: TcyForm,    // Inline | Container
    pub span: Span,
}

See also

  • Indent containers — tcy commonly appears inside 字下げ blocks; the parser applies tcy after the indent fence is established so the look-back search is bounded by the inner block.

Gaiji (外字 references)

Aozora Bunko predates ubiquitous Unicode support; many works still ship as Shift_JIS source. Characters that don’t fit in Shift_JIS — JIS X 0213 plane-2 ideographs, accented Latin letters, ad-hoc combining marks — appear in source as gaiji references:

※[#「魚+師のつくり」、第3水準1-94-37]
※[#「彳+寺」、U+5F85、393-13]
※[#濁点付き片仮名ヰ]

The leading (U+203B, reference mark) opens the annotation; the [#…] body describes the character in three orthogonal ways:

  1. A descriptive name in Japanese (「魚+師のつくり」 — “魚 plus the right-hand side of 師”) for human readers.
  2. A JIS X 0213 plane / row / cell triple (第3水準1-94-37 — plane 1, row 94, cell 37).
  3. A Unicode codepoint (U+5F85) when the character has one.

aozora resolves gaiji references through a compile-time PHF lookup table built from the JIS X 0213 official mapping plus the Unicode UCS register, with the descriptive name as a tertiary fallback.

Why a compile-time table?

The gaiji table has ~14 000 entries. Loading it at runtime from a JSON / TOML asset would:

  • Add a startup cost on every Document::new (the parser is supposed to start reading bytes within microseconds).
  • Force every binding (CLI, WASM, FFI, Python wheel) to ship the table as a separate asset, complicating distribution.
  • Defeat dead-code elimination — the linker can’t strip entries the consumer’s input never references if they’re loaded behind an opaque file read.

A phf::Map baked into the binary at compile time wins on every axis: zero-allocation lookup, single-binary distribution, full DCE and LTO visibility. The build cost is real (~40 s the first time, ~0 s incremental) but happens once per workspace build, not per-invocation.

phf over static HashMap (which would require runtime construction in a OnceLock): phf produces a true compile-time perfect-hash table — O(1) lookup with no first-call cost and no synchronisation on the hot path.

Resolution order

For a reference like ※[#「魚+師のつくり」、第3水準1-94-37]:

  1. Unicode codepoint if the source explicitly provided one (U+XXXX) — used directly.
  2. JIS X 0213 plane-row-cell lookup (第N水準P-R-C) — most ideographs land here.
  3. Descriptive name — the parser ships a curated mapping plus a single-character fallback (a description that is itself one glyph resolves to it). A reference that matches none of these resolves to nothing: the aozora::lex::unresolved_gaiji warning fires and the gaiji renders as its description text.

AST shape

pub struct Gaiji<'src> {
    /// Free-form description from the source (e.g. "魚+師のつくり").
    pub description: &'src str,
    /// Resolved Unicode value — a single scalar or a static combining
    /// sequence — or `None` when no path matched.
    pub ucs: Option<Resolved>,
    /// Raw mencode reference (e.g. "第3水準1-85-54", "U+XXXX").
    pub mencode: Option<&'src str>,
}

Resolved is Char(char) for the 99%+ single-scalar case or Multi(&'static str) for the 25 JIS X 0213 plane-1 combining-sequence cells. ucs == None is the unresolved case the unresolved_gaiji warning flags.

Render output

ucsHTML
Some(_)<span class="aozora-gaiji" data-codepoint="U+20B9B">𠮛</span> — the resolved glyph as content, the scalar(s) as space-separated U+XXXX in data-codepoint.
None<span class="aozora-gaiji" data-description="魚+師のつくり">魚+師のつくり</span> — the description as both attribute and content.

Accent decomposition

Aozora Bunko also encodes accented Latin letters (è, ñ, ä) using a separate notation that does not go through ※[#…]:

M&iexcl;cher    ← in some sources
me-zin       ← in others

The full table is at https://www.aozora.gr.jp/accent_separation.html — 114 ASCII digraphs / ligatures mapping to Unicode. aozora applies this decomposition during the lexer’s Phase 0 (sanitize), so by the time classification runs the source is pure Unicode. See Architecture → Seven-phase lexer for the phase ordering.

See also

Kunten / kaeriten (訓点・返り点)

Kunten are the marginal annotations Japanese readers add to classical Chinese (漢文) source so that it can be read in Japanese word order. aozora recognises kaeriten (返り点) — the reading-order return marks — in their bracketed form. The recognised marks are:

  • single: , , , , , , , , , , ,
  • Xレ compounds: 一レ, 二レ, 三レ, 上レ, 中レ, 下レ
  • 送り仮名: the parenthesised (…) form

(Re-reading marks — 再読文字 like 未 / 将 / 当 — and any other kunten that do not match the above are carried as generic [#…] annotations.)

A handful of late-Edo / Meiji Aozora Bunko works carry these. In real source the marks sit between characters as [#…] annotations:

有[#二]朋自遠方来[#一]

Notation forms

Bracketed (the recognised form)

aozora recognises the bracketed form only — the mark in a [#…] annotation:

有[#二]朋自遠方来[#一]

Renders as:

有<sup class="aozora-kaeriten">二</sup>朋自遠方来<sup class="aozora-kaeriten">一</sup>

Inline (not recognised)

A bare reading-mark glyph written directly between characters (有レ朋自遠方来) is left as plain text — the parser cannot tell a genuine 返り点 from an ordinary / / in running prose, which is exactly why the bracketed form exists. Use [#…] for any mark you want recognised.

Okurigana

Kunten 送り仮名 (reading-aid kana) use the parenthesised form, also inside a [#…] annotation:

有[#(リ)]

These are classified as kaeriten nodes but are not ladder marks (they take no part in the pairing check).

AST shape

The recognised marks (single 一 二 三 四 上 中 下 甲 乙 丙 丁 レ, the Xレ compounds, and (…) okurigana) all produce one node that stores the raw mark text:

pub struct Kaeriten<'src> {
    pub mark: NonEmptyStr<'src>,   // the raw mark, e.g. "二" / "一レ" / "(リ)"
}

The renderer wraps it in <sup class="aozora-kaeriten">…</sup>. The bracketed_kaeriten_no_pair / kaeriten_outside_kanbun checks classify the mark’s family and rank from this string at diagnostic time rather than storing a typed enum.

Diagnostics

CodeCondition
kaeriten_outside_kanbunA lone kaeriten in kana prose (conservative lookahead heuristic)
bracketed_kaeriten_no_pairA rank-≥2 mark whose family base ( / / ) is absent from the document

See also

Indent & align containers (字下げ)

Aozora Bunko uses paired [#ここから…] / [#ここで…終わり] brackets to delimit blocks of text with custom layout. The block container families aozora recognises:

FamilyOpenerCloserEffect
字下げ (indent)[#ここから2字下げ][#ここで字下げ終わり]Indent every line by N full-width chars
地付き / 地上げ (align-end)[#ここから地付き] / [#ここから地から2字上げ][#ここで地付き終わり]Flush right (vertical: 地 = ground = bottom)
罫囲み (boxed)[#罫囲み][#罫囲み終わり]Draw a rule frame around the block

The HTML renderer maps them to <div class="aozora-container …"> wrappers. Two more container kinds are inline, not block: 割り注 ([#割り注]…[#割り注終わり]) and the 傍点 / 傍線 range form ([#傍点]…[#傍点終わり], see bouten).

Single-line forms

The 字下げ / 地付き / 地上げ directives also have a single-line form (no ここから prefix, no closer) that applies to the rest of the line:

 [#地付き]平和への誓い

In the borrowed AST a single-line directive is a zero-width marker node (AozoraNode::Indent / AlignEnd), not a wrapping container — it renders as an empty span and the following text stays a sibling:

<span class="aozora-align-end aozora-align-end-0" data-offset="0"></span>平和への誓い

A page / section break sharing the line with such a marker drops it — see break_in_single_line_container.

AST shape

A paired block container is one Container node tagging the wrapped children (the lexer splices the enclosed siblings under it during post-processing); single-line forms and breaks are leaf nodes:

pub struct Container {
    pub kind: ContainerKind,
}

pub enum ContainerKind {
    Indent { amount: u8 },    // [#ここからN字下げ]
    AlignEnd { offset: u8 },  // [#ここから地付き / 地からN字上げ]
    Keigakomi,                // [#罫囲み]
    Warichu,                  // [#割り注]           (inline)
    BoutenRange { kind: BoutenKind, position: BoutenPosition }, // [#傍点]… (inline)
}

Why a small flat enum?

ContainerKind is closed by spec. A flat enum (vs a trait object or string tag) gives the parser O(1) variant dispatch in the classify phase and the renderer’s HTML walk, and lets the compiler’s exhaustiveness check enforce that every variant has a render path. The payloads are tiny (u8 / BoutenKind / BoutenPosition), so the whole enum stays within a few bytes — pinned by the container_kind_is_copy_and_fits_in_a_word assertion.

Composition

Containers nest:

[#ここから2字下げ]
 通常の段落。
 [#ここから地付き]
  右寄せの行。
 [#ここで地付き終わり]
 通常に戻る。
[#ここで字下げ終わり]

Renders as nested divs:

<div class="aozora-indent-2">
通常の段落。
<div class="aozora-align-end">
右寄せの行。
</div>
通常に戻る。
</div>

Mismatched closers (e.g. [#ここから地付き][#ここで字下げ終わり]) fire diagnostic aozora::lex::mismatched_container_close and the parser auto-closes the offending opener at the closer’s position. The check compares container families, so closing a 2字下げ opener with a plain 字下げ終わり (both indent) is fine — only a different family (indent vs align-end vs 罫囲み vs 割り注) is flagged.

Why containers, not stack-based push/pop tokens?

The spec describes these as opener / closer brackets, but the natural implementation in Rust is a recursive container node. That choice:

  • Lets the renderer walk the tree once with a single match on ContainerKind, instead of maintaining a render-time stack.
  • Surfaces shape errors (mismatched closers, dangling openers) at parse time — the lexer’s classify phase already has all the information to decide.
  • Makes the canonical-serialise pass trivial (each container prints its opener, walks its children, prints its closer).

The trade-off is one extra heap touch per container — a single bumpalo slice for children. The arena is already hot, so the cost is negligible (bumpalo returns aligned pointers in O(1) bumps).

See also

Page & section breaks (改ページ・改丁)

Aozora Bunko inherits print conventions for page-level structure. Four annotations split a work into pages, signatures, and openings:

NotationRenders asMeaning
[#改ページ]<div class="aozora-page-break"></div>Begin a new page
[#改丁]<div class="aozora-section-break aozora-section-break-kaicho"></div>Begin a new 丁 (leaf / recto)
[#改段]<div class="aozora-section-break aozora-section-break-kaidan"></div>Section break (smaller than a page)
[#改見開き]<div class="aozora-section-break aozora-section-break-kaimihiraki"></div>Begin a new two-page spread

All four are self-contained directives — no opener / closer pair, no inner content. They appear on their own line in the source.

AST shape

[#改ページ] is its own borrowed-AST node; the three 段 / 丁 / 見開き breaks share one SectionBreak node tagged by [SectionKind]:

// borrowed::AozoraNode variants
AozoraNode::PageBreak,                  // [#改ページ]
AozoraNode::SectionBreak(SectionKind),  // [#改丁 / 改段 / 改見開き]

pub enum SectionKind {
    Choho,   // 改丁
    Dan,     // 改段
    Spread,  // 改見開き
}

Why distinct variants for each break flavour?

The flavours render to identical HTML structure (an empty <div>) but different class hooks (aozora-page-break, aozora-section-break-{kaicho,kaidan,kaimihiraki}). Keeping PageBreak separate and tagging the section flavours with a SectionKind enum (rather than a string) means:

  • The renderer never plumbs the original notation through to the output, preserving the AST’s role as a normalised IR.
  • The compiler’s exhaustiveness check guarantees every flavour has a render path.
  • Tooling can count breaks of a specific flavour at the AST level without a string match.

Composition with other annotations

Breaks unconditionally close any open inline annotation (ruby, bouten, tcy) at their line. They do not close container directives (字下げ, 地付き, etc.) — those persist across page boundaries, which matches print typography.

[#ここから2字下げ]
 第一節
[#改ページ]
 第二節 (still 2字下げ)
[#ここで字下げ終わり]

Diagnostics

CodeCondition
break_in_single_line_containerA page / section break sharing a line with a single-line container (or inside a warichu range), which drops it

See also

Diagnostics catalogue

aozora is non-fatal by design: the parser always produces a tree, even from malformed input, and reports what it noticed through structured diagnostics that callers choose how to treat. This page is the catalogue.

Each Diagnostic carries:

  • a stable code — a dotted string such as aozora::lex::unclosed_bracket. The string is pinned by a test and never changes within a major release; new diagnostics add new codes.
  • a severity: Error / Warning / Note.
  • a source axis: Source (your input tripped it) or Internal (a library-bug sanity check — see Internal).
  • a span — a byte range in the sanitized source (the Phase 0 output: BOM stripped, CRLF→LF, 〔…〕 accents decomposed). For input with none of those, the sanitized bytes equal the original bytes.

Rendering them

The aozora check CLI renders diagnostics three ways, chosen with --diagnostic-format:

  • human (the default on a terminal) — a graphical miette report: the source line, a caret under the offending span, the label, the help text, and a link back to this page.
  • json (the default when stderr is piped) — the aozora::wire diagnostics envelope, byte-identical to what the WASM / FFI / Python / Extism front doors emit. This is the machine / agent path.
  • short — one grep-able line per diagnostic: path:offset: severity[code]: message.

Exit codes: 0 (diagnostics printed but tolerated), 1 (--strict with at least one diagnostic), 2 (CLI usage error), 3 (an Internal diagnostic fired — a library bug). See the CLI reference.

Library consumers get tree.diagnostics() -> &[Diagnostic] and reach the parts through code(), severity(), source(), and span(). All bindings carry the same structured data.

Source diagnostics

These trace back to your input. The parser emits exactly these — the authoring-error catalogue is complete (no diagnostic is specified-but-unimplemented).

Source contains PUA

aozora::lex::source_contains_pua · Warning

…￯…        (a literal U+E001..=U+E004 codepoint in the source)

The source contains a codepoint in U+E001..=U+E004, which the lexer reserves as inline / block placeholder sentinels. A source-side occurrence collides with the lexer’s own markers and would confuse the placeholder registry. Fix: remove the private-use codepoint from the source (these are not normal text characters and effectively never occur in real 青空文庫 files).

Unclosed bracket

aozora::lex::unclosed_bracket · Error

[#ここから2字下げ            (no matching [#ここで字下げ終わり])

An Aozora open delimiter (ruby , annotation [#, quote, …) reached end-of-input with no matching close on the pairing stack. The label points at the opener. The region degrades to plain text — no pair link is emitted. Fix: add the missing close delimiter, or remove the dangling opener.

Unmatched close

aozora::lex::unmatched_close · Error

青空]》            (a close with no matching open on the stack)

A close delimiter was seen with an empty pairing stack, or against a stack top of a different PairKind. The label points at the stray close. Fix: add the matching open delimiter, or remove the stray close.

Accent decomposition applied

aozora::lex::accent_decomposition_applied · Note

〔cafe'〕        (decomposed to 〔café〕)

A 〔…〕 accent digraph was rewritten to its Unicode-combined form during Phase 0 sanitize (cafe'café, fune + backtick → funè, …). This is intended behaviour, not an error — it is surfaced as a Note so an editor can show what changed. One note fires per 〔…〕 span that actually contained a digraph; a 〔…〕 with no accent digraph is silent. The span is in sanitized (post-decomposition) coordinates. The transform is loss-free: the serializer reconstructs the original 〔…〕 source form. See ADR-0003. No action required.

Unresolved gaiji

aozora::lex::unresolved_gaiji · Warning

※[#「架空の外字」、第3水準99-99-99]   (men-ku-ten out of range)

A 外字 (gaiji) reference — ※[#…] — resolved to neither a Unicode scalar nor a JIS X 0213 cell: no 第N水準P-R-C men-ku-ten or U+XXXX reference matched, and the description is not itself a single resolvable character. The construct still parses; the renderer falls back to the description text (<span class="aozora-gaiji" data-description="…">…</span>) rather than the intended glyph. The label points at the ※[#…] reference. Fix: correct the men-ku-ten / U+XXXX reference, or accept the description-only rendering. (Fires for top-level references; gaiji nested inside a ruby / bouten reading is not yet flagged.)

Mismatched container close

aozora::lex::mismatched_container_close · Error

[#ここから2字下げ]…[#ここで地付き終わり]   (indent opened, align-end closed)

A paired container opened with one family (indent / warichu / keigakomi / align-end) was closed by a closer of a different family. The comparison is by family, so closing a 2字下げ opener with a plain 字下げ終わり (both indent, differing only in amount) is not flagged — only a genuine family mismatch is. The label points at the close marker. The parser recovers by auto-closing the opener at the closer’s position (the container pair is still emitted, keyed by the open family). Fix: match the closer to the opener — ここから字下げここで字下げ終わり, ここから地付きここで地付き終わり, etc.

Empty ruby reading

aozora::lex::empty_ruby_reading · Error

|青梅《》        (base given, reading empty)

An explicit-base ruby supplied a base (a precedes the ) but an empty 《》 reading. Because the marks the base unambiguously, this is a genuine authoring slip rather than a literal 《》 run — so a bare 青梅《》 with no is not flagged (the parser can’t be sure a base was intended and treats it as text). The construct degrades to plain text. The label spans the whole |青梅《》. Fix: supply a reading, or drop the |…《》 markers to keep the base as plain text.

Nested ruby

aozora::lex::nested_ruby · Error

|漢《か《ん》じ》      (the reading body opens another 《…》)

A ruby reading body itself opened another ruby. Ruby does not nest; the label points at the inner . The outer ruby is still parsed best-effort. Note that an adjacent 《《…》》 is not nested ruby — the tokenizer reads 《《 / 》》 as double-bracket bouten, a separate construct — so this fires only when the inner 《…》 closes before the outer (text between the two closes, as in the catalogue shape |…《…《…》…》). Fix: close the outer reading before the inner , or remove the inner 《…》.

Unrecognised container directive

aozora::lex::unrecognised_container_directive · Warning

[#ここからナントカ]      (no such container kind)

A [#ここから…] directive looked like a paired-container opener but named no known container kind (字下げ, 地付き, 地から N 字上げ). The bracket is kept as a plain Annotation{Unknown} (so output is preserved and the “no bare [#” guarantee holds) but is not treated as a container — any matching [#ここで…終わり] will not pair with it. The label spans the directive. Fix: use a recognised opener, e.g. [#ここから2字下げ] or [#ここから地付き].

TCY target not found

aozora::lex::tcy_target_not_found · Warning

あ[#「い」は縦中横]      (no 「い」 earlier in the line)

A 縦中横 forward reference ([#「X」は縦中横]) named a target that does not appear anywhere in the preceding text, so it has no run to rotate. The directive degrades to an Annotation{Unknown}. The label spans the directive. Fix: check the spelling of the quoted target, or place the [#「X」は縦中横] after the run it should style.

Bouten target ambiguous

aozora::lex::bouten_target_ambiguous · Warning

青空青空[#「青空」に傍点]      (「青空」 occurs twice before the directive)

A forward-reference bouten ([#「X」に傍点]) named a target that occurs more than once in the preceding look-back window, so which run it emphasises is ambiguous. The parser still applies it (to the match its look-back rule selects) but the chosen run may not be the intended one. The label spans the directive. Fix: reword so the quoted target is unique before the directive. (Multi-target brackets like [#「A」「B」に傍点] name distinct runs and are never flagged.)

Mismatched bouten container

aozora::lex::mismatched_bouten_container · Error

彼は[#傍点]必ず[#傍線終わり]来る   (傍点 opened, 傍線 closed)

A 傍点 / 傍線 range form ([#傍点] … [#傍点終わり]) was opened with one family — 点 (dots) or 線 (line) — and closed by the other, e.g. a [#傍点] opener closed by [#傍線終わり]. The two families render differently (dots beside the text vs a line alongside it), so the run’s emphasis is ambiguous. The parser recovers by keying the run to the opener’s variant. A same-family variant difference (白丸傍点 closed by 丸傍点終わり) is tolerated. The label points at the close marker. Fix: match the closer’s family to the opener — [#傍点終わり] for any 点 variant, [#傍線終わり] for any 線 variant.

Bracketed kaeriten no pair

aozora::lex::bracketed_kaeriten_no_pair · Error

怪物[#二]   ([#二] with no [#一] anywhere in the document)

A bracketed kaeriten of rank ≥ 2 ([#二] / [#下] / [#乙]) appears in a document whose matching family base — [#一] / [#上] / [#甲] — is absent entirely, so the return mark has nothing to pair back to. The check is document-wide and base-only by design: real 漢文 return-mark groups span / and line boundaries (and write before ), and 上下点 may use just (skipping ), so any narrower scope would wrongly flag valid kanbun. (re-ten) is standalone and never flagged; 送り仮名 ([#(ス)]) is not a ladder mark. Fix: add the missing base mark, or check the mark is a genuine 返り点.

Kaeriten outside kanbun

aozora::lex::kaeriten_outside_kanbun · Warning

これは[#レ]と書いた。   (a lone kaeriten in kana prose)

A kaeriten ([#二] / [#レ] / …) is the only one in the entire document and its surroundings read as ordinary kana prose, so it is most likely a stray [#…] annotation rather than a genuine 返り点. The lookahead heuristic is deliberately conservative — a document carrying a cluster of kaeriten (real 漢文) is never flagged. The label points at the lone mark. (Only the bracketed [#…] form is recognised; a bare reading-mark glyph in running text is left as plain text.) Fix: confirm the mark is intended; remove it if it is not a reading mark.

Break in single line container

aozora::lex::break_in_single_line_container · Warning

[#地付き]本文[#改ページ]   (single-line directive shares its line with a break)

A single-line layout directive ([#地付き], [#N字下げ]) or a warichu range ([#割り注] … [#割り注終わり]) governs only the rest of its line. A page / section break sharing that line — or, for warichu, falling between the open and close — drops the container: the break starts a new block, so the directive’s run is cut short. Paired block forms ([#ここから…] … [#ここで…終わり]) persist across breaks and are not flagged (print typography keeps the layout across pages). The label points at the break. Fix: move the break off the line, or use the paired block form.

Internal

aozora::internal · Error · source = Internal

Pipeline-internal sanity checks. A correct build never emits these — their appearance means a bug in aozora itself, not a problem with your input. The specific check is identified by an InternalCheckCode:

Check codeFires when
aozora::lex::residual_annotation_markeran [# digraph survived classification into the normalized text (a missing recogniser)
aozora::lex::unregistered_sentinela PUA sentinel sits at a normalized position not recorded in the placeholder registry
aozora::lex::registry_out_of_ordera placeholder-registry vector is not strictly ordered by position
aozora::lex::registry_position_mismatcha registry entry references a position whose character is not the expected sentinel

aozora check exits 3 when one fires. Please report it with the source that triggered it.

Planned diagnostics

None outstanding. Every authoring-error diagnostic in the catalogue — including the four model-dependent ones (mismatched_bouten_container, bracketed_kaeriten_no_pair, kaeriten_outside_kanbun, break_in_single_line_container) — is now emitted; see the Source diagnostics above. New 記法 work adds new codes here as it lands.

Why a stable string code, not just a message?

  1. Test stability. The corpus sweep and conformance gate count diagnostics by code; a test like “this corpus emits at most N unresolved_gaiji warnings” survives message-wording tweaks and localisation. A test that greps the message string does not.
  2. Tool integration. Editors / LSPs / CI lints filter by code (e.g. “treat every Error-severity code as fatal, ignore unrecognised_container_directive for legacy files”). String matching on prose is fragile.

See also

Pipeline overview

aozora is a pure-functional parser: given the same input, the same arena, and the same compile-time configuration, the output is bit-for-bit identical. There are no thread-locals, no OnceCell caches in the parse path, no environmental side effects. The only state the parser owns is the arena and a string interner, both reset per Document.

Three layers

flowchart TD
    src["source text<br/>(UTF-8 or Shift_JIS)"]
    decode["Shift_JIS decode<br/>(aozora-encoding)"]
    lex["Lex<br/>(aozora-pipeline::lex_into_arena)<br/>sanitize → events → pair → classify"]
    tree["AozoraTree&lt;'arena&gt;<br/>(borrowed AST)"]
    render["Render<br/>(aozora-render)<br/>html  /  serialize"]
    out["HTML  /  canonical 青空文庫 source"]

    src --> decode --> lex --> tree --> render --> out

Each arrow is a pure function. The arena is threaded through lex; nothing else holds state.

Crate dependency graph

flowchart TD
    spec["aozora-spec<br/>shared types"]
    encoding["aozora-encoding<br/>SJIS + 外字 PHF"]
    scan["aozora-scan<br/>SIMD multi-pattern"]
    veb["aozora-veb<br/>Eytzinger sorted-set"]
    syntax["aozora-syntax<br/>AST node types"]
    pipeline["aozora-pipeline<br/>4-phase lexer +<br/>lex_into_arena"]
    render["aozora-render<br/>html / serialize"]
    facade["aozora<br/>public facade"]
    cli["aozora-cli"]
    ffi["aozora-ffi"]
    wasm["aozora-wasm"]
    py["aozora-py"]

    spec --> encoding
    spec --> scan
    spec --> veb
    spec --> syntax
    encoding --> syntax
    scan --> pipeline
    veb --> pipeline
    syntax --> pipeline
    pipeline --> render
    render --> facade
    facade --> cli
    facade --> ffi
    facade --> wasm
    facade --> py

aozora-spec is the foundation — every other crate depends on it. The dependency graph forms a strict DAG; circular deps are forbidden by cargo deny’s bans config and by the cargo metadata check in just lint.

What each layer does

Sanitize → Events → Pair → Classify

The lexer pipeline is split into four phases because each stage has a different cost / cache profile:

PhaseInputOutputWhy separate
Sanitizeraw &strnormalised &str + Phase-0 diagnosticsBOM / CRLF / accent decomposition / decorative-rule isolation / PUA collision pre-scan all happen here, once, before any expensive lookahead. Keeps later phases linear-time.
Eventssanitised &strIterator<Token>SIMD trigger scan (aozora-scan) fires here; the linear tokenise that follows fuses with the scan so no per-event vector is allocated.
PairIterator<Token>Iterator<PairEvent>Balanced-stack bracket matching across all opener / closer pairs (|》《, [], 〔〕, 「」, 《《》》). Recovery diagnostics for unclosed / unmatched fire here.
ClassifyIterator<PairEvent>Iterator<ClassifiedSpan> (→ AozoraNode<'arena>)Decides “is this [#…] an indent opener, a bouten directive, a tcy directive, …” via the slug-canonicalised dispatch table.

Splitting them lets the parser ship two surface APIs without code duplication:

  • lex_into_arena — fused, allocates one borrowed-AST tree.
  • Per-phase calls (sanitize, tokenize, pair, classify) — used by the bench harness’s per-phase probes and the integration tests in crates/aozora-pipeline/tests/.

Sanitize details

Phase 0 sanitize covers:

  • BOM strip — UTF-8 BOM detection at the head.
  • CRLF normalisation — CRLF → LF in one memchr2 pass.
  • Decorative rule isolation — separates long horizontal-rule patterns from neighbouring text so Phase 1’s trigger scan does not split them mid-glyph.
  • Accent decomposition — ASCII digraphs / ligatures → Unicode (see Gaiji).
  • PUA collision pre-scan — emits Diagnostic::SourceContainsPua for stray U+E001..U+E004 codepoints in the source so they can never be confused with the lexer’s own sentinel insertions later.

Events: SIMD scan

Trigger byte detection runs the SIMD multi-pattern scanner from aozora-scan. Multiple backends share a common trait; selection happens once via runtime CPU detection and is cached for the process lifetime. See Architecture → SIMD scanner backends for the dispatch order and what each backend looks like in samply.

Pair → Classify

Bracket matching is a single linear-time stack walk over the trigger event stream. Classify then does the actual recognition: each opener type maps via the SLUGS dispatch table to a recogniser, and the recogniser produces the borrowed AozoraNode<'arena> that lex_into_arena then registers and substitutes a PUA sentinel for. The slug canonicalisation makes prefix collisions (ここから2字下げ vs ここから2字下げ、地寄せ) deterministic without relying on declaration order. Look-back targets (bouten / tcy) resolve in the same walk against the sanitised text.

Render

Two render walkers:

  • html::render_to_string — a single O(n) tree walk emitting semantic HTML5 with aozora-* class hooks.
  • serialize::serialize — re-emits canonical 青空文庫 source.

Both are pure functions; both allocate exactly the output buffer and nothing else.

What the pipeline does not do

No tree mutation between layers. No optimisation passes. No “resolver” stage that mutates the AST. The lexer produces the final tree; the renderer consumes it; that’s it. This is the same shape as a functional reactive pipeline, and it’s what lets the borrowed-arena AST (next chapter) work without RefCell or UnsafeCell.

See also

Borrowed-arena AST

AozoraTree<'a> is not an owned tree. It’s a borrow into two things owned by Document:

  • the source Box<str>,
  • a bumpalo::Bump arena that holds every intermediate node and child slice.
flowchart LR
    subgraph Document
        src["Box&lt;str&gt; source"]
        bump["bumpalo::Bump arena"]
    end
    tree["AozoraTree&lt;'a&gt;"]
    walk["render / serialize / iterate"]

    src -.borrows.-> tree
    bump -.borrows.-> tree
    tree --> walk

When the Document drops, the source Box<str> and the arena’s single backing buffer drop in two free() calls — every node, every container, every interned string releases together. There is no per-node destructor and no walk-the-tree-to-free pass.

Why an arena and not Box<Node> everywhere?

The naive Rust shape — enum Node { Ruby { target: String, … }, … } — would allocate per node, per String, per Vec<Node> for container children. For a typical Aozora Bunko work (~500 KiB source, ~50 000 nodes) that’s:

  • ~50 000 individual heap allocations,
  • ~50 000 individual frees on drop (each is a syscall away from the heap allocator’s free list),
  • 16+ bytes of allocator metadata per allocation,
  • random-access fragmentation that defeats prefetch.

The arena variant produces:

  • ~16 bump allocations (4 KiB pages, refilled on overflow),
  • 1 free on drop (Bump::reset returns the pages to the OS, the pages themselves are typically reused via the cargo / system allocator’s page cache).
  • Sequential layout: nodes that were lexed near each other live near each other in memory, which is exactly the order the renderer walks them.

Measured on the corpus sweep: the arena variant parses 6.4× faster than the equivalent Box<Node> shape, and the peak RSS is 30% lower. The win is cumulative — every binding (CLI / WASM / FFI / Python) inherits it.

Why bumpalo over typed-arena, slotmap, or hand-rolled?

CrateShapeWhy aozora doesn’t use it
typed-arenaOne arena per type (Arena<Ruby>, Arena<Bouten>, …)aozora has 30+ node types; managing 30 arenas is operationally awkward and forces lifetime-bound &'a per type.
slotmapIndex-keyed nodes; arena owns; access via SlotMap::getAdds an indirection (key → slot → node) on every walk, regressing render throughput by ~25% on the bench harness. Also forces Copy keys, which for variable-length text fields means re-interning.
id-arena / index_vecIndex-typed, &str borrowingSame indirection cost as slotmap.
Hand-rolled bumpCustom; tightest controlCorrect, but bumpalo is already a stable, mainstream, allocator-aware bump arena with bumpalo::collections::Vec for child slices. Reinventing wins nothing.
bumpaloSingle arena, type-erased; allocate any T with bump.alloc(T)One arena per Document; allocate-then-borrow gives &'a T for the lifetime of the arena. Matches aozora’s “one arena per Document” need exactly.

bumpalo’s collections::Vec<'bump, T> (used for container child slices) is Vec-shaped but allocated inside the arena — child slices get the same arena lifetime as the parent without a separate allocation strategy.

How the AST shape interacts with the lifetime

pub enum AozoraNode<'src> {
    Plain(&'src str),
    Ruby(Ruby<'src>),
    Bouten(Bouten<'src>),
    Tcy(Tcy<'src>),
    Gaiji(Gaiji<'src>),
    Container(&'src Container<'src>),    // boxed in the arena
    BreakNode(BreakNode),
    // … 30+ variants
}

The 'src lifetime is the arena lifetime (re-using 'src because all node text borrows from the source buffer, which lives at least as long as the arena). Each variant either:

  • holds a &str slice into the source (zero copy), or
  • is a small Copy struct (BreakNode, Saidoku, …), or
  • is &'src Container<'src> — boxed in the arena because Container itself contains a &'src [AozoraNode<'src>] child slice.

The whole AozoraNode is Copy (it’s a tagged union of references and small primitives), so iterating the tree never needs & — just deref the reference, copy the node, walk on.

What you trade

The big trade-off: you can’t outlive the Document. A Vec<AozoraNode<'_>> doesn’t compile because the '_ lifetime is bound to the arena, which is bound to the Document.

In practice this rarely matters — consumers either:

  • Render the tree immediately and discard (tree.to_html() returns String, which has no lifetime tie).
  • Walk the tree once and emit their own owned IR (most editor backends do this).
  • Hold the Document itself across function boundaries and re-derive the tree on the inside.

For consumers that genuinely need an owned tree, the visitor trait on AozoraTree makes the conversion trivial — walk the tree once and emit your own owned IR. We resist shipping a built-in aozora::owned because doing so would push consumers toward it even when an immediate to_html() or per-walk transcription would serve them better.

Lifetime safety

The 'src parameter prevents these shapes at compile time:

fn bad() -> AozoraTree<'static> {
    let doc = aozora::Document::new("…".into());
    doc.parse()        // ERROR: cannot return value referencing local
}

Borrow-checker enforcement; no runtime Drop ordering bugs possible.

See also

Four-phase lexer

aozora-pipeline runs the lexer as four pure-functional phases, each fn(input) -> output with no shared mutable state. The split keeps the dominant hot path (Phase 1 events / Phase 3 classify) tight, lets the bench harness measure each phase independently, and maps every diagnostic to a single phase boundary.

The single public entry lex_into_arena drives all four phases and lands the resulting borrowed AST inside an aozora_syntax::borrowed::Arena provided by the caller. The legacy “phase 4 normalize / phase 5 registry / phase 6 validate” steps disappeared into a fused walk inside lex_into_arena; they no longer have standalone phase functions.

Phase ordering

flowchart LR
    p0["Phase 0<br/>sanitize"]
    p1["Phase 1<br/>events"]
    p2["Phase 2<br/>pair"]
    p3["Phase 3<br/>classify"]
    fused["lex_into_arena<br/>(fused walk:<br/>normalize + registry + validate)"]

    p0 --> p1 --> p2 --> p3 --> fused

Each arrow carries a small data structure (sanitised text, trigger events, pair events, classified spans); no phase reads back into a previous phase’s output.

PhaseInputOutputResponsibility
0 — Sanitizeraw &strSanitizeOutput { sanitized: &str, .. }BOM strip, CRLF → LF, accent decomposition, decorative-rule isolation, PUA collision pre-scan
1 — Eventssanitised &strIterator<Item = Token>SIMD trigger scan (aozora-scan) followed by linear tokenise into Plain / trigger events
2 — PairIterator<Token>Iterator<Item = PairEvent>Balanced-stack pairing for all opener/closer trigrams (|》《, [], 〔〕, 「」, 《《》》)
3 — ClassifyIterator<PairEvent>Iterator<Item = ClassifiedSpan>Full-spec Aozora classification into AozoraNode variants (ruby, bouten, gaiji, tcy, kaeriten, sashie, annotation, …)

The orchestrator lex_into_arena consumes the Phase 3 stream, substitutes PUA sentinels into the normalised text, builds the side-table registry that maps sentinel positions back to classified AozoraNode values, and accumulates diagnostics — all in a single fused walk over the classified-span stream.

Phase 0: sanitize

The most varied phase by what it touches. Sub-passes (in order):

  • bom_strip — UTF-8 BOM detection and removal at the head.
  • normalize_line_endings — CRLF → LF in one memchr2 pass.
  • rewrite_accent_spans — ASCII digraph / ligature decomposition for accent gaiji.
  • isolate_decorative_rules — long horizontal-rule lines (────────── patterns) get separated from neighbouring text so Phase 1’s trigger scan does not split them mid-glyph.
  • scan_for_sentinel_collisions — pre-scan for stray PUA codepoints (U+E001..U+E004); any hit emits Diagnostic::SourceContainsPua and the colliding bytes flow through verbatim (the registry has no entry for them, so they degrade to plain text).

Each sub-pass is independent and runs over the same buffer. The output SanitizeOutput carries the rewritten text alongside any diagnostics emitted along the way.

Phase 1: events

The hot path. SIMD multi-pattern scan from aozora-scan finds every trigger byte position; a single linear walk converts those positions into Token events:

pub enum Token<'src> {
    Plain(&'src str),
    Trigger(TriggerKind, Span),
}

The trigger scan and the tokenise loop fuse so the output stream allocates no per-event vector — downstream phases consume the iterator directly. See SIMD scanner backends for the runtime backend selection.

Throughput on a typical mid-size work (crime_and_punishment.txt, ~600 KiB UTF-8): on the order of GB/s for the SIMD backends, which is well above the rest of the pipeline’s throughput; Phase 1 is essentially free at the corpus level. Concrete numbers are pinned by cargo bench -p aozora-bench --bench crime_and_punishment and the synthetic corpus bench.

Phase 2: pair

Balanced-stack bracket matching. Walk the trigger event stream, push openers onto a SmallVec<[(PairKind, Span); 8]> (inline capacity 8 covers 99th-percentile bracket nesting in real corpus), pop on closers, and emit a PairEvent::Solo / Matched / Unmatched / Unclosed for every trigger.

Phase 2 is also the first place recovery semantics fire: stray closers and unmatched openers each emit a structured diagnostic but never abort, so downstream consumers see a complete event stream regardless of input wellformedness.

Phase 3: classify

The most code-heavy phase. The classifier maps PairEvents to AozoraNode variants via a slug-canonicalised dispatch table (SLUGS / canonicalise_slug). Recognisers are organised per construct family:

  • Ruby (|青梅《おうめ》, with implicit-base auto-glob)
  • Bouten / forward-bouten ([#「平和」に傍点], with look-back target resolution)
  • Tate-chu-yoko ([#「12」は縦中横])
  • Gaiji (※[#説明、ページ-行])
  • Kaeriten (Chinese-text reading marks)
  • Sashie (illustrations)
  • Indent / alignment / line-length annotations
  • Section / page breaks

The recogniser dispatch is deterministic and slug-canonicalised so prefix collisions (ここから2字下げ vs ここから2字下げ、地寄せ) resolve via the SLUGS entry’s family + arity, not by recogniser ordering. Look-back targets (bouten / tcy) resolve against the sanitised text in the same walk.

Fused finishing walk

After Phase 3, lex_into_arena runs a single output-build walk that does what was once three separate phases:

  • Normalise — substitute each Aozora span with its PUA sentinel (U+E001/E002/E003/E004 for inline / block-leaf / block-open / block-close) so the downstream CommonMark parser sees a flat text with single-codepoint placeholders.
  • Register — build the Registry (an EytzingerMap<u32, NodeRef<'src>>, see van Emde Boas / Eytzinger layout) keyed by sentinel byte position so the post-process walk can recover the borrowed-AST node from a normalised position in O(log n).
  • Validate + diagnostics — collect every Phase-0 / Phase-2 / Phase-3 diagnostic, sort by span, and pin stable codes (aozora::lex::source_contains_pua, aozora::lex::unclosed_bracket, …; see diagnostics).

Performing all three in one walk avoids three extra passes over the (potentially MB-class) source and keeps the Registry’s EytzingerMap build amortised.

Why four phases, not one big function?

Three reasons.

  1. Bench-driven optimisation. Per-phase boundaries let cargo bench -p aozora-bench measure each phase’s wall time independently. Knowing that “this document spends 80 % of parse time in Phase 3 classify” tells you where the next perf PR belongs. A monolithic lex() would force re-instrumentation in every PR.
  2. Spec compliance. Each phase corresponds to a discrete transformation the spec describes. Spec gaps in production almost always land in one phase, and the conformance suite can pin regression fixtures targeting that phase only.
  3. Composability. aozora-pipeline exposes both the fused lex_into_arena entry and the per-phase functions (sanitize, tokenize / tokenize_in, pair / pair_in, classify). Production code uses the fused entry; benchmarks and the type-state Pipeline state machine use per-phase calls to isolate regressions.

The cost is conceptual (more API surface internal to the crate); the win is that every perf decision in the parser has a measurement attached.

See also

SIMD scanner backends

Phase 1 of the lexer is a multi-pattern byte scan: find every occurrence of the 11 Aozora trigger characters (|《》#※[]〔〕「」) in the source. On a typical Japanese corpus document — where every codepoint is a 3-byte UTF-8 sequence and trigger characters appear on the order of 1–2 % of bytes — the scan dominates the interpretation by an order of magnitude. So this is the place where SIMD pays for itself.

Architecture: outer driver × inner kernel

aozora-scan ships a single algorithm — Hyperscan-style Teddy with nibble LUTs — implemented once as a platform-agnostic outer driver and plugged into per-ISA inner kernels. The split is the spine of the crate:

  • crate::kernel::teddy — algorithm side. Defines the const-built bucket LUTs (one bit per pattern; the 11 triggers fit comfortably in the 16-bit mask), the verify table, the TeddyInner trait every kernel implements, and teddy_outer — the platform- agnostic chunk loop + verify pass.
  • crate::arch::* — platform side. One file per ISA; each implements TeddyInner::lead_mask_chunk using the appropriate 16-byte LUT shuffle: pshufb on x86 SSSE3, _mm256_shuffle_epi8 on AVX2, vqtbl1q_u8 on NEON, i8x16_swizzle on WASM SIMD.

Adding a new SIMD ISA is one file under arch/. Adding a new algorithm (e.g. SHIFT-OR baseline, AVX-512 64-byte chunk) is one file under kernel/. The two axes never tangle.

BackendChoice + static dispatch

BackendChoice is a Copy enum carrying one variant per inner kernel currently compiled into the build. BackendChoice::detect() runs once at process start, picks the fastest variant the host CPU supports (cached in OnceLock), and the match-based BackendChoice::scan gives static dispatch straight into the monomorphised teddy_outer<I> instantiation. No &dyn, no virtual call on the hot path.

Static dispatch is the whole point: a trait object cannot carry a generic S: OffsetSink method, so a &dyn-based dispatcher would force every parse to allocate a heap Vec<u32> and memcpy it into the lex pipeline’s bumpalo arena. The enum-and-match shape gives us the same runtime-CPU adaptation a single binary needs without that detour.

Backends compiled into the build

VariantTarget gateKernel sizeNotes
TeddyAvx2x86_6432-byte chunkProduction winner on every modern dev / CI host. _mm256_shuffle_epi8 per-lane LUT shuffle.
TeddySsse3x86_6416-byte chunkSelected when AVX2 is unavailable but SSSE3 is. _mm_shuffle_epi8 (pshufb).
TeddyNeonaarch6416-byte chunkaarch64 ABI mandates NEON, so always selected on that target. vqtbl1q_u8.
TeddyWasmwasm3216-byte chunkWASM SIMD128 baseline since 2022. i8x16_swizzle.
ScalarTeddyalways16-byte chunk, no SIMDPure-Rust reference; the no_std last-resort dispatch target and the proptest oracle for SIMD ports.

NaiveScanner (brute-force PHF walker) is #[doc(hidden)] — kept reachable for the integration proptests and the bake-off bench, never the dispatch target.

Why a self-rolled Teddy

The previous production stack drove three external crates — aho_corasick::packed::teddy (SSSE3-only), regex_automata (DFA), hand-rolled simdjson-style structural bitmap (AVX2). Coverage gaps forced redundant fallback code on every commit and the trio carried ~1.4 MB of compiled dependency surface.

Switching to a self-rolled Teddy:

  • One algorithm, four ISAs. The outer driver is ~120 LOC; each ISA inner kernel is ~30 LOC. NEON / WASM SIMD ports compile natively rather than waiting on upstream aho_corasick.
  • No external SIMD deps. aho_corasick and regex_automata are gone from the default dep tree. The aozora-scan build no longer pulls in regex-automata’s ~600 KB of state-table code.
  • One-bit-per-pattern bucket layout. The 11 triggers fit in the lower 11 bits of a u16; we don’t pay for the collision-verify pass Hyperscan’s “fat-finger” packing requires.
  • OffsetSink visitor. Every kernel writes through the same generic sink, so the lex pipeline’s BumpVec<'_, u32> receives offsets directly from the SIMD inner loop — the legacy heap-allocate-then-memcpy detour is gone.

Every kernel cross-validates byte-identically against NaiveScanner in proptest, both in-source (chunk-level) and in tests/property_backend_equiv.rs (end-to-end across the workhorse fragment / pathological / unicode-adversarial distributions).

Verifying the scanner is firing

println!("{}", aozora_scan::BackendChoice::detect().name());
// "teddy-avx2" | "teddy-ssse3" | "teddy-neon" | "teddy-wasm" | "scalar-teddy"

Or under samply, look for one of the per-ISA inner kernels:

  • aozora_scan::arch::x86_64::lead_mask_chunk_avx2
  • aozora_scan::arch::x86_64::lead_mask_chunk_ssse3
  • aozora_scan::arch::aarch64::lead_mask_chunk_neon
  • aozora_scan::arch::wasm32::lead_mask_chunk_wasm
  • aozora_scan::kernel::teddy::ScalarTeddyKernel::lead_mask_chunk

Their parent in the call tree is always aozora_scan::kernel::teddy::teddy_outer, where the chunk loop lives.

See also

Eytzinger sorted-set lookup

aozora-veb is a no_std crate that provides one data structure: a sorted-set lookup over a static byte slice, laid out in Eytzinger order so that the binary search is cache-friendly. It backs the placeholder registry the lexer uses to recognise the fixed-set strings inside [#…] directives (“ここから”, “ここで”, “傍点”, “傍線”, “字下げ”, …).

flowchart LR
    needle["needle: &str"]
    table["Eytzinger-laid sorted set<br/>(static &[&str])"]
    cmp["compare at index, branch left/right"]
    found["Some(idx) | None"]

    needle --> cmp
    table --> cmp
    cmp --> cmp
    cmp --> found

What is Eytzinger order?

A standard sorted array stores elements in their natural order: [a, b, c, d, e, f, g]. Binary search visits indexes 3, 1 or 5, 0/2/4/6 — accesses that are spatially distant in memory. On modern CPUs that’s a cache miss per level past L1.

Eytzinger order stores the same elements in implicit-binary-tree order: the root at index 1 (index 0 is reserved as a sentinel), left child at 2i, right child at 2i+1. The walk visits indexes 1, 2 or 3, 4/5/6/7 — accesses that are consecutive in memory.

For 256+ entries the cache-line packing is a measured 2–3× speedup over std::slice::binary_search on the same data. Below 64 entries the difference is in the noise (everything fits in one cache line). The placeholder registry has ~120 entries — well into Eytzinger’s favourable regime.

Why this and not phf::Set?

phf::Set is a perfect-hash table: O(1) lookup, but with a real constant — one hash computation, one table probe, one strcmp. For short strings (the placeholder registry’s median is 4 chars) the hash dominates, and the table probe is a pointer chase to a separate allocation.

Eytzinger search is log N — but for N=120 that’s 7 comparisons, all in one contiguous slice, no hashing, no separate allocation. Measured: Eytzinger is ~1.5× faster than phf::Set on this workload.

For larger sets (the gaiji table at ~14 000 entries), phf::Set wins — log₂(14000) is 14 comparisons and the cache locality stops mattering. The choice is entry-count-dependent. The aozora codebase uses Eytzinger for sub-256-entry tables and phf::Set for larger ones; the cutoff was determined empirically.

Why not a hash table?

A HashMap<&str, ()> allocates and rehashes; phf and Eytzinger don’t. In the lexer’s Phase 3 classify, the placeholder registry is hit once per [#…] directive — measured as ~5 lookups per KB of source. A HashMap’s startup cost (build the table from a const array on first use, even with OnceLock) would dominate the parser’s per-Document::parse cost on tiny inputs.

API

pub struct EytzingerSet<'a> {
    entries: &'a [&'a str],   // already in Eytzinger order
}

impl<'a> EytzingerSet<'a> {
    pub const fn new(entries: &'a [&'a str]) -> Self { Self { entries } }

    pub fn contains(&self, needle: &str) -> bool { … }
    pub fn position(&self, needle: &str) -> Option<usize> { … }
}

new is const fn so registries are computed at compile time and end up in .rodata. Lookup is a single function with no allocation.

Building the order

The crate ships a build-time helper that takes a sorted slice and produces the Eytzinger permutation:

const PLACEHOLDERS: &[&str] = aozora_veb::eytzinger_layout!(
    "ここから", "ここで", "傍点", "傍線", "字下げ", …
);

The macro is const-evaluated; the resulting slice is what EytzingerSet::new takes.

Why a separate crate?

The lookup is no_std and has no aozora-specific dependencies. By extracting it, three things become true:

  1. The lexer can depend on aozora-veb without pulling in any workspace state, which keeps aozora-veb’s test surface small.
  2. aozora-veb can be reused by aozora-encoding (for the accent decomposition table) and by aozora-bench (for category slug lookups in the trace rollup) without forming a circular dependency.
  3. Future consumers can depend on just aozora-veb for the data structure, without taking the whole parser.

See also

Shift_JIS + 外字 resolver

aozora-encoding covers the full source-decoding stack:

  1. Shift_JIS / Shift_JIS-2004 / cp932 byte stream → UTF-8 string.
  2. JIS X 0213 plane-2 ideographs → Unicode (where possible).
  3. 外字 references (※[#…]) → resolved Unicode codepoint, JIS triple, or descriptive-text fallback.
  4. Accent decomposition (114 ASCII digraph / ligature → Unicode).

All four are pure functions; the crate has no global state and nothing that varies per-call.

Decode chain

flowchart TD
    raw["raw bytes<br/>(SJIS-encoded .txt from Aozora Bunko)"]
    sjis["encoding_rs::SHIFT_JIS<br/>or aozora-specific JIS X 0213 patch"]
    utf8["UTF-8 String"]
    sanitize["Phase 0 sanitize<br/>(in aozora-pipeline)"]
    pua["PUA assignment for 外字"]
    classified["normalised &str ready for Phase 1 scan"]

    raw --> sjis --> utf8 --> sanitize --> pua --> classified

The Shift_JIS decode itself uses encoding_rs — the same crate Firefox uses for HTML decoding. Battle-tested, SIMD-accelerated, and handles every Shift_JIS variant Aozora Bunko sources have used since the 1990s. We add a thin patch layer for JIS X 0213 plane-2 codepoints that encoding_rs’s strict cp932 mapping doesn’t cover (Aozora’s spec extends Shift_JIS into JIS X 0213 territory; encoding_rs keeps the strict cp932 surface).

外字 (gaiji) PHF table

The reference table contains ~14 000 entries:

static GAIJI_TABLE: phf::Map<&'static str, GaijiEntry> = phf_map! {
    "1-94-37" => GaijiEntry::JisX0213 { plane: 1, row: 94, cell: 37, codepoint: '⿰魚師' },
    "U+5F85"  => GaijiEntry::Direct   { codepoint: '待' },
    "魚+師のつくり" => GaijiEntry::Description { fallback: "[魚+師]" },
    …
};

Why PHF (perfect hash function):

  • The table is large enough (~14 000 entries) that linear scan or Eytzinger search would dominate the lookup cost.
  • It’s static and known at compile time — the perfect hash is computable once.
  • phf produces zero-allocation, zero-comparison-on-collision lookups. The hash is one wyhash round; the probe is one slice index; the comparison is one strcmp. ~25 ns per lookup on the bench harness.

Why not OnceLock<HashMap>:

  • First-call cost: building a HashMap<&str, GaijiEntry> from 14 000 entries on first use takes ~5 ms. That’s longer than parsing a small document end-to-end.
  • Memory: the runtime HashMap takes 2–3× the size of the static PHF (load-factor padding + RawTable metadata).
  • Concurrency: OnceLock adds an atomic load on every access, even after initialisation. PHF is static — no synchronisation.

Why not load from a JSON / TOML asset:

  • Adds startup cost on every Document::new (file I/O is microseconds away from the parser’s whole runtime budget for small inputs).
  • Forces every binding (CLI / WASM / FFI / Python wheel) to ship the asset as a separate file, complicating distribution.
  • Defeats dead-code elimination: the linker can’t strip entries the consumer’s input never references.

The build-time cost of compiling the PHF (~40 s the first time, 0 s incremental) is paid once per workspace build, not per-invocation.

Resolution order

pub fn resolve(reference: &str) -> Resolved {
    // 1. Direct codepoint (U+XXXX) wins outright.
    if let Some(c) = parse_unicode_form(reference) { return Resolved::Direct(c); }

    // 2. JIS X 0213 plane-row-cell triple.
    if let Some(triple) = parse_jis_triple(reference) {
        if let Some(c) = JIS_TABLE.get(&triple) { return Resolved::Lookup(c); }
    }

    // 3. Descriptive name lookup (curated subset).
    if let Some(fallback) = DESCRIPTION_TABLE.get(reference) {
        return Resolved::Fallback(fallback);
    }

    Resolved::Unknown
}

Three layers, in order. Direct wins because the source author explicitly wrote a Unicode codepoint — overriding it would be wrong even if our JIS table disagreed. Lookup is the common case. Fallback is the curated subset of characters that have no Unicode codepoint at all (~120 entries from the 14 000); we ship a descriptive-text rendering rather than dropping the character. Unknown fires diagnostic unresolved_gaiji.

Accent decomposition

Older Aozora works encode accented Latin letters using a separate notation that is not a ※[#…] reference:

M[i!]cher  →  Micher
M[a!]ria   →  Maria
[ae]on     →  Aeon

The full mapping (114 entries — every digraph and ligature in the spec) is at accent_separation.html in the spec snapshot. aozora applies this decomposition during Phase 0 sanitize, before the trigger scan, so by Phase 1 the source is pure Unicode with no ASCII-encoded accents.

The lookup is also Eytzinger-laid (see Eytzinger sorted-set lookup) since 114 entries is well inside its favourable regime.

Why a single crate for all of this?

encoding, gaiji, and accent are three distinct concerns, but:

  • They all need to be applied once, in order, at the boundary between the source bytes and the parser proper.
  • Splitting them would force three separate crate surfaces and three separate trigger points in the lexer.
  • Their data tables are all built from upstream Aozora Bunko spec pages, so a single update workflow (refresh docs/specs/aozora/, re-extract tables) hits all three at once.

Co-locating them in one crate keeps the boundary tight and the update surface predictable.

See also

HTML renderer & canonical serialiser

aozora-render ships two walkers over AozoraTree<'_>:

  • html::render_to_string — emits semantic HTML5 with aozora-* class hooks.
  • serialize::serialize — emits canonical 青空文庫 source.

Both are pure functions. Both walk the tree once, in source order, allocating exactly the output buffer (a String pre-sized to the arena footprint).

HTML renderer

Class-name scheme

aozora emits stable class names that downstream stylesheets can hook:

AST nodeHTMLClass hook
Ruby<ruby>X<rt>Y</rt></ruby>(no class — semantic ruby element)
Bouten { kind: Sesame }<em class="aozora-bouten-sesame">…</em>aozora-bouten-<slug>
Tcy<span class="aozora-tcy">…</span>aozora-tcy
Gaiji { resolution: Direct }<span data-aozora-gaiji-jis="1-94-37">字</span>data-aozora-gaiji-*
Gaiji { resolution: Fallback }<span class="aozora-gaiji-fallback" title="…">[…]</span>aozora-gaiji-fallback
Container { kind: Indent { n: 2 } }<div class="aozora-indent-2">…</div>aozora-indent-<n>
Container { kind: AlignEnd }<div class="aozora-align-end">…</div>aozora-align-end
Break::Page<div class="aozora-page-break"/>aozora-page-break
Kaeriten { mark: Re }<span class="aozora-kaeriten" data-aozora-kaeriten="レ">レ</span>aozora-kaeriten

The aozora- prefix is reserved for our class names — a downstream stylesheet can target every aozora-emitted hook with [class^="aozora-"] without conflicting with the consumer’s own classes.

Why a class-hook output instead of inline styles?

Inline styles would force a single typographic decision for every consumer — print stylesheet, screen stylesheet, e-book renderer, and LSP/preview pane all want different presentation. The class-hook output:

  • Lets each consumer ship its own stylesheet for its medium.
  • Survives content-security-policy regimes that block style attrs.
  • Stays diff-able (the rendered HTML is stable across runs; presentation churn doesn’t ripple into snapshot tests).

HTML escaping

The renderer escapes <, >, &, ", ' in user text exactly once, at emission. Pre-escaped or doubly-escaped output is a correctness bug, not a perf decision — every CI run validates render_to_string ∘ html_unescape is the source identity for plain runs.

Canonical serialiser

The serialiser is the inverse of the lexer’s surface form: walk the tree, emit the source notation that would re-parse identically. It exists for three reasons:

  1. Round-trip property. parse ∘ serialize ∘ parse must be stable on the second iteration. The corpus sweep verifies this on every Aozora Bunko work.
  2. aozora fmt. The CLI’s fmt subcommand canonicalises author input (CRLF → LF, accent decomposition, container directive spacing).
  3. Diff-quality output. When the parser drops a malformed construct, the serialiser re-emits the surrounding text without the offending fragment, so authors can see the exact change.

Why a separate walker, not “render with a different visitor”?

The HTML and canonical-serialise outputs differ on every node type:

  • HTML wraps Ruby { target, reading } in <ruby>X<rt>Y</rt></ruby>; serialise emits |X《Y》 (or auto-detect form).
  • HTML wraps Container { kind: Indent { n } } in <div class="aozora-indent-N">…</div>; serialise emits the bracketed directives [#ここからN字下げ]…[#ここで字下げ終わり].
  • HTML emits <span data-aozora-gaiji-jis="1-94-37">字</span> for a resolved gaiji; serialise emits the original ※[#…、第3水準1-94-37].

The transformations don’t share enough structure to fit a single “visitor with two methods per node” abstraction. Two purpose-built walkers stay clearer and slightly faster — the compiler can inline the per-node match, which a generic visitor with virtual dispatch prevents.

Walker shape

Both walkers follow the same shape:

pub fn render_to_string(tree: &AozoraTree<'_>) -> String {
    let mut buf = String::with_capacity(tree.estimated_html_size());
    walk(tree, &mut buf);
    buf
}

fn walk(tree: &AozoraTree<'_>, out: &mut String) {
    for node in tree.nodes() {
        match node {
            AozoraNode::Plain(s)     => out.push_str(html_escape(s)),
            AozoraNode::Ruby(r)      => emit_ruby(r, out),
            AozoraNode::Bouten(b)    => emit_bouten(b, out),
            AozoraNode::Tcy(t)       => emit_tcy(t, out),
            AozoraNode::Gaiji(g)     => emit_gaiji(g, out),
            AozoraNode::Container(c) => emit_container(c, out),
            AozoraNode::BreakNode(b) => emit_break(b, out),
            // … exhaustive
        }
    }
}

Single linear pass; no allocation outside the output buffer; no recursion that the compiler can’t unroll (containers recurse, but the fan-out is small — typically 1–4 children per container).

estimated_html_size heuristic

The buffer pre-size avoids String reallocations during the walk. Empirical heuristic from the corpus sweep: 2.6 × source_byte_len is at the 95th percentile (some HTML wraps a 3-byte ruby kanji in 30 bytes of <ruby>X<rt>Y</rt></ruby> markup). Going under leaves ~1 reallocation per render in the worst case; going over wastes memory on every render. 2.6× is the measured optimum.

See also

Concrete syntax tree (CST)

A rowan-backed lossless syntax tree lives under the cst Cargo feature on the aozora crate. The CST is a pure projection over the existing parse output — Phase 3 classification is unchanged, the AST stays the perf-critical path, and the CST adds zero overhead for consumers that don’t enable the feature.

Why a CST exists

The borrowed AST (AozoraNode<'src>) is great for renderers: classified spans, typed payload, no whitespace noise. It is the wrong shape for source-faithful tooling:

  • A formatter rewriting 日本《にほん》|日本《にほん》 needs the exact whitespace and trivia between tokens.
  • A LSP textDocument/foldingRange provider needs the open / close positions of every nestable region, including ones the renderer ignores.
  • A refactor that renames a kanji-range [#「青空」に傍点] to [#「あおぞら」に傍点] must preserve every bracket character the user wrote, not just the parsed target.

A CST whose leaves concatenate to the parser’s input gives those tools what they need without any custom plumbing.

Lossless invariant

The contract is sharp:

Concatenating every leaf token’s text yields the sanitized source bytes the parser actually saw.

“Sanitized” matters: Phase 0 normalises CRLF→LF, strips a leading BOM, isolates long decorative rule lines with a leading blank line, and rewrites 〔…〕 accent spans through accent decomposition. These transformations happen before classification, so source_nodes coordinates address sanitized bytes. The CST tracks that coordinate system; an editor that wants to map back to the user’s raw bytes runs the same Phase 0 transformation and inverts where needed.

The proptest in tests/property_lossless.rs runs the invariant across the full Aozora-shaped input distribution (aozora_fragment / pathological_aozora / unicode_adversarial from aozora-proptest). A regression here breaks every editor surface that walks the CST.

Architecture

The crate stays decoupled by design:

  • aozora-cst depends on aozora-pipeline + aozora-spec directly, not on the aozora meta crate. Going through aozora would create a cycle (the meta crate’s cst feature re-exports aozora-cst).
  • build_cst(sanitized_source, source_nodes) -> SyntaxNode takes the lower-level bits explicitly so consumers writing custom pipelines can reach in.
  • aozora::cst::from_tree(&tree) -> SyntaxNode is the ergonomic entry point; it runs Phase 0 sanitize internally and forwards.
  • The Phase 3 classifier sees no changes — adding / removing CST consumers cannot perturb AST perf.

SyntaxKind granularity

The CST is intentionally coarser than a token-stream re-construction:

SyntaxKindRole
DocumentTree root
ContainerPaired-container region ([#ここから...]...[#ここで...終わり])
ConstructSingle classified Aozora construct
ContainerOpen / ContainerCloseContainer boundary tokens
ConstructTextSource slice of a Construct
PlainPlain text run between classifications

Finer per-token granularity (individual punctuation, kana runs, …) can land later once a concrete consumer needs it. The lossless property holds at any granularity, so widening the leaf set is non-breaking for downstream tooling that walks preorder_with_tokens.

Why rowan, not Phase 3 integration

The bumpalo-arena AST stays the hot path; the CST sits on top as an editor-grade convenience layer rather than coupling lossless-tree concerns into the perf-critical classifier. rowan (over cstree) gives the lossless tree a maintained home — rust-analyzer’s tree infrastructure with 86 reverse deps — and the bumpalo / Arc dual-allocator overhead is the price for keeping the AST untouched.

Cross-references

Error recovery

aozora is non-fatal by design: the parser always returns an AozoraTree even when the input violates the spec. Every problem is reported as a structured Diagnostic whose code tooling can match on; nothing is ever raised as a panic from Document::parse.

This page documents what the parser actually does when each diagnostic fires — useful when implementing editor surfaces, lint fixers, or anything else that runs over imperfect documents.

Recovery model

Every diagnostic carries two orthogonal axes:

AxisValuesMeaning
severityError / Warning / NoteRouting hint for downstream surfaces; does not affect parsing.
sourceSource / InternalWhether the issue is in the user’s input (Source) or in the library’s invariants (Internal).

The parser keeps running regardless of severity. Error does not short-circuit; it only marks the surrounding output region as suspect so callers (CLI --strict, LSP) can decide policy. CI gates typically treat any Error as failure, but the AST is still safe to walk — the spans, classifications, and renderer all remain consistent.

Source-side codes

aozora::lex::source_contains_pua

Hello, …<U+E001>… world.

A user-supplied codepoint in the range U+E001..U+E004 collides with one of the lexer’s PUA sentinel reservations. The placeholder registry keys on these codepoints, so a bare collision means the classifier could no longer tell user-text occurrences from lexer-inserted markers.

Recovery: the colliding bytes are kept verbatim in the sanitised text — Phase 0 does not delete them. Downstream the character flows through as plain text (the registry has no entry for the position so it is treated as ordinary content). Editors that want to surface the collision visually can match on this code; ordinary HTML rendering is unaffected.

aozora::lex::unclosed_bracket

|青梅《おうめ

An open delimiter (, , , , , …) reached end-of-input with no matching close on the pairing stack.

Recovery: no PairLink is emitted for the orphaned opener (Unclosed opens have no partner span and would only confuse editor highlights). Phase 3 then sees no Aozora construct covering the unclosed open and degrades the whole region to plain text — the bytes from the opener to EOF are preserved literally, just without ruby / annotation classification.

aozora::lex::unmatched_close

》orphaned

A close delimiter saw an empty pairing stack, or its PairKind mismatched the stack top.

Recovery: the stray close is not matched against any opener; no PairLink is emitted. The bytes flow through as plain text, preserving the user’s content; nothing on the stack pops. The diagnostic span points at the close itself so editors can surface it without corrupting the document tree.

Internal codes

Internal-source diagnostics indicate library bugs — production parses on well-formed input never emit these. They are kept publicly visible so tooling can distinguish “user input has a problem” from “the library has a problem”; the parse still completes best-effort to keep editors usable.

CodeWhat broke
residual_annotation_markerAn [# digraph survived classification — a recogniser is missing for the contained keyword.
unregistered_sentinelA PUA sentinel is in normalised text without a registry entry.
registry_out_of_orderThe placeholder-registry vector is not strictly position-sorted.
registry_position_mismatchA registry entry references a normalised position whose codepoint is not the expected sentinel kind.

Recovery: the parser never acts on internal diagnostics — the problematic stretch flows through as plain text, the diagnostic records what was wrong, and Document::parse returns normally. Reproductions belong in aozora-spec test fixtures so the bug surface keeps shrinking over releases.

What recovery is not

The parser does not attempt fix-it suggestions. There is no “did you mean [#ここで字下げ終わり]?” guess; the diagnostic’s help text describes the symptom, not the cure. Higher-level tooling (LSPs, editor extensions) is the right place for fix-it proposals — they have user context the parser does not.

The parser also does not try to synthesise missing tokens. A truly unclosed bracket stays unclosed in the tree; we don’t insert a phantom to “balance” it. Synthesising tokens hides the diagnostic from any caller that walks the AST instead of the diagnostic list, and turns a fixable user error into a silent correction.

Cross-references

tree-sitter reference grammar

aozora ships a tree-sitter grammar at grammars/aozora.tree-sitter/grammar.js as a reference implementation alongside the canonical Rust parser. When the two disagree the Rust parser wins; this grammar exists to plug Aozora documents into the tree-sitter ecosystem (neovim, helix, web-tree-sitter / CodeMirror) and to serve as a teaching artefact.

Why a separate grammar at all

The Rust parser is a four-phase pipeline with a hand-rolled classifier; reading it tells you how the canonical implementation works but not what the spec accepts. A declarative grammar is the language community’s preferred form for “what the spec accepts.” Shipping one alongside the parser lets external tooling consume Aozora without binding to the Rust ABI.

What it does cover

The grammar handles bracket structure faithfully:

  • |base《reading》 and base《reading》 — explicit / implicit ruby
  • 《《content》》 — double-bracket bouten
  • ※[#...] — gaiji marker
  • [#...] — generic bracket annotation
  • 〔...〕 — tortoise-bracket / accent-decomposition span

Plain text — any byte that is not one of the bracket openers — flows through as a plain_text token, keeping the grammar lossless against the byte stream.

What it deliberately does not cover

Three classes of behaviour are intentionally out of reach:

  1. Stateful container pairing. [#ここから2字下げ] matches [#ここで字下げ終わり] across intervening content; a context- free grammar without a hand-written scanner.c cannot close this. Consumers rely on the body content of the bracket annotation to recognise the pairing themselves, or fall back to the Rust parser.
  2. Forward 「target」に傍点 resolution. The bouten directive walks back through preceding text to bind to a quoted run. The grammar accepts the directive faithfully; the lookup stays the consumer’s job.
  3. Ruby base disambiguation. When the glyph run preceding 《...》 could extend further, the Rust classifier uses a more nuanced rule. The grammar accepts the greedy base match uniformly.

A scanner.c extension could plug some of these gaps, but doing so contradicts the declarative-reference framing of the artefact and would put the canonical-parser-replacement question on the table prematurely.

Status

The grammar covers approximately 40 % of the canonical parser’s constructs as measured by an unweighted variant count. The gap to full coverage is documented; closing it would require a scanner.c extension, which trades the declarative-reference framing for a higher ceiling.

Cross-references

Crate map

aozora is a 21-crate workspace. The split exists for three reasons: narrow each crate’s compile surface (faster cargo check), pin dependency boundaries (cycles are forbidden by the layout), and let each binding (CLI, WASM, FFI, Python) compose only the layers it needs.

At a glance

flowchart TD
    subgraph foundation
      spec
    end
    subgraph types
      veb
      syntax
      encoding
      scan
    end
    subgraph parser
      pipeline
      render
    end
    subgraph editor
      cst
      query
    end
    subgraph integration
      pandoc
    end
    subgraph facade
      aozora_facade["aozora"]
    end
    subgraph bindings
      cli
      ffi
      wasm
      py
    end
    subgraph dev
      bench
      conformance
      corpus
      proptest
      trace
      xtask
    end

    spec --> veb
    spec --> syntax
    spec --> encoding
    spec --> scan
    veb --> pipeline
    syntax --> pipeline
    encoding --> pipeline
    scan --> pipeline
    pipeline --> render
    render --> aozora_facade
    aozora_facade --> cli
    aozora_facade --> ffi
    aozora_facade --> wasm
    aozora_facade --> py
    aozora_facade --> bench
    pipeline --> cst
    cst --> query
    syntax --> pandoc
    aozora_facade --> conformance
    corpus --> bench
    proptest --> pipeline
    trace --> xtask

Per-crate purpose

Foundation

CrateRole
aozora-specSingle source of truth for shared types: Span, Diagnostic, TriggerKind, PairKind, PUA sentinel codepoints, SLUGS dispatch table. No internal dependencies — every other crate may depend on it.

Types & primitives

CrateRole
aozora-vebno_std Eytzinger-layout sorted-set lookup. Cache-friendly binary search for sub-256-entry registries.
aozora-syntaxAST node types — AozoraNode<'src>, Container<'src>, Bouten<'src>, Ruby<'src>, …. Borrows from the bumpalo arena.
aozora-encodingShift_JIS decoding, JIS X 0213 patch, 外字 PHF resolver, accent decomposition.
aozora-scanSIMD-friendly multi-pattern byte scanner (Phase 1’s trigger scan). One of three crates that locally relaxes unsafe_code — for aligned-load SIMD intrinsics.

Parser

CrateRole
aozora-pipelineFour-phase lexer (sanitize → events → pair → classify) plus the lex_into_arena orchestrator that fuses normalize + registry + diagnostics into a single output walk.
aozora-renderHTML and canonical-serialisation walkers. Single O(n) tree pass each; no allocation outside the output buffer.

Editor-grade surface

CrateRole
aozora-cstLossless rowan-backed concrete syntax tree built as a pure projection over the AST. Powers formatters, LSP folding, source-faithful refactors.
aozora-queryTree-sitter-flavoured pattern DSL over aozora-cst’s SyntaxNode. Selects nodes by SyntaxKind + capture name.

Integration

CrateRole
aozora-pandocPandoc AST projection — turns an AozoraTree into pandoc_ast::Pandoc, unlocking 50+ output formats via Pandoc’s writer matrix.

Facade

CrateRole
aozoraPublic facade. Document::parse() -> AozoraTree<'_>, tree.to_html(), tree.serialize(), tree.diagnostics(). The single import for library consumers.

Bindings

CrateRole
aozora-cliThe aozora binary (check / fmt / schema / kinds / explain / pandoc).
aozora-ffiC ABI driver. Opaque handles, JSON-encoded structured data. Locally relaxes unsafe_code; every block carries a // SAFETY: comment.
aozora-wasmwasm32-unknown-unknown target with wasm-bindgen exports.
aozora-pyPyO3 binding shipped via maturin.

Development-only

CrateRole
aozora-benchCriterion + corpus-driven probes. Source of the PGO training data.
aozora-conformanceWPT-style fixture runner; pins golden HTML / serialise / diagnostics / wire output across 23 fixtures.
aozora-corpusCorpus source abstraction (zstd-archived, blake3-pinned). Dev-only.
aozora-proptestShared proptest strategies (aozora_fragment, pathological_aozora, unicode_adversarial, xss_payload). Dev-only.
aozora-traceDWARF symbolicator + samply gecko-trace loader. Dev-only.
aozora-xtaskHost-side dev tooling (samply wrapper, trace analysis, corpus pack/unpack, schema dumps). Not on the just build path.

Why 21 crates?

Three concrete wins from the split.

1. Compile latency

A single-crate workspace with the same code would force a full re-compile on any internal change. With the workspace split, a change in the renderer doesn’t touch the lexer, scanner, or any of the bindings — incremental compile times stay sub-second on iteration.

2. No-std reach

aozora-veb and aozora-spec are no_std-clean. aozora-scan is no_std-clean by default; the SIMD backends opt in to the std feature for runtime CPU detection. That matters for the wasm32 build (where std is a real cost) and would matter for embedded targets if anyone ever needed one. Keeping them in dedicated crates enforces the no_std discipline at the crate-graph level — adding a std import would require depending on a std-using crate, which is a visible Cargo.toml change.

3. Binding modularity

The C ABI driver (aozora-ffi) needs aozora + serde and nothing else. It does not pull in the bench harness, the trace loader, or the corpus crate. The wasm driver is similarly minimal. Each binding’s dependency closure is exactly what it needs — which is what keeps the wasm bundle inside its 500 KiB budget.

What we deliberately don’t split

A few things stay co-located despite plausible split points:

  • HTML render and canonical serialise in aozora-render. Both are tree walkers; sharing the visitor helper between them keeps the implementation small.
  • Phase 0 sanitize sub-passes in aozora-pipeline. Each sub-pass is < 100 LOC and operates on the same &str slice; pulling them out would create a 5-crate ecosystem for a transformation that’s conceptually one phase.
  • Trigger-byte enum and pair-kind enum in aozora-spec. They’re used by both aozora-scan (which produces them) and aozora-pipeline (which consumes them); putting them in spec avoids a back-reference.

Splits aren’t free — every additional crate adds a Cargo.toml, a README, doc-link reachability, and a test surface. Splits land when the cohesion benefit (one of the three above) is real.

See also

Choosing a binding

aozora reaches a lot of languages, but there is only one parser behind them. Every surface — the Rust library, the CLI, the wasm package, the PyO3 module, the Go module, the C ABI, the Extism plugin — funnels the same source text through the same lexer and renders it through the same aozora::wire authority. The HTML, the canonical serialise, and the diagnostic stream are therefore byte-identical across every binding. What differs between them is only the host language you write in and the overhead you pay to cross the language boundary.

So the decision is not “which binding is more correct” — they all produce the same bytes. It is “which one fits the language and runtime I already have, at the cost I’m willing to pay.”

Decision table

Find the row that describes you; the rest of the page explains the trade-offs behind it.

You are…UseWhyDistribution
Writing Rustumbrella aozora libraryZero-copy borrowed AST, full type safety, the fastest path. No serialise.crates.io¹
At a shell / in CI / scriptingthe aozora binarycheck / render / fmt / pandoc, reads stdin, exits with a code.GitHub release
In the browser, Node, or TypeScriptaozora-wasmwasm-bindgen Document class; runs client-side and at the edge.npm
Writing Pythonaozora-py (PyO3)In-process native module via maturin; idiomatic Python API.build-from-source²
Writing Goaozora-goPure-Go wazero host — no cgo, no C toolchain.go get
Embedding from C / C++ / another native FFIaozora-ffi C ABIOpaque handle + JSON over a stable C header; link it like any library.GitHub release
Writing Java, PHP, Ruby, or the long tailaozora-extism host SDKOne portable aozora.wasm loaded by any Extism SDK.GitHub release
Producing anything other than HTML (EPUB, LaTeX/PDF, DOCX, …)aozora pandocProjects to the Pandoc AST; 50+ output formats via Pandoc writers.GitHub release (CLI)

¹ crates.io publication tracks the v1.0 API freeze; until then the git-tag form in the install chapter is the canonical entry point. ² PyPI wheels are pending; pre-1.0 the Python binding builds from source via maturin.

In-process vs host-runtime

The bindings fall into two camps, and the split is the single most useful lens for choosing.

In-process / native — the Rust library, aozora-py, aozora-wasm, and the C ABI. The parser runs inside your process’s address space. Overhead is zero (Rust) to low (a string copy and a JSON projection at the boundary for the others). The cost is on our side: each of these is a native artifact that has to be built and published for every (OS × arch) pair the ecosystem expects.

Host-runtimeaozora-extism. The parser is a single portable aozora.wasm, and your language loads it through its Extism host SDK. You pay a JSON round-trip at the wasm boundary (text in, a versioned JSON envelope out), and your host gains one runtime dependency (the Extism runtime). In exchange, we do not maintain a native build matrix for your language — the same wasm bytes load identically on every platform. This is the deliberate breadth strategy for the long tail of languages where writing and shipping a bespoke native binding is the real cost. See ADR-0006 for the full reasoning.

Because every binding already presents an interface of “string → JSON envelope”, the serialization round-trip Extism adds is intrinsic to the shape of the problem, not a new tax. Native bindings simply skip it by sharing the address space.

By language

A quick jump list:

  • Rust → the aozora umbrella library.
  • JavaScript / TypeScript (browser, Node, Deno, edge) → aozora-wasm.
  • Pythonaozora-py.
  • Goaozora-go.
  • C / C++ / Zig / any FFI-capable native language → the aozora-ffi C ABI.
  • Java, PHP, Ruby, .NET, Elixir, Haskell, … → the aozora-extism host SDK.
  • None of the above / shell / CI → the aozora CLI binary.

By output format

aozora’s renderer emits semantic HTML5. The decision here is binary:

  • You want HTML. Use the built-in renderer — to_html() in any library binding, or aozora render at the CLI. It is the canonical output and what the conformance suite gates on.
  • You want anything else — EPUB, LaTeX/PDF, DOCX, ODT, MediaWiki, and ~50 more — use aozora pandoc. It projects the parsed tree into the Pandoc AST, where every Pandoc writer is one pipe away. Adding a new format means adding a Pandoc filter, never extending the parser.

A note on performance

If raw throughput is the deciding factor, the ordering is:

  1. Rust, borrowed-arena. The library hands you AozoraNodes that borrow directly from the bumpalo arena — no copies, no serialise, no JSON. Nothing is faster.
  2. In-process native bindings (aozora-py, aozora-wasm, C ABI). One string copy in, one JSON projection out, but all in-process. Low, constant overhead.
  3. Extism. A wasm-boundary JSON round-trip on top of the in-process cost. The slowest of the three transports — and still the right choice when the alternative is no binding for your language at all.

For the overwhelming majority of documents this difference is invisible against I/O. Reach for the Rust library’s borrowed AST only when you are parsing at scale (the corpus sweep over ~17 000 works is the motivating case); otherwise pick the binding that fits your language and let the constant overhead disappear into the noise.

See also

Rust library

The first-class binding. Full type safety, zero copy, and the borrowed-arena AST exposed directly.

Adding to a project

The recommended Cargo.toml snippet (with the current release tag) lives in the install chapter. Keeping the pin in one place avoids drift between this doc and the install page when a new release lands.

crates.io publication tracks the v1.0 API freeze; until then, the git tag form documented there is the canonical entry point.

Surface

The public surface is small by design — three types and four methods cover everything:

pub struct Document { /* opaque */ }
impl Document {
    pub fn new(source: String) -> Self;
    pub fn parse(&self) -> AozoraTree<'_>;
    pub fn source(&self) -> &str;
}

pub struct AozoraTree<'a> { /* borrows from Document */ }
impl<'a> AozoraTree<'a> {
    pub fn nodes(&self) -> impl Iterator<Item = AozoraNode<'a>>;
    pub fn to_html(&self) -> String;
    pub fn serialize(&self) -> String;
    pub fn diagnostics(&self) -> &[Diagnostic];
}

pub enum AozoraNode<'src> { Plain(&'src str), Ruby(Ruby<'src>), … }

See Library Quickstart for the walk-through.

Feature flags

aozora exposes one optional feature:

FeatureDefaultWhat it enables
serdeoffserde::Serialize / Deserialize impls on AozoraNode, Diagnostic, Span. Useful for downstream tools that need to ship the AST over a wire.

The default-off policy keeps cargo build aozora slim — the JSON encoders that the bindings need live in the bindings themselves (aozora-ffi, aozora-wasm, aozora-py), not in the core crate.

Error handling

Three philosophies, used consistently:

  1. Diagnostics are not errors. Document::parse() always returns a AozoraTree<'_>. Per-input diagnostics live in tree.diagnostics(). Callers decide whether to treat any diagnostic as fatal.
  2. Decoding is fallible. aozora_encoding::sjis::decode_to_string returns Result<Cow<str>, DecodeError>. Malformed Shift_JIS is the one place a function actually fails — the parser proper assumes UTF-8.
  3. Panics are bugs. No .unwrap() on user-data paths in non-test code; clippy’s unwrap_used and expect_used are warned workspace-wide. If you ever see a panic in aozora::*, file a bug.

Thread safety

Document is Send but not Sync — the bumpalo arena does not support concurrent allocation. Pass a Document between threads freely; do not share &Document across threads.

AozoraTree<'_> borrows from &Document, so by Rust’s lifetime rules the same shape applies: a &AozoraTree is Send + Sync (it’s just & to immutable data), but it can’t outlive its Document.

For parallel corpus processing (e.g. the corpus sweep harness parsing 1000s of documents concurrently), each thread creates its own Document from its own source. The arena resets per-Document, so there’s no contention point.

MSRV policy

aozora pins Rust 1.95.0. The MSRV advances roughly once per quarter, when a new stable feature is needed and the workspace moves to it. The msrv job in CI gates every PR; Dependabot is configured to not auto-bump the MSRV pin (manual decision).

Public API stability

Pre-1.0: minor-version bumps may break the API. cargo-semver-checks runs in CI to catch unintentional breakage between releases, so vX.Y.* patch bumps are always safe; only a minor bump (vX.Y.*vX.Y+1.*) opens the door for breaks. The current pin to track lives in the install chapter.

Post-1.0 (planned): semver discipline. Breaking changes accumulate on a next branch and ship in a major bump.

See also

WASM (wasm-pack)

The aozora-wasm crate compiles to wasm32-unknown-unknown and exposes a Document class via wasm-bindgen. The wasm artifact has a hard 500 KiB size budget after wasm-opt -O3 — measured on every release.

Build

rustup target add wasm32-unknown-unknown        # one-time
wasm-pack build --target web --release crates/aozora-wasm

Outputs land at crates/aozora-wasm/pkg/:

  • aozora_wasm_bg.wasm — the binary module
  • aozora_wasm.js — the wasm-bindgen JS shim
  • aozora_wasm.d.ts — TypeScript types
  • package.json — minimal npm-publishable metadata

Why wasm-opt = false in Cargo.toml?

wasm-pack ships its own bundled wasm-opt (via the binaryen crate) which lags upstream. Recent Rust releases emit bulk-memory opcodes (memory.copy, memory.fill) that the bundled wasm-opt mishandles on -O3, occasionally producing artifacts that crash on init. We disable the bundled run and recommend a fresh wasm-opt invocation externally:

wasm-opt -O3 \
    --enable-bulk-memory \
    --enable-mutable-globals \
    crates/aozora-wasm/pkg/aozora_wasm_bg.wasm \
    -o crates/aozora-wasm/pkg/aozora_wasm_bg.wasm

The post-wasm-opt artifact has a 500 KiB size budget. CI gates on this number — exceeding it is a release-blocking regression.

Usage

import init, { Document } from "./pkg/aozora_wasm.js";

await init();                                  // load the .wasm

const doc = new Document("|青梅《おうめ》");
const html = doc.to_html();
const canonical = doc.serialize();
const diagnostics = JSON.parse(doc.diagnostics_json());
console.log(html);
doc.free();                                    // release the bumpalo arena

In TypeScript, the .d.ts file gives you full type checking on every method.

API surface

MethodReturnsNotes
new Document(source: string)DocumentCopies the JS string into a Rust Box<str>.
to_html()stringRenders to semantic HTML5 with aozora-* class hooks.
serialize()stringRe-emits canonical 青空文庫 source.
diagnostics_json()stringJSON-encoded array of diagnostic objects.
source_byte_len()numberSource byte length, useful for progress UI.
free()Explicit drop; otherwise the JS GC eventually releases.

The diagnostics JSON shape mirrors aozora-ffi’s C ABI:

interface Diagnostic {
    code:    string;            // "aozora::lex::unresolved_gaiji", …
    level:   "error" | "warning" | "info";
    message: string;
    span:    { start: number; end: number };
    help?:   string;
}

Why a hand-written JSON projection over serde-wasm-bindgen?

serde-wasm-bindgen would let us pass the Diagnostic directly to JS as a structured object — no JSON round-trip needed. We don’t use it because:

  • It pulls in a meaningful chunk of serde_json machinery that bloats the wasm bundle by ~80 KiB.
  • The wire format ({ code: "aozora::lex::unresolved_gaiji", level: "warning", … }) is exactly what every JS consumer is going to deserialise into anyway.
  • It would force a serde::Serialize derivation on every diagnostic-related type in aozora-spec, which the Rust library consumers don’t otherwise need (they take &[Diagnostic] directly).

A small, hand-written JSON emitter (one core::fmt::Write impl, ~60 LOC) costs nothing and keeps the bundle small.

Why Document.free() and not just GC?

wasm-bindgen does wire Drop to a JS finalizer, but JS finalizers fire on the GC’s schedule — which can be minutes after the last reference goes out of scope, especially on Node.js where the GC batches aggressively. For large documents this means the bumpalo arena (potentially several MB) sits unreleased.

Explicit .free() is the same idiom every wasm-bindgen library exposes for resource-heavy types. Consumers that want JS-native ergonomics wrap the class in their own using (TC39 stage-3 explicit resource management) helper.

Browser support

Tier-1 (CI-tested):

  • Chrome 110+
  • Firefox 110+
  • Safari 16+

Tier-2 (works, not in CI):

  • Node.js 18+ (use --target nodejs in wasm-pack build)
  • Deno 1.30+

The bundle uses bulk-memory and mutable-globals; both have been universally supported since 2021.

Why wasm at all?

The CLI and the Rust library cover Linux / macOS / Windows native; the wasm build covers everywhere else — particularly:

  • Browser-side preview / formatter for a 青空文庫 LSP front-end.
  • Cloudflare Workers / Vercel Edge / Deno Deploy serverless rendering.
  • Notebook environments (Jupyter via pyodide, Observable, Quarto).

The same parser, same diagnostics, same canonical-serialise — across every wasm-runtime host.

See also

Python (PyO3 / maturin)

The aozora-py crate is a PyO3 binding shipped via maturin.

Install

pip install maturin                         # one-time

cd crates/aozora-py
maturin develop -F extension-module         # install in current venv
# or
maturin build -F extension-module --release # produce a redistributable wheel

The extension-module feature gates the PyO3 import-side machinery behind a flag, so a plain cargo build --workspace succeeds without Python development headers installed. CI has both modes covered.

Minimal Python usage

from aozora_py import Document

doc = Document("|青梅《おうめ》")
print(doc.to_html())          # <ruby>青梅<rt>おうめ</rt></ruby>
print(doc.serialize())        # |青梅《おうめ》
print(doc.diagnostics())      # JSON-encoded list of diagnostic dicts

API surface

MethodReturnsNotes
Document(source: str)DocumentThe constructor copies source into a Rust Box<str>.
to_html() -> strstrRenders to semantic HTML5 with aozora-* class hooks.
serialize() -> strstrRe-emits canonical 青空文庫 source.
diagnostics() -> strstrJSON-encoded list (same schema as the WASM and FFI bindings).
source_byte_len() -> intintSource byte length.

The diagnostics JSON shape is shared across every binding — see Bindings → WASM for the schema.

Thread safety: unsendable

The Document type is marked unsendable (PyO3 marker) because the underlying bumpalo arena uses interior Cell state. Concurrent access from another Python thread raises a RuntimeError:

import threading
from aozora_py import Document

doc = Document(open("src.txt").read())
def worker(): doc.to_html()              # raises RuntimeError on second thread
threading.Thread(target=worker).start()  # boom

For parallel corpus processing, create a Document per thread. The arena resets per-Document, so there’s no contention point; each thread allocates from its own arena.

Why not Send?

PyO3 has a Sendable trait that enables cross-thread access for binding types. We don’t enable it because:

  1. Arena correctness. bumpalo::Bump is !Sync — the per-page allocator state isn’t atomic. Marking it Sendable from PyO3 would require a mutex around every allocation, which is the cost we designed the arena to avoid in the first place.
  2. GIL semantics. Python threads share the GIL; “concurrent” in the Python sense is rarely actually parallel. The unsendable marker turns the misuse case into a loud RuntimeError instead of a silent data race.
  3. Multiprocessing path. The right answer for parallel corpus work is multiprocessing (one Document per process — the arenas are independent by construction). The unsendable marker nudges users toward this.

Why JSON-encoded diagnostics?

Same reason as the WASM binding:

  • The wire shape is stable across every binding.
  • Avoids forcing a pyclass declaration on every diagnostic-related type.
  • Downstream Python consumers json.loads() once and work with native dicts — no second translation.

The diagnostics() method returns a str, not a list[dict], so the json.loads is visible to the caller. Hiding it behind a PyO3 Vec<PyDict> mapping would silently allocate one Python object per diagnostic per call.

Wheel distribution

aozora_py is on PyPI (since v0.4.1):

pip install aozora_py

To build a wheel from a checkout instead:

maturin build -F extension-module --release  # → target/wheels/*.whl
pip install target/wheels/aozora_py-*.whl

Release wheels are built in CI with maturin for every supported (python, target) combination — the mainstream path for PyO3 projects.

See also

Go (wazero host SDK)

The Go binding is a host SDK over the portable aozora.wasm Extism plugin, run through the pure-Go wazero runtime. There is no cgo and no native libextism to link — go get is the whole install:

go get github.com/P4suta/aozora-go

It is one spoke of aozora’s polyglot binding strategy: rather than a hand-written native binding per language, every non-Rust front door funnels through the same aozora.wasm bytes and the same aozora::wire authority. See Choosing a binding for when to reach for Go versus the native C ABI or the in-process Rust library, and Extism plugin for the wasm artifact this SDK loads. The rationale for the whole approach is recorded in ADR-0006 (linked from the crate README).

Install & quickstart

package main

import (
	"context"
	"fmt"

	aozora "github.com/P4suta/aozora-go"
)

func main() {
	ctx := context.Background()
	p, err := aozora.Open(ctx)
	if err != nil {
		panic(err)
	}
	defer p.Close(ctx)

	html, _ := p.ToHTML("|青梅《おうめ》")
	fmt.Println(html) // <ruby>青梅<rt>おうめ</rt></ruby>

	nodes, _ := p.Nodes("|青梅《おうめ》")
	for _, n := range nodes.Data {
		fmt.Printf("%s @ [%d,%d)\n", n.Kind, n.Span.Start, n.Span.End)
	}
}

Open(ctx) instantiates the plugin once; reuse the returned Parser across calls and Close(ctx) it when done. Beyond ToHTML and Nodes, a Parser exposes Serialize, Diagnostics, Pairs, and ContainerPairs — each returning the matching wire envelope decoded into the generated Go types:

MethodReturnsNotes
ToHTML(src)stringSemantic HTML5 with aozora-* class hooks.
Serialize(src)wire envelopeCanonical 青空文庫 source round-trip.
Nodes(src)wire envelopeBorrowed-AST nodes with Kind + Span.
Diagnostics(src)wire envelopeSame diagnostic schema as every other binding (see WASM → API surface).
Pairs(src)wire envelopeMatched ruby / bracket / quote pairs.
ContainerPairs(src)wire envelopeMatched indent / align-end container pairs.

Concurrency

A Parser is not safe for concurrent use — the underlying Extism instance carries per-call wasm linear-memory state. Open one Parser per goroutine (each Open is independent), or guard a shared one behind your own mutex. For parallel corpus processing the per-goroutine pattern is the intended one; instances do not contend.

How it works

Open(ctx) loads the embedded aozora.wasm plugin into a fresh wazero runtime. Every method serialises its argument, calls the corresponding plugin export, and decodes the JSON envelope into a Go type. Those wire types live in wire_gen.go and are generated by just types-langs (quicktype, fed from the wire JSON Schema) — they are not hand-maintained, so they cannot drift from the Rust aozora::wire definitions. Because the plugin bytes and the wire schema are shared, the Go output is byte-identical to the Rust, WASM, Python, and C-ABI front doors: same HTML, same canonical serialisation, same diagnostics.

Building / contributing

From the aozora workspace root, just smoke-go builds the plugin (just extism-build), embeds the resulting aozora.wasm into the module, and runs gofmt + go vet + go test. The aozora.wasm artifact is git-ignored locally and dropped in by that target or by the release workflow; wire_gen.go is regenerated by just types-langs and must not be edited by hand.

Reference

C ABI

The aozora-ffi crate compiles to a cdylib + staticlib. The API is opaque-handle + JSON-encoded structured data — the C side never sees a Rust type, just opaque pointers and byte buffers.

Build

cargo build --release -p aozora-ffi
# → target/release/libaozora_ffi.{so,dylib,a}
# → target/release/aozora.h          (cbindgen-generated)

The build script regenerates aozora.h automatically. After build, the header lands at:

  • target/release/aozora.h — host-side convenience copy
  • $OUT_DIR/aozora.h — cargo build-script standard location

#include "aozora.h" and link with -laozora_ffi.

Smoke test

just smoke-ffi

Builds the cdylib, compiles crates/aozora-ffi/tests/c_smoke/smoke.c against it, runs it end-to-end. CI runs this on every PR — if the ABI shape changes accidentally, the smoke test fails before the PR merges.

Minimal C usage

#include <stdint.h>
#include <stdio.h>
#include <string.h>
#include "aozora.h"

int main(void) {
    const char *src = "|青梅《おうめ》";
    AozoraDocument *doc = NULL;
    if (aozora_document_new((const uint8_t *)src, strlen(src), &doc) != 0)
        return 1;

    AozoraBytes html = {0};
    if (aozora_document_to_html(doc, &html) != 0) {
        aozora_document_free(doc);
        return 1;
    }
    fwrite(html.ptr, 1, html.len, stdout);

    aozora_bytes_free(&html);
    aozora_document_free(doc);
    return 0;
}

API surface

typedef struct AozoraDocument AozoraDocument;
typedef struct {
    uint8_t *ptr;
    size_t   len;
    size_t   cap;
} AozoraBytes;

extern int32_t aozora_document_new(const uint8_t *src, size_t src_len,
                                   AozoraDocument **out_doc);
extern int32_t aozora_document_to_html(const AozoraDocument *doc,
                                       AozoraBytes *out_html);
extern int32_t aozora_document_serialize(const AozoraDocument *doc,
                                         AozoraBytes *out_canonical);
extern int32_t aozora_document_diagnostics_json(const AozoraDocument *doc,
                                                AozoraBytes *out_json);
extern void    aozora_bytes_free(AozoraBytes *bytes);
extern void    aozora_document_free(AozoraDocument *doc);

Status codes

CodeMeaning
0Ok
-1Null input pointer
-2Input was not valid UTF-8
-3Allocation failed
-4Internal serialisation error

Memory ownership

Every pointer or AozoraBytes returned by an aozora_* function must be released by the matching _free call:

Returned byFree with
aozora_document_new (AozoraDocument *)aozora_document_free
aozora_document_to_html (AozoraBytes)aozora_bytes_free
aozora_document_serialize (AozoraBytes)aozora_bytes_free
aozora_document_diagnostics_json (AozoraBytes)aozora_bytes_free

Dropping a handle without _free leaks; freeing then dereferencing is undefined behaviour. This is the standard ABI contract — any unsafe { Box::from_raw(...) } mistake on the consumer side trips both ASan and miri (both run in CI on the FFI test suite).

Why JSON for diagnostics, not a C struct?

Three reasons.

  1. Variant types. Diagnostic has optional fields (help, sometimes a multi-span). A flat C struct would either lose data or grow nullable pointers everywhere. JSON expresses optionality naturally.
  2. Schema stability. Adding a new diagnostic field is a backward-compatible JSON change. Adding a field to a C struct breaks every consumer that compiled against the old size.
  3. Single emitter. The same JSON shape is produced by aozora-wasm (consumed by JS) and aozora-py (consumed by Python). Aligning the C ABI on the same shape means downstream polyglot consumers don’t translate between three different schemas.

The cost is one serde_json::to_string call per aozora_document_diagnostics_json invocation — a one-shot O(N) allocation that is a rounding error compared to the parse itself.

Why opaque handle + bytes, not a flat C struct projection?

A flat C struct projection of AozoraTree would require:

  • Naming every Rust enum variant in C (not supported cleanly via cbindgen for tagged unions).
  • Translating the bumpalo arena into a malloc-backed block contiguous with the tree (which means copying the tree out).
  • Pinning the AST shape across the C ABI — internal refactors (e.g. adding a new AozoraNode variant) would break ABI without warning.

The opaque-handle approach keeps the AST entirely Rust-side. C consumers ask for HTML, canonical text, or JSON-encoded diagnostics — three stable shapes that don’t change with internal refactors.

Use from Go / Zig / Nim

Anything with a C FFI. The aozora.h header is plain C99 — no inline functions, no macros that depend on a compiler-specific extension, no #pragma. Tested in CI by the smoke test against gcc, clang, and msvc.

See also

Extism host SDKs (Java / PHP / Ruby / … the polyglot tail)

The aozora-extism crate compiles to one portable wasm32-unknown-unknown artifact — aozora.wasm — that any language with an Extism host SDK can load. The bytes are identical on every platform, so there is no per-(OS × arch) native build to produce, sign, and publish: a Java, PHP, or Ruby host loads the same wasm a Go host does.

This is the breadth strategy for new languages (ADR-0006). The native bindings stay where they already pay their way — Python (PyO3) and the browser WASM (wasm-bindgen) are in-process and faster — and the C ABI remains for max-performance embedders willing to ship a native library per platform. Extism covers everyone else with a single artifact and mechanically generated types.

The contract is the same “text in → bytes out” waist as the C ABI: each export takes the Aozora source as input bytes and returns either HTML, a round-tripped source string, or a versioned JSON envelope. Every JSON path delegates to aozora::wire — the single cross-driver authority — so the output is byte-identical to the C ABI, the browser WASM, and the PyO3 drivers.

The plugin contract

aozora.wasm exports seven #[plugin_fn] entry points. Each takes the source text as input and returns a string:

ExportInputReturnsShape
to_htmlsourcestringSemantic HTML5 with aozora-* class hooks.
serializesourcestringCanonical 青空文庫 source (round-trip).
diagnostics_jsonsourcestringWire envelope of diagnostics.
nodes_jsonsourcestringWire envelope of source-keyed nodes.
pairs_jsonsourcestringWire envelope of matched open/close pairs.
container_pairs_jsonsourcestringWire envelope of container open/close pairs.
schema_version(ignored)stringThe wire schema version as a decimal string.

The four *_json exports each emit the standard wire envelope

{ "schema_version": 1, "data": [ /* … entries … */ ] }

The per-endpoint data entry shapes — and the committed JSON Schema for each — are documented in the Wire format chapter. to_html and serialize return a bare string (no envelope), and schema_version returns just the integer rendered as text (e.g. "1"); it ignores its input, so a host calls it with an empty buffer.

A source larger than the parser’s 4 GiB (u32::MAX) span limit is rejected on the Extism error channel rather than aborting the instance — the same guard the C ABI and browser WASM apply.

The schema_version wire contract

Every *_json export wraps its payload in { "schema_version": N, "data": [...] }, where N is aozora::wire::SCHEMA_VERSION baked into the wasm at build time.

A host MUST call schema_version at load time and assert that the returned integer equals the version its types were generated for:

  1. The wasm and the host’s generated types are version-locked. Mismatch means the data array may not decode into the types you compiled against.
  2. schema_version is a cheap, input-free probe — the canonical place to fail fast, before the first real parse.

A SCHEMA_VERSION bump is a breaking change to the wire shape (a new kind value, a field rename, an envelope restructuring). Per ADR-0006’s consequences, a bump forces:

  • regeneration of every language’s types (just types-langs, drift-gated), and
  • a coordinated SDK release — the wasm release asset and the host SDKs are released together, version-locked.

So a host that asserts schema_version == <generated-for> at load can treat any other value as “this wasm is from a different release than my types” and refuse to proceed, rather than silently decoding against the wrong shape.

Worked example: the Go SDK

The reference host SDK is aozora-go — a pure-Go host built on the wazero runtime (no cgo, no native build). It is the concrete instance of the language-agnostic pattern below: load aozora.wasm, assert schema_version, call the exports, and decode the envelopes with types generated from the committed JSON Schema. Every other Extism host SDK follows the identical shape — only the host-SDK API calls and the generated type syntax differ.

See aozora-go for the worked, idiomatic version; the section below is the template every language instantiates.

Language-agnostic “call a plugin export” template

The steps are the same in every Extism host SDK; only the method names and type syntax change.

  1. Obtain aozora.wasm. Download it from a GitHub release asset, or build it yourself with just extism-build (see Building the plugin).
  2. Create an Extism plugin from the bytes. Hand the wasm bytes to your host SDK’s plugin constructor. WASI is not required — the plugin needs no filesystem or environment access.
  3. Call schema_version and assert. Invoke schema_version with an empty input, parse the returned decimal string to an integer, and assert it equals the version your types were generated for. Abort on mismatch.
  4. Call to_html(source). Pass the source bytes; receive the HTML5 string.
  5. Call nodes_json(source) (or any *_json export). Receive the JSON envelope string and parse it.
  6. Decode data with generated types. Deserialize the envelope’s data array into the types generated from the committed JSON Schema for that endpoint.
plugin  = ExtismPlugin(read("aozora.wasm"))      // step 2

ver     = int(plugin.call("schema_version", ""))  // step 3
assert ver == EXPECTED_SCHEMA_VERSION

html    = plugin.call("to_html", source)          // step 4

env     = json_parse(plugin.call("nodes_json", source))   // step 5
assert env.schema_version == EXPECTED_SCHEMA_VERSION
nodes   = decode<NodeWire[]>(env.data)            // step 6

One plugin instance is not concurrency-safe. A single Extism plugin wraps a single wasm instance with its own linear memory; do not call into one instance from multiple threads at once. Use one instance per thread, or pool them.

Per-language pointers

Extism publishes host SDKs for roughly 15 languages — including Java, PHP, Ruby, .NET, Elixir, Haskell, OCaml, C/C++, and more — plus the pure-Go aozora-go reference. Browse the current set at the Extism host-SDK docs.

  • Types for every supported language are generated from the committed wire JSON Schema by just types-langs (the quicktype driver), wired into the same drift-gate that guards the TypeScript .d.ts. Generate once per SCHEMA_VERSION; commit the output.
  • The wasm ships as a GitHub release asset (one artifact, all platforms) and is reproducible locally via just extism-build.

Building the plugin

just extism-build

Builds aozora-extism for wasm32-unknown-unknown and runs binaryen’s wasm-opt (the pinned, bulk-memory-capable build baked into the dev image), producing:

  • crates/aozora-extism/dist/aozora.wasm — the portable plugin artifact.

To exercise it end-to-end:

just smoke-extism

Both run inside the dev image — never invoke cargo / wasm-opt on the host.

See also

  • Choosing a binding — native vs. C ABI vs. Extism, and when to reach for each.
  • Go SDK — the reference Extism host SDK (pure-Go wazero).
  • Wire format — the envelope shape, the four endpoint payloads, and their JSON Schemas.
  • C ABI — the in-process alternative for embedders that ship a native library.
  • ADR-0006 — why Extism + schema-driven type generation is the breadth strategy.

Pandoc integration

The aozora-pandoc crate (workspace-internal, available via the aozora CLI) projects a parsed Aozora document into the Pandoc AST. Once you have Pandoc JSON, every Pandoc output format (HTML, EPUB, LaTeX/PDF, DOCX, ODT, MediaWiki, …) is one shell pipe away.

This is the recommended path if you want to convert Aozora Bunko notation into anything other than the built-in HTML renderer. Adding a new output format means adding a Pandoc filter (or none, if the default Span/Div mapping is enough), not extending the parser crate.

Quickstart

# Pandoc JSON to stdout
aozora pandoc input.txt > out.json

# Or pipe through pandoc directly
aozora pandoc input.txt | pandoc -f json -t html
aozora pandoc input.txt | pandoc -f json -t epub3 -o out.epub

# `--format` is shorthand for the pipe (requires pandoc on PATH)
aozora pandoc input.txt --format html > out.html
aozora pandoc -E sjis legacy.txt -t epub > out.epub

Projection rules

Each AozoraNode variant lifts to a Pandoc construct carrying a stable CSS class so downstream filters or stylesheets can specialise the rendering:

Aozora variantPandoc constructClass on the construct
RubySpanaozora-ruby
↳ base textnested Spanaozora-ruby-base
↳ reading textnested Spanaozora-ruby-reading
BoutenSpan over target textaozora-bouten
TateChuYokoSpanaozora-tate-chu-yoko
GaijiSpan carrying mencodeaozora-gaiji
Indent, AlignEndempty Span (marker)aozora-indent / align-end
WarichuSpan with two childrenaozora-warichu
AngleQuoteSpanaozora-angle-quote
Annotation, Kaeriten, HeadingHintempty Span carrying rawaozora-annotation / etc.
PageBreakHorizontalRule block(n/a — semantic block)
SectionBreakempty Divaozora-section-break
AozoraHeadingHeader blockaozora-heading
SashiePara with Imageaozora-sashie
Container (字下げ等)Div wrapping inner blocksaozora-container-indent / etc.

The structural attribute kvs (Pandoc’s third Attr tuple) carries non-textual metadata (bouten kind / position, gaiji description / mencode, indent amount, container kind). Filters that want format-native rendering pattern-match on the class + kvs.

Why a Pandoc projection at all

Aozora notation has rich semantic markup (ruby, bouten, tate-chu-yoko, gaiji…) that no single Pandoc native construct captures. The naive shortcut of emitting RawInline("html", "<ruby>…</ruby>") would only work for the HTML writer; every other Pandoc output format would strip the raw HTML and lose the meaning.

By lifting each Aozora variant to a Span / Div with a stable class, the same JSON renders sensibly across every Pandoc format today (each format’s writer renders Span as a stylable container) and stays open for richer format-native rendering tomorrow via filters. That’s the same pattern Pandoc itself uses for [content]{.smallcaps} — semantic in the AST, format-specific in the writer.

Architecture

The library entry point is aozora_pandoc::to_pandoc:

use aozora::Document;
use aozora_pandoc::to_pandoc;

let doc = Document::new(std::fs::read_to_string("input.txt")?);
let pandoc = to_pandoc(&doc.parse());
let json = serde_json::to_string(&pandoc)?;

aozora-cli wires that into aozora pandoc so binary consumers don’t need to write Rust.

Recipes

Task-shaped, copy-paste answers to “how do I do X with aozora?”. Each recipe is a single problem stated in one sentence, the minimal correct code to solve it, the output you should expect, and a jump list to the deeper chapters.

The Rust snippets use the umbrella aozora crate and nothing else — downstream consumers depend on aozora alone, never the internal build-block crates. The shell snippets use the aozora binary. If you have not yet got either in scope, start at Install, then the Library or CLI quickstart.

Each recipe that has a Rust solution maps to a runnable example under crates/aozora/examples/, so you can read the whole program and run it rather than reassembling fragments. Where that applies the recipe says so — e.g. run with just example walk_ast.

The recipes

I want to…Recipe
Pull every ruby base + reading pair out of a documentExtract ruby pairs
Get diagnostics as machine-readable JSONDiagnostics as JSON
Walk the parsed tree node by nodeWalk the AST
Parse a Shift_JIS file and resolve 外字Shift_JIS & gaiji
Convert to EPUB / LaTeX / DOCXEPUB via Pandoc
Check that a file is already canonicalRound-trip & fmt –check
Call aozora from Go / Java / Python / JSCall from another language

The example programs

The recipes mirror these runnable examples (authored under crates/aozora/examples/); each is launched with just example <name>:

ExampleMirrors
helloThe six-line render in the Library quickstart
walk_astWalk the AST, Extract ruby pairs
diagnosticsDiagnostics as JSON
round_tripRound-trip & fmt –check
sjisShift_JIS & gaiji

See also

  • Library Quickstart — the lifetime model and the core DocumentAozoraTree flow every recipe assumes.
  • Choosing a binding — picking the surface (Rust / CLI / wasm / Python / Go / Extism) before you start.
  • Node reference — what each AST node represents.
  • Wire format — the JSON envelope the aozora::wire serialisers emit.

Extract ruby pairs

Problem. You want every ruby annotation in a document as (base, reading) string pairs — to build a furigana glossary, audit readings, or feed a dictionary.

Solution

Walk source_nodes() (see Walk the AST), keep only the Ruby nodes, and read each node’s base and reading. Both are NonEmpty<Content>; call .get() to get the Content, then .as_plain() for the common case where the text carries no nested constructs.

use aozora::{Document, AozoraNode, NodeRef};

fn main() {
    let source = "|青梅《おうめ》街道を|逢《お》う";
    let doc = Document::new(source);
    let tree = doc.parse();

    for sn in tree.source_nodes() {
        // Ruby is always an inline construct.
        if let NodeRef::Inline(AozoraNode::Ruby(ruby)) = sn.node {
            // `base` / `reading` are NonEmpty<Content>; `.get()` is the
            // Content, `.as_plain()` its text when there are no nested nodes.
            let base = ruby.base.get().as_plain().unwrap_or("<mixed>");
            let reading = ruby.reading.get().as_plain().unwrap_or("<mixed>");
            println!("{base}\t{reading}");
        }
    }
}

Expected output

青梅	おうめ
逢	お

Notes

  • Why NonEmpty. The parser only emits a Ruby node once both base and reading have content, so the fields are NonEmpty<Content> — an empty side is unrepresentable, and you never have to guard against it. .get() unwraps to the inner Content.

  • The <mixed> arm. Content::as_plain() returns None when the run carries nested constructs (a gaiji reference or annotation inside the base, for instance). That is rare for readings but does happen for bases. To flatten those too, iterate the segments instead of bailing (Segment lives under the syntax module since it is not in the top-level re-export set):

    use aozora::syntax::borrowed::Segment;
    
    fn text_of(content: aozora::Content<'_>) -> String {
        let mut out = String::new();
        for seg in content.iter() {
            if let Segment::Text(s) = seg {
                out.push_str(s);
            }
            // Segment::Gaiji / Segment::Annotation carry non-plain payloads;
            // handle them here if your glossary needs them.
        }
        out
    }

    Content::iter() yields a Segment per logical run; the Plain case yields exactly one Text segment, so the loop is uniform.

  • delim_explicit. ruby.delim_explicit records whether the source used the explicit base delimiter. It does not affect the base/reading text — see the Ruby node chapter for why both source forms classify identically.

See also

  • Runnable example: just example walk_ast (crates/aozora/examples/walk_ast.rs) shows the full node walk this recipe narrows.
  • Walk the AST — the general traversal.
  • Ruby node reference — the Ruby struct, the two source forms, and the rendered HTML.
  • Ruby notation — the |青梅《おうめ》 syntax itself.

Diagnostics as JSON

Problem. You want the parser’s diagnostics as a stable, machine-readable JSON document — to feed an editor, a CI annotation, or a cross-language tool.

Solution (library)

The parser always produces a tree, even from malformed input; diagnostics ride alongside it. AozoraTree::diagnostics is the typed slice, and aozora::wire::serialize_diagnostics projects that slice into the shared wire envelope — the exact JSON every binding (FFI, wasm, Python, Extism) emits.

use aozora::Document;
use aozora::wire::serialize_diagnostics;

fn main() {
    // U+E001 is a private-use sentinel the parser reserves; feeding one
    // in raises a diagnostic without aborting the parse.
    let doc = Document::new("abc\u{E001}def");
    let tree = doc.parse();

    let json = serialize_diagnostics(tree.diagnostics());
    println!("{json}");
}

The wire module is behind the wire Cargo feature on aozora.

Expected output

{"schema_version":1,"data":[{"kind":"source_contains_pua","severity":"warning","source":"source","span":{"start":3,"end":6},"codepoint":""}]}

Each entry is { kind, severity, source, span: { start, end }, codepoint? }. schema_version lets a consumer branch before an added variant shows up; see the Wire format chapter for the full schema and the "unknown" fallback contract.

Walking diagnostics without serialising

If you are staying in Rust, you usually do not need JSON at all — read the typed slice directly:

for d in tree.diagnostics() {
    // `Diagnostic` is an enum: `{d}` is the human message (thiserror),
    // `code()` the stable id, `span()` the byte range.
    let span = d.span();
    eprintln!("[{}] {d} @ {}..{}", d.code(), span.start, span.end);
}

Diagnostics are non-fatal by design: callers that want strict behaviour treat any diagnostic as an error themselves. The Diagnostics catalogue lists every stable code.

Solution (CLI)

For shell / CI use, aozora check lexes a file and reports diagnostics, exiting non-zero under --strict:

aozora check src.txt              # human-readable; exit 0 even with warnings
aozora check --strict src.txt     # warnings → exit 1 (the CI gate)
cat src.txt | aozora check        # reads stdin

A JSON output mode for check (--diagnostic-format json, emitting the same serialize_diagnostics envelope) is planned so scripts get the structured stream without writing Rust. Until it lands, the library path above is the supported way to obtain the JSON; the CLI’s current output is the human-readable form documented in the CLI reference.

See also

Walk the AST

Problem. You have parsed a document and want to visit every classified Aozora construct in source order — to count node kinds, build an index, or drive a custom renderer.

Solution

AozoraTree::source_nodes returns a slice of SourceNode, one per classified construct, sorted by source position. Each carries a source_span (byte offsets into the source) and a node, which is a NodeRef tagging the sentinel kind that fired.

use aozora::{Document, NodeRef};

fn main() {
    let source = "|青梅《おうめ》の[#ここから2字下げ]街道《かいどう》[#ここで字下げ終わり]";
    let doc = Document::new(source);
    let tree = doc.parse();

    for sn in tree.source_nodes() {
        let span = sn.source_span;
        match sn.node {
            NodeRef::Inline(node) | NodeRef::BlockLeaf(node) => {
                // `node` is an AozoraNode; `.kind()` is the cross-cutting tag.
                println!("{:>3}..{:<3} {:?}", span.start, span.end, node.kind());
            }
            NodeRef::BlockOpen(kind) => {
                println!("{:>3}..{:<3} open  {kind:?}", span.start, span.end);
            }
            NodeRef::BlockClose(kind) => {
                println!("{:>3}..{:<3} close {kind:?}", span.start, span.end);
            }
        }
    }
}

Expected output

  0..21  Ruby
 24..45  open  Indent { amount: 2 }
 45..72  Ruby
 72..105 close Indent { amount: 2 }

(Byte offsets are over the full-width UTF-8 source; the exact numbers depend on your input.)

How the surface is shaped

source_nodes() is the source-coordinate view — the one editor features and indexers want. The NodeRef variant tells you where the construct landed:

  • Inline — an inline construct (ruby, bouten, gaiji, 縦中横, …) carrying an AozoraNode.
  • BlockLeaf — a standalone block construct (page break, section break, heading) carrying an AozoraNode.
  • BlockOpen / BlockClose — the two ends of a paired container ([#ここから…] / [#ここで…終わり]), each carrying a ContainerKind.

NodeRef::kind() collapses all four into a single NodeKind tag when you only need the discriminant; NodeRef::sentinel_kind() gives the sentinel family.

Matching container open/close pairs

The walk above sees opens and closes as independent events. When you need them paired — “where does this [#ここから…] close?” — read AozoraTree::container_pairs instead, which yields one entry per balanced pair (in normalized coordinates). The inline-delimiter analogue (ruby 《…》, brackets) is AozoraTree::pairs. See Indent & align containers for the container model.

Reaching inside a node

AozoraNode is a borrowed enum; its payload fields hold the construct’s content. To pull text out of a specific variant — say the base and reading of a ruby node — match the variant and read its Content; that is the next recipe, Extract ruby pairs.

See also

Shift_JIS & gaiji

Problem. Aozora Bunko ships its corpus as Shift_JIS, and those files contain 外字 (gaiji) references like ※[#「木+吶のつくり」、第3水準1-85-54]. You want to decode the bytes and see how each gaiji reference resolved.

Two concerns, two layers

  • Encoding is not the parser’s job — the parser is strictly UTF-8. Decode Shift_JIS first with aozora::encoding, then hand the resulting String to Document::new.
  • Gaiji resolution is the parser’s job. As it classifies a ※[#…] reference it resolves the mencode against the bundled JIS X 0213 tables, attaching the result to the Gaiji node. You read it off the node; you do not call the resolver yourself.

Solution (library)

use aozora::{Document, AozoraNode, NodeRef};
use aozora::encoding::decode_sjis;

fn main() -> Result<(), Box<dyn std::error::Error>> {
    // Decode the Shift_JIS archive file to UTF-8 (strict — errors on
    // malformed bytes rather than substituting replacement chars).
    let bytes = std::fs::read("crime_and_punishment.txt")?;
    let utf8 = decode_sjis(&bytes)?;

    let doc = Document::new(utf8);
    let tree = doc.parse();

    for sn in tree.source_nodes() {
        if let NodeRef::Inline(AozoraNode::Gaiji(g)) = sn.node {
            match g.ucs.and_then(|r| r.as_char()) {
                Some(ch) => println!("{} → {ch}", g.description),
                None => println!("{} → (unresolved)", g.description),
            }
        }
    }
    Ok(())
}

Expected output

木+吶のつくり → 吶

Gaiji carries three fields: description (the free-form source text), ucs (the resolved Resolved, None when no table matched), and mencode (the raw reference such as 第3水準1-85-54). Resolved is either a single Char — recovered via as_char() above — or a Multi combining sequence for the handful of plane-1 cells that need one; see the Gaiji node chapter.

Picking the decoder

aozora::encoding offers more than one entry point:

  • decode_sjis(&[u8]) -> Result<String, _> — force Shift_JIS. Use it when you know the input is the canonical archive encoding.
  • decode_auto(&[u8]) -> Result<Cow<str>, _> — sniff: valid UTF-8 is returned borrowed (zero-copy), otherwise the bytes decode as Shift_JIS. Use it for a mixed corpus where some files are pre-converted UTF-8 mirrors.

Both are strict — neither substitutes replacement characters — so you learn when you are looking at corrupted source rather than silently absorbing it.

Solution (CLI)

The aozora binary decodes Shift_JIS with -E sjis (alias --encoding sjis); the default is UTF-8:

aozora render -E sjis crime.txt > crime.html
aozora check  -E sjis crime.txt          # diagnostics on the decoded text
aozora pandoc -E sjis crime.txt -t epub > crime.epub

See also

EPUB via Pandoc

Problem. You want an EPUB (or LaTeX/PDF, DOCX, ODT, …) out of an Aozora Bunko source — anything the built-in HTML renderer does not produce.

Solution

The aozora binary projects a parsed document into the Pandoc AST as JSON. Pipe that JSON into pandoc, and every Pandoc output format is one writer away. For EPUB:

aozora pandoc input.txt | pandoc -f json -t epub3 -o out.epub

That is the whole recipe. The same pipe reaches the other formats by swapping the -t writer:

aozora pandoc input.txt | pandoc -f json -t latex  -o out.tex
aozora pandoc input.txt | pandoc -f json -t docx   -o out.docx
aozora pandoc input.txt | pandoc -f json -t html   > out.html

Shift_JIS source decodes with -E sjis, exactly as for render / check:

aozora pandoc -E sjis crime.txt | pandoc -f json -t epub3 -o crime.epub

aozora pandoc also has a --format / -t shorthand that runs the pipe for you when pandoc is on PATH:

aozora pandoc input.txt -t epub > out.epub

Expected output

out.epub — a valid EPUB 3 container. Each Aozora construct lifts to a Pandoc Span / Div carrying a stable CSS class (aozora-ruby, aozora-bouten, …), so you can style or filter them per format. The projection-rules table lists every variant’s mapping.

Why a Pandoc projection at all

Aozora notation has rich semantic markup (ruby, bouten, 縦中横, gaiji) that no single Pandoc native construct captures. Emitting raw HTML would only survive the HTML writer; every other format would strip it. Lifting each variant to a classed Span/Div instead means the same JSON renders sensibly across every Pandoc format today and stays open to richer format-native rendering via filters tomorrow. Adding a new output format is a Pandoc filter, never a parser change.

See also

Round-trip & fmt –check

Problem. You want to confirm a file is already in canonical Aozora form — or to canonicalise it — and to rely on parse ∘ serialize being lossless.

The property

AozoraTree::serialize re-emits Aozora source from the parsed tree. The guarantee is a fixed point: parsing a canonical document and serialising it returns the same bytes, and serialising again changes nothing.

use aozora::Document;

fn main() {
    let source = "|青梅《おうめ》";

    let once = Document::new(source).parse().serialize();
    let twice = Document::new(once.clone()).parse().serialize();

    assert_eq!(once, twice, "serialize is a fixed point");
    println!("{twice}");
}

Expected output

|青梅《おうめ》

Canonical vs. raw input

Real Aozora Bunko sources carry stylistic variation the lexer normalises before tokenising — CRLF vs LF, NFC vs NFD around accents, and the bare-vs-explicit ruby delimiter (青梅《おうめ》 vs |青梅《おうめ》). For raw input, therefore:

// Not guaranteed for arbitrary raw input:
assert_eq!(Document::new(raw).parse().serialize(), raw);   // may differ

// Guaranteed: the SECOND pass is a fixed point.
let canonical = Document::new(raw).parse().serialize();
assert_eq!(Document::new(canonical.clone()).parse().serialize(), canonical);

The first serialize() is the canonical form (e.g. it always emits the explicit ruby delimiter — see the Ruby node chapter); from there it is stable. This fixed-point property is what the corpus sweep verifies across the full ~17 000-work catalogue.

Solution (CLI)

aozora fmt is the round-trip at the shell. With --check it is a read-only gate — exit 0 if the file is already canonical, 1 if it would change:

aozora fmt --check src.txt        # CI gate: nonzero if not canonical
aozora fmt src.txt > out.txt      # write the canonical form to stdout
aozora fmt --write src.txt        # rewrite in place
cat src.txt | aozora fmt          # stdin → stdout

Exit codes: 0 on success (or no diff under --check), 1 on a formatting mismatch under --check, 2 on a usage error. aozora fmt --check is exactly what this project runs in CI to keep fixtures canonical.

See also

Call from another language

Problem. You are not writing Rust — you want to parse Aozora notation from Go, Java, Python, JavaScript, Ruby, PHP, or something further down the long tail.

One parser, many front doors

There is exactly one parser. Every binding funnels the same source text through the same lexer and emits the same HTML, the same canonical serialise, and the same wire-envelope JSON — byte-identical across every language. So the decision is not “which binding is more correct”; it is “which fits the language and runtime I already have.” Choosing a binding is the full decision table; this recipe is the short jump list.

Pick your language

  • JavaScript / TypeScript (browser, Node, Deno, edge) → aozora-wasm. A wasm-bindgen Document class; runs client-side and at the edge, distributed on npm.

  • Pythonaozora-py. An in-process PyO3 native module built with maturin:

    from aozora_py import Document
    doc = Document("|青梅《おうめ》")
    print(doc.to_html())     # <ruby>青梅<rt>おうめ</rt></ruby>
    
  • Goaozora-go. A pure-Go wazero host over aozora.wasmno cgo, no C toolchain:

    go get github.com/P4suta/aozora-go
    
  • C / C++ / Zig / any FFI-capable native language → the aozora-ffi C ABI: an opaque handle plus JSON over a stable C header (aozora.h).

  • Java, PHP, Ruby, .NET, Elixir, Haskell, … the long tail → the aozora-extism host SDK. One portable aozora.wasm that any Extism host SDK loads — see below.

  • Anything other than HTML (EPUB, LaTeX/PDF, DOCX, …) → the aozora pandoc pipe, regardless of host language.

The Extism template (the breadth strategy)

For the languages without a bespoke native binding, the answer is the single aozora.wasm artifact loaded through that language’s Extism host SDK. The steps are identical in every SDK — only the method names change:

  1. Obtain aozora.wasm (a GitHub release asset).
  2. Load it with your host SDK’s plugin constructor (no WASI needed).
  3. Assert schema_version matches the wire schema you compiled against.
  4. Call an export with the source string:
    • to_html / serialize → a bare string;
    • diagnostics_json / nodes_json / pairs_json / container_pairs_json → a { schema_version, data } wire envelope.
  5. Parse the envelope data with types generated from the committed JSON Schema.

The reference host SDK (aozora-go) is exactly this template instantiated in Go; every other Extism SDK follows the same shape. The full export list and the language-agnostic walkthrough live in the Extism chapter. Why a wasm plugin for the tail rather than a native binding per language is ADR-0006; the short version is in Choosing a binding → In-process vs host-runtime.

See also

Release profile & PGO

aozora’s [profile.release] is tuned for cross-crate inlining at the expense of compile time:

[profile.release]
lto           = "fat"        # full LTO across the whole workspace
codegen-units = 1            # single CGU so LTO sees everything
strip         = "symbols"    # smaller binary, faster cold start
panic         = "abort"      # no unwinding tables in the binary
opt-level     = 3

Why fat LTO over thin

A thin LTO build keeps each crate’s IR isolated; the cross-crate inliner only inlines through summary stubs. Fat LTO concatenates every crate’s IR into one module before optimisation, so the inliner can see across the whole pipeline.

For aozora that pays off because the lex pipeline is deep: aozora-renderaozoraaozora-pipeline::lex_into_arena → per-phase functions, each living behind a crate boundary or a module boundary that LLVM treats the same way under thin LTO. A function call across that depth under thin LTO costs several indirect calls and stack frames; the fat LTO build folds the chain into ~40 inlined instructions on the hot per-byte path.

Measured on the corpus sweep: fat LTO is 30%+ faster than thin LTO once the lex orchestrator is split across crates. Compile-time cost is real (release builds take ~3 minutes vs ~1 minute for thin), but release builds happen at tag time, not on every iteration.

Why codegen-units = 1

codegen-units = N splits each crate into N parallel codegen jobs during compilation. Each unit optimises independently, then the linker stitches them together. With N > 1 the LLVM inliner can’t see across unit boundaries inside a single crate — which under fat LTO defeats half the point.

codegen-units = 1 ensures fat LTO actually sees every function in every crate. Compile time grows; runtime wins back.

Why panic = "abort"

aozora is a parser, not a server. There’s no panic handler to recover into — a panic on user input would be a parser bug, not a recoverable error. panic = "abort":

  • Drops the unwinding tables from the binary (~80 KiB savings on the CLI).
  • Removes the panic-handling overhead from every function call (the compiler doesn’t insert landing pads).
  • Surfaces parser bugs as SIGABRT immediately, which is what we want — a panic always indicates an invariant violation that needs fixing, not a state to gracefully degrade through.

For library consumers that want unwinding (e.g. embedding in a long-running server), the dependency-mode build inherits the consumer’s profile, so this only affects the binaries we publish.

Profile-guided optimisation (PGO)

The release pipeline supports PGO via scripts/pgo-build.sh:

./scripts/pgo-build.sh

Three-stage build:

  1. Instrumented buildcargo build --release with RUSTFLAGS="-Cprofile-generate=/tmp/pgo-data". The resulting binary is slower than vanilla release because of the instrumentation overhead.
  2. Profile collection — run the corpus sweep against the instrumented binary. The corpus must contain a representative spread of document sizes and notation density. The aozora-bench throughput_by_class probe handles this.
  3. Final buildcargo build --release with RUSTFLAGS="-Cprofile-use=/tmp/pgo-data/merged.profdata". LLVM uses the profile to drive its inliner, branch-prediction hints, and basic-block ordering decisions.

Measured win on the corpus sweep: 8–12% faster than non-PGO release build. The cost is operational complexity (the build-script needs a real corpus available); the win compounds with fat LTO, since both target the same hot paths.

BOLT (post-link optimisation)

BOLT is the next layer after PGO: it reorders basic blocks in the final binary based on the same profile. scripts/pgo-build.sh ends with an optional BOLT pass when llvm-bolt is on PATH.

BOLT wins another ~3% on top of PGO, mostly by improving I-cache density for the lex hot path. The win is smaller than PGO’s because PGO already used the profile during compilation; BOLT only refines the final binary’s layout.

Why we don’t use specific tricks

  • -Cforce-frame-pointers=yes — would help samply unwind on some platforms, but the workspace [profile.bench] covers the profiling case (debug = 1 + strip = none). Release builds get the smaller binary.
  • unsafe perf shortcutsunsafe_code = "forbid" at the workspace level. Three crates locally relax it (FFI / scan / xtask), each with // SAFETY: comments and #[deny(unsafe_op_in_unsafe_fn)]. Where a perf opportunity needs unsafe, we measure it first and cite the win in the comment.
  • #[inline(always)] — used sparingly. The compiler’s default heuristics have improved enough that forcing inlining usually costs binary size for negligible win. Where it does help (e.g. the per-byte scanner inner loop), the call site has a measurement comment.

See also

Profiling with samply

samply is the workspace’s sampling profiler. It produces .json.gz traces in the Firefox-Profiler gecko format that can be loaded into the web UI for visual analysis, or fed to the in-tree aozora-trace crate for automated rollups.

Quick commands

# Single corpus document
AOZORA_CORPUS_ROOT=/path/to/corpus \
  just samply-doc 001529/files/50685_ruby_67979/50685_ruby_67979.txt

# Full corpus, parser-bound (5 parse passes after the one-time load)
AOZORA_CORPUS_ROOT=/path/to/corpus just samply-corpus

# Full corpus, render-bound
AOZORA_CORPUS_ROOT=/path/to/corpus just samply-render

# Open in Firefox-Profiler
samply load /tmp/aozora-corpus-<timestamp>.json.gz

All three are wrappers over the aozora-xtask samply subcommand, which:

  • Builds the bench probe with --profile=bench (debug info preserved).
  • Runs samply against the resulting binary.
  • Drops the .json.gz in /tmp/.

Why these run on the host (not Docker)

samply uses perf_event_open(2) for kernel sampling. Docker’s default seccomp profile blocks that syscall. The xtask binary therefore runs on the host (not via docker compose run) and the Justfile recipes are exempt from the workspace’s normal “everything in Docker” policy.

The recipes check /proc/sys/kernel/perf_event_paranoid on entry and print the fix-up command if the value is too high (default 2; needs to be ≤ 1 for unprivileged sampling):

echo 1 | sudo tee /proc/sys/kernel/perf_event_paranoid

Why --profile=bench and not --release

cargo build --release uses [profile.release], which has debug = 0 + strip = "symbols". Samply still records samples, but they show up as raw addresses (0x8fb61) instead of function names — every sample becomes useless to a human reader.

The workspace [profile.bench] inherits from release but sets debug = 1 + strip = "none". The xtask wrappers automatically build with --profile=bench. If you launch samply manually, do the same.

Corpus load dominates a single-pass trace

throughput_by_class and render_hot_path spend most wall time in Shift_JIS decode + filesystem I/O during the one-time corpus load. A single-pass samply trace puts __memmove_avx_unaligned and encoding_rs::ShiftJisDecoder at the top — not the parser.

Fix: set AOZORA_PROFILE_REPEAT=K (or pass K to just samply-corpus) so the parse pass runs K times after the load. The xtask defaults to 5; raise to 10+ for very small corpora.

Trace analysis from the CLI

aozora-xtask trace … (and the just trace-* shortcuts) load saved .json.gz traces, symbolicate them via the aozora-trace crate (DWARF lookup is pure-Rust through addr2line::Loader), and run the bundled analyses.

# 1. One-time per trace: write the symbol cache next to it
just trace-cache /tmp/aozora-corpus-<ts>.json.gz

# 2. Analyses (cache is auto-loaded if present)
just trace-libs    /tmp/aozora-corpus-<ts>.json.gz                  # binary vs libc vs vdso
just trace-hot     /tmp/aozora-corpus-<ts>.json.gz 25               # top-25 hot leaf frames
just trace-rollup  /tmp/aozora-corpus-<ts>.json.gz                  # bucketed by aozora's built-in categories
just trace-stacks  /tmp/aozora-corpus-<ts>.json.gz 'teddy' 5        # full call chains hitting any frame matching `teddy`
just trace-compare /tmp/before.json.gz /tmp/after.json.gz 25        # before/after diff
just trace-flame   /tmp/aozora-corpus-<ts>.json.gz | flamegraph.pl > flame.svg

Each analysis returns a typed report — HotReport, LibraryReport, RollupReport, ComparisonReport, MatchedStacksReport, FlameReport — whose module docstring explains the algorithm.

Why a pure-Rust DWARF symbolicator?

The mainstream alternative is shelling out to addr2line(1) from binutils. We don’t because:

  • Process spawn cost. A typical trace has 5 000+ unique addresses; spawning addr2line per address is unworkable. Pipelining through a single subprocess works but ties symbolisation to the presence of binutils on PATH (not always true on minimal containers).
  • Build-id verification. The aozora-trace::Symbolicator checks the binary’s gnu-build-id against the trace’s codeId so rebuilding between recording and analysis fails loudly rather than producing wrong symbol names. addr2line(1) has no such check.
  • Caching. The symbolicator writes a sidecar <trace>.symbols.json on first call (~100 ms per binary) and reads from it on every subsequent call (instant). Re-running addr2line per analysis would re-walk DWARF every time.

Verifying the SIMD scanner is firing

// In any binary or test
println!("{}", aozora_scan::BackendChoice::detect().name());
// "teddy-avx2" | "teddy-ssse3" | "teddy-neon" | "teddy-wasm" | "scalar-teddy"

Or under samply, look for aozora_scan::arch::x86_64::lead_mask_chunk_avx2 in the trace’s call tree. If the trace shows aozora_scan::arch::x86_64::lead_mask_chunk_ssse3 instead, the SSSE3 fallback is firing because the host lacked AVX2; aozora_scan::kernel::teddy::ScalarTeddyKernel::lead_mask_chunk indicates the pure-Rust last resort fired.

Workflow recipes

“I changed something, did I regress?”

# Microbench the per-band tokenizer throughput
cargo bench -p aozora-pipeline --bench tokenize_compare

# Macrobench the full pipeline end-to-end
AOZORA_CORPUS_ROOT=… cargo run --release --example throughput_by_class -p aozora-bench
AOZORA_CORPUS_ROOT=… cargo run --release --example render_hot_path     -p aozora-bench

# Check the worst doc didn't regress
AOZORA_CORPUS_ROOT=… AOZORA_PROBE_DOC=000286/files/49178_ruby_58807/49178_ruby_58807.txt \
  cargo run --release --example pathological_probe -p aozora-bench

“Where is lex_into_arena spending its time?”

# Macroscopic per-phase split
AOZORA_CORPUS_ROOT=… cargo run --release --example phase_breakdown -p aozora-bench

# Latency tail shape
AOZORA_CORPUS_ROOT=… cargo run --release --example latency_histogram -p aozora-bench

# Microscopic: which classify recogniser dominates a specific doc?
AOZORA_CORPUS_ROOT=… AOZORA_PROBE_DOC=… \
  cargo run --release --features instrument --example pathological_probe -p aozora-bench

See also

Benchmarks (criterion)

aozora ships two layers of perf measurement:

  • Criterion microbenchmarks in crates/aozora-pipeline/benches/, crates/aozora-syntax/benches/, crates/aozora-scan/benches/, and crates/aozora-bench/benches/. Reproducible per-function timings with statistical confidence intervals.
  • Corpus probes in crates/aozora-bench/examples/. Each probe is a cargo run --release --example <name> binary that reports per-band statistics across a real corpus.

Criterion microbenchmarks

Run a specific bench:

cargo bench -p aozora-pipeline --bench tokenize_compare
cargo bench -p aozora-pipeline --bench classify_kaeriten
cargo bench -p aozora-syntax   --bench accent_decompose
cargo bench -p aozora-scan     --bench scanner_bakeoff
cargo bench -p aozora-bench    --bench crime_and_punishment
cargo bench -p aozora-bench    --bench synthetic_corpus

Criterion writes HTML reports under target/criterion/. Each bench reports throughput in MB/s, ns/byte, and a confidence interval; the HTML reports include violin plots that surface multi-modal latency distributions (which often indicate cache-line or page-fault effects we’d otherwise miss).

Why criterion over #[bench]

Three reasons.

  1. Statistical rigour. #[bench] reports the minimum of N iterations; criterion fits a model and reports a confidence interval. The minimum is a known-bad estimator on a system with any noise (which is every real machine).
  2. Iteration count auto-tuning. Criterion picks the iteration count to reach a target precision; #[bench] requires a hand-picked count.
  3. Stability. #[bench] is unstable Rust, only works on nightly. Criterion is stable Rust.

Corpus probes

Each probe under crates/aozora-bench/examples/ reports a different slice of the workload. All read AOZORA_CORPUS_ROOT; most accept AOZORA_PROFILE_LIMIT=N to cap the sweep.

ProbeQuestion it answersOutput shape
throughput_by_classPer-band MB/s for lex_into_arena4-band table + p50 / p90 / p99 / max + ns/byte
phase_breakdownPer-phase ms for sanitize / events / pair / classifyper-doc latencies + top-5 worst classify / sanitize
latency_histogramLog-bucketed latency distribution per phasebar histogram, 10 buckets, 1 µs … 1 s
pathological_probeSingle-doc 100-iter avg per phasetight per-call numbers; takes AOZORA_PROBE_DOC for any corpus path
phase0_breakdownPer-sub-pass cost inside Phase 0 sanitizebom_strip / crlf / rule_isolate / accent / pua_scan
phase0_impactDoes Phase 0 sub-pass firing change Phase 1 cost?bucketed by which sub-passes fired
phase3_subsystemsPer-recogniser ms inside classifyrequires --features instrument
diagnostic_distributionWhat fraction of docs emit diagnostics?histogram by diag count; latency-by-diag-bucket
allocator_pressureArena bytes / source byte ratio + intern dedupper-doc histograms
fused_vs_materializedDoes the deforestation actually win?per-band gap % between fused (lex_into_arena) and materialized (per-phase collect)
intern_dedup_ratioHow well does the interner dedup short strings?corpus-aggregate (cache + table) / calls
render_hot_pathPer-band MB/s for HTML render4-band MB/s + render/parse ratio + out/in size ratio

Each probe is invoked directly:

AOZORA_CORPUS_ROOT=… cargo run --release --example <name> -p aozora-bench

For phase3_subsystems, build with the instrumentation feature:

AOZORA_CORPUS_ROOT=… cargo run --release --features instrument \
  --example phase3_subsystems -p aozora-bench

Why corpus probes and criterion benches?

Different questions.

  • Criterion answers “is function X faster after my change?” on a fixed input. Microscopic, reproducible, the right tool for optimising a single hot loop.
  • Corpus probes answer “is the parser faster on the real Aozora Bunko catalogue after my change?” Macroscopic, includes every distribution effect (small-doc dispatch overhead, large-doc cache pressure, gaiji-density variation). The right tool for validating a perf PR end-to-end.

A perf PR that wins on criterion but loses on the corpus is suspicious — usually it’s optimised the small-input path at the cost of the large-input path. The corpus probe catches it.

Phase 3 instrumentation caveat

phase3-instrument wraps every recogniser entry in a SubsystemGuard that calls Instant::now() on construction + drop. For the dominant inner-loop recognisers this adds enough overhead that the report’s own timing is significantly skewed.

Use the instrumentation to compare relative costs between subsystems, not as an absolute number. For absolute numbers, run phase_breakdown (no instrumentation).

Where to look in samply

If a corpus probe regresses, sample-profile the same workload:

AOZORA_CORPUS_ROOT=… just samply-corpus 5
samply load /tmp/aozora-corpus-<ts>.json.gz
# or
just trace-rollup /tmp/aozora-corpus-<ts>.json.gz

The trace-rollup analysis groups samples into aozora’s built-in categories (Phase 0/1/2/3 + corpus_load + intern + alloc + …) so a regression’s category jumps out at a glance.

See also

Corpus sweeps

aozora’s tier-A acceptance gate is a corpus sweep: every Aozora Bunko work parses without panicking, and the parse ∘ serialize ∘ parse round-trip is stable. The corpus has ~17 000 works in active rotation; sweeping the lot takes ~90 s on a modern x86_64 desktop.

Setting up the corpus

AOZORA_CORPUS_ROOT should point at a directory containing the unpacked Aozora Bunko tarball:

$AOZORA_CORPUS_ROOT/
├── 000001/
│   └── files/
│       └── 18310_ruby_01058/
│           └── 18310_ruby_01058.txt   ← Shift_JIS .txt source
├── 000002/
│   └── files/
│       └── …
└── …

The structure mirrors the upstream aozorabunko repo. Set the env var once in your shell:

export AOZORA_CORPUS_ROOT=/path/to/aozorabunko

Every probe, every sample-profile recipe, and the corpus sweep test suite reads it.

Running the sweep

just corpus-sweep

Wraps the aozora-corpus crate’s ParallelSweep runner. Iterates every .txt file under $AOZORA_CORPUS_ROOT, parses it, verifies:

  • No panic.
  • tree.diagnostics() count is within an expected envelope.
  • parse(serialize(parse(source))) == parse(source) (round-trip property).
  • Render emits valid UTF-8 HTML (no broken byte sequences).

Failure: prints the offending document path + diagnostic, exits non-zero.

Why blake3 / zstd for the archive variant?

aozora-corpus ships an archive mode: the corpus packed into a single .zst file with a blake3 manifest. This is what CI uses (the corpus is downloaded once per workflow run and unpacked in-memory).

  • blake3 for per-entry content-addressed hashing. Used so the archive packer can detect “this work hasn’t changed since the last build” and skip re-encoding it. blake3 over sha256: ~10× faster on the same data, no security trade-off for our use case (we’re not signing anything, just diffing).
  • zstd for compression. Frame-level random access matters because the ParallelSweep runner wants to mmap individual works on demand without decompressing the whole archive. zstd over gzip / xz: 5–10× faster decompression at comparable ratios.

Both crates are mainstream pure-Rust APIs (the underlying libzstd is C, but the boundary is hidden behind the zstd crate’s safe API).

Why parallel sweep?

A serial sweep runs sequentially through every work; on a 16-core machine that’s wall-clock 16× the per-doc parse time. The ParallelSweep runner uses rayon to parse documents in parallel, sized to physical cores via num_cpus::get_physical() — not logical cores.

The reason is memory bandwidth. The parser is bandwidth-bound, not ALU-bound (the SIMD scanner streams the source through L1 once per trigger byte, then the lexer touches each token a few more times). SMT siblings starve each other for cache lines and bus bandwidth, so oversubscribing logical cores actively slows the sweep. Sized to physical, the throughput peaks where the bandwidth ceiling does.

posix_fadvise(POSIX_FADV_DONTNEED) for honest cold-cache numbers

The xtask corpus uncache command evicts every corpus file from the kernel page cache before a measurement run:

cargo run -p aozora-xtask --release -- corpus uncache

It uses posix_fadvise(fd, 0, 0, POSIX_FADV_DONTNEED) per file — no sudo required (unlike echo 1 > /proc/sys/vm/drop_caches, which needs root and drops every cache, defeating the purpose).

Why this matters: a “fresh” benchmark run that finds the corpus already warm in the page cache reports throughput numbers that no cold start can ever achieve. The uncache step makes “cold benchmark” a real, repeatable thing.

Probes that go corpus-wide

ProbeWhat
throughput_by_classPer-band MB/s for lex_into_arena. Splits the corpus by document size (small / medium / large / huge).
phase_breakdownPer-phase ms per doc.
latency_histogramLog-bucketed latency distribution per phase.
diagnostic_distributionWhat fraction of docs emit diagnostics? Histogram by diag count.
allocator_pressureArena bytes / source byte ratio + intern dedup ratio.
render_hot_pathPer-band render MB/s.

See Benchmarks for the full list.

Why a dedicated aozora-corpus crate?

Three concerns kept apart from aozora-bench:

  1. Corpus discovery and loading. Walking the directory, decoding Shift_JIS, applying any per-work filters. This is shared by every probe + by the xtask corpus pack/unpack tooling.
  2. Archive format. The blake3 + zstd packing/unpacking lives here so the bench harness doesn’t pull in compression libraries.
  3. Parallel sweep runner. A reusable rayon::par_iter wrapper with the right ordering (largest documents first to balance load).

aozora-bench then builds on this — each probe is a thin for doc in corpus { measure(doc) } loop, with the corpus crate handling all the I/O.

Why a separate AOZORA_PROFILE_REPEAT?

samply traces of probes that include corpus loading get dominated by I/O and Shift_JIS decode (see Profiling with samply). Running the parse pass K times per document after the one-time load gives samply enough parse-bound wall time to catch the parser hot frames. Default K = 5; raise to 10+ for very small corpora.

See also

Phase D — Sentinel enum + single-table registry

The single-table registry collapsed four per-kind sentinel position tables into one position-keyed EytzingerMap dispatched through a NodeRef enum. Before the refactor the registry held independent inline / block_leaf / block_open / block_close EytzingerMaps and Registry::node_at(pos) swept them in declaration order with four if let Some(...) = table.get(&pos) chains; the current shape is one binary search per lookup, with the variant tag carried on the entry itself.

Structural changes

old  : Registry { inline, block_leaf, block_open, block_close }   // 4× EytzingerMap
       node_at(pos) → 4-way if-let chain, ~4 binary searches worst-case

now  : Registry { table: EytzingerMap<u32, NodeRef<'src>> }       // 1× EytzingerMap
       node_at(pos) → one binary search, NodeRef variant tags the kind

Renderers (crates/aozora-render/src/html.rs, crates/aozora-render/src/serialize.rs) replaced the parallel 4-way if let Some(...) = registry.<kind>.get(...) chains with a single (Structural, NodeRef) cross-product match — the compiler now enforces variant coverage at the call site.

Expected runtime impact

Theoretical: per-lookup binary search count drops from ≤ 4 to 1. Render hot path is dominated by registry lookups inside the memchr2_iter loop in html::render_into (one lookup per PUA sentinel hit), so the savings scale with sentinel density. Aozora corpus profiling against the four-table layout showed registry lookups at ~12 % of render time on bouten-heavy documents; the unified dispatch should absorb roughly that fraction.

Measuring before / after

The repro recipe lives in perf/samply.md. Numerical comparisons against the previous release are produced as release-PR artefacts (the corpus-sweep run output in /tmp/aozora-corpus-<timestamp>.json.gz, plus the diff produced by xtask trace compare) and summarised in the CHANGELOG entry for the release that lands the change. Pinned numbers in this page would rot; the recipe + per-release artefact pair stays current without an editing step here.

CLI reference

Full reference for the aozora binary. For a guided tour, see CLI Quickstart.

Synopsis

aozora <SUBCOMMAND> [OPTIONS] [ARGS]
SubcommandWhat it does
checkLex + report diagnostics.
fmtRound-trip parse ∘ serialize (canonicalise).
renderRender to HTML on stdout.
pandocProject to a Pandoc AST (JSON, or pipe through pandoc).
kindsTabulate every NodeKind / PairKind / Severity / … wire tag.
schemaPrint the JSON Schema for a wire envelope.
explainPrint short prose for a NodeKind tag.

There are no global options beyond clap’s -h/--help and -V/--version; the input-shaping flags below are per-subcommand. All document subcommands accept - (or no path) to read stdin.

Common flagSubcommandsEffect
-E, --encoding {auto,utf8,sjis}check / fmt / render / pandocSource encoding. Default auto — UTF-8 if the bytes are valid UTF-8, else Shift_JIS.

Colour follows the terminal and the NO_COLOR environment variable (miette honours it); there is no --no-color flag.

aozora check

aozora check [OPTIONS] [PATH]

Lex the source and report diagnostics. PATH of - (or omitted) reads from stdin.

OptionEffect
--strict, -sExit non-zero (1) on any diagnostic.
--encoding, -ESource encoding (see above).
--diagnostic-format {human,json,short}How to render diagnostics. Default auto: human when stderr is a terminal, json when piped.

The three formats:

  • human — a graphical miette report: the source line, a caret under the span, the label, the help, and a link to the diagnostics catalogue.
  • json — the aozora::wire diagnostics envelope, byte-identical to every other binding. The machine / agent path (the default when piped).
  • short — one grep-able line: path:offset: severity[code]: msg.

Exit codes: 0 (parse succeeded; diagnostics may have been printed but were tolerated), 1 (--strict and at least one diagnostic), 2 (usage error), 3 (an Internal-source diagnostic fired — a library bug, not bad input; please report it).

aozora check src.txt                       # human on a TTY, json when piped
aozora check --strict src.txt              # any diagnostic -> exit 1
aozora check -E sjis crime.txt             # Shift_JIS source
aozora check --diagnostic-format short -   # one line per diagnostic, from stdin
cat src.txt | aozora check                 # json envelope (stderr is piped)

aozora fmt

aozora fmt [OPTIONS] [PATH]

Round-trip the source through parse ∘ serialize. Default prints the canonical form on stdout.

OptionEffect
--checkExit non-zero if the formatted output differs from the input (after Phase 0 sanitize: BOM strip, CRLF→LF). Mutually exclusive with --write.
--writeOverwrite the input file with the canonical form. Ignored when reading from stdin.
--encoding, -ESource encoding (see above).

Exit codes: 0 (success, or no diff under --check), 1 (formatting mismatch under --check), 2 (usage error).

aozora fmt src.txt > formatted.txt
aozora fmt --check src.txt                 # CI gate
aozora fmt --write src.txt                 # in-place
cat src.txt | aozora fmt                    # stdin -> stdout

aozora render

aozora render [OPTIONS] [PATH]

Render the parsed tree to HTML on stdout. Accepts --encoding/-E.

aozora render src.txt > out.html
aozora render -E sjis crime.txt > crime.html
cat src.txt | aozora render -

The output is semantic HTML5 with aozora-* class hooks (no inline styles). See HTML renderer for the class-name reference.

aozora pandoc

aozora pandoc [OPTIONS] [PATH]

Project the parsed document to a Pandoc AST. Without --format/-t, prints Pandoc JSON to stdout (consumable by pandoc -f json -t …); with --format, spawns pandoc and pipes the JSON through it. Accepts --encoding/-E.

aozora pandoc src.txt | pandoc -f json -t epub3 -o out.epub
aozora pandoc src.txt -t latex > src.tex          # spawns pandoc directly

See Bindings → Pandoc.

Introspection subcommands

kinds, schema {diagnostics|nodes|pairs|container-pairs}, and explain <tag> print typed contracts and need no input file. They back the drift-gated wire artefacts; see Wire format.

Exit codes

CodeMeaning
0Success.
1Diagnostics under --strict, or a formatting mismatch under fmt --check, or a spawned tool (pandoc) exited non-zero.
2Usage error (bad flag, unreadable file, decode failure).
3An Internal-source diagnostic fired during check — a library bug.

Environment

VariableEffect
NO_COLORIf set (any value), disable ANSI colour in diagnostics output.
AOZORA_LOGtracing-subscriber filter (e.g. aozora_pipeline=debug). Internal debugging; not part of the stable surface.

See Reference → Environment variables for the full matrix.

See also

API reference (rustdoc)

The full rustdoc surface for every crate in the workspace is auto-deployed alongside this handbook. Browse it at:

https://p4suta.github.io/aozora/api/aozora/

The landing redirects to the top-level facade (aozora); from there every workspace crate is reachable via the side panel.

Why /api/ instead of docs.rs?

aozora is on crates.io (since v0.4.1), so docs.rs/aozora hosts the released API reference. We also build and deploy the full rustdoc under /api/ on every main push: the in-tree copy tracks the development tip — ahead of whatever the latest crates.io release renders on docs.rs — and presents the umbrella plus every build-block crate as one cross-linked set.

Read docs.rs for the version you depend on; use the /api/ mirror here when you need unreleased main.

Layout

PathWhat
/aozora/ (this site)Handbook (this mdbook)
/aozora/api/aozora/Public facade crate
/aozora/api/aozora_pipeline/Four-phase lexer + lex_into_arena orchestrator
/aozora/api/aozora_render/HTML / serialise renderers
/aozora/api/aozora_syntax/AST node types
/aozora/api/aozora_spec/Shared types + SLUGS dispatch table
/aozora/api/aozora_scan/SIMD trigger scanner
/aozora/api/aozora_veb/Eytzinger sorted-set
/aozora/api/aozora_encoding/SJIS + 外字
/aozora/api/aozora_cst/rowan-backed lossless CST
/aozora/api/aozora_query/tree-sitter-flavoured pattern DSL
/aozora/api/aozora_pandoc/Pandoc AST projection
/aozora/api/aozora_cli/CLI binary internals
/aozora/api/aozora_ffi/C ABI driver
/aozora/api/aozora_wasm/WASM driver
/aozora/api/aozora_py/Python binding
/aozora/api/aozora_bench/Bench probes
/aozora/api/aozora_conformance/Conformance fixture runner
/aozora/api/aozora_corpus/Corpus runner
/aozora/api/aozora_proptest/Proptest strategies
/aozora/api/aozora_trace/Samply trace loader
/aozora/api/aozora_xtask/Dev tooling

The workspace [workspace.lints.rustdoc] block denies every documentation lint:

  • broken_intra_doc_links = "deny" — every [name] link in a doc comment must resolve.
  • private_intra_doc_links = "deny" — links to pub(crate) items flagged so the public docs don’t dangle into private structures.
  • invalid_codeblock_attributes = "deny" — typos in ```rust,no_run style attributes get caught.
  • invalid_html_tags = "deny" — accidental <foo> in prose flagged.
  • invalid_rust_codeblocks = "deny" — every ```rust block must parse as Rust.
  • bare_urls = "deny" — links must be <https://...> or [label](url), not bare URLs (which markdown parses inconsistently).
  • redundant_explicit_links = "deny"[x](x) where the autolink form would do.
  • unescaped_backticks = "deny" — stray backticks flagged.

Every workspace-internal pub item that lands in rustdoc is verified by cargo doc --workspace --no-deps running with RUSTDOCFLAGS=-D warnings.

Local rustdoc build

just doc                        # workspace-wide rustdoc (no deps)
just doc-open                   # rustdoc + open in default browser

Both run inside the dev container; output lands at target/doc/aozora/index.html.

Building this handbook

just book-build                 # render to crates/aozora-book/book/
just book-serve                 # live-preview at localhost:3000
just book-linkcheck             # lychee link verification

See Contributing → Development loop for the full toolchain.

Environment variables

A central reference for every env var aozora reads. Variables fall into three groups: parser configuration, dev / bench harness, and container plumbing.

Parser configuration

VariableRead byEffect
NO_COLORaozora-cliIf set (any value), disable ANSI colour output. Same as --no-color. Standard convention from https://no-color.org.
AOZORA_LOGaozora-cli, library opt-intracing-subscriber filter directive (e.g. aozora_pipeline=debug,aozora_render=info). For internal debugging; not part of the stable surface.

Dev / bench harness

VariableRead byEffect
AOZORA_CORPUS_ROOTaozora-corpus, every probe, every sample-profile recipe, the corpus sweepDirectory of 青空文庫 source files (UTF-8 or Shift_JIS). Required for any corpus-driven operation.
AOZORA_PROFILE_LIMITaozora-bench probesCap the number of corpus documents per probe. Useful for fast iteration; set to 100 for a sub-second sweep.
AOZORA_PROFILE_REPEATsamply-corpus, samply-renderNumber of parse / render passes per document after the one-time corpus load. Default 5; raise to give samply enough parser-bound wall time to attach to.
AOZORA_PROBE_DOCpathological_probeSingle corpus path to probe in tight per-call mode. Path is relative to $AOZORA_CORPUS_ROOT.
AOZORA_PROPTEST_CASESaozora-proptest::configOverride default proptest case count (default 128 per block). 4096 for just prop-deep.

Container plumbing

These are set by docker-compose.yml and don’t need manual handling unless you’re invoking cargo directly outside the dev container.

VariableSet byPurpose
CARGO_HOMEcompose/workspace/.cargo — registry + git deps cached on a named volume.
CARGO_TARGET_DIRcompose/workspace/target — build output cached on a named volume.
RUSTC_WRAPPERcomposesccache — compile cache.
SCCACHE_DIRcompose/workspace/.sccache — sccache backing store on a named volume.
SCCACHE_CACHE_SIZEcompose10G — default cap.
CARGO_INCREMENTALcompose0 — incremental compile defeats sccache; turning it off lets sccache cache the very crates we build most often.
RUST_BACKTRACEcompose1 — full backtraces on panic.
GIT_CONFIG_*composeWhitelists /workspace for git’s “dubious ownership” check (the bind-mounted host source is a non-root UID; the container runs as root).

Variables we deliberately do not read

A few standard variables aozora intentionally ignores:

VariableWhy ignored
LANG / LC_ALLaozora handles its own encoding via --encoding. Locale-driven byte interpretation would make the parser non-reproducible across machines.
RUSTFLAGS (in non-build context)The release / bench / PGO profiles set their own flags; per-invocation RUSTFLAGS would defeat sccache hits for unrelated crates.
CARGO_BUILD_JOBSCargo’s default (CPU count) is what we want. Overriding usually fights the bench harness’s own parallelism control.

See also

Conformance suite

aozora ships a WPT-style conformance corpus so other implementations of the Aozora Bunko notation (the tree-sitter reference grammar, third-party ports, alternate parsers in other languages) can measure their adherence against the same set of cases the Rust parser is held to.

Tier model

LevelMeaningEffect on xtask conformance run
mustRequired for any conforming implementation.A failure here exits non-zero.
shouldRecommended but not strictly required.A failure here logs a warning.
mayOptional; implementations decide.Pure information; never fails.

The tier is declared per case in crates/aozora-conformance/fixtures/render/<case>/meta.toml alongside a feature tag (ruby, bouten, composite, recovery, …). The runner aggregates pass / fail counts by (feature, level).

Running

just conformance               # full suite, exits non-zero on must-fail
just render-gate               # the byte-identical render gate, K3-style
xtask conformance run          # invoke the runner directly

A successful run also writes crates/aozora-book/src/conformance-results.json with per-case detail. The JSON shape is stable; downstream dashboards / shields parse it.

What gets compared

The runner pins six axes per fixture:

  1. tree.to_html() byte-identical to expected.html.
  2. tree.serialize() byte-identical to expected.serialize.txt.
  3. aozora::wire::serialize_diagnostics(tree.diagnostics()) byte-identical to expected.diagnostics.json.
  4. aozora::wire::serialize_nodes(&tree) byte-identical to expected.nodes.json.
  5. aozora::wire::serialize_pairs(&tree) byte-identical to expected.pairs.json.
  6. aozora::wire::serialize_container_pairs(&tree) byte-identical to expected.container_pairs.json.

Axes 1–2 anchor the human-readable surface; axes 3–6 pin the JSON projections that drivers (FFI / WASM / PyO3) consume in production, so a regression that survives the renderer gate but breaks a wire client lights up here.

All six goldens regenerate via UPDATE_GOLDEN=1 cargo test -p aozora-conformance --test render_gate after intentional output changes.

Implementations

The runner currently targets a single implementation — the Rust parser itself. The results.json format carries an implementation field so external runs can append their own results without disturbing the canonical Rust pass-rate.

See also

AST query DSL

A tree-sitter-flavoured pattern DSL selects nodes / tokens from the concrete syntax tree. Editor surfaces (LSP textDocument/documentHighlight, “find all ruby annotations”, refactoring filters, syntax-aware search) compose against the DSL instead of re-implementing tree walks.

The DSL ships behind the query Cargo feature on the aozora crate; that feature also enables cst since queries run against SyntaxNode.

Quickstart

use aozora::Document;
use aozora::query::compile;

let doc = Document::new("|青梅《おうめ》と|青空《あおぞら》");
let cst = aozora::cst::from_tree(&doc.parse());
let query = compile("(Construct @ruby)").expect("compile");
for capture in query.captures(&cst) {
    println!("{} -> {:?}", capture.name, capture.node);
}

Grammar

query   := pattern ('\n' pattern)* '\n'?
pattern := '(' kind capture? ')'
         | '(' '_'  capture? ')'
kind    := SyntaxKind ident      // e.g. `Construct`, `Container`
capture := '@' ident
ident   := [A-Za-z_][A-Za-z0-9_-]*
  • (Construct) — match every Construct node.
  • (Construct @ruby) — capture each Construct under the name ruby.
  • (_) — match any kind (node or token).
  • (_ @any) — combined; tour every kind in preorder.
  • Multiple patterns separated by newlines run as an OR — every matching node yields one Capture per pattern that hits.

Execution model

The DSL compiles once into a Vec<Pattern>; the engine then tests every pattern at every preorder step (O(nodes × patterns)). The small capture-only surface keeps the implementation tight while the predicate / field-access / alternation extensions wait for a concrete consumer ask.

Not yet supported

  • Predicates (#eq?, #match?) — the tree-sitter query language exposes per-capture filters. The DSL ships without them; consumers filter the resulting [Capture] vec in Rust.
  • Field accessors ((Container body: (Construct))) — the CST has no named fields yet.
  • Quantifiers ((...)?, (...)*, (...)+).
  • Alternation [...] between patterns.

These extensions are forward-compatible with the existing API shape (compilecaptures); a future release can land them without breaking existing queries.

Cross-references

Wire format

aozora ships a stable JSON wire format used by every binding — aozora-ffi (C ABI), aozora-wasm (npm), aozora-py (PyO3) — to project the parser’s output across language boundaries. aozora::wire is the single authority for that projection; downstream drivers call into it and receive bit-identical output.

Envelope shape

Every wire JSON has the form

{ "schema_version": 1, "data": [ /* … entries … */ ] }

where schema_version is the major version of the wire contract and data is the per-endpoint payload array.

The four endpoint envelopes are:

EndpointEntry shapeJSON Schema
serialize_diagnostics{ kind, severity, source, span, codepoint? }schema-diagnostics.json
serialize_nodes{ kind, span: { start, end } }schema-nodes.json
serialize_pairs{ kind, open: { start, end }, close: { … } }schema-pairs.json
serialize_container_pairs{ kind, open: { offset }, close: { offset } }schema-container-pairs.json

SCHEMA_VERSION

The schema_version integer (aozora::wire::SCHEMA_VERSION) bumps on any breaking change to the serialised shape — variant additions exposing as a new kind value, field renames, envelope restructuring. Clients should branch on the version and handle unknown values defensively; schema 1 makes no forward-compatibility guarantees with later schemas.

Stability vs. non_exhaustive

Diagnostic and AozoraNode are #[non_exhaustive] — minor releases can add variants. The wire format protects callers in two ways:

  1. Unrecognised variants emit kind: "unknown" rather than failing to serialise, so an old client never sees parse-time data loss.
  2. SCHEMA_VERSION bumps when new variants ship in the wire surface, giving version-branching clients a chance to react before "unknown" shows up in production traffic.

See also

Your first PR

Not every contribution touches the parser. A typo fix, a clarified sentence, a broken link — these are real, welcome PRs, and they ride a much lighter path than the add-a-notation TDD flow. This chapter is that lightweight path. For parser changes (a new 青空文庫 notation, a lexer phase, a renderer shape), follow the full TDD flow in Development loop → Adding a new 青空文庫 notation instead.

Before your first commit: environment setup

If this is a brand-new checkout, get the environment standing first:

just setup            # one-shot first-time environment bootstrap
# or, equivalently:
./bootstrap

That builds the dev image and installs the lefthook git hooks. If hooks ever stop firing later, re-run just hooks (see Troubleshooting → Hooks not firing).

The lightweight path (a doc / typo fix)

  1. Pick a small fix. A typo, a stale link, an unclear sentence in the handbook (crates/aozora-book/src/…) or a top-level doc. Keep it to a single logical change.

  2. Branch. main is branch-protected — never commit to it directly.

    git switch -c docs/fix-ruby-example-typo
    
  3. Edit the .md. Use only your editor here; no parser code is involved, so there’s nothing to compile.

  4. Verify locally. Two quick gates cover doc-only changes:

    just typos            # spelling across the tree
    just book-build       # the mdbook handbook still builds
    

    just book-build catches broken intra-book links and bad Markdown that a typo check won’t. Both run inside the dev container, matching what CI runs.

  5. Commit with a signed, Conventional Commit. Doc changes take the docs: type:

    git commit -m "docs: fix ruby example in the notation chapter"
    

    Both requirements are enforced by hooks: the commit-msg hook rejects a non-Conventional subject, and the signing layers reject an unsigned commit. Scope is optional for cross-cutting doc edits; use one when the change is crate-local (e.g. docs(render): …). See Conventional commits for the accepted types.

  6. Push and open a PR. The PR title mirrors the commit subject (docs: …). The PR template walks the checklist — keep it. CI re-runs the same gates you ran locally.

If a commit is rejected for signing, or a hook misbehaves, jump to Troubleshooting & gate recovery.

The inner loop (while you iterate)

For anything beyond a one-line fix, run a watcher in a second terminal so feedback is continuous instead of per-commit:

just watch            # default check job — recompiles on save
just watch-lint       # fmt + clippy on save
just watch-test       # nextest on save

The watcher runs inside the dev container, so it detects saves against the bind-mounted source. See Development loop → Watch mode for the in-watcher keybindings.

How this differs from a parser change

A doc fix is intentionally cheap. A parser change is not: it lands a failing test first, then the fix, and extends every test layer the new shape touches. The contrast is deliberate.

Doc / typo fixNew notation / parser change
TouchesA .md fileSpec fixture → AST → lexer → renderer → invariants
Verifyjust typos, just book-buildjust test, just prop, just coverage
Commit typedocs:feat: / fix: / perf:
TDDNot applicableRed test first, then green — required

Both paths share the same two hard rules: signed commits and Conventional Commits. Everything else scales with the size of the change.

See also

Development loop

aozora’s development workflow is built around three rules:

  1. Docker-only execution. The host toolchain is never invoked.
  2. just is the entry point. Every operation goes through a just recipe that wraps the underlying tool inside the dev container.
  3. Lint gates run automatically. lefthook installs git hooks that run fmt + clippy + typos pre-commit, and pre-push runs the full local CI gate suite plus a deep property sweep before every push (signed-commit check first), so a passing local commit roughly mirrors a passing CI run.

First-time setup

git clone git@github.com:P4suta/aozora.git
cd aozora
docker compose build dev        # ~5 min the first time, cached afterwards
just hooks                      # install lefthook git hooks
just test                       # confirm green

Daily loop

just shell                      # drop into the dev container
just build                      # cargo build --workspace --all-targets
just test                       # workspace nextest
just lint                       # fmt + clippy + typos + strict-code
just prop                       # property-based sweep (128 cases / block)
just ci                         # full CI replica (lint + build + test + prop + deny + audit + udeps + coverage + book-build)

just --list enumerates everything available; just --list --unsorted preserves the topical grouping (build → test → lint → deps → bench → docs → release → dev-helpers).

Watch mode (bacon)

just watch                      # default `check` job
just watch clippy
just watch test

Inside bacon: t test, c clippy, d doc, f failing-only, esc previous job, q quit, Ctrl-J list jobs. The watcher runs inside the dev container so file change detection works against the bind-mounted source.

For headless usage (no TTY, e.g. piping to tee):

just watch-headless check       # plain output, no TUI

Why Docker for everything?

Three reasons.

  1. Toolchain reproducibility. The dev image pins rust:1.95.0-bookworm plus exact versions of cargo-nextest, cargo-llvm-cov, cargo-deny, cargo-audit, cargo-udeps, cargo-semver-checks, cargo-fuzz, mdbook, mdbook-mermaid, lychee, git-cliff, bacon, and lefthook. A fresh checkout on any machine produces identical tool behaviour.
  2. sccache hits. The compose file mounts a named volume at /workspace/.sccache and sets RUSTC_WRAPPER=sccache. Across sessions and across branches, the cache stays warm.
  3. Host insulation. Nothing in the workspace touches ~/.cargo, ~/.rustup, or any global state. Removing the project means docker compose down -v && rm -rf aozora/.

The two exceptions to Docker-only:

  • samply profiling. perf_event_open(2) doesn’t survive the container seccomp profile; the samply-* recipes invoke the host toolchain (see Profiling with samply).
  • Release builds. GitHub Actions runners build the release binaries natively per OS (the cross-target binary needs to match its runner OS exactly).

Editor / IDE setup

The repository includes a .devcontainer/ config, so:

  • VS Code with Dev Containers extension — “Reopen in Container” picks up the dev image, the rust-analyzer toolchain, and the aozora-* workspace at once. No host-side rust install needed.
  • Anything else — point your editor’s rust-analyzer at the dev container via docker exec. The cleanest approach is symlinking target/ from the named volume to a host-visible path; the alternative is the editor’s own remote-LSP support.

sccache stats

After a build cycle, check that the cache is actually warm:

just sccache-stats

Healthy steady state: 80%+ hit rate during normal iteration. A sub-50% hit rate usually means RUSTC_WRAPPER got defeated — the likely culprit is a stray env override or an [env] in .cargo/config.toml. To reset counters before a measurement window:

just sccache-zero && just clean && just build && just sccache-stats

Pre-commit hooks (lefthook)

lefthook.yml configures:

  • pre-commit (parallel): fmt, clippy, typos.
  • commit-msg: Conventional Commits regex.
  • pre-push: the full local CI gate suite plus a deep property sweep before every push (the signed-commit check runs first).

The hooks shell into docker compose run --rm dev … so they’re identical to the just recipes you ran manually. To skip a hook temporarily, push from the dev container’s shell directly (the hooks attach to the host git, not the container’s git).

Why lefthook over husky / pre-commit / cargo-husky?

  • husky — Node-only ecosystem; would force a Node dep into a Rust workspace.
  • pre-commit (Python framework) — Python-only ecosystem; same issue inverted.
  • cargo-husky — abandoned upstream.
  • lefthook — single Go binary, language-neutral, parallel execution, ships from a small upstream that’s actively maintained. Mainstream choice for polyglot Rust workspaces in 2026.

Conventional commits

The commit-msg hook enforces:

<type>(<scope>): <subject>

Where <type>feat | fix | docs | style | refactor | perf | test | build | ci | chore | revert, and <scope> is typically a crate name without the aozora- prefix (e.g. feat(render): add aozora-tcy class hook).

git-cliff turns these into the CHANGELOG on release.

Adding a new 青空文庫 notation

End-to-end TDD flow:

  1. Conformance fixture. Add a source + expected.* golden under crates/aozora-conformance/fixtures/render/ (and, for a normative case, a spec vector in ../aozora-notation-spec, synced via just sync-spec-vectors).
  2. AST variant. Add a borrowed-arena variant to AozoraNode in crates/aozora-syntax/src/borrowed.rs.
  3. Lexer test (red). Add a case to the relevant phase test under crates/aozora-pipeline/tests/.
  4. Lexer impl (green). Wire the recogniser into the appropriate phase (sanitize → events → pair → classify).
  5. Renderer. Emit the new HTML shape in crates/aozora-render/src/html.rs and the canonical serialisation in crates/aozora-render/src/serialize.rs.
  6. Cross-layer invariants. Extend the property test or corpus predicate that the new shape interacts with (escape-safety, round-trip, span well-formedness).

See also

Testing strategy

aozora targets C1 100% branch coverage as a goal — but coverage is the floor, not the ceiling. Every invariant is asserted from multiple angles so a single missed test path doesn’t silently hide a regression.

The five test layers

flowchart TD
    A["1. Conformance suite<br/>(crates/aozora-conformance/)"]
    B["2. Property tests<br/>(crates/*/tests/property_*.rs)"]
    C["3. Corpus sweep<br/>(every Aozora Bunko work)"]
    D["4. Fuzz harness<br/>(cargo-fuzz)"]
    E["5. Sanitizers<br/>(Miri / TSan / ASan)"]

    A --> B --> C --> D --> E

Each layer catches a different kind of bug:

LayerCatches
Conformance suitePer-feature contract regressions — render goldens + spec vectors.
Property testsInvariant violations in the space of inputs (round-trip, escape-safety, span well-formedness).
Corpus sweepReal-world distribution effects the property generator missed.
FuzzLatent panics on adversarial inputs the corpus doesn’t contain.
SanitizersUB / data race / heap-corruption issues the language can’t catch.

When you add a new invariant, land all five touchpoints in the same PR, or split them into a chain of PRs that explicitly references the invariant.

Layer 1: conformance suite

The aozora-conformance crate is the per-feature contract layer, with two CI-gated halves:

  • Render fixturescrates/aozora-conformance/fixtures/render/<case>/ pins a source plus expected.html / expected.nodes / expected.pairs goldens. just render-gate asserts byte-identical output; just render-gate-update (UPDATE_GOLDEN=1) refreshes the goldens after an intentional change.
  • Spec vectorscrates/aozora-conformance/spec-vectors/vectors/<case>/vector.json pins a (source, expected.{html,serialize,nodes,pairs,diagnostics}) tuple. The specification repo (../aozora-notation-spec) is the single source of truth: vectors are vendored via just sync-spec-vectors, just verify-spec-vectors guards the copy against drift, and just conformance runs them across must / should / may tiers.

Both halves enumerate their fixture directories automatically, so a new case is picked up without editing a manual list. The romaji CSS slugs the fixtures assert are themselves centralised in aozora-spec::RENDER_SLUGS and machine-checked against their kana reading, so a misread slug fails a unit test before it ever reaches a fixture.

The flagship corpus fixture lives at spec/aozora/fixtures/56656/ — the Japanese translation of Crime and Punishment (Aozora Bunko card 56656). It exercises 1000+ ruby annotations, forward-reference bouten, JIS X 0213 gaiji, and accent decomposition edge cases.

Layer 2: property tests

proptest generators in crates/aozora-proptest drive parse / render / round-trip invariants. Default 128 cases per proptest! block (CI budget); just prop-deep runs 4096 per block (release-cut budget).

just prop                       # 128 cases
just prop-deep                  # 4096 cases
AOZORA_PROPTEST_CASES=10000 cargo nextest run --workspace --test 'property_*'

Why proptest over quickcheck:

  • Proptest’s shrinker is structural (reduces by the generator’s ops), so a counterexample collapses to a minimal reproduction that still fails. Quickcheck shrinks per-type, which produces noisier outputs.
  • Proptest persists failure seeds to proptest-regressions/ — every reproduced failure becomes a permanent regression test. Quickcheck has nothing like this.

Why a separate generator crate (aozora-proptest):

The generators are non-trivial (they have to produce valid 青空文庫 source — random byte streams would just stress the parser’s error path, which the fuzz harness already covers). Centralising them means every property test in every crate gets the same generator quality, and the generator itself can be unit-tested.

Layer 3: corpus sweep

export AOZORA_CORPUS_ROOT=$HOME/aozora-corpus
just corpus-sweep

Walks every .txt under $AOZORA_CORPUS_ROOT, parses, verifies the round-trip property holds, no panics. ~17 000 works in active rotation; ~90 s sweep on a modern x86_64 desktop using the parallel loader.

The sweep catches what the property generator can’t — every weird real-world idiom the maintained corpus has accumulated over 25 years of volunteer encoding choices. It’s the parser’s truth-from-the-field.

See Performance → Corpus sweeps for the corpus structure, archive format, and parallel loader details.

Layer 4: fuzz

just fuzz parse_render -- -runs=10000

Targets under crates/*/fuzz/fuzz_targets/:

  • parse_render — feed arbitrary bytes through Document::new ∘ to_html.
  • serialize_roundtripparse ∘ serialize ∘ parse stability.
  • sjis_decodeaozora_encoding::sjis::decode_to_string on arbitrary byte streams.

Fuzz failures auto-shrink to a minimal byte sequence and land in crates/<crate>/fuzz/artifacts/. Add the failing input to crates/aozora-conformance/fixtures/render/ as a regression case after diagnosing.

Why libFuzzer / cargo-fuzz:

Mainstream Rust fuzzing runs on libFuzzer via cargo-fuzz; it has the broadest crate-ecosystem support (most upstream crates ship fuzz targets), the corpus-management tooling is mature, and the crash artefacts are diff-able with git diff.

Layer 5: sanitizers

bash scripts/sanitizers.sh miri      # UB on FFI / scan intrinsics
bash scripts/sanitizers.sh tsan      # data races (parallel corpus loader)
bash scripts/sanitizers.sh asan      # heap correctness

Sanitizer runs are slower (~10× under Miri) so they don’t run on every PR — they’re nightly via the dev-image cron in CI, plus release-cut. The slow path catches the slow-class of bugs.

Why all three:

  • Miri catches undefined behaviour the compiler couldn’t see (out-of- bounds slice access, dangling references, transmute mismatches). The FFI driver and the SIMD scanner have unsafe surfaces; Miri is the only fully-checked oracle for them.
  • TSan catches race conditions in the parallel corpus loader. We use rayon correctly as far as we know, but TSan is the backstop.
  • ASan catches the small set of heap-correctness bugs that get through Miri (typically C-side issues in the FFI smoke test).

Coverage measurement

just coverage           # cargo llvm-cov branch coverage; CI gate
just coverage-html      # local HTML report at coverage/html/index.html
just coverage-branch    # nightly toolchain, branch-coverage detail

cargo llvm-cov over tarpaulin: tarpaulin is x86_64-linux only and uses ptrace-based instrumentation that misses some optimised-out branches. llvm-cov uses LLVM’s source-based coverage instrumentation — works on every target and gives accurate branch numbers.

The CI gate is region coverage; branch coverage is informational (it requires the nightly compiler, which the workspace doesn’t pin on the hot path).

Test naming and structure

  • Unit tests in mod tests {} at the bottom of each module.
  • Integration tests in crates/<crate>/tests/. One file per area (e.g. tests/lexer_phase0.rs, tests/lexer_phase3.rs).
  • Property tests prefixed property_ (the prop recipe globs on this).
  • Doc tests inside ```rust blocks in rustdoc comments. CI runs just test-doc separately because nextest skips them.

Snapshot testing

Where the output is a multi-line string that’s tedious to inline (rendered HTML, diagnostic-formatted text), we use insta:

insta::assert_snapshot!(tree.to_html());

The first run writes tests/snapshots/<test>.snap; subsequent runs compare against it. Updates happen via cargo insta review (the interactive UI inside the dev container), never by manually editing the .snap file.

See also

Troubleshooting & gate recovery

Most first-run friction is environmental, not code. This chapter collects the failures people actually hit on a fresh checkout and the shortest path back to a green tree. If you haven’t set up the environment yet, start with Development loop → First-time setup; if a commit is being rejected for signing, see Your first PR.

First-run failures

Docker daemon not running

Every just target shells into the dev container via docker compose run …, so a stopped daemon makes all of them fail at once — usually with Cannot connect to the Docker daemon at unix:///var/run/docker.sock.

Start (or restart) Docker Desktop / the docker service and re-run the target. Nothing in aozora runs on the host toolchain; the daemon is a hard prerequisite for build, test, lint, and ci alike.

Disk full / image build fails midway

docker compose build dev pulls a rust:*-bookworm base and layers a pinned toolchain on top. A build that dies partway through — or a no space left on device from just build — almost always means the Docker volume is out of room, not a Dockerfile bug.

docker system prune          # reclaim dangling images / layers / build cache

Keep roughly 5 GB free for the dev image plus the named cargo / sccache volumes. After pruning, re-run docker compose build dev; the layer cache resumes from the last good step.

Commit signing fails

Signed commits are mandatory. If a commit is rejected — or the post-commit re-amend rolls your commit back because the signer was unavailable — your SSH/GPG signing key isn’t reachable from the container’s git context. Walk through the signing setup in CONTRIBUTING.md → First-time setup and confirm git config commit.gpgsign is true with a configured user.signingkey.

This is the three-layer defense working as designed: a post-commit re-amend, the signing-check pre-push command (scripts/check-signed-commits.sh), and GitHub’s “require signed commits” ruleset. Do not weaken any layer — the redundancy is intentional. Fix the key, don’t disable the gate.

Hooks not firing

If fmt / clippy / typos aren’t running on commit, or the signing re-amend never happens, lefthook isn’t installed for this clone. Hooks live in .git/hooks/, which is per-clone and never committed, so a fresh checkout always needs:

just hooks                   # (re)install the lefthook git hooks

Re-run it any time hooks go quiet (e.g. after git init-level surgery or switching the hooks path).

Reading lefthook output

Lefthook prints one icon per command in its post-run summary. The non-obvious one:

  • 🥊 is a failure, not decoration. Lefthook falls back to its branding glyph when the underlying tooling (a docker compose run, or a multi-step just recipe with background jobs) buries the real exit status. Treat 🥊 exactly like a plain failure mark and scroll up: the actual error line is in the command output above the summary, not in the summary itself.

Each command in lefthook.yml also carries a fail_text: hint naming the recipe responsible, so a failing push prints both the raw output and a pointer at what to fix.

When a gate fails

just ci runs the full pipeline; the pre-push hook runs the same jobs plus a deep property sweep. When one trips, this table maps the symptom to its recovery recipe:

GateSymptomRecovery
coverageRegion coverage below the floorjust coverage-html, open coverage/html/index.html, add tests for the uncovered regions
clippy / fmtcargo fmt --check diff or clippy denialjust fmt to auto-format, fix any clippy findings, then re-run just lint
drift-gate (schema)wire JSON Schema is stalejust schema to regenerate, then commit the diff
drift-gate (types)TypeScript .d.ts driftjust types to regenerate, then commit the diff
drift-gate (langs)Generated host-SDK wire types are stalejust types-langs to regenerate, then commit the diff
typosSpelling hitjust typos to see every hit; fix, or add a genuine term to typos.toml
deny / auditLicense / advisory failureRead the captured log under /tmp (the recipe writes the full cargo deny / cargo audit output there), then update the dependency or the deny.toml exception

For the schema / types gates, the regenerate-then-commit step is the fix — the gate only checks that the committed artefact matches what the generator would emit, so a stale checkout fails until you regenerate and stage it. See Wire format for what wire / .d.ts / langs each cover.

Escape hatches

Two exist, and they are not interchangeable:

  • SKIP_TAGS=deep git push — the narrow hatch. Skips only the tagged deep command (the 4096-case prop-deep sweep) while leaving signing-check, ci, and everything else in force. Use this when a deep-sweep regression is unrelated to your change — and file an issue against the failing crate so it doesn’t stay hidden.
  • LEFTHOOK=0 — the nuclear hatch. Disables all hooks, including signing-check. An unsigned commit pushed this way is rejected server-side by the ruleset anyway, so you gain nothing but a later, more confusing failure. Avoid it. Reach for SKIP_TAGS instead.

See also

Release process

aozora releases are git-tag-driven: push an annotated v<semver> tag, and .github/workflows/release.yml builds the cross-platform binaries, generates release notes from Conventional Commits, and publishes the GitHub Release.

Cutting a release

# 1. Pre-flight (everything green locally)
just ci                          # lint + build + test + prop + deny + audit + udeps + coverage + book-build
just prop-deep                   # 4096 cases per proptest block
AOZORA_CORPUS_ROOT=… just corpus-sweep
just smoke-py                    # host-side: abi3 wheel build + mypy + pytest (not in `just ci`)

# 2. Bump workspace version
cargo set-version --workspace 0.2.7
git commit -am "chore(release): bump workspace to v0.2.7"

# 3. Refresh CHANGELOG (Unreleased → version)
just changelog                   # runs git-cliff with --unreleased --prepend
git add CHANGELOG.md && git commit -m "docs: refresh CHANGELOG for v0.2.7"

# 4. Tag (annotated)
git tag -a v0.2.7 -m "v0.2.7"
git push origin main v0.2.7

release.yml reacts to the tag: builds release binaries on three runners (linux x86_64, macOS arm64, windows x86_64), assembles tarballs / zips with the aozora binary + LICENSE-MIT + LICENSE-APACHE + NOTICE + README.md, and publishes the archives plus SHA256SUMS to the GitHub Release.

Sanity check after release

# Verify checksums
curl -L -O https://github.com/P4suta/aozora/releases/download/v0.2.7/SHA256SUMS
curl -L -O https://github.com/P4suta/aozora/releases/download/v0.2.7/aozora-v0.2.7-x86_64-unknown-linux-gnu.tar.gz
sha256sum --check SHA256SUMS

# Verify the binary
tar -xzf aozora-v0.2.7-*.tar.gz
./aozora --version              # prints "aozora 0.2.7"

Why annotated tags?

git tag -a creates a tagged-tag object with a message; git tag alone creates a lightweight tag (a bare ref). git-cliff’s release note extraction only walks annotated tags, and the standard ecosystem expectation (cargo-release, cargo-dist) is that release tags are annotated. Using lightweight tags would silently break the changelog generator.

Why git-tag-driven, not branch-driven?

A release/v0.2.7 branch model is the alternative. We don’t use it because:

  • Single-author workflow doesn’t benefit from the parallel-tracks model that branch-driven releases enable.
  • An annotated tag is the release artefact — anything you need to retroactively understand about a release lives in git show v0.2.7. A branch loses that locality.
  • Rollback is git tag -d + delete the GitHub release. Trivial.

CHANGELOG generation

git-cliff consumes Conventional Commits and produces Keep-a-Changelog formatted output:

just changelog          # incremental: --unreleased --prepend CHANGELOG.md
just changelog-full     # rebuild from scratch

cliff.toml configures the grouping:

Commit typeSection in CHANGELOG
feat:Added
fix:Fixed
perf:Performance
refactor:Changed
docs:Documentation
test:Tests
build:Build
ci:CI
chore:(skipped unless scope is release)
revert:Reverted

Non-conventional commits are silently skipped (they survive in git log but don’t pollute the changelog).

Why --unreleased --prepend over -o CHANGELOG.md:

The full-rebuild form (-o) regenerates the entire changelog from git history every time, which churns the diff for past releases even when nothing about them changed (whitespace, footer formatting). The incremental form only writes the new “Unreleased” section between the latest release and HEAD, leaving past entries byte-stable.

Why three release targets and not five?

The CI matrix builds:

  • x86_64-unknown-linux-gnu (linux x86_64)
  • aarch64-apple-darwin (macOS arm64)
  • x86_64-pc-windows-msvc (windows x86_64)

We don’t build x86_64-apple-darwin (macOS Intel — Apple deprecated the platform; arm64 covers all current Apple Silicon machines) or aarch64-unknown-linux-gnu (linux arm64 — covered by cargo install from source for the niche ARM Linux deployment case).

Adding a target is one line in release.yml; we add them when a real consumer asks for a binary build of one. Pre-emptive coverage isn’t worth the CI minutes.

Why not cargo-dist / release-plz?

Both are mainstream choices; we use a hand-written release.yml because:

  • cargo-dist is opinionated about archive layout (assumes you ship bin/ + share/); aozora’s archive is flat (aozora + LICENSE-* + NOTICE + README.md).
  • release-plz automates the version-bump + PR flow; for a single- author repo the manual cargo set-version + git tag is two commands and one fewer integration to debug.

When the workspace grows past three release targets or aozora goes multi-author, both will be worth re-evaluating.

Pre-1.0 SemVer

aozora is currently in the 0.x series. The contract:

  • 0.x.y0.x.y+1: patches and additions, no breaks. Always safe to upgrade.
  • 0.x.y0.x+1.0: may break the API. cargo-semver-checks flags the breaks during CI; the version-bump commit references the break in its body.
  • 0.x.y1.0.0: the API freeze. Post-1.0, breaking changes collect on a next branch and ship in a major bump.

The MSRV pin (rust-toolchain.toml) advances on its own cadence, roughly quarterly. MSRV bumps are not breaking under our pre-1.0 contract — consumers that need a frozen MSRV pin a release tag.

When you raise the MSRV, bump the Dockerfile FROM rust: base in the same commit so the dev image keeps building on exactly the pinned channel (one toolchain, no dead second one). Dependabot deliberately ignores the rust base image (.github/dependabot.yml) precisely so it cannot drift ahead of rust-toolchain.toml, so this base bump is manual. Resolve the new digest with docker buildx imagetools inspect rust:<ver>-bookworm.

Publishing to crates.io

Live since v0.4.1. The whole workspace publishes through the manual .github/workflows/publish-crates.yml workflow:

gh workflow run publish-crates.yml -f dry_run=false

It runs cargo publish --workspace (cargo 1.90+), which publishes every publishable member in topological order — aozora-encoding / aozora-spec first, aozora and aozora-cli last — and waits for crates.io index propagation between dependent crates itself. Members marked publish = false (aozora-corpus, aozora-conformance, aozora-bench, aozora-trace, aozora-xtask, plus the aozora-wasm / aozora-ffi / aozora-py drivers that ship through npm / GitHub Releases / PyPI) are skipped automatically.

The default dry_run: true runs cargo publish --workspace --dry-run only — a safe metadata gate that succeeds even on a first publish because --workspace resolves intra-workspace deps locally. A live run needs the CARGO_TOKEN repo secret populated with a crates.io API token carrying both the publish-new and publish-update scopes (the first run creates brand-new crates).

Single front door, still. The parser is built from many internal crates (aozora-spec, aozora-syntax, aozora-pipeline, aozora-render, aozora-encoding, aozora-scan, aozora-veb, plus aozora-cst / aozora-query / aozora-proptest). They are now on crates.io so the umbrella aozora crate can depend on them, but they carry no API-stability contract — their crate descriptions say so, and downstream consumers should depend on aozora alone.

Why we publish before v1.0

Earlier this was deferred to v1.0 (every pre-1.0 minor may break the API; a published name is load-bearing). We publish now because the crate boundary has stabilised and claiming the aozora* namespace is itself worth doing. The pre-1.0 SemVer contract above still holds — a 0.x → 0.x+1 bump may break the API and is flagged by cargo-semver-checks.

Publishing to npm and PyPI

The browser (WASM) and Python drivers ship through their own manual workflows, same dry_run: true default as crates:

# npm — aozora-wasm (needs the NPM_TOKEN repo secret)
gh workflow run publish-npm.yml -f dry_run=false

# PyPI — aozora_py wheels (OIDC trusted publishing; no token secret)
gh workflow run publish-pypi.yml -f dry_run=false

publish-npm.yml builds the package with wasm-pack build --target web --release and npm publishes crates/aozora-wasm/pkg/. publish-pypi.yml builds one cp311-abi3 wheel per OS (pyo3 abi3-py311, so a single wheel covers CPython 3.11 → 3.14 and future 3.x — no per-Python-version matrix) plus an sdist, and uploads via PyPI trusted publishing (configure the project’s trusted publisher once, pointing at this repo + publish-pypi.yml). Run just smoke-py first. Linux aarch64, macOS universal2, and free-threaded (3.13t/3.14t, which abi3 cannot target) wheels are a future cibuildwheel addition.

Cut these from the same vX.Y.Z tag as the GitHub Release so every channel ships the same version. Run each workflow once with the default dry_run: true first and confirm it’s green before flipping to dry_run=false.

Code signing

Release binaries are not CA code-signed (no Authenticode on the Windows .exe, no Apple Developer ID / notarization on the macOS build). This is a deliberate pre-1.0 decision.

What we ship instead — and why it covers the current audience:

  • Build provenance attestation (actions/attest-build-provenance, since v0.4.0): every archive carries a Sigstore-backed SLSA provenance statement, verifiable with gh attestation verify <archive> --repo P4suta/aozora — no certificates, no CA. It proves which CI built which artefact from which source: a supply-chain control, not an OS-level execution-trust signal.
  • SHA256SUMS for integrity; signed git tags / commits for authorship.

CA code signing solves a different problem — suppressing the Windows SmartScreen / macOS Gatekeeper “unknown publisher” prompt for end users who double-click a downloaded binary. For a parser library + developer CLI installed via cargo install / package managers, that prompt is low-friction, so the recurring cost and operational overhead (HSM-stored keys mandatory since 2023-06; ≤458-day cert validity since 2026-03) is not justified yet.

When we revisit this (post-1.0, if desktop double-click installs become a real distribution path):

  • WindowsSignPath Foundation free OSS code signing (Sectigo-issued, HSM-backed, CI-integrated). Note the 2024 SmartScreen change: EV no longer buys instant trust — both OV and EV build reputation organically over downloads.
  • macOS → Apple Developer ID ($99/yr Apple Developer Program) + notarization. Third-party CA certs (e.g. ssl.com) do not satisfy Gatekeeper; only an Apple-issued Developer ID does.
  • A paid CA (ssl.com eSigner, etc.) was evaluated and rejected: it covers Windows only, no longer removes the first-run warning on day one, and adds a yearly cost the project does not need pre-1.0.

See also