Welcome
aozora is a pure-functional Rust parser for 青空文庫記法 (Aozora Bunko notation) — the in-text annotation language used by 青空文庫, the long-running volunteer digital library of Japanese literature in the public domain.
It handles ruby (|青梅《おうめ》), bouten / bousen
([#「X」に傍点]), 縦中横, gaiji references
(※[#…、第3水準1-85-54]), kunten / kaeriten, indent and align
containers ([#ここから2字下げ]… [#ここで字下げ終わり]), and
page / section breaks — every notation that appears in a real Aozora
Bunko .txt source.
The repository is CommonMark-free, Markdown-free: aozora deals only with the 青空文庫 notation. The renderer emits semantic HTML5; the lexer reports structured diagnostics; the AST is a borrowed-arena tree that can be walked in O(n) without copying source bytes. If you want a Markdown dialect that also understands aozora notation, see the sibling project afm, which is built on top of this parser.
What this handbook is for
A practical tour and a deep reference, in one document.
- Tour — install the CLI, drop the library into a Rust project, or call it from WASM, C, or Python.
- Notation reference — every annotation aozora recognises, with examples, output, edge cases, and the diagnostics that fire when authors get them subtly wrong.
- Architecture — what makes aozora fast and small: the borrowed-arena AST, the seven-phase lexer, the SIMD scanner backends (Teddy, structural bitmaps, Hoehrmann-style multi-pattern DFA), Eytzinger-layout sorted-set lookup, and the Shift_JIS + 外字 resolver. Every choice is motivated against the alternative we didn’t take.
- Performance — the release-profile decisions, PGO pipeline, samply workflow, criterion benchmarks, and the parallel corpus sweep that exercises the parser against every Aozora Bunko work.
- Reference & contributing — CLI, env vars, rustdoc API, and how the dev loop / TDD policy / release pipeline fit together.
Project shape
aozora is a single-author, green-field project that takes the opportunity to reach for the good algorithm and data structure for each problem rather than the obvious naive one. That orientation permeates every chapter — when you read about the scanner or the arena or the gaiji table, you’ll see why this technique spelled out, not just what the code does.
Status
v0.2.x working set. The CLI, Rust library, WASM, C ABI, and Python
binding all build and pass the integration smoke tests in CI. Public
crates.io publication is gated on the v1.0 API freeze; in the
meantime, depend on a tagged commit (see
Install).
A live build of this site lives at https://p4suta.github.io/aozora/; the rustdoc API reference is layered underneath at https://p4suta.github.io/aozora/api/aozora/.
Install
aozora ships in five shapes — pick the one that matches how you want to consume the parser.
CLI binary (release archive)
Pre-built aozora binaries for the three Tier-1 platforms ride on
every GitHub Release:
aozora-vX.Y.Z-x86_64-unknown-linux-gnu.tar.gzaozora-vX.Y.Z-aarch64-apple-darwin.tar.gzaozora-vX.Y.Z-x86_64-pc-windows-msvc.zip
Each archive is shipped with a SHA256SUMS companion. Browse them at
https://github.com/P4suta/aozora/releases.
curl -L -O \
https://github.com/P4suta/aozora/releases/latest/download/aozora-x86_64-unknown-linux-gnu.tar.gz
tar -xzf aozora-*.tar.gz
sudo install -m 0755 aozora /usr/local/bin/
aozora --version
CLI binary (build from source)
Cargo can build the CLI directly from the repository. The --locked
flag is non-negotiable — it pins to the exact Cargo.lock we shipped,
which matters because the workspace uses fat LTO (mismatched dep
versions silently change inlining behaviour).
Latest main (default — tracks the development tip):
cargo install --git https://github.com/P4suta/aozora --locked aozora-cli
Reproducible build pinned to a release tag (replace the tag with the current value from the releases page):
cargo install --git https://github.com/P4suta/aozora \
--tag v0.3.0 --locked aozora-cli
Rust library
aozora is not yet on crates.io — public release tracks the v1.0 API freeze. Until then, depend on a tagged commit. This snippet is the single source of truth for the recommended pin — every other doc link here instead of inlining the tag, so a new release only needs this one block updated:
[dependencies]
aozora = { git = "https://github.com/P4suta/aozora.git", tag = "v0.3.0" }
aozora-encoding = { git = "https://github.com/P4suta/aozora.git", tag = "v0.3.0" }
The current tag is whatever
GitHub Releases is
marked Latest; bump the two tag = "..." lines accordingly.
Ship-it pattern: pin the tag in Cargo.toml, let Dependabot bump it
on the next release. The repo follows Conventional Commits and
SemVer; breaking changes always advance the major version (post-1.0)
or the minor version (during 0.x).
WASM (browser / Node)
rustup target add wasm32-unknown-unknown # one-time
wasm-pack build --target web --release crates/aozora-wasm
The post-wasm-opt artifact has a 500 KiB size budget. See
Bindings → WASM for the JS surface and the
post-build wasm-opt invocation we recommend.
C ABI
cargo build --release -p aozora-ffi
# → target/release/libaozora_ffi.{so,dylib,a}
# → target/release/aozora.h (cbindgen-generated)
Link with -laozora_ffi and include aozora.h. See
Bindings → C ABI for the API surface and memory
ownership rules.
Python
pip install maturin # one-time
cd crates/aozora-py
maturin develop -F extension-module # install in current venv
maturin build -F extension-module --release # produce a redistributable wheel
See Bindings → Python for the API and the
unsendable thread-safety contract.
Toolchain pin
aozora pins Rust 1.95.0 as its MSRV (rust-toolchain.toml). CI
enforces it via a dedicated msrv job. If you run rustup show
inside the repo and see something else, your local override needs
updating.
CLI Quickstart
The aozora binary covers three operations:
aozora check FILE.txt # lex + report diagnostics on stderr
aozora fmt FILE.txt # round-trip parse ∘ serialize, print to stdout
aozora render FILE.txt # render to HTML on stdout
- (or no path argument) reads from stdin. --encoding sjis (alias
-E sjis) decodes Shift_JIS source — Aozora Bunko’s distributed
.txt files are Shift_JIS, so this flag is the common case for real
corpus work.
Common invocations
# Lex an Aozora Bunko file and print diagnostics
aozora check -E sjis crime_and_punishment.txt
# Render to HTML (stdout)
aozora render -E sjis crime_and_punishment.txt > out.html
# Pipe from stdin
cat src.txt | aozora render -
# CI gate: fail if format is not idempotent
aozora fmt --check src.txt
Flag reference
| Flag | Subcommand | Effect |
|---|---|---|
-E sjis, --encoding sjis | all | Decode Shift_JIS source. Default is UTF-8. |
--strict | check | Exit non-zero on any diagnostic. |
--check | fmt | Exit non-zero if formatted output differs from input. |
--write | fmt | Overwrite the input file with the canonical form. (Ignored when reading from stdin.) |
--no-color | all | Disable ANSI colour in diagnostics output. |
--verbose | all | Print parse phase timings to stderr. |
Exit codes
| Code | Meaning |
|---|---|
0 | Success. |
1 | Diagnostics emitted under --strict, or formatting mismatch under --check. |
2 | Usage error (bad flag, missing file, decode error). |
Diagnostics format
aozora check prints diagnostics in
miette style — a coloured source snippet
with carets pointing at the byte range, a short message, and (where
applicable) a help line:
× ruby reading mismatch: target spans 3 chars but |《》 reading is empty
╭─[input.txt:42:9]
42 │ |青梅《》
· ───┬───
· ╰── empty reading
╰────
help: provide a reading inside 《…》 or remove the | marker
Every diagnostic carries a stable error code (E0001, E0002, …);
see the Diagnostics catalogue for the
full list.
Why not a single subcommand?
check / fmt / render are intentionally separate so each one has
a single, predictable failure mode in shell pipelines:
checkexits 0 on parse success, regardless of warnings (use--strictfor “no diagnostics allowed”).fmtis a pure-text transform: stdin in, canonical text out.--checkupgrades it to a CI gate without forking a second binary.renderis a pure-text-to-HTML transform with the same exit-code shape.
Combining them behind flags would make the exit-code semantics
ambiguous (does --check mean format-check or strict-check?). Keeping
them split is the same logic that splits gofmt from vet from
go build.
Library Quickstart
The minimal Rust use of aozora is six lines:
use aozora::Document;
fn main() {
let source = std::fs::read_to_string("src.txt").unwrap();
let doc = Document::new(source);
let tree = doc.parse();
println!("{}", tree.to_html());
}
That’s enough to get HTML out of any UTF-8 青空文庫 source. The rest of this page covers the lifetime model, the diagnostic stream, and the AST walk — three things you’ll need once you do anything beyond “render to HTML”.
The lifetime model
Document owns two things: a bumpalo::Bump
arena and the source Box<str>. AozoraTree<'a> borrows from both:
let doc = aozora::Document::new(source); // Document: 'static
let tree = doc.parse(); // AozoraTree<'_> bound to &doc
let html = tree.to_html(); // walks the borrow
// dropping doc releases every node in a single Bump::reset()
drop(doc);
That is: hand the Document around, not the tree. If you need
to keep a parse result alive across function boundaries, the function
takes ownership of (or borrows) the Document, and re-derives the
tree on the inside. This is unusual for Rust libraries — most parse
APIs hand back an owned tree — but it’s what makes aozora’s
zero-copy AST safe. See Architecture → Borrowed-arena AST
for why this trade is worth it.
Shift_JIS input
Aozora Bunko ships its corpus as Shift_JIS. Decode through
aozora-encoding first:
use aozora::Document;
use aozora_encoding::sjis;
let bytes = std::fs::read("src.sjis.txt")?;
let utf8 = sjis::decode_to_string(&bytes)?; // returns Cow<'_, str>
let doc = Document::new(utf8.into_owned());
let tree = doc.parse();
sjis::decode_to_string handles BOM stripping, JIS X 0213 codepoints,
and the Aozora-specific 外字 references that survive the decode pass
as private-use sentinels (resolved later in the parser).
Diagnostics
use aozora::Diagnostic;
let diags: &[Diagnostic] = tree.diagnostics();
for d in diags {
eprintln!("[{}] {} @ {}..{}", d.code, d.message, d.span.start, d.span.end);
}
Each Diagnostic carries a stable error code, a span, and a level.
Diagnostics are non-fatal by design: the parser always produces a
tree, even from malformed input. Callers that want strict behaviour
treat any diagnostic as an error themselves. See the
Diagnostics catalogue for the code list.
Walking the AST
AozoraTree exposes a flat node iterator and a typed enum:
use aozora::AozoraNode;
for node in tree.nodes() {
match node {
AozoraNode::Plain(s) => print!("{s}"),
AozoraNode::Ruby(r) => print!("[ruby:{}={}]", r.target(), r.reading()),
AozoraNode::Bouten(b) => print!("[bouten {}]", b.kind().slug()),
AozoraNode::Tcy(t) => print!("[tcy:{}]", t.text()),
AozoraNode::Gaiji(g) => print!("[gaiji {}]", g.codepoint()),
AozoraNode::Container(c)=> { /* recurse into c.children() */ }
// …
}
}
For richer traversal patterns (visitor, fold, structural diff), the
nodes implement Copy (they’re effectively (tag, &str, &Bump-slice)
triples), so you can keep references around freely as long as the
Document lives.
Round-trip and canonicalisation
Every parse should round-trip:
let parsed = doc.parse();
let canonical: String = parsed.serialize();
assert_eq!(canonical, doc.source()); // for *canonical* input
Real Aozora Bunko sources contain stylistic variations (CRLF vs LF,
NFC vs NFD around accents, half-width vs full-width punctuation) that
the lexer normalises before tokenising. For those the assertion above
holds after aozora fmt has been applied once.
The pure round-trip property is what aozora fmt --check exercises in
CI, and what the corpus sweep verifies across the full Aozora Bunko
catalogue (~17 000 works).
Where to next
- Notation reference for what each node type represents.
- Architecture → Pipeline overview for what
happens between
Document::newandDocument::parse. - API reference for the rustdoc-generated surface.
Node reference
aozora exposes 19 NodeKind variants. Each is documented
on its own page with source examples, the rendered HTML, the
serialize round-trip output, the in-memory AST shape, and the
diagnostics it can fire alongside.
The page layout matches the aozora explain <kind> CLI subcommand:
once you find the variant in the table, the deep dive is one click —
or one shell invocation — away.
| Variant | Wire tag | Notation |
|---|---|---|
| Ruby | ruby | |base《reading》 |
| Bouten | bouten | [#「target」に傍点] |
| TateChuYoko | tateChuYoko | [#「12」は縦中横] |
| Gaiji | gaiji | ※[#...、第3水準1-85-54] |
| Indent | indent | [#2字下げ] |
| AlignEnd | alignEnd | [#地から2字上げ] |
| Warichu | warichu | [#割り注]... |
| Keigakomi | keigakomi | [#罫囲み] |
| PageBreak | pageBreak | [#改ページ] |
| SectionBreak | sectionBreak | [#改丁] |
| AozoraHeading | heading | [#見出し] |
| HeadingHint | headingHint | [#「対象」は中見出し] |
| Sashie | sashie | [#挿絵(path.png)入る] |
| Kaeriten | kaeriten | [#返り点 一・二] |
| Annotation | annotation | [#任意のコメント] |
| DoubleRuby | doubleRuby | 《《重要》》 |
| Container | container | [#ここから...]...[#ここで...終わり] |
| ContainerOpen | containerOpen | (NodeRef projection) |
| ContainerClose | containerClose | (NodeRef projection) |
How to read these pages
Every node page follows the same skeleton:
| Section | Content |
|---|---|
| Source examples | One or two minimal Aozora-notation strings that produce this variant. |
| Rendered HTML | What Document::new(src).parse().to_html() emits. |
| Serialize output | What serialize() emits — typically the canonical form of the source. |
| AST shape | The borrowed-AST struct fields the variant carries. |
| When emitted | Phase 3 classification rule that produces this variant. |
| Diagnostics | Codes that may accompany this variant. |
| Related kinds | Cross-links to neighbours (Bouten ↔ Bousen, Indent ↔ Container::Indent, etc.). |
#[non_exhaustive] on NodeKind: a future minor release adding a
new variant lands here without a breaking change. Downstream
consumers that match on NodeKind exhaustively must include a _
arm.
NodeKind::Ruby
Wire tag: ruby — base text + reading annotation. The most common
non-trivial variant in Aozora Bunko.
Source examples
|青梅《おうめ》
青梅《おうめ》
Both forms classify as Ruby; the leading | (U+FF5C) makes the
delimiter explicit and lets the parser disambiguate the base run
when ambiguous neighbours could otherwise extend the base.
Rendered HTML
<ruby>青梅<rp>(</rp><rt>おうめ</rt><rp>)</rp></ruby>
<rp> parens are emitted so HTML clients without ruby support
still display a readable fallback.
Serialize output
serialize() always emits the explicit-delimiter form
(|base《reading》), so a parse → serialize → parse round-trip is
a fixed point regardless of which form the source used.
AST shape
pub struct Ruby<'src> {
pub base: NonEmpty<Content<'src>>,
pub reading: NonEmpty<Content<'src>>,
pub delim_explicit: bool,
}
Both fields are NonEmpty<Content>;
empty base or reading is rejected upstream and never produces a
Ruby node.
When emitted
Phase 3 classifies a 《…》 pair as ruby when the preceding run is a
sequence of CJK / kana / latin glyphs and the close is followed by
neither a glyph (which would extend the base further) nor a stray
opener.
Diagnostics
aozora::lex::unclosed_bracket— unbalanced《reaches EOF.aozora::lex::unmatched_close— stray》with no matching open.
Related kinds
- DoubleRuby —
《《…》》double-bracket variant. - Annotation::InvalidRubySpan — fallback when the ruby pair could not be parsed cleanly.
NodeKind::Bouten
Wire tag: bouten — emphasis dots / sidelines over a target span.
Source examples
青空に[#「青空」に傍点]
青空に[#「青空」に丸傍点]
The bracketed annotation refers backwards to the literal text
quoted with 「…」, so the parser resolves the target by string
match against the preceding line(s).
Rendered HTML
<em class="aozora-bouten aozora-bouten-goma aozora-bouten-right">青空</em>に
The two trailing class slots carry the bouten kind (goma,
circle, wavy-line, …) and the position (right for vertical
text, left for the rare under-side variant).
Serialize output
Round-trips to the explicit [#「target」に<kind>傍点] form.
AST shape
pub struct Bouten<'src> {
pub kind: BoutenKind,
pub target: NonEmpty<Content<'src>>,
pub position: BoutenPosition,
}
BoutenKind enumerates the 11 visual variants (Goma, WhiteSesame,
Circle, …); BoutenPosition is Right (default for vertical text)
or Left.
When emitted
Phase 3 sees [#「QUOTE」に <slug>傍点] / [#「QUOTE」に <slug>傍線],
walks back through the recent text to find QUOTE, and emits the
node with the matched span.
Diagnostics
aozora::lex::unclosed_bracket— annotation[#opened with no matching].Annotation(fallback) — quote target unresolved.
Related kinds
- Annotation — fallback when the target cannot be matched.
NodeKind::TateChuYoko
Wire tag: tateChuYoko — horizontal text inside a vertical
writing-mode run (縦中横, “vertical-with-horizontal-inside”).
Source examples
昭和[#「12」は縦中横]年
Rendered HTML
<span class="aozora-tcy">12</span>
Downstream CSS gives the span text-combine-upright: all for proper
vertical-writing display.
Serialize output
Round-trips to [#「target」は縦中横].
AST shape
pub struct TateChuYoko<'src> {
pub text: NonEmpty<Content<'src>>,
}
When emitted
Phase 3 matches the directive [#「TARGET」は縦中横] and resolves
TARGET in preceding text, then emits with the matched span.
Diagnostics
aozora::lex::unclosed_bracket if [# is unmatched.
Related kinds
- Annotation — fallback if target resolution fails.
NodeKind::Gaiji
Wire tag: gaiji — out-of-character-set glyph reference. The
historical Aozora-Bunko notation for characters Shift_JIS could
not encode; modern files mostly use them for genuine non-Unicode
glyphs.
Source examples
※[#「木+吶のつくり」、第3水準1-85-54]
The ※ (U+203B) flags the construct; [#description、mencode]
carries the human description and a structured Mojikyō / JIS / U+
identifier.
Rendered HTML
<span class="aozora-gaiji" title="木+吶のつくり" data-mencode="第3水準1-85-54">〓</span>
The fallback glyph 〓 (U+3013, “geta mark”) is the conventional
Japanese typesetting placeholder for missing glyphs. When the
resolver finds a Unicode mapping the inner text becomes the
resolved character instead of the geta mark.
Serialize output
Round-trips to ※[#description、mencode].
AST shape
pub struct Gaiji<'src> {
pub description: &'src str,
pub ucs: Option<Resolved>,
pub mencode: Option<&'src str>,
}
Resolved is either a single Unicode scalar or one of 25
predefined static combining sequences (e.g. か゚ — か + the IPA
voicing-pair-mark — kept as a static constant so the borrowed-AST
stays Copy).
When emitted
Phase 3 sees the ※[#…] digraph and parses the description /
mencode payload. The encoding crate’s gaiji resolver lifts the
mencode reference into a Unicode character when one exists.
Diagnostics
None on a well-formed ※[#...]. Ambiguous descriptions land as
Annotation::Unknown instead of Gaiji.
Related kinds
- Annotation — fallback when description is malformed.
NodeKind::Indent
Wire tag: indent — single-line [#N字下げ] indent marker.
Source examples
[#2字下げ]
[#3字下げ]もう一段下げる
Rendered HTML
<span class="aozora-indent" data-amount="2"></span>
CSS controls the actual padding (typically padding-inline-start: Nem).
Serialize output
Round-trips to [#N字下げ].
AST shape
pub struct Indent {
pub amount: u8,
}
When emitted
Phase 3 matches the digraph plus a numeric prefix and emits a
single inline marker. For paired indent regions ([#ここから2字下げ]
… [#ここで字下げ終わり]), see Container.
Diagnostics
None on well-formed input.
Related kinds
- Container — paired indent / dedent regions
(
ContainerKind::Indent). - AlignEnd — right-edge alignment counterpart.
NodeKind::AlignEnd
Wire tag: alignEnd — right-edge alignment marker (字上げ).
Source examples
[#地付き]
[#地から3字上げ]
Rendered HTML
<span class="aozora-align-end" data-offset="0"></span>
offset is 0 for 地付き, N for 地から N 字上げ.
Serialize output
Round-trips to [#地付き] / [#地からN字上げ].
AST shape
pub struct AlignEnd {
pub offset: u8,
}
When emitted
Phase 3 matches the directive form. Paired alignment regions
([#ここから地から N 字上げ] … [#ここで字上げ終わり]) are
Container instead.
Diagnostics
None.
Related kinds
NodeKind::Warichu
Wire tag: warichu — split-line annotation (割注). Two text runs
are stacked into a single line of the surrounding text.
Source examples
[#割り注]上の段/下の段[#割り注終わり]
Rendered HTML
<span class="aozora-warichu">
<span class="aozora-warichu-upper">上の段</span>
<span class="aozora-warichu-lower">下の段</span>
</span>
Serialize output
Round-trips to the explicit [#割り注].../...[#割り注終わり].
AST shape
pub struct Warichu<'src> {
pub upper: Content<'src>,
pub lower: Content<'src>,
}
upper / lower are plain Content;
empty halves are valid (one-sided warichu).
When emitted
The single-line [#割り注]...[#割り注終わり] form is
inline-classified; multi-line [#割注] containers become a
Container of kind Warichu.
Diagnostics
None on well-formed input.
Related kinds
- Container — multi-line counterpart.
NodeKind::Keigakomi
Wire tag: keigakomi — ruled-box annotation (罫囲み).
Source examples
[#罫囲み]本文[#罫囲み終わり]
Rendered HTML
<span class="aozora-keigakomi"></span>
(Inline marker; the multi-line container form yields a
<div class="aozora-container-keigakomi"> wrapper instead — see
Container.)
Serialize output
Round-trips to [#罫囲み]...[#罫囲み終わり].
AST shape
pub struct Keigakomi;
Marker struct with no payload — the surrounding text carries the content.
When emitted
Phase 3 sees the inline form. Multi-line keigakomi blocks classify
as Container Keigakomi.
Diagnostics
None on well-formed input.
Related kinds
- Container — multi-line counterpart.
NodeKind::PageBreak
Wire tag: pageBreak — [#改ページ] page break marker.
Source examples
end of chapter
[#改ページ]
beginning of next chapter
Rendered HTML
<div class="aozora-page-break"></div>
CSS gives the div a page-break-before: always for paged media
(EPUB / print).
Serialize output
Round-trips to [#改ページ]\n.
AST shape
AozoraNode::PageBreak is a unit variant — no payload.
When emitted
Phase 3 sees [#改ページ] and emits a single BlockLeaf
classification covering the whole bracket span.
Diagnostics
None on well-formed input.
Related kinds
- SectionBreak —
[#改丁]family.
NodeKind::SectionBreak
Wire tag: sectionBreak — section breaks (改丁 / 改段 / 改見開き).
Source examples
[#改丁]
[#改段]
[#改見開き]
Rendered HTML
<div class="aozora-section-break aozora-section-break-choho"></div>
The second class slot carries the variant slug (choho, dan,
spread, other).
Serialize output
Round-trips to [#改丁] etc.
AST shape
AozoraNode::SectionBreak(SectionKind)
SectionKind is Choho (改丁) / Dan (改段) / Spread (改見開き).
When emitted
Phase 3 matches each directive; the kind enum captures which.
Diagnostics
None on well-formed input.
Related kinds
- PageBreak — finer-grained
[#改ページ]variant.
NodeKind::AozoraHeading
Wire tag: heading — Aozora 見出し (window / sub heading).
Source examples
[#見出し]序章[#見出し終わり]
Rendered HTML
<h2 class="aozora-heading aozora-heading-window">序章</h2>
The Pandoc projection uses level 2 for Window, level 3 for Sub.
Serialize output
Round-trips to [#<kind>見出し]...[#<kind>見出し終わり].
AST shape
pub struct AozoraHeading<'src> {
pub kind: AozoraHeadingKind,
pub text: NonEmpty<Content<'src>>,
}
AozoraHeadingKind is Window (窓見出し) or Sub (副見出し).
When emitted
Phase 3 matches the keyword 見出し family and binds the body run.
Diagnostics
None on well-formed input.
Related kinds
- HeadingHint — forward-reference style heading hint.
NodeKind::HeadingHint
Wire tag: headingHint — forward-reference heading hint
([#「target」は中見出し]).
Source examples
序章
[#「序章」は中見出し]
The hint refers to a quoted target string in the preceding line(s); downstream renderers pick this up as “promote the matched run to a heading.”
Rendered HTML
The marker itself emits no visible content; renderers that honour
the hint elevate the previously-matched span to a <h2> /
<h3> retroactively. The default HTML renderer in aozora-render
emits a structural marker comment.
Serialize output
Round-trips to [#「target」は<level>見出し].
AST shape
pub struct HeadingHint<'src> {
pub level: u8,
pub target: NonEmptyStr<'src>,
}
level follows the Aozora convention: 1=大見出し, 2=中見出し,
3=小見出し.
When emitted
Phase 3 matches the directive and records the level + target. Empty target is rejected and falls through to plain text.
Diagnostics
None on well-formed input.
Related kinds
- AozoraHeading — direct heading-marker variant.
NodeKind::Sashie
Wire tag: sashie — illustration reference (挿絵).
Source examples
[#挿絵(cover.png)入る]
[#挿絵(pages/03.jpg、第3章扉絵)入る]
Rendered HTML
<figure class="aozora-sashie">
<img src="cover.png" alt="">
</figure>
When a caption is present it lands as a <figcaption> next to the
<img>.
Serialize output
Round-trips to [#挿絵(path[、caption])入る].
AST shape
pub struct Sashie<'src> {
pub file: NonEmptyStr<'src>,
pub caption: Option<Content<'src>>,
}
Empty file is rejected upstream — the construct cannot ship a
nameless image.
When emitted
Phase 3 matches the 挿絵(…)入る digraph and parses out the path
- optional caption.
Diagnostics
None on well-formed input.
Related kinds
- Annotation — fallback when the directive is malformed.
NodeKind::Kaeriten
Wire tag: kaeriten — kanbun reading-order marker (返り点).
Source examples
読[#返り点 一・二]本
Rendered HTML
<sup class="aozora-kaeriten" data-mark="一・二"></sup>
CSS positions the sup glyph appropriately for vertical / horizontal writing mode.
Serialize output
Round-trips to [#返り点 mark].
AST shape
pub struct Kaeriten<'src> {
pub mark: NonEmptyStr<'src>,
}
When emitted
Phase 3 matches 返り点 keyword + marker payload. Empty marker
rejected upstream.
Diagnostics
None on well-formed input.
Related kinds
None.
NodeKind::Annotation
Wire tag: annotation — generic [#...] annotation that no
specific recogniser claimed.
Source examples
text[#任意のメモ]more
text[#ふりがな付きの説明]more
Rendered HTML
<span class="aozora-annotation" title="..."></span>
The default renderer suppresses the body; downstream filters can
match on aozora-annotation to surface the comment.
Serialize output
Round-trips to [#<raw>].
AST shape
pub struct Annotation<'src> {
pub raw: NonEmptyStr<'src>,
pub kind: AnnotationKind,
}
AnnotationKind discriminates the recognised sub-variants
(Unknown, AsIs, TextualNote, InvalidRubySpan, …); raw
carries the raw bracket body for any further analysis.
When emitted
Phase 3 reaches [#...] after no specific recogniser matched.
Annotation is the fallback that always preserves the user’s
content rather than dropping it.
Diagnostics
None — Annotation is the recovery path for unrecognised
directives. A genuine invalid-bracket diagnostic
(unclosed_bracket / unmatched_close) appears separately.
Related kinds
NodeKind::DoubleRuby
Wire tag: doubleRuby — double-bracket bouten (《《重要》》).
Source examples
《《重要》》
Rendered HTML
<em class="aozora-double-ruby">重要</em>
CSS typically sets font-weight: bold or attaches sidelines for
this construct; the default class hand-off lets stylesheets pick
the visual.
Serialize output
Round-trips to 《《content》》.
AST shape
pub struct DoubleRuby<'src> {
pub content: NonEmpty<Content<'src>>,
}
content is NonEmpty — empty 《《》》 is rejected upstream and
falls through to plain text rather than producing an empty node.
When emitted
Phase 3 sees 《《 as a single tokenised opener (not two 《); the
classifier matches 《《...》》 as a single pair and emits the
node.
Diagnostics
unclosed_bracket for 《《 without 》》.
Related kinds
- Ruby — single-bracket variant.
NodeKind::Container
Wire tag: container — paired-container wrapping
([#ここから...]...[#ここで...終わり]).
Source examples
[#ここから2字下げ]
第一節
第二節
[#ここで字下げ終わり]
[#罫囲み]
本文
[#罫囲み終わり]
[#地から3字上げ]
寄付者一覧
[#字上げ終わり]
Rendered HTML
<div class="aozora-container-indent" data-amount="2">
...
</div>
The wrapping div carries the kind-specific class
(aozora-container-indent, aozora-container-warichu,
aozora-container-keigakomi, aozora-container-align-end) plus
any structural data (indent amount, align offset) on data-*.
Serialize output
Round-trips to the explicit-paired directive form.
AST shape
pub struct Container {
pub kind: ContainerKind,
}
pub enum ContainerKind {
Indent { amount: u8 },
Warichu,
Keigakomi,
AlignEnd { offset: u8 },
}
The Container payload appears wrapping the content — the actual
walker driver fires visit_container_open on enter and
visit_container_close on exit so renderers wrap the body cleanly.
When emitted
Phase 2 pairs the [#ここから…] / [#ここで…終わり] openers
and closers; Phase 3’s BlockOpen / BlockClose events project to
this variant.
Diagnostics
unclosed_bracket for unbalanced opens.
Related kinds
- ContainerOpen —
NodeRefprojection of the open boundary. - ContainerClose —
NodeRefprojection of the close boundary. - Indent, AlignEnd, Warichu, Keigakomi — single-line counterparts.
NodeKind::ContainerOpen
Wire tag: containerOpen — paired-container open boundary marker.
This variant only appears in NodeRef-flavoured wire output (e.g.
serialize_nodes); the structural AozoraNode::Container
payload covers the wrapping construct itself.
Source examples
[#ここから2字下げ] <- ContainerOpen
indented body
[#ここで字下げ終わり] <- ContainerClose
Rendered HTML
The default HTML renderer routes the open / close pair through
visit_container_open / visit_container_close and emits the
opening <div class="aozora-container-..."> wrapping the body.
Serialize output
Round-trips together with the matching close to the
[#ここから…]...[#ここで…終わり] form.
AST shape
NodeRef::BlockOpen(ContainerKind) — see
ContainerKind.
When emitted
Phase 2 pairs the open / close brackets; Phase 3’s normalised text
emits a BlockOpen PUA sentinel at the position of the opener so
the registry can dispatch the open event during walking.
Diagnostics
unclosed_bracket if the open never finds a matching close.
Related kinds
- ContainerClose — paired close-side counterpart.
- Container — the structural payload variant.
NodeKind::ContainerClose
Wire tag: containerClose — paired-container close boundary marker.
NodeRef-only counterpart of ContainerOpen.
Source examples
[#ここから2字下げ] <- ContainerOpen
body
[#ここで字下げ終わり] <- ContainerClose
Rendered HTML
Routed through visit_container_close; the default renderer emits
the closing </div> of the
<div class="aozora-container-..."> opened by the matching
ContainerOpen.
Serialize output
Round-trips with the matching open.
AST shape
NodeRef::BlockClose(ContainerKind).
When emitted
Phase 3 normalised-text emits a BlockClose PUA sentinel at the
matching close position.
Diagnostics
unmatched_close if the close has no open partner — in which case
no ContainerClose is emitted and the close-bracket bytes flow
through as plain.
Related kinds
- ContainerOpen — open-side counterpart.
- Container — structural payload.
Notation overview
青空文庫記法 is a small, line-oriented annotation language layered inside a plain-text Japanese document. Authors mark up the text in two distinct registers:
- Inline markers — single-character sigils (
|,《,》,※) that fence inline annotations directly inside the prose. - Block annotations —
[#…]brackets containing a Japanese directive in natural language (“ここから2字下げ”, “「X」に傍点”, …) that act as openers, closers, or self-contained directives.
aozora recognises every annotation that survives in real Aozora Bunko sources — the volunteer corpus has ~17 000 works in active rotation, and the parser is exercised against the entire archive in CI as part of the corpus sweep.
Notations covered
| Chapter | What it marks |
|---|---|
| Ruby | Pronunciation glosses (|青梅《おうめ》, 青梅《おうめ》). |
| Bouten / bousen | Emphasis dots and lines: 傍点 (sesame, white sesame, filled circle, open circle, …) and 傍線 (single, double, dashed, …). |
| 縦中横 | Horizontally-set runs inside vertical text ([#「数字」は縦中横]). |
| Gaiji | Out-of-Shift_JIS character references (※[#…、第3水準1-85-54]) and accented-Latin decomposition. |
| Kunten | 漢文 reading marks: 返り点 (レ, 一, 二, 上, 中, 下), 再読文字, 送り仮名. |
| Indent containers | [#ここから2字下げ]… [#ここで字下げ終わり] and the geji / 地付き / 地寄せ family. |
| Page & section breaks | 改ページ, 改丁, 改見開き, 改段. |
| Diagnostics | The catalogue of structured diagnostics the parser emits. |
Spec source of truth
The authoritative spec lives at
https://www.aozora.gr.jp/annotation/index.html. A snapshot is
vendored at docs/specs/aozora/
in the repo so that every page in this handbook can link to a stable
fragment (the upstream HTML reorganises occasionally; the snapshot
shields chapter cross-references from rot).
When this handbook says “the spec says X”, that means that snapshot. Where the live spec drifts, we update the snapshot, then update the parser, then update this handbook — in that order.
How a sample input looks
|青梅《おうめ》街道を歩いて、※[#「魚+師のつくり」、第3水準1-94-37]を見た。
[#ここから2字下げ]
[#「平和」に傍点]という言葉は、もう古い。
[#ここで字下げ終わり]
[#改ページ]
That single sample exercises ruby, gaiji, indent containers, bouten, and a page break. The parser turns it into a flat node stream — see the per-chapter pages for the exact AST shapes.
Notation we deliberately omit
Aozora Bunko’s spec mentions a handful of annotations that don’t appear in the maintained corpus:
- Image references beyond
[#挿絵]— covered up to the caption, no actual image rendering. - キャプション alignment edge cases that the spec lists but no active work uses (verified against the corpus sweep).
These are recognised as Container::Unknown with a
W0010 advisory diagnostic. Adding full
support is a one-PR job once a real corpus document needs it.
Ruby (|青梅《おうめ》)
Ruby is a pronunciation gloss attached to a run of base text. In 青空文庫 source it appears in two shapes:
|青梅《おうめ》 ← explicit-base form
青梅《おうめ》 ← implicit-base form (auto-detect)
Both forms render the same HTML:
<ruby>青梅<rt>おうめ</rt></ruby>
Explicit base (|…《…》)
The full-width vertical bar | (U+FF5C) marks the start of the
base text; 《…》 (U+300A / U+300B) wraps the reading. The base
runs from | to the 《. Use this form when:
- The base contains characters that the auto-detect heuristic would otherwise skip (kana, ASCII letters, mixed scripts).
- The boundary between base and surrounding text is ambiguous.
|山田《やまだ》さん → <ruby>山田<rt>やまだ</rt></ruby>さん
|HTTP《ハイパー・テキスト》 → <ruby>HTTP<rt>ハイパー・テキスト</rt></ruby>
Implicit base
When 《…》 follows a run of kanji without a leading |, the
parser auto-detects the base by scanning backwards through the kanji
run. The auto-detect terminates at the first non-kanji character
(kana, punctuation, ASCII, full-width digit).
青梅《おうめ》 → <ruby>青梅<rt>おうめ</rt></ruby>
お青梅《おうめ》 → お<ruby>青梅<rt>おうめ</rt></ruby>
The “kanji” predicate is CJK Unified Ideographs + CJK Compatibility Ideographs + CJK Unified Ideographs Extension A–F
- the iteration mark
々. JIS X 0213 plane-2 ideographs not in Unicode are represented as gaiji references (see Gaiji) and likewise terminate the auto-detect.
Empty reading
|青梅《》 is a parse error. The lexer emits diagnostic
E0001 (“ruby reading mismatch: target spans
N chars but |《》 reading is empty”) and the node is dropped
from the AST.
The implicit-base form silently skips a 《》 with empty contents —
that combination cannot have arisen from valid markup, so the parser
treats the bare 《》 as literal text.
Nested ruby (forbidden)
The spec disallows ruby inside ruby. Sources with |青梅《|お《お》うめ》
are rejected with diagnostic E0002.
AST shape
pub struct Ruby<'src> {
pub target: &'src str, // borrowed from source
pub reading: &'src str, // borrowed from source
pub span: Span, // byte range in the source
pub explicit_base: bool, // true if the input used the |…《…》 form
}
Both target and reading are &str slices into the
Document-owned source — no allocation, no copy. Re-emitting
canonical form is exactly:
match (ruby.explicit_base, ruby.target, ruby.reading) {
(true, t, r) => format!("|{t}《{r}》"),
(false, t, r) => format!("{t}《{r}》"),
}
Edge cases
| Input | Output |
|---|---|
青梅《おうめ》 | <ruby>青梅<rt>おうめ</rt></ruby> |
|青梅《おうめ》 | <ruby>青梅<rt>おうめ</rt></ruby> (canonical-equivalent) |
|山田《やまだ》 | <ruby>山田<rt>やまだ</rt></ruby> |
|HTTP《ハイパー・テキスト》 | <ruby>HTTP<rt>ハイパー・テキスト</rt></ruby> |
お青梅《おうめ》 | お<ruby>青梅<rt>おうめ</rt></ruby> (auto-detect skips kana) |
1青梅《おうめ》 | 1<ruby>青梅<rt>おうめ</rt></ruby> (auto-detect skips digit) |
|青梅《》 | parse error E0001 |
《おうめ》 | literal text (no preceding kanji to anchor) |
|青梅《|お《お》うめ》 | parse error E0002 |
See also
- Bouten / bousen — emphasis annotations that share the
「X」に…indirection idiom. - Architecture → Seven-phase lexer — where ruby recognition fits in the classifier pipeline.
Bouten / bousen (傍点・傍線)
Bouten (傍点) are emphasis dots placed beside characters in vertical text — the Japanese typographic equivalent of italic or bold. Bousen (傍線) are the same idea with a line instead of dots. The spec recognises eleven dot variants and six line variants; aozora accepts every one.
Notation forms
Two indirection styles, both common in real corpus:
[#「平和」に傍点] ← target-by-quoting
平和[#「平和」に傍点] ← redundant explicit copy (also accepted)
[#ここから傍点]平和[#ここで傍点終わり] ← container form
The target-by-quoting form is by far the most common: the inline annotation looks backwards in the text for the most recent occurrence of the quoted string and applies the bouten to that run.
Variant catalogue
| Slug | Source kanji | Renders as |
|---|---|---|
sesame | 傍点 | small black sesame ﹅ |
white_sesame | 白ゴマ傍点 | small white sesame ﹆ |
circle | 丸傍点 | filled circle ● |
white_circle | 白丸傍点 | open circle ○ |
dot | 黒点傍点 | bold black dot |
triangle | 三角傍点 | filled triangle |
white_triangle | 白三角傍点 | open triangle |
bullseye | 二重丸傍点 | bullseye |
kotenten | コ点傍点 | small katakana ko-mark |
kotenten_white | 白コ点傍点 | white ko-mark |
linear | 線傍点 | dotted underline |
single_line | 傍線 | single line |
double_line | 二重傍線 | double line |
dashed_line | 鎖線 | dashed line |
wavy_line | 波線 | wavy line |
chained_line | 二重鎖線 | double dashed line |
under_dotted | 下線 | dotted underline |
Each variant has a stable BoutenKind::slug() that the HTML renderer
emits as a class name (e.g. <em class="aozora-bouten-sesame">). See
Architecture → HTML renderer for the full
class-name scheme.
Default rendering
aozora emits <em class="aozora-bouten-<slug>">…</em> so that an
external stylesheet can pick the visual treatment per variant.
Default CSS hooks live at the consumer side; the parser ships no
stylesheet of its own.
<!-- 平和[#「平和」に傍点] -->
平和<em class="aozora-bouten-sesame">平和</em>
(The redundant copy is intentional — the [#…] indirection
re-emits the target wrapped in <em>, leaving the original run
in place. The HTML rendering matches what print Aozora Bunko output
does in practice.)
Container form
For runs that span multiple lines or include other annotations, use the container form:
[#ここから傍点]
平和は手の届かないものだった。
そして、戦争もまた。
[#ここで傍点終わり]
Renders as:
<em class="aozora-bouten-sesame">
平和は手の届かないものだった。
そして、戦争もまた。
</em>
The opening directive can be any of the variant openers (ここから二重傍線,
ここから波線, …); the matching closer must use the same family
(ここで傍線終わり for any 線 variant, ここで傍点終わり for any 点
variant). Mismatched closers fire diagnostic
E0004.
AST shape
pub struct Bouten<'src> {
pub target: &'src str, // the run wrapped in emphasis
pub kind: BoutenKind, // one of 17 variants
pub form: BoutenForm, // Indirect | Inline | Container
pub span: Span,
}
BoutenKind is a flat enum with slug accessors; see the
rustdoc for the exact variant list.
See also
- Notation overview — how this fits with the other inline annotations.
- Diagnostics catalogue —
E0004,W0003.
縦中横 (tate-chū-yoko)
縦中横 (tate-chū-yoko, “horizontal in vertical”) is a typographic construct that lays a short run — usually digits, Latin letters, or mixed punctuation — horizontally inside otherwise vertical text. In print, it is the common treatment for two- or three-digit numbers in a vertical paragraph.
Notation
The annotation always uses the indirect-quoting form:
昭和27年生まれ[#「27」は縦中横]
Renders as:
昭和<span class="aozora-tcy">27</span>年生まれ
The [#…] directive looks back through the most recent text and
applies the tcy treatment to the most recent occurrence of the
quoted run. The target text is not re-emitted — the wrapper is
applied in place, unlike bouten.
Container form
For longer mixed-orientation runs (multi-line table data, Latin abbreviations spanning a paragraph), the container form sits inside an outer indent block:
[#ここから縦中横]
27 / 100 = 0.27
[#ここで縦中横終わり]
Renders as:
<div class="aozora-tcy-block">
27 / 100 = 0.27
</div>
Common targets
| Source | Output |
|---|---|
27[#「27」は縦中横] | <span class="aozora-tcy">27</span> |
100%[#「100」は縦中横] | <span class="aozora-tcy">100</span>% |
A4[#「A4」は縦中横] | <span class="aozora-tcy">A4</span> |
&[#「&」は縦中横] | <span class="aozora-tcy">&</span> |
(HTML escapes are handled by the renderer, not the AST.)
Anchor lookup
The lookup that finds the target run:
- Scans backwards from the
[#…]directive through the current line. - Stops at the first match for the quoted run.
- Falls through to the previous line if no match (with an upper bound of 64 KiB or one paragraph break, whichever comes first).
If no match is found, diagnostic W0001
fires and the directive is dropped from the output. Authors get the
same look-back semantics they’d get from bouten — see
Bouten for the symmetric case.
Why a span, not a flow rotation?
Web renderers reach for writing-mode: horizontal-tb inside a
writing-mode: vertical-rl parent, but that has poor browser support
and breaks line-break propagation. aozora’s HTML output uses a
single class hook (<span class="aozora-tcy">) so the consuming
stylesheet can decide:
- print stylesheet →
font-feature-settings: "vert"; text-combine-upright: all; - screen stylesheet → leave horizontal, set monospace
- e-book renderer → use the renderer’s native tcy primitive
Pushing this decision into the HTML output (e.g. emitting an inline SVG with rotated glyphs) would lock consumers into a specific typographic model. The class-hook output keeps the HTML semantic and defers presentation to the consumer.
AST shape
pub struct Tcy<'src> {
pub text: &'src str,
pub form: TcyForm, // Inline | Container
pub span: Span,
}
See also
- Indent containers — tcy commonly appears inside 字下げ blocks; the parser applies tcy after the indent fence is established so the look-back search is bounded by the inner block.
Gaiji (外字 references)
Aozora Bunko predates ubiquitous Unicode support; many works still ship as Shift_JIS source. Characters that don’t fit in Shift_JIS — JIS X 0213 plane-2 ideographs, accented Latin letters, ad-hoc combining marks — appear in source as gaiji references:
※[#「魚+師のつくり」、第3水準1-94-37]
※[#「彳+寺」、U+5F85、393-13]
※[#濁点付き片仮名ヰ]
The leading ※ (U+203B, reference mark) opens the annotation; the
[#…] body describes the character in three orthogonal ways:
- A descriptive name in Japanese (
「魚+師のつくり」— “魚 plus the right-hand side of 師”) for human readers. - A JIS X 0213 plane / row / cell triple
(
第3水準1-94-37— plane 1, row 94, cell 37). - A Unicode codepoint (
U+5F85) when the character has one.
aozora resolves gaiji references through a compile-time PHF lookup table built from the JIS X 0213 official mapping plus the Unicode UCS register, with the descriptive name as a tertiary fallback.
Why a compile-time table?
The gaiji table has ~14 000 entries. Loading it at runtime from a JSON / TOML asset would:
- Add a startup cost on every
Document::new(the parser is supposed to start reading bytes within microseconds). - Force every binding (CLI, WASM, FFI, Python wheel) to ship the table as a separate asset, complicating distribution.
- Defeat dead-code elimination — the linker can’t strip entries the consumer’s input never references if they’re loaded behind an opaque file read.
A phf::Map baked into the binary at compile time wins on every
axis: zero-allocation lookup, single-binary distribution, full
DCE and LTO visibility. The build cost is real (~40 s the first
time, ~0 s incremental) but happens once per workspace build, not
per-invocation.
phf over static HashMap (which would require runtime construction
in a OnceLock): phf produces a true compile-time perfect-hash
table — O(1) lookup with no first-call cost and no synchronisation
on the hot path.
Resolution order
For a reference like ※[#「魚+師のつくり」、第3水準1-94-37]:
- Unicode codepoint if the source explicitly provided one
(
U+XXXX) — used directly. - JIS X 0213 plane-row-cell lookup (
第N水準P-R-C) — most ideographs land here. - Descriptive name — the parser ships a curated mapping for the
~120 characters that have no JIS / Unicode codepoint at all.
Misses fire diagnostic
W0006and the gaiji is rendered as the descriptive text in<span>brackets.
AST shape
pub struct Gaiji<'src> {
pub description: &'src str, // 「魚+師のつくり」
pub jis: Option<JisCode>, // (plane, row, cell)
pub unicode: Option<char>, // resolved codepoint
pub resolution: GaijiResolution, // Direct | Lookup | Fallback
pub span: Span,
}
pub enum GaijiResolution {
/// The source provided U+XXXX directly.
Direct,
/// Resolved via JIS table.
Lookup,
/// Could not resolve; rendered as descriptive text.
Fallback,
}
Render output
| Resolution | HTML |
|---|---|
Direct / Lookup | the resolved codepoint inline, with a data-aozora-gaiji-jis="1-94-37" attribute for downstream analysis tools. |
Fallback | <span class="aozora-gaiji-fallback" title="魚+師のつくり">[魚+師のつくり]</span> |
Accent decomposition
Aozora Bunko also encodes accented Latin letters (è, ñ, ä) using a
separate notation that does not go through ※[#…]:
M¡cher ← in some sources
me-zin ← in others
The full table is at https://www.aozora.gr.jp/accent_separation.html — 114 ASCII digraphs / ligatures mapping to Unicode. aozora applies this decomposition during the lexer’s Phase 0 (sanitize), so by the time classification runs the source is pure Unicode. See Architecture → Seven-phase lexer for the phase ordering.
See also
- Architecture → Shift_JIS + 外字 resolver — the encoding pipeline and the PHF table internals.
- Diagnostics →
W0006— unresolved gaiji reference.
Kunten / kaeriten (訓点・返り点)
Kunten are the marginal annotations Japanese readers add to classical Chinese (漢文) source so that it can be read in Japanese word order. The two categories aozora handles:
- Kaeriten (返り点) — reading-order marks inserted between
characters:
レ,一,二,三,上,中,下,甲,乙,天,地,人. - Saidoku-moji (再読文字) — characters that are read twice with different glosses (e.g. 未, 将, 当).
A handful of late-Edo / Meiji Aozora Bunko works carry these. The notation:
有﹅レ朋﹅自﹅遠﹅方﹅来
…where ﹅ stands in for the actual kaeriten character. In real
source the marks are interleaved between characters using either the
direct character or a [#…] annotation:
有[#二]朋自遠方来[#一]
Notation forms
Inline (preferred in modern works)
The kaeriten character is inserted directly between the source characters:
有レ朋自遠方来
Renders as:
有<span class="aozora-kaeriten" data-aozora-kaeriten="レ">レ</span>朋自遠方来
Bracketed (older works)
有[#二]朋自遠方来[#一]
Renders as:
有<span class="aozora-kaeriten" data-aozora-kaeriten="二">二</span>朋自遠方来<span class="aozora-kaeriten" data-aozora-kaeriten="一">一</span>
The bracketed form is useful when the kaeriten character would
otherwise be ambiguous with the surrounding text (e.g. a real 一
that is not a reading mark).
Saidoku-moji
未[#「未」に二の字点]
The 二の字点 / 一二点 prefix tells the renderer that the preceding
character is read twice. aozora emits a data-aozora-saidoku data
attribute on the wrapper.
AST shape
pub struct Kaeriten<'src> {
pub mark: KaeritenKind, // Re | Ichi | Ni | San | Jou | Chuu | Ge | Kou | Otsu | Ten | Chi | Jin
pub form: KaeritenForm, // Inline | Bracketed
pub span: Span,
}
pub struct Saidoku<'src> {
pub target: &'src str, // the character being re-read
pub gloss: &'src str, // the second reading
pub span: Span,
}
Why a flat enum, not just &str?
The 13 kaeriten kinds form a closed set fixed by the spec — there
will never be a 14th. A KaeritenKind enum lets the renderer match
exhaustively (the compiler catches unhandled variants), and pins the
data-aozora-kaeriten attribute value to a stable slug rather than
the literal source character. That matters because the inline form
uses the actual 一 / 二 / 上 / … glyphs, which are also valid
plain text — the enum lets the AST distinguish “a 一 that’s a
kaeriten” from “the digit one in the running text”.
Diagnostics
| Code | Condition |
|---|---|
W0007 | Kaeriten outside a 漢文-like context (lookahead heuristic) |
E0009 | Bracketed kaeriten with no matching pair |
See also
- Notation overview — the orientation map for all the inline annotations.
Indent & align containers (字下げ)
Aozora Bunko uses paired [#ここから…] / [#ここで…終わり]
brackets to delimit blocks of text with custom layout. The five
families:
| Family | Opener | Closer | Effect |
|---|---|---|---|
| 字下げ (indent) | [#ここから2字下げ] | [#ここで字下げ終わり] | Indent every line by N full-width chars |
| 地付き (right-flush) | [#ここから地付き] | [#ここで地付き終わり] | Flush right (vertical: 地 = ground = bottom) |
| 地寄せ (right-align with margin) | [#ここから2字下げ、地寄せ] | [#ここで字下げ終わり] | Right-align with N-char inset |
| 字詰め (line-length) | [#ここから30字詰め] | [#ここで字詰め終わり] | Force a line length of N chars |
| 中央揃え | [#ここから中央揃え] | [#ここで中央揃え終わり] | Centre each line |
aozora parses every variant; the HTML renderer maps them to a
<div class="aozora-indent-N"> / aozora-align-end / etc. wrapper.
Single-line forms
Some directives apply only to the next single line and don’t need a closer:
[#地付き]平和への誓い
Renders as:
<div class="aozora-align-end">平和への誓い</div>
AST shape
pub struct Container<'src> {
pub kind: ContainerKind,
pub indent: Option<u8>, // 字 count for indent variants
pub form: ContainerForm, // SingleLine | Block
pub children: &'src [AozoraNode<'src>],
pub span: Span,
}
pub enum ContainerKind {
Indent,
AlignEnd,
AlignEndWithIndent,
LineLength,
Centre,
/// Composite: indent + align-end on a single block.
Composite { indent: u8, align: ContainerAlign },
/// Bouten / 縦中横 / 鎖線 / 罫囲み container forms.
Emphasis(EmphasisKind),
/// Spec-listed but not present in maintained corpus.
Unknown,
}
Why a small flat enum?
ContainerKind is closed by spec. A flat enum (vs a trait object
or string tag) gives the parser O(1) variant dispatch in the lexer’s
classify phase and the renderer’s HTML walk, and lets clippy’s
exhaustiveness check enforce that every variant has a render path.
The Composite variant is the one place we don’t extend the enum
horizontally — composite indent+align combinations would explode the
enum to ~30 variants, most of which never appear in real corpus. A
nested struct with a sub-enum keeps the variant count finite while
staying matchable.
large_enum_variant clippy lint: Container::Composite is the
largest variant at 4 bytes; the others are ≤ 2 bytes. The variant
data is tiny enough that boxing would add a pointer chase for no
real layout win — see the [workspace.lints.clippy] large_enum_variant = "allow" carve-out in Cargo.toml.
Composition
Containers nest:
[#ここから2字下げ]
通常の段落。
[#ここから地付き]
右寄せの行。
[#ここで地付き終わり]
通常に戻る。
[#ここで字下げ終わり]
Renders as nested divs:
<div class="aozora-indent-2">
通常の段落。
<div class="aozora-align-end">
右寄せの行。
</div>
通常に戻る。
</div>
Mismatched closers (e.g. [#ここから地付き] … [#ここで字下げ終わり])
fire diagnostic E0005 and the parser
auto-closes the offending opener at the closer’s position.
Why containers, not stack-based push/pop tokens?
The spec describes these as opener / closer brackets, but the natural implementation in Rust is a recursive container node. That choice:
- Lets the renderer walk the tree once with a single match on
ContainerKind, instead of maintaining a render-time stack. - Surfaces shape errors (mismatched closers, dangling openers) at parse time — the lexer’s classify phase already has all the information to decide.
- Makes the canonical-serialise pass trivial (each container prints its opener, walks its children, prints its closer).
The trade-off is one extra heap touch per container — a single
bumpalo slice for children. The arena is already hot, so the cost
is negligible (bumpalo returns aligned pointers in O(1) bumps).
See also
- Architecture → Borrowed-arena AST — how container child slices are laid out in the arena.
- Diagnostics →
E0005— mismatched closer.
Page & section breaks (改ページ・改丁)
Aozora Bunko inherits print conventions for page-level structure. Four annotations split a work into pages, signatures, and openings:
| Notation | Renders as | Meaning |
|---|---|---|
[#改ページ] | <div class="aozora-page-break"/> | Begin a new page |
[#改丁] | <div class="aozora-page-break aozora-recto"/> | Begin a new recto (right-hand) page |
[#改見開き] | <div class="aozora-page-break aozora-spread"/> | Begin a new two-page spread |
[#改段] | <div class="aozora-section-break"/> | Section break (smaller than a page) |
All four are self-contained directives — no opener / closer pair, no inner content. They appear on their own line in the source.
AST shape
pub enum Break {
Page,
PageRecto, // 改丁
PageSpread, // 改見開き
Section, // 改段
}
pub struct BreakNode {
pub kind: Break,
pub span: Span,
}
Why distinct variants for each break flavour?
The four flavours render to identical HTML structure (an empty
<div>) but different class hooks. Collapsing them to a single
variant with a string tag would:
- Force the renderer to plumb the original notation through to the output, defeating the AST’s role as a normalised IR.
- Lose the type-system check that every break flavour has a render path — clippy’s exhaustiveness lint catches the bug at compile time.
- Make it impossible to count page breaks of a specific flavour at the AST level without a string match.
The 4-variant enum is 1 byte plus discriminant — no real cost over the alternative.
Composition with other annotations
Breaks unconditionally close any open inline annotation (ruby, bouten, tcy) at their line. They do not close container directives (字下げ, 地付き, etc.) — those persist across page boundaries, which matches print typography.
[#ここから2字下げ]
第一節
[#改ページ]
第二節 (still 2字下げ)
[#ここで字下げ終わり]
Diagnostics
| Code | Condition |
|---|---|
W0008 | Page break inside a single-line container (drops the container) |
See also
- Indent containers — containers persist across breaks.
Diagnostics catalogue
aozora is non-fatal by design: the parser always produces a tree, even from malformed input, and reports problems through structured diagnostics that callers can choose to treat as errors. This page lists every diagnostic the lexer can emit.
Each diagnostic carries:
- A stable code (
E0001,W0001, …). The number suffix is permanent across versions; codes are added but never renumbered. - A level:
Error,Warning,Info. - A span (byte range in the source).
- A message in English.
- (optional) a help line suggesting a fix.
The CLI renders diagnostics through miette;
all bindings (Rust library, FFI JSON, WASM JSON, Python list) carry
the same structured data.
E-codes (errors)
E0001 — empty ruby reading
|青梅《》
The base text is given but the reading inside 《…》 is empty.
Fix: provide a reading or remove the | marker.
E0002 — nested ruby
|青梅《|お《お》うめ》
The spec disallows ruby inside ruby; the inner |…《…》 is
ambiguous. Fix: restructure so the readings are siblings, not
nested.
E0004 — mismatched bouten container closer
[#ここから傍点]…[#ここで傍線終わり]
The opener was a bouten variant; the closer was a bousen variant.
Fix: match the closer to the opener family (傍点終わり for any
点 variant; 傍線終わり for any 線 variant).
E0005 — mismatched container closer
[#ここから2字下げ]…[#ここで地付き終わり]
Different container kinds. The parser auto-closes the offending opener at the closer’s position. Fix: match opener and closer.
E0009 — bracketed kaeriten with no pair
有[#二]朋自遠方来 ([#一] missing)
The bracketed kaeriten form requires a paired closer. Fix: add
the matching [#一] (or remove the [#二]).
W-codes (warnings)
W0001 — tcy target not found
昭和27年生まれ[#「999」は縦中横]
The quoted run does not appear in the look-back window (current line
- previous line, max 64 KiB). The directive is dropped. Fix: quote a run that actually appears in the source.
W0003 — bouten target ambiguous
平和平和[#「平和」に傍点]
Two candidate runs in the look-back window. The parser applies the
bouten to the most recent match (right-most in vertical / left-to-
right reading); W0003 flags the ambiguity for the author to
disambiguate.
W0006 — unresolved gaiji reference
The gaiji reference resolved to neither a Unicode codepoint nor a
JIS X 0213 entry, and no descriptive-name fallback applied. The
character is rendered as descriptive text in <span> brackets.
Fix: check the JIS triple, add the codepoint manually, or extend
the descriptive-name table.
W0007 — kaeriten outside 漢文 context
こんにちは レ
A kaeriten character (レ, 一, 上, …) appeared in a context
that doesn’t look like 漢文 (no preceding kanji run, surrounded by
kana). The parser still emits the kaeriten node but flags the
suspicious placement.
W0008 — break inside single-line container
[#地付き]right-flushed[#改ページ]
The page break terminates the single-line container before its implicit end-of-line closer. The container is dropped from the output.
W0010 — unrecognised container directive
The [#ここから…] directive matched no known container kind.
The parser emits a Container::Unknown and copies the directive
verbatim into the canonical-serialise output.
I-codes (info)
I0001 — accent decomposition applied
M[i!]cher → Micher
Reported once per source for each distinct ASCII digraph that the
sanitize phase decomposed. Off by default; enable with
--diagnostics info on the CLI.
Why a stable code, not just a message?
Two reasons.
- Test stability. The corpus sweep counts diagnostics by code to detect parser regressions. A test like “the corpus emits at most 12 W0006 warnings” is robust against message wording tweaks; a test that greps the message string breaks every localisation pass.
- Tool integration. Editors / LSPs / CI lints filter diagnostics by code (e.g. “treat E* as error, ignore W0010 for legacy files”). String matching there is fragile in practice.
The cost is a small lookup table (code → message); the win is
that diagnostics survive refactors and translation.
See also
- Architecture → Error recovery — what the parser does after each diagnostic fires (preserved output, dropped tokens, where the bytes go).
- Library Quickstart → Diagnostics
- CLI Quickstart → Diagnostics format
- Architecture → Seven-phase lexer — which phase emits which code.
Pipeline overview
aozora is a pure-functional parser: given the same input, the same
arena, and the same compile-time configuration, the output is
bit-for-bit identical. There are no thread-locals, no OnceCell
caches in the parse path, no environmental side effects. The only
state the parser owns is the arena and a string interner, both reset
per Document.
Three layers
flowchart TD
src["source text<br/>(UTF-8 or Shift_JIS)"]
decode["Shift_JIS decode<br/>(aozora-encoding)"]
lex["Lex<br/>(aozora-lex)<br/>sanitize → tokenize → pair → classify"]
tree["AozoraTree<'arena><br/>(borrowed AST)"]
render["Render<br/>(aozora-render)<br/>html / serialize"]
out["HTML / canonical 青空文庫 source"]
src --> decode --> lex --> tree --> render --> out
Each arrow is a pure function. The arena is threaded through lex;
nothing else holds state.
Crate dependency graph
flowchart TD
spec["aozora-spec<br/>shared types"]
encoding["aozora-encoding<br/>SJIS + 外字 PHF"]
scan["aozora-scan<br/>SIMD multi-pattern"]
veb["aozora-veb<br/>Eytzinger sorted-set"]
syntax["aozora-syntax<br/>AST node types"]
lexer["aozora-lexer<br/>7-phase classifier"]
lex["aozora-lex<br/>fused orchestrator"]
render["aozora-render<br/>html / serialize"]
facade["aozora<br/>public facade"]
cli["aozora-cli"]
ffi["aozora-ffi"]
wasm["aozora-wasm"]
py["aozora-py"]
spec --> encoding
spec --> scan
spec --> veb
spec --> syntax
encoding --> syntax
scan --> lexer
veb --> lexer
syntax --> lexer
lexer --> lex
lex --> render
render --> facade
facade --> cli
facade --> ffi
facade --> wasm
facade --> py
aozora-spec is the foundation — every other crate depends on it.
The dependency graph forms a strict DAG; circular deps are forbidden
by clippy’s [cyclic_module] lint and by the cargo metadata check
in just lint.
What each layer does
Sanitize → Tokenize → Pair → Classify
The lexer pipeline is split into four sub-phases because each stage has a different cost / cache profile:
| Sub-phase | Input | Output | Why separate |
|---|---|---|---|
| Sanitize | raw &str | normalised &str | BOM / CRLF / accent-decomposition / PUA assignment all happen here, once, before any expensive lookahead. Keeps later phases linear-time. |
| Tokenize | normalised &str | trigger offsets | SIMD scanner fires here; finds every | 《 》 ※ [ ] byte. |
| Pair | trigger offsets | balanced (open, close) pairs | Bracket matching only; no semantic interpretation. |
| Classify | pairs + slices | AozoraNode<'_> stream | Decides “is this [#…] an indent opener, a bouten directive, a tcy directive, …”. |
Splitting them lets the parser ship two surface APIs without code duplication:
lex_into_arena()— fused, allocates one tree.- Per-phase calls — used by the bench harness’s
phase_breakdownprobe (and theaozora-lexerintegration tests for spec-conformance).
Sanitize details
Phase 0 sanitize covers:
- BOM strip — UTF-8 and UTF-16 BOMs (rare in source, but real).
- CRLF normalisation — CRLF → LF.
- Rule isolation — separates inline
※[#…]from following text so the tokenizer has unambiguous boundaries. - Accent decomposition — 114 ASCII digraphs / ligatures → Unicode (see Gaiji).
- PUA assignment — gaiji references get private-use codepoints
inline so the tokenizer can treat them as single-character tokens
without re-parsing the
※[#…]body.
Tokenize: SIMD scan
Trigger byte detection runs the SIMD multi-pattern scanner. Three backends:
- Teddy (Hyperscan-style packed-pattern via
aho-corasick) on x86_64 with AVX2. - Hoehrmann-style multi-pattern DFA (
regex-automataengine) as the portable fallback. - Memchr-based for
wasm32untilwasm-simdlands in the workspace.
See Architecture → SIMD scanner backends for the selection logic and what each backend looks like in samply.
Pair → Classify
Bracket matching is a single linear-time stack walk over the trigger
offsets. Classify then does the actual recognition: each opener
type has a recogniser registered under
aozora-lexer::recognisers::*. The recognisers run in deterministic
order (see Architecture → Seven-phase lexer).
Render
Two render walkers:
html::render_to_string— a single O(n) tree walk emitting semantic HTML5 withaozora-*class hooks.serialize::serialize— re-emits canonical 青空文庫 source.
Both are pure functions; both allocate exactly the output buffer and nothing else.
What the pipeline does not do
No tree mutation between layers. No optimisation passes. No
“resolver” stage that mutates the AST. The lexer produces the
final tree; the renderer consumes it; that’s it. This is the same
shape as a functional reactive pipeline, and it’s what lets the
borrowed-arena AST (next chapter) work without RefCell or
UnsafeCell.
See also
- Borrowed-arena AST — what
AozoraTree<'arena>actually points at. - Seven-phase lexer — the inside of the Lex box.
- Crate map — every crate, its purpose, what depends on what.
Borrowed-arena AST
AozoraTree<'a> is not an owned tree. It’s a borrow into two
things owned by Document:
- the source
Box<str>, - a
bumpalo::Bumparena that holds every intermediate node and child slice.
flowchart LR
subgraph Document
src["Box<str> source"]
bump["bumpalo::Bump arena"]
end
tree["AozoraTree<'a>"]
walk["render / serialize / iterate"]
src -.borrows.-> tree
bump -.borrows.-> tree
tree --> walk
When the Document drops, the source Box<str> and the arena’s
single backing buffer drop in two free() calls — every node, every
container, every interned string releases together. There is no
per-node destructor and no walk-the-tree-to-free pass.
Why an arena and not Box<Node> everywhere?
The naive Rust shape — enum Node { Ruby { target: String, … }, … }
— would allocate per node, per String, per Vec<Node> for
container children. For a typical Aozora Bunko work (~500 KiB
source, ~50 000 nodes) that’s:
- ~50 000 individual heap allocations,
- ~50 000 individual frees on drop (each is a syscall away from the heap allocator’s free list),
- 16+ bytes of allocator metadata per allocation,
- random-access fragmentation that defeats prefetch.
The arena variant produces:
- ~16 bump allocations (4 KiB pages, refilled on overflow),
- 1 free on drop (
Bump::resetreturns the pages to the OS, the pages themselves are typically reused via the cargo / system allocator’s page cache). - Sequential layout: nodes that were lexed near each other live near each other in memory, which is exactly the order the renderer walks them.
Measured on the corpus sweep: the arena variant
parses 6.4× faster than the equivalent Box<Node> shape, and the
peak RSS is 30% lower. The win is cumulative — every binding
(CLI / WASM / FFI / Python) inherits it.
Why bumpalo over typed-arena, slotmap, or hand-rolled?
| Crate | Shape | Why aozora doesn’t use it |
|---|---|---|
typed-arena | One arena per type (Arena<Ruby>, Arena<Bouten>, …) | aozora has 30+ node types; managing 30 arenas is operationally awkward and forces lifetime-bound &'a per type. |
slotmap | Index-keyed nodes; arena owns; access via SlotMap::get | Adds an indirection (key → slot → node) on every walk, regressing render throughput by ~25% on the bench harness. Also forces Copy keys, which for variable-length text fields means re-interning. |
id-arena / index_vec | Index-typed, &str borrowing | Same indirection cost as slotmap. |
| Hand-rolled bump | Custom; tightest control | Correct, but bumpalo is already a stable, mainstream, allocator-aware bump arena with bumpalo::collections::Vec for child slices. Reinventing wins nothing. |
bumpalo | Single arena, type-erased; allocate any T with bump.alloc(T) | One arena per Document; allocate-then-borrow gives &'a T for the lifetime of the arena. Matches aozora’s “one arena per Document” need exactly. |
bumpalo’s collections::Vec<'bump, T> (used for container child
slices) is Vec-shaped but allocated inside the arena — child
slices get the same arena lifetime as the parent without a separate
allocation strategy.
How the AST shape interacts with the lifetime
pub enum AozoraNode<'src> {
Plain(&'src str),
Ruby(Ruby<'src>),
Bouten(Bouten<'src>),
Tcy(Tcy<'src>),
Gaiji(Gaiji<'src>),
Container(&'src Container<'src>), // boxed in the arena
BreakNode(BreakNode),
// … 30+ variants
}
The 'src lifetime is the arena lifetime (re-using 'src because
all node text borrows from the source buffer, which lives at least
as long as the arena). Each variant either:
- holds a
&strslice into the source (zero copy), or - is a small
Copystruct (BreakNode,Saidoku, …), or - is
&'src Container<'src>— boxed in the arena becauseContaineritself contains a&'src [AozoraNode<'src>]child slice.
The whole AozoraNode is Copy (it’s a tagged union of references
and small primitives), so iterating the tree never needs & — just
deref the reference, copy the node, walk on.
What you trade
The big trade-off: you can’t outlive the Document. A
Vec<AozoraNode<'_>> doesn’t compile because the '_ lifetime is
bound to the arena, which is bound to the Document.
In practice this rarely matters — consumers either:
- Render the tree immediately and discard (
tree.to_html()returnsString, which has no lifetime tie). - Walk the tree once and emit their own owned IR (most editor backends do this).
- Hold the
Documentitself across function boundaries and re-derive the tree on the inside.
For consumers that genuinely need an owned tree, aozora::owned
(planned for v0.3) will provide a walk helper that builds a
Vec<OwnedNode> from a tree pass. We resist shipping it pre-1.0
because the conversion is trivial and shipping a built-in owned
version would push consumers toward it even when they don’t need it.
Lifetime safety
The 'src parameter prevents these shapes at compile time:
fn bad() -> AozoraTree<'static> {
let doc = aozora::Document::new("…".into());
doc.parse() // ERROR: cannot return value referencing local
}
Borrow-checker enforcement; no runtime Drop ordering bugs possible.
See also
- Pipeline overview — where the arena is created.
- Crate map —
aozora-syntaxdefines the node types;aozora-lexdoes the allocation.
Seven-phase lexer
aozora-lexer runs as seven distinct phases, each a pure function
on the previous phase’s output. The split exists because each phase
has a different cost profile — separating them keeps the dominant
hot path (Phase 2 tokenize) tight, and lets the bench harness measure
each phase independently via the phase_breakdown probe.
Phase ordering
flowchart LR
p0["Phase 0<br/>sanitize"]
p1["Phase 1<br/>scan triggers"]
p2["Phase 2<br/>tokenize"]
p3["Phase 3<br/>classify"]
p4["Phase 4<br/>pair containers"]
p5["Phase 5<br/>resolve targets"]
p6["Phase 6<br/>diagnostics"]
p0 --> p1 --> p2 --> p3 --> p4 --> p5 --> p6
Each arrow carries a small data structure (offsets, slices, AST nodes); no phase reads back into a previous phase’s output.
| Phase | Input | Output | What it does |
|---|---|---|---|
| 0 — Sanitize | raw &str | normalised &str | BOM strip, CRLF → LF, accent decomp, PUA assignment for gaiji refs |
| 1 — Scan | normalised &str | trigger offsets &[Trigger] | SIMD multi-pattern scan for |《》※[] |
| 2 — Tokenize | normalised &str + offsets | &[Token] | Slice the source at trigger boundaries; classify each slice as Plain / Open / Close / RefMark |
| 3 — Classify | &[Token] | &[ClassifiedToken] | Recogniser registry decides what each [#…] body actually is |
| 4 — Pair | &[ClassifiedToken] | &[Container] | Bracket matching: openers ↔ closers, build container tree |
| 5 — Resolve | &[Container] + source | AozoraTree<'_> | Look-back resolution for bouten / tcy targets, tie inline annotations to AST nodes |
| 6 — Diagnostics | AozoraTree<'_> + accumulator | Diagnostics | Collect diagnostics from earlier phases, sort by span, pin codes |
Phase 0: sanitize
The most varied phase by what it touches. Sub-passes:
- bom_strip — UTF-8 / UTF-16 BOM detection and removal.
- crlf — CRLF → LF in one
memchr2pass. - rule_isolate — separate inline
※[#…]from following text so the tokenizer has unambiguous boundaries. - accent — 114 ASCII digraph / ligature decomposition (see Notation → Gaiji).
- pua_scan — assign each
※[#…]reference a private-use codepoint inline so subsequent phases treat it as a single character.
Each sub-pass is independent; phase0_breakdown probe measures them
separately. In the corpus sweep, pua_scan dominates Phase 0 (60%
of phase wall time on average) because it has to ※[#…] scan
the whole document — the SIMD scanner from Phase 1 isn’t yet active.
Phase 1: scan triggers
The hot path. SIMD multi-pattern scan for the seven trigger bytes:
| 《 》 ※ [ ] (full-width space)
The chosen scanner backend (Teddy, Hoehrmann DFA, memchr-based)
produces a &[Trigger] of byte offsets. See
SIMD scanner backends for the selection logic.
Throughput on a typical mid-size work (crime_and_punishment.txt,
~600 KiB UTF-8): ~12 GB/s on Teddy, ~3.5 GB/s on the DFA fallback.
Both are well above the rest of the pipeline’s throughput — Phase 1
is essentially free at the corpus level.
Phase 2: tokenize
Slice the source at trigger boundaries and classify each slice:
pub enum Token<'src> {
Plain(&'src str),
Open(OpenKind, Span),
Close(CloseKind, Span),
RefMark(Span), // ※ in isolation
}
Single linear pass over the trigger array; no allocation outside the
output Vec (which is sized exactly from the trigger count).
Phase 3: classify
The most code-heavy phase. The classifier registry has one
recogniser per [#…] directive family:
RubyRecogniserBoutenRecogniserTcyRecogniserIndentRecogniserAlignRecogniserLineLengthRecogniserBreakRecogniserKaeritenRecogniser- … 17 in total
The recognisers run in deterministic order; the first recogniser
that matches the directive body wins. Order matters because some
directive bodies are valid prefixes of others (e.g. ここから2字下げ
is valid prefix of ここから2字下げ、地寄せ). Compile-time tests
in aozora-lexer enforce ordering invariants.
The recognisers themselves are short (most are < 50 LOC) — the bulk
of classify cost is the phf::Map of directive prefixes the
recognisers share for opener detection.
Phase 4: pair
Bracket matching. Walk the classified token stream, push openers
onto a stack, pop on closers, fail if mismatched. The output is a
tree of Container<'_> nodes whose children are flat &[Token<'_>]
slices.
Single linear pass; the stack is a SmallVec<[ContainerKind; 8]> so
it stays on the stack for typical 1–4 deep nesting.
Phase 5: resolve
Bouten / tcy targets quote-by-look-back: the directive [#「平和」に傍点]
applies to the most recent 平和 in the preceding text. Phase 5
walks the container tree and resolves these references.
Pre-Phase-5 the tree carries unresolved BoutenRef { target: "平和" }
nodes; post-Phase-5 they’re Bouten { target_span: Span } pointing
at the actual matched run. The resolver uses an aho-corasick DFA
over the live target strings — single-pass over the source, no
recogniser-order dependencies.
Phase 6: diagnostics
Collect, sort by span, pin codes. Diagnostics emitted in earlier
phases were buffered in a DiagnosticAccumulator threaded through
the call stack; Phase 6 sorts them and assigns the stable error
codes (E0001, W0001, …).
Why seven phases, not one big function?
Three reasons.
- Bench-driven optimisation. The
phase_breakdownprobe reports per-phase wall time per corpus document. Knowing that “this document spends 80% of parse time in Phase 3 classify” tells you exactly where to focus a perf PR. A monolithiclex()would force you to re-instrument every PR. - Spec compliance. Each phase corresponds to a discrete transformation that the spec describes. If a spec gap shows up in production, it almost always lands in one phase, and the test harness can pin a regression test that exercises that phase only.
- Composability.
aozora-lexerexposes both the fusedlex_into_arenaand the per-phase calls. The fused version is whataozora-lexships to consumers; the per-phase calls are what the bench harness and integration tests use to isolate regressions.
The cost is conceptual (more API surface internal to the lexer); the win is that every perf decision in the parser has a measurement attached.
See also
- Pipeline overview — how the lexer fits into the parse layer.
- SIMD scanner backends — Phase 1 in detail.
- Performance → Profiling with samply — how to measure the per-phase cost on your own workload.
SIMD scanner backends
Phase 1 of the lexer is a multi-pattern byte scan: find every
occurrence of the seven trigger bytes (|《》※[] ) in the
source. On a typical Japanese corpus document — where every
codepoint is a 3-byte UTF-8 sequence and no trigger byte appears
more than once per kilobyte — the scan dominates the interpretation
by an order of magnitude. So this is the place where SIMD pays for
itself.
aozora-scan ships three backends, one of which is selected per
target at compile time:
| Backend | Target | Throughput (corpus) | Selection |
|---|---|---|---|
| Teddy | x86_64 + AVX2 | ~12 GB/s | first choice when AVX2 is available |
| Hoehrmann DFA | portable | ~3.5 GB/s | x86_64 fallback, native arm64, etc. |
| Memchr-multi | wasm32 | ~1.2 GB/s | wasm32 until the SIMD proposal lands |
Each backend produces the same (offset, TriggerKind) stream; the
lexer cannot tell which one ran. Selection happens behind a
runtime-dispatched trait so a single binary can carry both the SIMD
fast path and a portable fallback.
Backend 1: Teddy (Hyperscan-style packed)
Teddy is the small-string multi-pattern algorithm from Intel’s
Hyperscan. The
aho-corasick crate ships a packed::teddy implementation that
aozora calls into directly.
Why Teddy here:
- The trigger set is small (7 patterns) and short (1 char each in full-width form, 3 bytes in UTF-8). Teddy’s regime is exactly N small patterns where N ≤ 64 — ours has 7.
- The patterns share no common prefix structure (they’re distinct full-width punctuation), so a Boyer-Moore-style suffix table doesn’t help.
- AVX2 lets Teddy compare 32 bytes per cycle against the packed shuffle table, and our patterns fit cleanly into that lane width.
Why not just memchr-multi (the obvious upgrade):
memchr3 does scan for up to 3 bytes simultaneously — but our
trigger set is 7 patterns × 3 bytes = 21 raw bytes, which would
require seven separate memchr passes (one per pattern). Each pass
streams the whole source. Teddy does one pass for all seven
patterns. The arithmetic favours Teddy by ~3.5×.
Why not memchr’s own packed-pattern path:
memchr does have a packed multi-pattern API now, but it tops out
at ~5 GB/s on our workload because it goes through a generic 16-byte
SSE2 lane. Teddy’s AVX2 32-byte lane — combined with aho-corasick’s
shuffle-table compilation — wins on the corpus by 2.5×.
Backend 2: Hoehrmann-style multi-pattern DFA
For targets that lack AVX2 (older x86_64, native arm64 on some
runners, Alpine builds) the fallback is a byte-DFA built by
regex-automata’s dense::Builder. Hoehrmann’s design — single-byte
transitions, no backtracking, table-driven — gives O(1) per byte
with no SIMD requirement.
Why Hoehrmann-style over Aho-Corasick textbook NFA:
Aho-Corasick at runtime is an NFA with failure transitions; each mismatched byte may walk a chain of failure links before consuming the next input byte. Hoehrmann compiles those failure links into the dense table at build time, so every byte consumes exactly one table lookup. For a small pattern set that fits in cache, the dense table is faster than the NFA representation by 2×.
Why a DFA over a hand-rolled state machine:
regex-automata gives us a battle-tested table compiler with
correctness guarantees (panics from malformed transitions are
impossible) and the same crate handles the build-time DFA →
serialised-table flow if we ever want to ship the table as a static
asset. Hand-rolling buys nothing here — the patterns are small
enough that the compiler-emitted code generation isn’t the bottleneck.
Backend 3: memchr-multi (wasm32)
wasm32-unknown-unknown doesn’t yet have AVX2 (and even after
wasm-simd lands, the lane width is 16 bytes — which would put it
between Teddy and the DFA). Until the workspace targets wasm-simd,
the wasm build uses memchr’s portable multi-pattern path:
memchr3for the three single-byte open / close triggers,- a follow-up scan for the multi-byte
|《》※[]UTF-8 sequences (these expand to 3-byte each).
Throughput is lower (~1.2 GB/s) but the WASM bundle stays small —
no need to ship a Teddy table or a regex-automata DFA in the
500 KiB-budgeted wasm artifact.
Backend selection
pub fn best_scanner_name() -> &'static str {
if is_x86_feature_detected!("avx2") {
"teddy"
} else if cfg!(target_arch = "wasm32") {
"memchr-multi"
} else {
"hoehrmann-dfa"
}
}
Runtime detection (not compile-time cfg!) so a single x86_64
binary works on AVX2-less CPUs without recompilation.
The dispatch goes through a &'static dyn Scanner trait object;
the indirect call is hoisted out of the inner loop in the lexer’s
Phase 2, so the trait dispatch is paid once per Document::parse,
not per byte.
Why a runtime dispatch over per-target binaries?
Two reasons.
- Distribution. Shipping one binary that adapts to its host is
simpler than shipping
aozora-x86_64-avx2andaozora-x86_64separately. The release pipeline only has to manage three archives (linux-gnu, darwin-arm64, windows-msvc), not six. - Container portability.
docker run --platform linux/amd64on an arm64 Mac (Rosetta) lands on x86_64 without AVX2 — runtime detection picks the DFA backend silently. A compile-time-only build would crash withSIGILLon first trigger byte.
The cost is a single indirect call per parse; the win is that the distribution surface stays minimal.
Verifying the scanner is firing
println!("{}", aozora_scan::best_scanner_name());
// "teddy" | "hoehrmann-dfa" | "memchr-multi"
Or under samply, look for one of:
aozora_scan::backends::teddy::scan_offsets— Teddy is firing.aozora_scan::backends::dfa::scan_offsets— Hoehrmann fallback.memchr::arch::*::scan— memchr’s own internal SIMD; the scalar / wasm path is firing.
See Performance → Profiling with samply for the full workflow.
See also
- Pipeline overview
- Seven-phase lexer — Phase 1 fits in here.
Eytzinger sorted-set lookup
aozora-veb is a no_std crate that provides one data structure: a
sorted-set lookup over a static byte slice, laid out in
Eytzinger order so that the binary search is cache-friendly. It
backs the placeholder registry the lexer uses to recognise the
fixed-set strings inside [#…] directives (“ここから”, “ここで”,
“傍点”, “傍線”, “字下げ”, …).
flowchart LR
needle["needle: &str"]
table["Eytzinger-laid sorted set<br/>(static &[&str])"]
cmp["compare at index, branch left/right"]
found["Some(idx) | None"]
needle --> cmp
table --> cmp
cmp --> cmp
cmp --> found
What is Eytzinger order?
A standard sorted array stores elements in their natural order:
[a, b, c, d, e, f, g]. Binary search visits indexes
3, 1 or 5, 0/2/4/6 — accesses that are spatially distant in
memory. On modern CPUs that’s a cache miss per level past L1.
Eytzinger order stores the same elements in implicit-binary-tree
order: the root at index 1 (index 0 is reserved as a sentinel),
left child at 2i, right child at 2i+1. The walk visits indexes
1, 2 or 3, 4/5/6/7 — accesses that are consecutive in memory.
For 256+ entries the cache-line packing is a measured 2–3× speedup
over std::slice::binary_search on the same data. Below 64 entries
the difference is in the noise (everything fits in one cache line).
The placeholder registry has ~120 entries — well into Eytzinger’s
favourable regime.
Why this and not phf::Set?
phf::Set is a perfect-hash table: O(1) lookup, but with a real
constant — one hash computation, one table probe, one strcmp. For
short strings (the placeholder registry’s median is 4 chars) the
hash dominates, and the table probe is a pointer chase to a separate
allocation.
Eytzinger search is log N — but for N=120 that’s 7 comparisons,
all in one contiguous slice, no hashing, no separate allocation.
Measured: Eytzinger is ~1.5× faster than phf::Set on this
workload.
For larger sets (the gaiji table at ~14 000 entries),
phf::Set wins — log₂(14000) is 14 comparisons and the cache
locality stops mattering. The choice is entry-count-dependent.
The aozora codebase uses Eytzinger for sub-256-entry tables and
phf::Set for larger ones; the cutoff was determined empirically.
Why not a hash table?
A HashMap<&str, ()> allocates and rehashes; phf and Eytzinger
don’t. In the lexer’s Phase 3 classify, the placeholder registry
is hit once per [#…] directive — measured as ~5 lookups per
KB of source. A HashMap’s startup cost (build the table from a
const array on first use, even with OnceLock) would dominate
the parser’s per-Document::parse cost on tiny inputs.
API
pub struct EytzingerSet<'a> {
entries: &'a [&'a str], // already in Eytzinger order
}
impl<'a> EytzingerSet<'a> {
pub const fn new(entries: &'a [&'a str]) -> Self { Self { entries } }
pub fn contains(&self, needle: &str) -> bool { … }
pub fn position(&self, needle: &str) -> Option<usize> { … }
}
new is const fn so registries are computed at compile time and
end up in .rodata. Lookup is a single function with no allocation.
Building the order
The crate ships a build-time helper that takes a sorted slice and produces the Eytzinger permutation:
const PLACEHOLDERS: &[&str] = aozora_veb::eytzinger_layout!(
"ここから", "ここで", "傍点", "傍線", "字下げ", …
);
The macro is const-evaluated; the resulting slice is what
EytzingerSet::new takes.
Why a separate crate?
The lookup is no_std and has no aozora-specific dependencies. By
extracting it, three things become true:
- The lexer can depend on
aozora-vebwithout pulling in any workspace state, which keepsaozora-veb’s test surface small. aozora-vebcan be reused byaozora-encoding(for the accent decomposition table) and byaozora-bench(for category slug lookups in the trace rollup) without forming a circular dependency.- Future consumers can depend on just
aozora-vebfor the data structure, without taking the whole parser.
See also
- Crate map —
aozora-vebis the foundation crate with no internal deps. - Performance → Benchmarks — the Eytzinger vs
phfcutoff measurement.
Shift_JIS + 外字 resolver
aozora-encoding covers the full source-decoding stack:
- Shift_JIS / Shift_JIS-2004 / cp932 byte stream → UTF-8 string.
- JIS X 0213 plane-2 ideographs → Unicode (where possible).
- 外字 references (
※[#…]) → resolved Unicode codepoint, JIS triple, or descriptive-text fallback. - Accent decomposition (114 ASCII digraph / ligature → Unicode).
All four are pure functions; the crate has no global state and nothing that varies per-call.
Decode chain
flowchart TD
raw["raw bytes<br/>(SJIS-encoded .txt from Aozora Bunko)"]
sjis["encoding_rs::SHIFT_JIS<br/>or aozora-specific JIS X 0213 patch"]
utf8["UTF-8 String"]
sanitize["Phase 0 sanitize<br/>(in aozora-lexer)"]
pua["PUA assignment for 外字"]
classified["normalised &str ready for Phase 1 scan"]
raw --> sjis --> utf8 --> sanitize --> pua --> classified
The Shift_JIS decode itself uses encoding_rs
— the same crate Firefox uses for HTML decoding. Battle-tested,
SIMD-accelerated, and handles every Shift_JIS variant Aozora Bunko
sources have used since the 1990s. We add a thin patch layer for
JIS X 0213 plane-2 codepoints that encoding_rs’s strict cp932
mapping doesn’t cover (Aozora’s spec extends Shift_JIS into JIS
X 0213 territory; encoding_rs keeps the strict cp932 surface).
外字 (gaiji) PHF table
The reference table contains ~14 000 entries:
static GAIJI_TABLE: phf::Map<&'static str, GaijiEntry> = phf_map! {
"1-94-37" => GaijiEntry::JisX0213 { plane: 1, row: 94, cell: 37, codepoint: '⿰魚師' },
"U+5F85" => GaijiEntry::Direct { codepoint: '待' },
"魚+師のつくり" => GaijiEntry::Description { fallback: "[魚+師]" },
…
};
Why PHF (perfect hash function):
- The table is large enough (~14 000 entries) that linear scan or Eytzinger search would dominate the lookup cost.
- It’s static and known at compile time — the perfect hash is computable once.
phfproduces zero-allocation, zero-comparison-on-collision lookups. The hash is onewyhashround; the probe is one slice index; the comparison is one strcmp. ~25 ns per lookup on the bench harness.
Why not OnceLock<HashMap>:
- First-call cost: building a
HashMap<&str, GaijiEntry>from 14 000 entries on first use takes ~5 ms. That’s longer than parsing a small document end-to-end. - Memory: the runtime
HashMaptakes 2–3× the size of the static PHF (load-factor padding +RawTablemetadata). - Concurrency:
OnceLockadds an atomic load on every access, even after initialisation. PHF isstatic— no synchronisation.
Why not load from a JSON / TOML asset:
- Adds startup cost on every
Document::new(file I/O is microseconds away from the parser’s whole runtime budget for small inputs). - Forces every binding (CLI / WASM / FFI / Python wheel) to ship the asset as a separate file, complicating distribution.
- Defeats dead-code elimination: the linker can’t strip entries the consumer’s input never references.
The build-time cost of compiling the PHF (~40 s the first time, 0 s incremental) is paid once per workspace build, not per-invocation.
Resolution order
pub fn resolve(reference: &str) -> Resolved {
// 1. Direct codepoint (U+XXXX) wins outright.
if let Some(c) = parse_unicode_form(reference) { return Resolved::Direct(c); }
// 2. JIS X 0213 plane-row-cell triple.
if let Some(triple) = parse_jis_triple(reference) {
if let Some(c) = JIS_TABLE.get(&triple) { return Resolved::Lookup(c); }
}
// 3. Descriptive name lookup (curated subset).
if let Some(fallback) = DESCRIPTION_TABLE.get(reference) {
return Resolved::Fallback(fallback);
}
Resolved::Unknown
}
Three layers, in order. Direct wins because the source author
explicitly wrote a Unicode codepoint — overriding it would be
wrong even if our JIS table disagreed. Lookup is the common case.
Fallback is the curated subset of characters that have no Unicode
codepoint at all (~120 entries from the 14 000); we ship a
descriptive-text rendering rather than dropping the character.
Unknown fires diagnostic W0006.
Accent decomposition
Older Aozora works encode accented Latin letters using a separate
notation that is not a ※[#…] reference:
M[i!]cher → Micher
M[a!]ria → Maria
[ae]on → Aeon
The full mapping (114 entries — every digraph and ligature in the
spec) is at accent_separation.html in the spec snapshot. aozora
applies this decomposition during Phase 0 sanitize, before the
trigger scan, so by Phase 1 the source is pure Unicode with no
ASCII-encoded accents.
The lookup is also Eytzinger-laid (see Eytzinger sorted-set lookup) since 114 entries is well inside its favourable regime.
Why a single crate for all of this?
encoding, gaiji, and accent are three distinct concerns, but:
- They all need to be applied once, in order, at the boundary between the source bytes and the parser proper.
- Splitting them would force three separate crate surfaces and three separate trigger points in the lexer.
- Their data tables are all built from upstream Aozora Bunko spec
pages, so a single update workflow (refresh
docs/specs/aozora/, re-extract tables) hits all three at once.
Co-locating them in one crate keeps the boundary tight and the update surface predictable.
See also
- Notation → Gaiji — author-facing notation reference.
- Seven-phase lexer → Phase 0 — where the resolver is invoked.
HTML renderer & canonical serialiser
aozora-render ships two walkers over AozoraTree<'_>:
html::render_to_string— emits semantic HTML5 withaozora-*class hooks.serialize::serialize— emits canonical 青空文庫 source.
Both are pure functions. Both walk the tree once, in source order,
allocating exactly the output buffer (a String pre-sized to the
arena footprint).
HTML renderer
Class-name scheme
aozora emits stable class names that downstream stylesheets can hook:
| AST node | HTML | Class hook |
|---|---|---|
Ruby | <ruby>X<rt>Y</rt></ruby> | (no class — semantic ruby element) |
Bouten { kind: Sesame } | <em class="aozora-bouten-sesame">…</em> | aozora-bouten-<slug> |
Tcy | <span class="aozora-tcy">…</span> | aozora-tcy |
Gaiji { resolution: Direct } | <span data-aozora-gaiji-jis="1-94-37">字</span> | data-aozora-gaiji-* |
Gaiji { resolution: Fallback } | <span class="aozora-gaiji-fallback" title="…">[…]</span> | aozora-gaiji-fallback |
Container { kind: Indent { n: 2 } } | <div class="aozora-indent-2">…</div> | aozora-indent-<n> |
Container { kind: AlignEnd } | <div class="aozora-align-end">…</div> | aozora-align-end |
Break::Page | <div class="aozora-page-break"/> | aozora-page-break |
Kaeriten { mark: Re } | <span class="aozora-kaeriten" data-aozora-kaeriten="レ">レ</span> | aozora-kaeriten |
The aozora- prefix is reserved for our class names — a downstream
stylesheet can target every aozora-emitted hook with [class^="aozora-"]
without conflicting with the consumer’s own classes.
Why a class-hook output instead of inline styles?
Inline styles would force a single typographic decision for every consumer — print stylesheet, screen stylesheet, e-book renderer, and LSP/preview pane all want different presentation. The class-hook output:
- Lets each consumer ship its own stylesheet for its medium.
- Survives content-security-policy regimes that block
styleattrs. - Stays diff-able (the rendered HTML is stable across runs; presentation churn doesn’t ripple into snapshot tests).
HTML escaping
The renderer escapes <, >, &, ", ' in user text exactly
once, at emission. Pre-escaped or doubly-escaped output is a
correctness bug, not a perf decision — every CI run validates
render_to_string ∘ html_unescape is the source identity for
plain runs.
Canonical serialiser
The serialiser is the inverse of the lexer’s surface form: walk the tree, emit the source notation that would re-parse identically. It exists for three reasons:
- Round-trip property.
parse ∘ serialize ∘ parsemust be stable on the second iteration. The corpus sweep verifies this on every Aozora Bunko work. aozora fmt. The CLI’sfmtsubcommand canonicalises author input (CRLF → LF, accent decomposition, container directive spacing).- Diff-quality output. When the parser drops a malformed construct, the serialiser re-emits the surrounding text without the offending fragment, so authors can see the exact change.
Why a separate walker, not “render with a different visitor”?
The HTML and canonical-serialise outputs differ on every node type:
- HTML wraps
Ruby { target, reading }in<ruby>X<rt>Y</rt></ruby>; serialise emits|X《Y》(or auto-detect form). - HTML wraps
Container { kind: Indent { n } }in<div class="aozora-indent-N">…</div>; serialise emits the bracketed directives[#ここからN字下げ]…[#ここで字下げ終わり]. - HTML emits
<span data-aozora-gaiji-jis="1-94-37">字</span>for a resolved gaiji; serialise emits the original※[#…、第3水準1-94-37].
The transformations don’t share enough structure to fit a single “visitor with two methods per node” abstraction. Two purpose-built walkers stay clearer and slightly faster — the compiler can inline the per-node match, which a generic visitor with virtual dispatch prevents.
Walker shape
Both walkers follow the same shape:
pub fn render_to_string(tree: &AozoraTree<'_>) -> String {
let mut buf = String::with_capacity(tree.estimated_html_size());
walk(tree, &mut buf);
buf
}
fn walk(tree: &AozoraTree<'_>, out: &mut String) {
for node in tree.nodes() {
match node {
AozoraNode::Plain(s) => out.push_str(html_escape(s)),
AozoraNode::Ruby(r) => emit_ruby(r, out),
AozoraNode::Bouten(b) => emit_bouten(b, out),
AozoraNode::Tcy(t) => emit_tcy(t, out),
AozoraNode::Gaiji(g) => emit_gaiji(g, out),
AozoraNode::Container(c) => emit_container(c, out),
AozoraNode::BreakNode(b) => emit_break(b, out),
// … exhaustive
}
}
}
Single linear pass; no allocation outside the output buffer; no recursion that the compiler can’t unroll (containers recurse, but the fan-out is small — typically 1–4 children per container).
estimated_html_size heuristic
The buffer pre-size avoids String reallocations during the walk.
Empirical heuristic from the corpus sweep: 2.6 × source_byte_len
is at the 95th percentile (some HTML wraps a 3-byte ruby kanji in
30 bytes of <ruby>X<rt>Y</rt></ruby> markup). Going under leaves
~1 reallocation per render in the worst case; going over wastes
memory on every render. 2.6× is the measured optimum.
See also
- Notation overview — what each AST node represents.
- Borrowed-arena AST — the input shape.
- Performance → Benchmarks — the
render_hot_pathprobe that drives the size estimate.
Concrete syntax tree (CST)
A rowan-backed lossless syntax tree lives under the cst
Cargo feature on the aozora crate. The CST is a pure projection
over the existing parse output — Phase 3 classification is unchanged,
the AST stays the perf-critical path, and the CST adds zero overhead
for consumers that don’t enable the feature.
Why a CST exists
The borrowed AST (AozoraNode<'src>) is great for renderers:
classified spans, typed payload, no whitespace noise. It is the wrong
shape for source-faithful tooling:
- A formatter rewriting
日本《にほん》→|日本《にほん》needs the exact whitespace and trivia between tokens. - A LSP
textDocument/foldingRangeprovider needs the open / close positions of every nestable region, including ones the renderer ignores. - A refactor that renames a kanji-range
[#「青空」に傍点]to[#「あおぞら」に傍点]must preserve every bracket character the user wrote, not just the parsedtarget.
A CST whose leaves concatenate to the parser’s input gives those tools what they need without any custom plumbing.
Lossless invariant
The contract is sharp:
Concatenating every leaf token’s text yields the sanitized source bytes the parser actually saw.
“Sanitized” matters: Phase 0 normalises CRLF→LF, strips a leading
BOM, isolates long decorative rule lines with a leading blank line,
and rewrites 〔…〕 accent spans through accent decomposition. These
transformations happen before classification, so source_nodes
coordinates address sanitized bytes. The CST tracks that coordinate
system; an editor that wants to map back to the user’s raw bytes
runs the same Phase 0 transformation and inverts where needed.
The proptest in tests/property_lossless.rs runs the invariant
across the full Aozora-shaped input distribution
(aozora_fragment / pathological_aozora /
unicode_adversarial from aozora-proptest). A regression here
breaks every editor surface that walks the CST.
Architecture
The crate stays decoupled by design:
aozora-cstdepends onaozora-pipeline+aozora-specdirectly, not on theaozorameta crate. Going throughaozorawould create a cycle (the meta crate’scstfeature re-exportsaozora-cst).build_cst(sanitized_source, source_nodes) -> SyntaxNodetakes the lower-level bits explicitly so consumers writing custom pipelines can reach in.aozora::cst::from_tree(&tree) -> SyntaxNodeis the ergonomic entry point; it runs Phase 0 sanitize internally and forwards.- The Phase 3 classifier sees no changes — adding / removing CST consumers cannot perturb AST perf.
SyntaxKind granularity
The CST is intentionally coarser than a token-stream re-construction:
SyntaxKind | Role |
|---|---|
Document | Tree root |
Container | Paired-container region ([#ここから...]...[#ここで...終わり]) |
Construct | Single classified Aozora construct |
ContainerOpen / ContainerClose | Container boundary tokens |
ConstructText | Source slice of a Construct |
Plain | Plain text run between classifications |
Finer per-token granularity (individual punctuation, kana runs, …)
can land later once a concrete consumer needs it. The lossless
property holds at any granularity, so widening the leaf set is
non-breaking for downstream tooling that walks preorder_with_tokens.
Why rowan, not Phase 3 integration
The bumpalo-arena AST stays the hot path; the CST sits on top as an editor-grade convenience layer rather than coupling lossless-tree concerns into the perf-critical classifier. rowan (over cstree) gives the lossless tree a maintained home — rust-analyzer’s tree infrastructure with 86 reverse deps — and the bumpalo / Arc dual-allocator overhead is the price for keeping the AST untouched.
Cross-references
- Architecture → Borrowed-arena AST — the underlying perf-critical tree.
- Architecture → Seven-phase lexer — where Phase 0 sanitize and Phase 3 classify do their work.
Document::edit— the incremental-parse counterpart that reuses the same CST.
Error recovery
aozora is non-fatal by design: the parser always returns an
AozoraTree even when the input violates the spec. Every
problem is reported as a structured Diagnostic whose
code tooling can match on; nothing is ever raised as a
panic from Document::parse.
This page documents what the parser actually does when each diagnostic fires — useful when implementing editor surfaces, lint fixers, or anything else that runs over imperfect documents.
Recovery model
Every diagnostic carries two orthogonal axes:
| Axis | Values | Meaning |
|---|---|---|
severity | Error / Warning / Note | Routing hint for downstream surfaces; does not affect parsing. |
source | Source / Internal | Whether the issue is in the user’s input (Source) or in the library’s invariants (Internal). |
The parser keeps running regardless of severity. Error does not
short-circuit; it only marks the surrounding output region as
suspect so callers (CLI --strict, LSP) can decide policy. CI gates
typically treat any Error as failure, but the AST is still safe
to walk — the spans, classifications, and renderer all remain
consistent.
Source-side codes
aozora::lex::source_contains_pua
Hello, …<U+E001>… world.
A user-supplied codepoint in the range U+E001..U+E004 collides with one of the lexer’s PUA sentinel reservations. The placeholder registry keys on these codepoints, so a bare collision means the classifier could no longer tell user-text occurrences from lexer-inserted markers.
Recovery: the colliding bytes are kept verbatim in the sanitised text — Phase 0 does not delete them. Downstream the character flows through as plain text (the registry has no entry for the position so it is treated as ordinary content). Editors that want to surface the collision visually can match on this code; ordinary HTML rendering is unaffected.
aozora::lex::unclosed_bracket
|青梅《おうめ
An open delimiter (|, 《, [, 〔, 「, …) reached
end-of-input with no matching close on the pairing stack.
Recovery: no PairLink is emitted for the orphaned
opener (Unclosed opens have no partner span and would only
confuse editor highlights). Phase 3 then sees no Aozora construct
covering the unclosed open and degrades the whole region to plain
text — the bytes from the opener to EOF are preserved literally,
just without ruby / annotation classification.
aozora::lex::unmatched_close
》orphaned
A close delimiter saw an empty pairing stack, or its PairKind
mismatched the stack top.
Recovery: the stray close is not matched against any opener;
no PairLink is emitted. The bytes flow through as plain text,
preserving the user’s content; nothing on the stack pops. The
diagnostic span points at the close itself so editors can surface
it without corrupting the document tree.
Internal codes
Internal-source diagnostics indicate library bugs — production
parses on well-formed input never emit these. They are kept
publicly visible so tooling can distinguish “user input has a
problem” from “the library has a problem”; the parse still
completes best-effort to keep editors usable.
| Code | What broke |
|---|---|
residual_annotation_marker | An [# digraph survived classification — a recogniser is missing for the contained keyword. |
unregistered_sentinel | A PUA sentinel is in normalised text without a registry entry. |
registry_out_of_order | The placeholder-registry vector is not strictly position-sorted. |
registry_position_mismatch | A registry entry references a normalised position whose codepoint is not the expected sentinel kind. |
Recovery: the parser never acts on internal diagnostics —
the problematic stretch flows through as plain text, the diagnostic
records what was wrong, and Document::parse returns normally.
Reproductions belong in aozora-spec test fixtures so the bug
surface keeps shrinking over releases.
What recovery is not
The parser does not attempt fix-it suggestions. There is no
“did you mean [#ここで字下げ終わり]?” guess; the diagnostic’s
help text describes the symptom, not the cure. Higher-level
tooling (LSPs, editor extensions) is the right place for fix-it
proposals — they have user context the parser does not.
The parser also does not try to synthesise missing tokens. A
truly unclosed bracket stays unclosed in the tree; we don’t insert
a phantom 》 to “balance” it. Synthesising tokens hides the
diagnostic from any caller that walks the AST instead of the
diagnostic list, and turns a fixable user error into a silent
correction.
Cross-references
- Diagnostics catalogue — code-by-code
reference, including the
[#改ページ]-family directives this page does not cover. - Architecture → Seven-phase lexer — which pipeline phase emits which code.
- Wire format → DiagnosticWire — the JSON shape every binding (FFI, WASM, Python) carries diagnostics over.
tree-sitter reference grammar
aozora ships a tree-sitter grammar at
grammars/aozora.tree-sitter/grammar.js as a reference
implementation alongside the canonical Rust parser. When the two
disagree the Rust parser wins; this grammar exists to plug Aozora
documents into the tree-sitter ecosystem (neovim, helix,
web-tree-sitter / CodeMirror) and to serve as a teaching artefact.
Why a separate grammar at all
The Rust parser is a seven-phase pipeline with a hand-rolled classifier; reading it tells you how the canonical implementation works but not what the spec accepts. A declarative grammar is the language community’s preferred form for “what the spec accepts.” Shipping one alongside the parser lets external tooling consume Aozora without binding to the Rust ABI.
What it does cover
The grammar handles bracket structure faithfully:
|base《reading》andbase《reading》— explicit / implicit ruby《《content》》— double-bracket bouten※[#...]— gaiji marker[#...]— generic bracket annotation〔...〕— tortoise-bracket / accent-decomposition span
Plain text — any byte that is not one of the bracket openers —
flows through as a plain_text token, keeping the grammar lossless
against the byte stream.
What it deliberately does not cover
Three classes of behaviour are intentionally out of reach:
- Stateful container pairing.
[#ここから2字下げ]matches[#ここで字下げ終わり]across intervening content; a context- free grammar without a hand-writtenscanner.ccannot close this. Consumers rely on the body content of the bracket annotation to recognise the pairing themselves, or fall back to the Rust parser. - Forward
「target」に傍点resolution. The bouten directive walks back through preceding text to bind to a quoted run. The grammar accepts the directive faithfully; the lookup stays the consumer’s job. - Ruby base disambiguation. When the glyph run preceding
《...》could extend further, the Rust classifier uses a more nuanced rule. The grammar accepts the greedy base match uniformly.
A scanner.c extension could plug some of these gaps, but doing
so contradicts the declarative-reference framing of the artefact
and would put the canonical-parser-replacement question on the
table prematurely.
Status
The grammar covers approximately 40 % of the canonical parser’s
constructs as measured by an unweighted variant count. The gap to
full coverage is documented; closing it would require a scanner.c
extension, which trades the declarative-reference framing for a
higher ceiling.
Cross-references
- Architecture → Concrete syntax tree — the rowan-backed in-process equivalent.
- Conformance suite — a future
xtask conformance run --implementation tree-sitterwill run the fixture set against this grammar to compute the per-tier pass rate againstmust/should/may. grammars/aozora.tree-sitter/README.md— build instructions.
Crate map
aozora is an 18-crate workspace. The split exists for three reasons:
narrow each crate’s compile surface (faster cargo check), pin
dependency boundaries (cycles are forbidden by the layout), and let
each binding (CLI, WASM, FFI, Python) compose only the layers it
needs.
At a glance
flowchart TD
subgraph foundation
spec
end
subgraph types
veb
syntax
encoding
scan
end
subgraph parser
lexer
lex
render
end
subgraph facade
aozora_facade["aozora"]
end
subgraph bindings
cli
ffi
wasm
py
end
subgraph dev
bench
corpus
test_utils["test-utils"]
trace
xtask
end
spec --> veb
spec --> syntax
spec --> encoding
spec --> scan
veb --> lexer
syntax --> lexer
encoding --> lexer
scan --> lexer
lexer --> lex
lex --> render
render --> aozora_facade
aozora_facade --> cli
aozora_facade --> ffi
aozora_facade --> wasm
aozora_facade --> py
aozora_facade --> bench
corpus --> bench
test_utils --> lexer
trace --> xtask
Per-crate purpose
Foundation
| Crate | Role |
|---|---|
aozora-spec | Single source of truth for shared types: Span, Diagnostic, TriggerKind, PairKind, PUA sentinel codepoints. No internal dependencies — every other crate may depend on it. |
Types & primitives
| Crate | Role |
|---|---|
aozora-veb | no_std Eytzinger-layout sorted-set lookup. Cache-friendly binary search for sub-256-entry registries. |
aozora-syntax | AST node types — AozoraNode<'src>, Container<'src>, Bouten<'src>, Ruby<'src>, …. Borrows from the bumpalo arena. |
aozora-encoding | Shift_JIS decoding, JIS X 0213 patch, 外字 PHF resolver, accent decomposition. |
aozora-scan | SIMD-friendly multi-pattern byte scanner. The only crate (besides aozora-ffi) that locally relaxes unsafe_code — for aligned-load SIMD intrinsics. |
Parser
| Crate | Role |
|---|---|
aozora-lexer | Seven-phase classifier pipeline (sanitize → scan → tokenize → classify → pair → resolve → diagnostics). Emits the diagnostic catalogue. |
aozora-lex | Streaming orchestrator — fused lex_into_arena over the lexer’s per-phase calls. The front door for the public crate. |
aozora-render | HTML and canonical-serialisation walkers. Single O(n) tree pass each; no allocation outside the output buffer. |
Facade
| Crate | Role |
|---|---|
aozora | Public facade. Document::parse() -> AozoraTree<'_>, tree.to_html(), tree.serialize(), tree.diagnostics(). The single import for library consumers. |
Bindings
| Crate | Role |
|---|---|
aozora-cli | The aozora binary (check / fmt / render). |
aozora-ffi | C ABI driver. Opaque handles, JSON-encoded structured data. Locally relaxes unsafe_code; every block carries a // SAFETY: comment. |
aozora-wasm | wasm32-unknown-unknown target with wasm-bindgen exports. |
aozora-py | PyO3 binding shipped via maturin. |
Development-only
| Crate | Role |
|---|---|
aozora-bench | Criterion + corpus-driven probes. Source of the PGO training data. |
aozora-corpus | Corpus source abstraction (zstd-archived, blake3-pinned). Dev-only. |
aozora-proptest | Shared proptest strategies. Dev-only. |
aozora-trace | DWARF symbolicator + samply gecko-trace loader. Dev-only. |
aozora-xtask | Host-side dev tooling (samply wrapper, trace analysis, corpus pack/unpack). Not on the just build path. |
Why 18 crates?
Three concrete wins from the split.
1. Compile latency
A single-crate workspace with the same code would force a full re-compile on any internal change. With 18 crates, a change in the renderer doesn’t touch the lexer, scanner, or any of the bindings — incremental compile times stay sub-second on iteration.
2. No-std reach
aozora-veb and aozora-spec are no_std-clean. That matters for
the wasm32 build (where std is a real cost) and would matter for
embedded targets if anyone ever needed one. Keeping them in dedicated
crates enforces the no_std discipline at the crate-graph level —
adding a std import would require depending on a std-using crate,
which is a visible Cargo.toml change.
3. Binding modularity
The C ABI driver (aozora-ffi) needs aozora + serde_json and
nothing else. It does not pull in the bench harness, the trace
loader, or the corpus crate. The wasm driver is similarly minimal.
Each binding’s dependency closure is exactly what it needs — which
is what keeps the wasm bundle inside its 500 KiB budget.
What we deliberately don’t split
A few things stay co-located despite plausible split points:
- HTML render and canonical serialise in
aozora-render. Both are tree walkers; sharing thewalk()helper between them keeps the implementation small. - Phase 0 sanitize sub-passes in
aozora-lexer. Each sub-pass is < 100 LOC and operates on the same&strslice; pulling them out would create a 5-crate ecosystem for a transformation that’s conceptually one phase. - Trigger-byte enum and pair-kind enum in
aozora-spec. They’re used by bothaozora-scan(which produces them) andaozora-lexer(which consumes them); putting them inspecavoids a back-reference.
Splits aren’t free — every additional crate adds a Cargo.toml, a
README, doc-link reachability, and a test surface. Splits land when
the cohesion benefit (one of the three above) is real.
See also
- Pipeline overview
- Borrowed-arena AST
- Reference → API — generated rustdoc for the public surface.
Rust library
The first-class binding. Full type safety, zero copy, and the borrowed-arena AST exposed directly.
Adding to a project
The recommended Cargo.toml snippet (with the current release tag)
lives in the install chapter.
Keeping the pin in one place avoids drift between this doc and the
install page when a new release lands.
crates.io publication tracks the v1.0 API freeze; until then, the git tag form documented there is the canonical entry point.
Surface
The public surface is small by design — three types and four methods cover everything:
pub struct Document { /* opaque */ }
impl Document {
pub fn new(source: String) -> Self;
pub fn parse(&self) -> AozoraTree<'_>;
pub fn source(&self) -> &str;
}
pub struct AozoraTree<'a> { /* borrows from Document */ }
impl<'a> AozoraTree<'a> {
pub fn nodes(&self) -> impl Iterator<Item = AozoraNode<'a>>;
pub fn to_html(&self) -> String;
pub fn serialize(&self) -> String;
pub fn diagnostics(&self) -> &[Diagnostic];
}
pub enum AozoraNode<'src> { Plain(&'src str), Ruby(Ruby<'src>), … }
See Library Quickstart for the walk-through.
Feature flags
aozora exposes one optional feature:
| Feature | Default | What it enables |
|---|---|---|
serde | off | serde::Serialize / Deserialize impls on AozoraNode, Diagnostic, Span. Useful for downstream tools that need to ship the AST over a wire. |
The default-off policy keeps cargo build aozora slim — the JSON
encoders that the bindings need live in the bindings themselves
(aozora-ffi, aozora-wasm, aozora-py), not in the core crate.
Error handling
Three philosophies, used consistently:
- Diagnostics are not errors.
Document::parse()always returns aAozoraTree<'_>. Per-input diagnostics live intree.diagnostics(). Callers decide whether to treat any diagnostic as fatal. - Decoding is fallible.
aozora_encoding::sjis::decode_to_stringreturnsResult<Cow<str>, DecodeError>. Malformed Shift_JIS is the one place a function actually fails — the parser proper assumes UTF-8. - Panics are bugs. No
.unwrap()on user-data paths in non-test code; clippy’sunwrap_usedandexpect_usedare warned workspace-wide. If you ever see a panic inaozora::*, file a bug.
Thread safety
Document is Send but not Sync — the bumpalo arena does not
support concurrent allocation. Pass a Document between threads
freely; do not share &Document across threads.
AozoraTree<'_> borrows from &Document, so by Rust’s lifetime
rules the same shape applies: a &AozoraTree is Send + Sync (it’s
just & to immutable data), but it can’t outlive its Document.
For parallel corpus processing (e.g. the corpus sweep harness
parsing 1000s of documents concurrently), each thread creates its
own Document from its own source. The arena resets per-Document,
so there’s no contention point.
MSRV policy
aozora pins Rust 1.95.0. The MSRV advances roughly once per
quarter, when a new stable feature is needed and the workspace
moves to it. The msrv job in CI gates every PR; Dependabot is
configured to not auto-bump the MSRV pin (manual decision).
Public API stability
Pre-1.0: minor-version bumps may break the API. cargo-semver-checks
runs in CI to catch unintentional breakage between releases, so a
v0.2.x → v0.2.y upgrade is always safe; only v0.x.y →
v0.x+1.y opens the door for breaks.
Post-1.0 (planned): semver discipline. Breaking changes accumulate
on a next branch and ship in a major bump.
See also
- Library Quickstart
- Borrowed-arena AST — the lifetime model.
- Reference → API — generated rustdoc.
WASM (wasm-pack)
The aozora-wasm crate compiles to wasm32-unknown-unknown and
exposes a Document class via wasm-bindgen. The wasm artifact has
a hard 500 KiB size budget after wasm-opt -O3 — measured on every
release.
Build
rustup target add wasm32-unknown-unknown # one-time
wasm-pack build --target web --release crates/aozora-wasm
Outputs land at crates/aozora-wasm/pkg/:
aozora_wasm_bg.wasm— the binary moduleaozora_wasm.js— the wasm-bindgen JS shimaozora_wasm.d.ts— TypeScript typespackage.json— minimal npm-publishable metadata
Why wasm-opt = false in Cargo.toml?
wasm-pack ships its own bundled wasm-opt (via the binaryen crate)
which lags upstream. Recent Rust releases emit bulk-memory opcodes
(memory.copy, memory.fill) that the bundled wasm-opt mishandles
on -O3, occasionally producing artifacts that crash on init. We
disable the bundled run and recommend a fresh wasm-opt invocation
externally:
wasm-opt -O3 \
--enable-bulk-memory \
--enable-mutable-globals \
crates/aozora-wasm/pkg/aozora_wasm_bg.wasm \
-o crates/aozora-wasm/pkg/aozora_wasm_bg.wasm
The post-wasm-opt artifact has a 500 KiB size budget. CI gates on
this number — exceeding it is a release-blocking regression.
Usage
import init, { Document } from "./pkg/aozora_wasm.js";
await init(); // load the .wasm
const doc = new Document("|青梅《おうめ》");
const html = doc.to_html();
const canonical = doc.serialize();
const diagnostics = JSON.parse(doc.diagnostics_json());
console.log(html);
doc.free(); // release the bumpalo arena
In TypeScript, the .d.ts file gives you full type checking on
every method.
API surface
| Method | Returns | Notes |
|---|---|---|
new Document(source: string) | Document | Copies the JS string into a Rust Box<str>. |
to_html() | string | Renders to semantic HTML5 with aozora-* class hooks. |
serialize() | string | Re-emits canonical 青空文庫 source. |
diagnostics_json() | string | JSON-encoded array of diagnostic objects. |
source_byte_len() | number | Source byte length, useful for progress UI. |
free() | — | Explicit drop; otherwise the JS GC eventually releases. |
The diagnostics JSON shape mirrors aozora-ffi’s C ABI:
interface Diagnostic {
code: string; // "E0001", "W0006", …
level: "error" | "warning" | "info";
message: string;
span: { start: number; end: number };
help?: string;
}
Why a hand-written JSON projection over serde-wasm-bindgen?
serde-wasm-bindgen would let us pass the Diagnostic directly to
JS as a structured object — no JSON round-trip needed. We don’t use
it because:
- It pulls in a meaningful chunk of
serde_jsonmachinery that bloats the wasm bundle by ~80 KiB. - The wire format (
{ code: "E0001", level: "warning", … }) is exactly what every JS consumer is going to deserialise into anyway. - It would force a
serde::Serializederivation on every diagnostic-related type inaozora-spec, which the Rust library consumers don’t otherwise need (they take&[Diagnostic]directly).
A small, hand-written JSON emitter (one core::fmt::Write impl, ~60
LOC) costs nothing and keeps the bundle small.
Why Document.free() and not just GC?
wasm-bindgen does wire Drop to a JS finalizer, but JS finalizers
fire on the GC’s schedule — which can be minutes after the last
reference goes out of scope, especially on Node.js where the GC
batches aggressively. For large documents this means the bumpalo
arena (potentially several MB) sits unreleased.
Explicit .free() is the same idiom every wasm-bindgen library
exposes for resource-heavy types. Consumers that want JS-native
ergonomics wrap the class in their own using (TC39 stage-3 explicit
resource management) helper.
Browser support
Tier-1 (CI-tested):
- Chrome 110+
- Firefox 110+
- Safari 16+
Tier-2 (works, not in CI):
- Node.js 18+ (use
--target nodejsinwasm-pack build) - Deno 1.30+
The bundle uses bulk-memory and mutable-globals; both have been universally supported since 2021.
Why wasm at all?
The CLI and the Rust library cover Linux / macOS / Windows native; the wasm build covers everywhere else — particularly:
- Browser-side preview / formatter for a 青空文庫 LSP front-end.
- Cloudflare Workers / Vercel Edge / Deno Deploy serverless rendering.
- Notebook environments (Jupyter via
pyodide, Observable, Quarto).
The same parser, same diagnostics, same canonical-serialise — across every wasm-runtime host.
See also
- Install
- Architecture → SIMD scanner backends — the wasm32 scanner backend.
C ABI
The aozora-ffi crate compiles to a cdylib + staticlib. The API
is opaque-handle + JSON-encoded structured data — the C side never
sees a Rust type, just opaque pointers and byte buffers.
Build
cargo build --release -p aozora-ffi
# → target/release/libaozora_ffi.{so,dylib,a}
# → target/release/aozora.h (cbindgen-generated)
The build script regenerates aozora.h automatically. After build,
the header lands at:
target/release/aozora.h— host-side convenience copy$OUT_DIR/aozora.h— cargo build-script standard location
#include "aozora.h" and link with -laozora_ffi.
Smoke test
just smoke-ffi
Builds the cdylib, compiles crates/aozora-ffi/tests/c_smoke/smoke.c
against it, runs it end-to-end. CI runs this on every PR — if the
ABI shape changes accidentally, the smoke test fails before the PR
merges.
Minimal C usage
#include <stdint.h>
#include <stdio.h>
#include <string.h>
#include "aozora.h"
int main(void) {
const char *src = "|青梅《おうめ》";
AozoraDocument *doc = NULL;
if (aozora_document_new((const uint8_t *)src, strlen(src), &doc) != 0)
return 1;
AozoraBytes html = {0};
if (aozora_document_to_html(doc, &html) != 0) {
aozora_document_free(doc);
return 1;
}
fwrite(html.ptr, 1, html.len, stdout);
aozora_bytes_free(&html);
aozora_document_free(doc);
return 0;
}
API surface
typedef struct AozoraDocument AozoraDocument;
typedef struct {
uint8_t *ptr;
size_t len;
size_t cap;
} AozoraBytes;
extern int32_t aozora_document_new(const uint8_t *src, size_t src_len,
AozoraDocument **out_doc);
extern int32_t aozora_document_to_html(const AozoraDocument *doc,
AozoraBytes *out_html);
extern int32_t aozora_document_serialize(const AozoraDocument *doc,
AozoraBytes *out_canonical);
extern int32_t aozora_document_diagnostics_json(const AozoraDocument *doc,
AozoraBytes *out_json);
extern void aozora_bytes_free(AozoraBytes *bytes);
extern void aozora_document_free(AozoraDocument *doc);
Status codes
| Code | Meaning |
|---|---|
0 | Ok |
-1 | Null input pointer |
-2 | Input was not valid UTF-8 |
-3 | Allocation failed |
-4 | Internal serialisation error |
Memory ownership
Every pointer or AozoraBytes returned by an aozora_* function
must be released by the matching _free call:
| Returned by | Free with |
|---|---|
aozora_document_new (AozoraDocument *) | aozora_document_free |
aozora_document_to_html (AozoraBytes) | aozora_bytes_free |
aozora_document_serialize (AozoraBytes) | aozora_bytes_free |
aozora_document_diagnostics_json (AozoraBytes) | aozora_bytes_free |
Dropping a handle without _free leaks; freeing then dereferencing
is undefined behaviour. This is the standard ABI contract — any
unsafe { Box::from_raw(...) } mistake on the consumer side
trips both ASan and miri (both run in CI on the FFI test suite).
Why JSON for diagnostics, not a C struct?
Three reasons.
- Variant types.
Diagnostichas optional fields (help, sometimes a multi-span). A flat C struct would either lose data or grow nullable pointers everywhere. JSON expresses optionality naturally. - Schema stability. Adding a new diagnostic field is a backward-compatible JSON change. Adding a field to a C struct breaks every consumer that compiled against the old size.
- Single emitter. The same JSON shape is produced by
aozora-wasm(consumed by JS) andaozora-py(consumed by Python). Aligning the C ABI on the same shape means downstream polyglot consumers don’t translate between three different schemas.
The cost is one serde_json::to_string call per
aozora_document_diagnostics_json invocation — a one-shot O(N)
allocation that is a rounding error compared to the parse itself.
Why opaque handle + bytes, not a flat C struct projection?
A flat C struct projection of AozoraTree would require:
- Naming every Rust enum variant in C (not supported cleanly via cbindgen for tagged unions).
- Translating the bumpalo arena into a malloc-backed block contiguous with the tree (which means copying the tree out).
- Pinning the AST shape across the C ABI — internal refactors
(e.g. adding a new
AozoraNodevariant) would break ABI without warning.
The opaque-handle approach keeps the AST entirely Rust-side. C consumers ask for HTML, canonical text, or JSON-encoded diagnostics — three stable shapes that don’t change with internal refactors.
Use from Go / Zig / Nim
Anything with a C FFI. The aozora.h header is plain C99 — no
inline functions, no macros that depend on a compiler-specific
extension, no #pragma. Tested in CI by the smoke test against
gcc, clang, and msvc.
See also
- Install → C ABI
- Bindings → WASM — same JSON diagnostics shape.
Python (PyO3 / maturin)
The aozora-py crate is a PyO3 binding shipped
via maturin.
Install
pip install maturin # one-time
cd crates/aozora-py
maturin develop -F extension-module # install in current venv
# or
maturin build -F extension-module --release # produce a redistributable wheel
The extension-module feature gates the PyO3 import-side machinery
behind a flag, so a plain cargo build --workspace succeeds without
Python development headers installed. CI has both modes covered.
Minimal Python usage
from aozora_py import Document
doc = Document("|青梅《おうめ》")
print(doc.to_html()) # <ruby>青梅<rt>おうめ</rt></ruby>
print(doc.serialize()) # |青梅《おうめ》
print(doc.diagnostics()) # JSON-encoded list of diagnostic dicts
API surface
| Method | Returns | Notes |
|---|---|---|
Document(source: str) | Document | The constructor copies source into a Rust Box<str>. |
to_html() -> str | str | Renders to semantic HTML5 with aozora-* class hooks. |
serialize() -> str | str | Re-emits canonical 青空文庫 source. |
diagnostics() -> str | str | JSON-encoded list (same schema as the WASM and FFI bindings). |
source_byte_len() -> int | int | Source byte length. |
The diagnostics JSON shape is shared across every binding — see Bindings → WASM for the schema.
Thread safety: unsendable
The Document type is marked unsendable (PyO3 marker) because
the underlying bumpalo arena uses interior Cell state. Concurrent
access from another Python thread raises a RuntimeError:
import threading
from aozora_py import Document
doc = Document(open("src.txt").read())
def worker(): doc.to_html() # raises RuntimeError on second thread
threading.Thread(target=worker).start() # boom
For parallel corpus processing, create a Document per thread.
The arena resets per-Document, so there’s no contention point;
each thread allocates from its own arena.
Why not Send?
PyO3 has a Sendable trait that enables cross-thread access for
binding types. We don’t enable it because:
- Arena correctness.
bumpalo::Bumpis!Sync— the per-page allocator state isn’t atomic. Marking itSendablefrom PyO3 would require a mutex around every allocation, which is the cost we designed the arena to avoid in the first place. - GIL semantics. Python threads share the GIL; “concurrent” in
the Python sense is rarely actually parallel. The
unsendablemarker turns the misuse case into a loudRuntimeErrorinstead of a silent data race. - Multiprocessing path. The right answer for parallel corpus
work is
multiprocessing(oneDocumentper process — the arenas are independent by construction). Theunsendablemarker nudges users toward this.
Why JSON-encoded diagnostics?
Same reason as the WASM binding:
- The wire shape is stable across every binding.
- Avoids forcing a
pyclassdeclaration on every diagnostic-related type. - Downstream Python consumers
json.loads()once and work with native dicts — no second translation.
The diagnostics() method returns a str, not a list[dict], so
the json.loads is visible to the caller. Hiding it behind a
PyO3 Vec<PyDict> mapping would silently allocate one Python
object per diagnostic per call.
Wheel distribution
aozora-py is not yet on PyPI — public release tracks the v1.0 freeze of the core library. Until then, build wheels locally:
maturin build -F extension-module --release # → target/wheels/*.whl
pip install target/wheels/aozora_py-*.whl
Pre-1.0 distribution will likely use cibuildwheel to ship wheels
for every supported (python, target) combination — that’s the
mainstream path for PyO3 projects in 2026.
See also
- Install → Python
- Bindings → C ABI — same diagnostics JSON shape.
- PyO3 user guide — the binding framework.
Pandoc integration
The aozora-pandoc crate (workspace-internal, available via the
aozora CLI) projects a parsed Aozora document into the
Pandoc AST. Once you have Pandoc JSON, every Pandoc
output format (HTML, EPUB, LaTeX/PDF, DOCX, ODT, MediaWiki, …) is one
shell pipe away.
This is the recommended path if you want to convert Aozora Bunko notation into anything other than the built-in HTML renderer. Adding a new output format means adding a Pandoc filter (or none, if the default Span/Div mapping is enough), not extending the parser crate.
Quickstart
# Pandoc JSON to stdout
aozora pandoc input.txt > out.json
# Or pipe through pandoc directly
aozora pandoc input.txt | pandoc -f json -t html
aozora pandoc input.txt | pandoc -f json -t epub3 -o out.epub
# `--format` is shorthand for the pipe (requires pandoc on PATH)
aozora pandoc input.txt --format html > out.html
aozora pandoc -E sjis legacy.txt -t epub > out.epub
Projection rules
Each AozoraNode variant lifts to a Pandoc construct
carrying a stable CSS class so downstream filters or stylesheets can
specialise the rendering:
| Aozora variant | Pandoc construct | Class on the construct |
|---|---|---|
Ruby | Span | aozora-ruby |
| ↳ base text | nested Span | aozora-ruby-base |
| ↳ reading text | nested Span | aozora-ruby-reading |
Bouten | Span over target text | aozora-bouten |
TateChuYoko | Span | aozora-tate-chu-yoko |
Gaiji | Span carrying mencode | aozora-gaiji |
Indent, AlignEnd | empty Span (marker) | aozora-indent / align-end |
Warichu | Span with two children | aozora-warichu |
DoubleRuby | Span | aozora-double-ruby |
Annotation, Kaeriten, HeadingHint | empty Span carrying raw | aozora-annotation / etc. |
PageBreak | HorizontalRule block | (n/a — semantic block) |
SectionBreak | empty Div | aozora-section-break |
AozoraHeading | Header block | aozora-heading |
Sashie | Para with Image | aozora-sashie |
| Container (字下げ等) | Div wrapping inner blocks | aozora-container-indent / etc. |
The structural attribute kvs (Pandoc’s third Attr tuple) carries
non-textual metadata (bouten kind / position, gaiji description /
mencode, indent amount, container kind). Filters that want
format-native rendering pattern-match on the class + kvs.
Why a Pandoc projection at all
Aozora notation has rich semantic markup (ruby, bouten, tate-chu-yoko,
gaiji…) that no single Pandoc native construct captures. The naive
shortcut of emitting RawInline("html", "<ruby>…</ruby>") would only
work for the HTML writer; every other Pandoc output format would
strip the raw HTML and lose the meaning.
By lifting each Aozora variant to a Span / Div with a stable
class, the same JSON renders sensibly across every Pandoc format
today (each format’s writer renders Span as a stylable container)
and stays open for richer format-native rendering tomorrow via
filters. That’s the same pattern Pandoc itself uses for
[content]{.smallcaps} — semantic in the AST, format-specific in the
writer.
Architecture
The library entry point is aozora_pandoc::to_pandoc:
use aozora::Document;
use aozora_pandoc::to_pandoc;
let doc = Document::new(std::fs::read_to_string("input.txt")?);
let pandoc = to_pandoc(&doc.parse());
let json = serde_json::to_string(&pandoc)?;
aozora-cli wires that into aozora pandoc so binary consumers
don’t need to write Rust.
Release profile & PGO
aozora’s [profile.release] is tuned for cross-crate inlining at
the expense of compile time:
[profile.release]
lto = "fat" # full LTO across the whole workspace
codegen-units = 1 # single CGU so LTO sees everything
strip = "symbols" # smaller binary, faster cold start
panic = "abort" # no unwinding tables in the binary
opt-level = 3
Why fat LTO over thin
A thin LTO build keeps each crate’s IR isolated; the cross-crate inliner only inlines through summary stubs. Fat LTO concatenates every crate’s IR into one module before optimisation, so the inliner can see across the whole pipeline.
For aozora that pays off because the lex pipeline is deep:
aozora-render → aozora → aozora-lex → aozora-lexer Phase
functions, each in its own crate. A function call across that depth
under thin LTO costs four indirect calls and four stack frames; the
fat LTO build folds the chain into ~40 inlined instructions on the
hot per-byte path.
Measured on the corpus sweep: fat LTO is 30%+ faster than thin LTO once the lex orchestrator is split across crates. Compile-time cost is real (release builds take ~3 minutes vs ~1 minute for thin), but release builds happen at tag time, not on every iteration.
Why codegen-units = 1
codegen-units = N splits each crate into N parallel codegen jobs
during compilation. Each unit optimises independently, then the
linker stitches them together. With N > 1 the LLVM inliner can’t
see across unit boundaries inside a single crate — which under fat
LTO defeats half the point.
codegen-units = 1 ensures fat LTO actually sees every function in
every crate. Compile time grows; runtime wins back.
Why panic = "abort"
aozora is a parser, not a server. There’s no panic handler to
recover into — a panic on user input would be a parser bug, not a
recoverable error. panic = "abort":
- Drops the unwinding tables from the binary (~80 KiB savings on the CLI).
- Removes the panic-handling overhead from every function call (the compiler doesn’t insert landing pads).
- Surfaces parser bugs as
SIGABRTimmediately, which is what we want — a panic always indicates an invariant violation that needs fixing, not a state to gracefully degrade through.
For library consumers that want unwinding (e.g. embedding in a long-running server), the dependency-mode build inherits the consumer’s profile, so this only affects the binaries we publish.
Profile-guided optimisation (PGO)
The release pipeline supports PGO via scripts/pgo-build.sh:
./scripts/pgo-build.sh
Three-stage build:
- Instrumented build —
cargo build --releasewithRUSTFLAGS="-Cprofile-generate=/tmp/pgo-data". The resulting binary is slower than vanilla release because of the instrumentation overhead. - Profile collection — run the corpus sweep against the
instrumented binary. The corpus must contain a representative
spread of document sizes and notation density. The
aozora-benchthroughput_by_classprobe handles this. - Final build —
cargo build --releasewithRUSTFLAGS="-Cprofile-use=/tmp/pgo-data/merged.profdata". LLVM uses the profile to drive its inliner, branch-prediction hints, and basic-block ordering decisions.
Measured win on the corpus sweep: 8–12% faster than non-PGO release build. The cost is operational complexity (the build-script needs a real corpus available); the win compounds with fat LTO, since both target the same hot paths.
BOLT (post-link optimisation)
BOLT is the next layer after PGO: it reorders basic blocks in the
final binary based on the same profile. scripts/pgo-build.sh ends
with an optional BOLT pass when llvm-bolt is on PATH.
BOLT wins another ~3% on top of PGO, mostly by improving I-cache density for the lex hot path. The win is smaller than PGO’s because PGO already used the profile during compilation; BOLT only refines the final binary’s layout.
Why we don’t use specific tricks
-Cforce-frame-pointers=yes— would help samply unwind on some platforms, but the workspace[profile.bench]covers the profiling case (debug = 1 + strip = none). Release builds get the smaller binary.unsafeperf shortcuts —unsafe_code = "forbid"at the workspace level. Three crates locally relax it (FFI / scan / xtask), each with// SAFETY:comments and#[deny(unsafe_op_in_unsafe_fn)]. Where a perf opportunity needs unsafe, we measure it first and cite the win in the comment.#[inline(always)]— used sparingly. The compiler’s default heuristics have improved enough that forcing inlining usually costs binary size for negligible win. Where it does help (e.g. the per-byte scanner inner loop), the call site has a measurement comment.
See also
- Profiling with samply — how to measure whether a perf change helped.
- Benchmarks — the harness that produces the PGO profile.
- Corpus sweeps — the input the bench harness consumes.
Profiling with samply
samply is the workspace’s
sampling profiler. It produces .json.gz traces in the
Firefox-Profiler gecko format
that can be loaded into the web UI for visual analysis, or fed to
the in-tree aozora-trace crate for automated rollups.
Quick commands
# Single corpus document
AOZORA_CORPUS_ROOT=/path/to/corpus \
just samply-doc 001529/files/50685_ruby_67979/50685_ruby_67979.txt
# Full corpus, parser-bound (5 parse passes after the one-time load)
AOZORA_CORPUS_ROOT=/path/to/corpus just samply-corpus
# Full corpus, render-bound
AOZORA_CORPUS_ROOT=/path/to/corpus just samply-render
# Open in Firefox-Profiler
samply load /tmp/aozora-corpus-<timestamp>.json.gz
All three are wrappers over the aozora-xtask samply subcommand,
which:
- Builds the bench probe with
--profile=bench(debug info preserved). - Runs samply against the resulting binary.
- Drops the
.json.gzin/tmp/.
Why these run on the host (not Docker)
samply uses perf_event_open(2) for kernel sampling. Docker’s
default seccomp profile blocks that syscall. The xtask binary
therefore runs on the host (not via docker compose run) and the
Justfile recipes are exempt from the workspace’s normal
“everything in Docker” policy.
The recipes check /proc/sys/kernel/perf_event_paranoid on entry
and print the fix-up command if the value is too high (default 2;
needs to be ≤ 1 for unprivileged sampling):
echo 1 | sudo tee /proc/sys/kernel/perf_event_paranoid
Why --profile=bench and not --release
cargo build --release uses [profile.release], which has
debug = 0 + strip = "symbols". Samply still records samples,
but they show up as raw addresses (0x8fb61) instead of function
names — every sample becomes useless to a human reader.
The workspace [profile.bench] inherits from release but sets
debug = 1 + strip = "none". The xtask wrappers automatically
build with --profile=bench. If you launch samply manually, do the
same.
Corpus load dominates a single-pass trace
throughput_by_class and render_hot_path spend most wall time in
Shift_JIS decode + filesystem I/O during the one-time corpus load.
A single-pass samply trace puts __memmove_avx_unaligned and
encoding_rs::ShiftJisDecoder at the top — not the parser.
Fix: set AOZORA_PROFILE_REPEAT=K (or pass K to
just samply-corpus) so the parse pass runs K times after the
load. The xtask defaults to 5; raise to 10+ for very small corpora.
Trace analysis from the CLI
aozora-xtask trace … (and the just trace-* shortcuts) load
saved .json.gz traces, symbolicate them via the aozora-trace
crate (DWARF lookup is pure-Rust through addr2line::Loader), and
run the bundled analyses.
# 1. One-time per trace: write the symbol cache next to it
just trace-cache /tmp/aozora-corpus-<ts>.json.gz
# 2. Analyses (cache is auto-loaded if present)
just trace-libs /tmp/aozora-corpus-<ts>.json.gz # binary vs libc vs vdso
just trace-hot /tmp/aozora-corpus-<ts>.json.gz 25 # top-25 hot leaf frames
just trace-rollup /tmp/aozora-corpus-<ts>.json.gz # bucketed by aozora's built-in categories
just trace-stacks /tmp/aozora-corpus-<ts>.json.gz 'teddy' 5 # full call chains hitting any frame matching `teddy`
just trace-compare /tmp/before.json.gz /tmp/after.json.gz 25 # before/after diff
just trace-flame /tmp/aozora-corpus-<ts>.json.gz | flamegraph.pl > flame.svg
Each analysis returns a typed report — HotReport, LibraryReport,
RollupReport, ComparisonReport, MatchedStacksReport,
FlameReport — whose module docstring explains the algorithm.
Why a pure-Rust DWARF symbolicator?
The mainstream alternative is shelling out to addr2line(1) from
binutils. We don’t because:
- Process spawn cost. A typical trace has 5 000+ unique addresses;
spawning
addr2lineper address is unworkable. Pipelining through a single subprocess works but ties symbolisation to the presence of binutils onPATH(not always true on minimal containers). - Build-id verification. The
aozora-trace::Symbolicatorchecks the binary’sgnu-build-idagainst the trace’scodeIdso rebuilding between recording and analysis fails loudly rather than producing wrong symbol names.addr2line(1)has no such check. - Caching. The symbolicator writes a sidecar
<trace>.symbols.jsonon first call (~100 ms per binary) and reads from it on every subsequent call (instant). Re-runningaddr2lineper analysis would re-walk DWARF every time.
Verifying the SIMD scanner is firing
// In any binary or test
println!("{}", aozora_scan::best_scanner_name());
// "teddy" | "hoehrmann-dfa" | "memchr-multi"
Or under samply, look for aozora_scan::backends::teddy::scan_offsets
in the trace’s call tree. If the trace shows
memchr::arch::x86_64::avx2::* instead, you’re on the scalar
fallback (which uses memchr’s own SIMD dispatch internally — still
SIMD, just not aozora-scan’s).
Workflow recipes
“I changed something, did I regress?”
# Microbench the per-band tokenizer throughput
cargo bench -p aozora-lex --bench tokenize_compare
# Macrobench the full pipeline end-to-end
AOZORA_CORPUS_ROOT=… cargo run --release --example throughput_by_class -p aozora-bench
AOZORA_CORPUS_ROOT=… cargo run --release --example render_hot_path -p aozora-bench
# Check the worst doc didn't regress
AOZORA_CORPUS_ROOT=… AOZORA_PROBE_DOC=000286/files/49178_ruby_58807/49178_ruby_58807.txt \
cargo run --release --example pathological_probe -p aozora-bench
“Where is lex_into_arena spending its time?”
# Macroscopic per-phase split
AOZORA_CORPUS_ROOT=… cargo run --release --example phase_breakdown -p aozora-bench
# Latency tail shape
AOZORA_CORPUS_ROOT=… cargo run --release --example latency_histogram -p aozora-bench
# Microscopic: which classify recogniser dominates a specific doc?
AOZORA_CORPUS_ROOT=… AOZORA_PROBE_DOC=… \
cargo run --release --features instrument --example pathological_probe -p aozora-bench
See also
- Benchmarks — the per-probe descriptions.
- Corpus sweeps — corpus setup and
AOZORA_*env vars.
Benchmarks (criterion)
aozora ships two layers of perf measurement:
- Criterion microbenchmarks in
crates/aozora-lex/benches/andcrates/aozora-render/benches/. Reproducible per-function timings with statistical confidence intervals. - Corpus probes in
crates/aozora-bench/examples/. Each probe is acargo run --release --example <name>binary that reports per-band statistics across a real corpus.
Criterion microbenchmarks
Run a specific bench:
cargo bench -p aozora-lex --bench tokenize_compare
cargo bench -p aozora-render --bench html_emit
Criterion writes HTML reports under target/criterion/. Each bench
reports throughput in MB/s, ns/byte, and a confidence interval; the
HTML reports include violin plots that surface multi-modal latency
distributions (which often indicate cache-line or page-fault
effects we’d otherwise miss).
Why criterion over #[bench]
Three reasons.
- Statistical rigour.
#[bench]reports the minimum of N iterations; criterion fits a model and reports a confidence interval. The minimum is a known-bad estimator on a system with any noise (which is every real machine). - Iteration count auto-tuning. Criterion picks the iteration
count to reach a target precision;
#[bench]requires a hand-picked count. - Stability.
#[bench]is unstable Rust, only works on nightly. Criterion is stable Rust.
Corpus probes
Each probe under crates/aozora-bench/examples/ reports a different
slice of the workload. All read AOZORA_CORPUS_ROOT; most accept
AOZORA_PROFILE_LIMIT=N to cap the sweep.
| Probe | Question it answers | Output shape |
|---|---|---|
throughput_by_class | Per-band MB/s for lex_into_arena | 4-band table + p50 / p90 / p99 / max + ns/byte |
phase_breakdown | Per-phase ms for sanitize / tokenize / pair / classify | per-doc latencies + top-5 worst classify / sanitize |
latency_histogram | Log-bucketed latency distribution per phase | bar histogram, 10 buckets, 1 µs … 1 s |
pathological_probe | Single-doc 100-iter avg per phase | tight per-call numbers; takes AOZORA_PROBE_DOC for any corpus path |
phase0_breakdown | Per-sub-pass cost inside Phase 0 sanitize | bom_strip / crlf / rule_isolate / accent / pua_scan |
phase0_impact | Does Phase 0 sub-pass firing change Phase 1 cost? | bucketed by which sub-passes fired |
phase3_subsystems | Per-recogniser ms inside classify | requires --features instrument |
diagnostic_distribution | What fraction of docs emit diagnostics? | histogram by diag count; latency-by-diag-bucket |
allocator_pressure | Arena bytes / source byte ratio + intern dedup | per-doc histograms |
fused_vs_materialized | Does the deforestation actually win? | per-band gap % between fused (lex_into_arena) and materialized (per-phase collect) |
intern_dedup_ratio | How well does the interner dedup short strings? | corpus-aggregate (cache + table) / calls |
render_hot_path | Per-band MB/s for HTML render | 4-band MB/s + render/parse ratio + out/in size ratio |
Each probe is invoked directly:
AOZORA_CORPUS_ROOT=… cargo run --release --example <name> -p aozora-bench
For phase3_subsystems, build with the instrumentation feature:
AOZORA_CORPUS_ROOT=… cargo run --release --features instrument \
--example phase3_subsystems -p aozora-bench
Why corpus probes and criterion benches?
Different questions.
- Criterion answers “is function
Xfaster after my change?” on a fixed input. Microscopic, reproducible, the right tool for optimising a single hot loop. - Corpus probes answer “is the parser faster on the real Aozora Bunko catalogue after my change?” Macroscopic, includes every distribution effect (small-doc dispatch overhead, large-doc cache pressure, gaiji-density variation). The right tool for validating a perf PR end-to-end.
A perf PR that wins on criterion but loses on the corpus is suspicious — usually it’s optimised the small-input path at the cost of the large-input path. The corpus probe catches it.
Phase 3 instrumentation caveat
phase3-instrument wraps every recogniser entry in a
SubsystemGuard that calls Instant::now() on construction +
drop. For the dominant inner-loop recognisers this adds enough
overhead that the report’s own timing is significantly skewed.
Use the instrumentation to compare relative costs between
subsystems, not as an absolute number. For absolute numbers, run
phase_breakdown (no instrumentation).
Where to look in samply
If a corpus probe regresses, sample-profile the same workload:
AOZORA_CORPUS_ROOT=… just samply-corpus 5
samply load /tmp/aozora-corpus-<ts>.json.gz
# or
just trace-rollup /tmp/aozora-corpus-<ts>.json.gz
The trace-rollup analysis groups samples into aozora’s built-in
categories (Phase 0/1/2/3/4 + corpus_load + intern + alloc + …) so
a regression’s category jumps out at a glance.
See also
- Profiling with samply — the trace workflow.
- Corpus sweeps — what
AOZORA_CORPUS_ROOTshould point at. - Release profile & PGO — the build profile that produces these numbers.
Corpus sweeps
aozora’s tier-A acceptance gate is a corpus sweep: every Aozora
Bunko work parses without panicking, and the
parse ∘ serialize ∘ parse round-trip is stable. The corpus has
~17 000 works in active rotation; sweeping the lot takes ~90 s on a
modern x86_64 desktop.
Setting up the corpus
AOZORA_CORPUS_ROOT should point at a directory containing the
unpacked Aozora Bunko tarball:
$AOZORA_CORPUS_ROOT/
├── 000001/
│ └── files/
│ └── 18310_ruby_01058/
│ └── 18310_ruby_01058.txt ← Shift_JIS .txt source
├── 000002/
│ └── files/
│ └── …
└── …
The structure mirrors the upstream aozorabunko repo. Set the env var once in your shell:
export AOZORA_CORPUS_ROOT=/path/to/aozorabunko
Every probe, every sample-profile recipe, and the corpus sweep test suite reads it.
Running the sweep
just corpus-sweep
Wraps the aozora-corpus crate’s ParallelSweep runner. Iterates
every .txt file under $AOZORA_CORPUS_ROOT, parses it, verifies:
- No panic.
tree.diagnostics()count is within an expected envelope.parse(serialize(parse(source))) == parse(source)(round-trip property).- Render emits valid UTF-8 HTML (no broken byte sequences).
Failure: prints the offending document path + diagnostic, exits non-zero.
Why blake3 / zstd for the archive variant?
aozora-corpus ships an archive mode: the corpus packed into a
single .zst file with a blake3 manifest. This is what CI uses
(the corpus is downloaded once per workflow run and unpacked
in-memory).
- blake3 for per-entry content-addressed hashing. Used so the archive packer can detect “this work hasn’t changed since the last build” and skip re-encoding it. blake3 over sha256: ~10× faster on the same data, no security trade-off for our use case (we’re not signing anything, just diffing).
- zstd for compression. Frame-level random access matters
because the
ParallelSweeprunner wants to mmap individual works on demand without decompressing the whole archive. zstd over gzip / xz: 5–10× faster decompression at comparable ratios.
Both crates are mainstream pure-Rust APIs (the underlying libzstd
is C, but the boundary is hidden behind the zstd crate’s safe API).
Why parallel sweep?
A serial sweep runs sequentially through every work; on a 16-core
machine that’s wall-clock 16× the per-doc parse time. The
ParallelSweep runner uses rayon to parse documents in parallel,
sized to physical cores via num_cpus::get_physical() — not
logical cores.
The reason is memory bandwidth. The parser is bandwidth-bound, not ALU-bound (the SIMD scanner streams the source through L1 once per trigger byte, then the lexer touches each token a few more times). SMT siblings starve each other for cache lines and bus bandwidth, so oversubscribing logical cores actively slows the sweep. Sized to physical, the throughput peaks where the bandwidth ceiling does.
posix_fadvise(POSIX_FADV_DONTNEED) for honest cold-cache numbers
The xtask corpus uncache command evicts every corpus file from
the kernel page cache before a measurement run:
cargo run -p aozora-xtask --release -- corpus uncache
It uses posix_fadvise(fd, 0, 0, POSIX_FADV_DONTNEED) per file —
no sudo required (unlike echo 1 > /proc/sys/vm/drop_caches, which
needs root and drops every cache, defeating the purpose).
Why this matters: a “fresh” benchmark run that finds the corpus
already warm in the page cache reports throughput numbers that no
cold start can ever achieve. The uncache step makes “cold
benchmark” a real, repeatable thing.
Probes that go corpus-wide
| Probe | What |
|---|---|
throughput_by_class | Per-band MB/s for lex_into_arena. Splits the corpus by document size (small / medium / large / huge). |
phase_breakdown | Per-phase ms per doc. |
latency_histogram | Log-bucketed latency distribution per phase. |
diagnostic_distribution | What fraction of docs emit diagnostics? Histogram by diag count. |
allocator_pressure | Arena bytes / source byte ratio + intern dedup ratio. |
render_hot_path | Per-band render MB/s. |
See Benchmarks for the full list.
Why a dedicated aozora-corpus crate?
Three concerns kept apart from aozora-bench:
- Corpus discovery and loading. Walking the directory, decoding Shift_JIS, applying any per-work filters. This is shared by every probe + by the xtask corpus pack/unpack tooling.
- Archive format. The blake3 + zstd packing/unpacking lives here so the bench harness doesn’t pull in compression libraries.
- Parallel sweep runner. A reusable
rayon::par_iterwrapper with the right ordering (largest documents first to balance load).
aozora-bench then builds on this — each probe is a thin
for doc in corpus { measure(doc) } loop, with the corpus crate
handling all the I/O.
Why a separate AOZORA_PROFILE_REPEAT?
samply traces of probes that include corpus loading get dominated
by I/O and Shift_JIS decode (see
Profiling with samply).
Running the parse pass K times per document after the one-time
load gives samply enough parse-bound wall time to catch the
parser hot frames. Default K = 5; raise to 10+ for very small
corpora.
See also
- Benchmarks — the per-probe descriptions.
- Profiling with samply — the trace workflow.
Phase D — Sentinel enum + single-table registry results
The single-table registry collapsed four per-kind sentinel position
tables into one position-keyed EytzingerMap dispatched through a
NodeRef enum. Before the refactor the registry held independent
inline / block_leaf / block_open / block_close EytzingerMaps
and Registry::node_at(pos) swept them in declaration order with
four if let Some(...) = table.get(&pos) chains; the current shape
is one binary search per lookup, with the variant tag carried on the
entry itself.
Structural changes
old : Registry { inline, block_leaf, block_open, block_close } // 4× EytzingerMap
node_at(pos) → 4-way if-let chain, ~4 binary searches worst-case
now : Registry { table: EytzingerMap<u32, NodeRef<'src>> } // 1× EytzingerMap
node_at(pos) → one binary search, NodeRef variant tags the kind
Renderers (crates/aozora-render/src/html.rs,
crates/aozora-render/src/serialize.rs) replaced the parallel
4-way if let Some(...) = registry.<kind>.get(...) chains with
a single (Structural, NodeRef) cross-product match — the
compiler now enforces variant coverage at the call site.
Expected runtime impact
Theoretical: per-lookup binary search count drops from ≤ 4 to 1.
Render hot path is dominated by registry lookups inside the
memchr2_iter loop in html::render_into (one lookup per PUA
sentinel hit), so the savings scale with sentinel density. Aozora
corpus profiling against the four-table layout showed registry
lookups at ~12 % of render time on bouten-heavy documents; the
unified dispatch should absorb roughly that fraction.
Measurement procedure
Run before each minor release:
# Take a baseline against the previous release tag
git checkout v0.3.0
just samply-corpus --repeat 5 --out before.json.gz
git checkout -
# Take a current measurement
just samply-corpus --repeat 5 --out after.json.gz
# Diff at the function level
xtask trace compare before.json.gz after.json.gz
Numbers go in the table below at release time:
| Metric | Four-table | Single-table | Δ |
|---|---|---|---|
| Render hot path (corpus median, ns/doc) | to fill | to fill | to fill |
| Registry lookup CPU share (%) | to fill | to fill | to fill |
| End-to-end parse + render p50 (ms/doc) | to fill | to fill | to fill |
Repro environment recorded in perf/samply.md. Pin the host
CPU + corpus version + Rust toolchain so the table is comparable
across releases.
CLI reference
Full reference for the aozora binary. For a guided tour, see
CLI Quickstart.
Synopsis
aozora [OPTIONS] <SUBCOMMAND> [ARGS]
Subcommands:
| Subcommand | What it does |
|---|---|
check | Lex + report diagnostics. |
fmt | Round-trip parse ∘ serialize. |
render | Render to HTML on stdout. |
Global options apply to every subcommand:
| Option | Effect |
|---|---|
-E sjis, --encoding sjis | Decode Shift_JIS source. Default is UTF-8. |
--no-color | Disable ANSI colour in diagnostics output. |
--verbose | Print parse phase timings to stderr. |
--diagnostics LEVEL | Filter diagnostics by minimum level (error | warning | info). Default: warning. |
-V, --version | Print version and exit. |
-h, --help | Print help and exit. |
aozora check
aozora check [OPTIONS] [PATH]
Lex the source and print diagnostics. PATH of - (or omitted)
reads from stdin.
| Option | Effect |
|---|---|
--strict | Exit non-zero on any diagnostic. |
Exit codes: 0 on parse success (regardless of diagnostics, unless
--strict); 1 on diagnostics under --strict; 2 on usage error.
aozora check src.txt # warnings shown, exit 0
aozora check --strict src.txt # warnings -> exit 1
aozora check -E sjis crime.txt # SJIS source
cat src.txt | aozora check # stdin
aozora fmt
aozora fmt [OPTIONS] [PATH]
Round-trip the source through parse ∘ serialize. Default behaviour
prints the canonical form on stdout.
| Option | Effect |
|---|---|
--check | Exit non-zero if the formatted output differs from the input. Don’t print the canonical form. |
--write | Overwrite the input file with the canonical form. (Ignored when reading from stdin.) |
Exit codes: 0 on success (or no diff under --check); 1 on a
formatting mismatch under --check; 2 on usage error.
aozora fmt src.txt > formatted.txt
aozora fmt --check src.txt # CI gate
aozora fmt --write src.txt # in-place
cat src.txt | aozora fmt # stdin → stdout
aozora render
aozora render [OPTIONS] [PATH]
Render the parsed tree to HTML on stdout.
aozora render src.txt > out.html
aozora render -E sjis crime.txt > crime.html
cat src.txt | aozora render -
The output is semantic HTML5 with aozora-* class hooks (no inline
styles). See HTML renderer
for the class-name reference.
Exit codes
| Code | Meaning |
|---|---|
0 | Success. |
1 | Diagnostics emitted under --strict, or formatting mismatch under --check. |
2 | Usage error (bad flag, missing file, decode error). |
Environment
| Variable | Effect |
|---|---|
NO_COLOR | If set (any value), disable ANSI colour output. Same as --no-color. |
AOZORA_LOG | tracing-subscriber filter (e.g. aozora_lex=debug). For internal debugging; not part of the stable surface. |
See Reference → Environment variables for the full env matrix (which includes the bench / profiling vars).
See also
- CLI Quickstart — examples and the three-subcommand rationale.
- Notation overview — what the parser recognises.
- Diagnostics catalogue — the codes
you’ll see in
check’s output.
API reference (rustdoc)
The full rustdoc surface for every crate in the workspace is auto-deployed alongside this handbook. Browse it at:
The landing redirects to the top-level facade (aozora); from there
every workspace crate is reachable via the side panel.
Why /api/ instead of docs.rs?
aozora is not yet on crates.io — public release tracks the v1.0 API
freeze. Until then, docs.rs has nothing to render against, so the
rustdoc API reference is built directly from the workspace and
deployed under the GitHub Pages site that serves this handbook.
When the v1.0 release lands and we publish to crates.io, docs.rs
will pick up the API reference automatically; the in-tree /api/
copy will keep working as a mirror, since the GitHub Pages deploy
runs on every main push regardless.
Layout
| Path | What |
|---|---|
/aozora/ (this site) | Handbook (this mdbook) |
/aozora/api/aozora/ | Public facade crate |
/aozora/api/aozora_lex/ | Lexer orchestrator |
/aozora/api/aozora_lexer/ | Seven-phase lexer |
/aozora/api/aozora_render/ | HTML / serialise renderers |
/aozora/api/aozora_syntax/ | AST node types |
/aozora/api/aozora_spec/ | Shared types |
/aozora/api/aozora_scan/ | SIMD scanner |
/aozora/api/aozora_veb/ | Eytzinger sorted-set |
/aozora/api/aozora_encoding/ | SJIS + 外字 |
/aozora/api/aozora_cli/ | CLI binary internals |
/aozora/api/aozora_ffi/ | C ABI driver |
/aozora/api/aozora_wasm/ | WASM driver |
/aozora/api/aozora_py/ | Python binding |
/aozora/api/aozora_bench/ | Bench probes |
/aozora/api/aozora_corpus/ | Corpus runner |
/aozora/api/aozora_proptest/ | Proptest strategies |
/aozora/api/aozora_trace/ | Samply trace loader |
/aozora/api/aozora_xtask/ | Dev tooling |
Doc-link discipline
The workspace [workspace.lints.rustdoc] block sets every
documentation lint to warn (target: deny). Specifically:
broken_intra_doc_links = "warn"— every[name]link in a doc comment must resolve.private_intra_doc_links = "warn"— links topub(crate)items flagged so the public docs don’t dangle into private structures.invalid_codeblock_attributes = "warn"— typos in```rust,no_runstyle attributes get caught.invalid_html_tags = "warn"— accidental<foo>in prose flagged.invalid_rust_codeblocks = "warn"— every```rustblock must parse as Rust.bare_urls = "warn"— links must be<https://...>or[label](url), not bare URLs (which markdown parses inconsistently).redundant_explicit_links = "warn"—[x](x)where the autolink form would do.unescaped_backticks = "warn"— stray backticks flagged.
The deferred deny upgrade is tracked separately; once the existing warnings are cleaned up the gate will tighten.
Local rustdoc build
just doc # workspace-wide rustdoc (no deps)
just doc-open # rustdoc + open in default browser
Both run inside the dev container; output lands at
target/doc/aozora/index.html.
Building this handbook
just book-build # render to crates/aozora-book/book/
just book-serve # live-preview at localhost:3000
just book-linkcheck # lychee link verification
See Contributing → Development loop for the full toolchain.
See also
- Crate map — narrative description of each crate.
- Library Quickstart — common API patterns.
Environment variables
A central reference for every env var aozora reads. Variables fall into three groups: parser configuration, dev / bench harness, and container plumbing.
Parser configuration
| Variable | Read by | Effect |
|---|---|---|
NO_COLOR | aozora-cli | If set (any value), disable ANSI colour output. Same as --no-color. Standard convention from https://no-color.org. |
AOZORA_LOG | aozora-cli, library opt-in | tracing-subscriber filter directive (e.g. aozora_lex=debug,aozora_render=info). For internal debugging; not part of the stable surface. |
Dev / bench harness
| Variable | Read by | Effect |
|---|---|---|
AOZORA_CORPUS_ROOT | aozora-corpus, every probe, every sample-profile recipe, the corpus sweep | Directory of 青空文庫 source files (UTF-8 or Shift_JIS). Required for any corpus-driven operation. |
AOZORA_PROFILE_LIMIT | aozora-bench probes | Cap the number of corpus documents per probe. Useful for fast iteration; set to 100 for a sub-second sweep. |
AOZORA_PROFILE_REPEAT | samply-corpus, samply-render | Number of parse / render passes per document after the one-time corpus load. Default 5; raise to give samply enough parser-bound wall time to attach to. |
AOZORA_PROBE_DOC | pathological_probe | Single corpus path to probe in tight per-call mode. Path is relative to $AOZORA_CORPUS_ROOT. |
AOZORA_PROPTEST_CASES | aozora-proptest::config | Override default proptest case count (default 128 per block). 4096 for just prop-deep. |
Container plumbing
These are set by docker-compose.yml and don’t need manual handling
unless you’re invoking cargo directly outside the dev container.
| Variable | Set by | Purpose |
|---|---|---|
CARGO_HOME | compose | /workspace/.cargo — registry + git deps cached on a named volume. |
CARGO_TARGET_DIR | compose | /workspace/target — build output cached on a named volume. |
RUSTC_WRAPPER | compose | sccache — compile cache. |
SCCACHE_DIR | compose | /workspace/.sccache — sccache backing store on a named volume. |
SCCACHE_CACHE_SIZE | compose | 10G — default cap. |
CARGO_INCREMENTAL | compose | 0 — incremental compile defeats sccache; turning it off lets sccache cache the very crates we build most often. |
RUST_BACKTRACE | compose | 1 — full backtraces on panic. |
GIT_CONFIG_* | compose | Whitelists /workspace for git’s “dubious ownership” check (the bind-mounted host source is a non-root UID; the container runs as root). |
Variables we deliberately do not read
A few standard variables aozora intentionally ignores:
| Variable | Why ignored |
|---|---|
LANG / LC_ALL | aozora handles its own encoding via --encoding. Locale-driven byte interpretation would make the parser non-reproducible across machines. |
RUSTFLAGS (in non-build context) | The release / bench / PGO profiles set their own flags; per-invocation RUSTFLAGS would defeat sccache hits for unrelated crates. |
CARGO_BUILD_JOBS | Cargo’s default (CPU count) is what we want. Overriding usually fights the bench harness’s own parallelism control. |
See also
- CLI reference → Environment — the CLI’s per-invocation env.
- Performance → Corpus sweeps — the
AOZORA_CORPUS_ROOTsetup. - Performance → Profiling with samply — the
AOZORA_PROFILE_REPEATknob.
Conformance suite
aozora ships a WPT-style conformance corpus so other implementations of the Aozora Bunko notation (the tree-sitter reference grammar, third-party ports, alternate parsers in other languages) can measure their adherence against the same set of cases the Rust parser is held to.
Tier model
| Level | Meaning | Effect on xtask conformance run |
|---|---|---|
must | Required for any conforming implementation. | A failure here exits non-zero. |
should | Recommended but not strictly required. | A failure here logs a warning. |
may | Optional; implementations decide. | Pure information; never fails. |
The tier is declared per case in
crates/aozora-conformance/fixtures/render/<case>/meta.toml
alongside a feature tag (ruby, bouten, composite, recovery,
…). The runner aggregates pass / fail counts by (feature, level).
Running
just conformance # full suite, exits non-zero on must-fail
just render-gate # the byte-identical render gate, K3-style
xtask conformance run # invoke the runner directly
A successful run also writes
crates/aozora-book/src/conformance-results.json with per-case
detail. The JSON shape is stable; downstream dashboards / shields
parse it.
What gets compared
The runner checks two outputs per fixture:
tree.to_html()byte-identical toexpected.html.tree.serialize()byte-identical toexpected.serialize.txt.
Both goldens regenerate via
UPDATE_GOLDEN=1 cargo test -p aozora-conformance --test render_gate
after intentional output changes. The runner does not yet compare
diagnostics or wire-format output; both are future extensions of the
same manifest.
Implementations
The runner currently targets a single implementation — the Rust
parser itself. The results.json format carries an implementation
field so external runs can append their own results without
disturbing the canonical Rust pass-rate.
See also
- Architecture → Error recovery — what the
parser does after each diagnostic fires; the
recovery-feature fixtures pin those semantics. - Node reference — per-
NodeKinddocumentation.
AST query DSL
A tree-sitter-flavoured pattern DSL selects nodes / tokens from the
concrete syntax tree. Editor surfaces (LSP
textDocument/documentHighlight, “find all ruby annotations”,
refactoring filters, syntax-aware search) compose against the DSL
instead of re-implementing tree walks.
The DSL ships behind the query Cargo feature on the aozora
crate; that feature also enables cst since queries run against
SyntaxNode.
Quickstart
use aozora::Document;
use aozora::query::compile;
let doc = Document::new("|青梅《おうめ》と|青空《あおぞら》");
let cst = aozora::cst::from_tree(&doc.parse());
let query = compile("(Construct @ruby)").expect("compile");
for capture in query.captures(&cst) {
println!("{} -> {:?}", capture.name, capture.node);
}
Grammar
query := pattern ('\n' pattern)* '\n'?
pattern := '(' kind capture? ')'
| '(' '_' capture? ')'
kind := SyntaxKind ident // e.g. `Construct`, `Container`
capture := '@' ident
ident := [A-Za-z_][A-Za-z0-9_-]*
(Construct)— match everyConstructnode.(Construct @ruby)— capture eachConstructunder the nameruby.(_)— match any kind (node or token).(_ @any)— combined; tour every kind in preorder.- Multiple patterns separated by newlines run as an OR — every
matching node yields one
Captureper pattern that hits.
Execution model
The DSL compiles once into a Vec<Pattern>; the engine then tests
every pattern at every preorder step (O(nodes × patterns)). The
small capture-only surface keeps the implementation tight while the
predicate / field-access / alternation extensions wait for a
concrete consumer ask.
Not yet supported
- Predicates (
#eq?,#match?) — the tree-sitter query language exposes per-capture filters. The DSL ships without them; consumers filter the resulting [Capture] vec in Rust. - Field accessors (
(Container body: (Construct))) — the CST has no named fields yet. - Quantifiers (
(...)?,(...)*,(...)+). - Alternation
[...]between patterns.
These extensions are forward-compatible with the existing API
shape (compile → captures); a future release can land them
without breaking existing queries.
Cross-references
- Architecture → Concrete syntax tree — the CST the DSL queries.
- Node reference —
NodeKind/SyntaxKinddocumentation.
Wire format
aozora ships a stable JSON wire format used by every binding —
aozora-ffi (C ABI), aozora-wasm (npm), aozora-py (PyO3) —
to project the parser’s output across language boundaries.
aozora::wire
is the single authority for that projection; downstream drivers
call into it and receive bit-identical output.
Envelope shape
Every wire JSON has the form
{ "schema_version": 1, "data": [ /* … entries … */ ] }
where schema_version is the major version of the wire contract and
data is the per-endpoint payload array.
The four endpoint envelopes are:
| Endpoint | Entry shape | JSON Schema |
|---|---|---|
serialize_diagnostics | { kind, severity, source, span, codepoint? } | schema-diagnostics.json |
serialize_nodes | { kind, span: { start, end } } | schema-nodes.json |
serialize_pairs | { kind, open: { start, end }, close: { … } } | schema-pairs.json |
serialize_container_pairs | { kind, open: { offset }, close: { offset } } | schema-container-pairs.json |
SCHEMA_VERSION
The schema_version integer (aozora::wire::SCHEMA_VERSION)
bumps on any breaking change to the serialised shape — variant
additions exposing as a new kind value, field renames, envelope
restructuring. Clients should branch on the version and handle
unknown values defensively; schema 1 makes no forward-compatibility
guarantees with later schemas.
Stability vs. non_exhaustive
Diagnostic
and AozoraNode
are #[non_exhaustive] — minor releases can add variants. The wire
format protects callers in two ways:
- Unrecognised variants emit
kind: "unknown"rather than failing to serialise, so an old client never sees parse-time data loss. SCHEMA_VERSIONbumps when new variants ship in the wire surface, giving version-branching clients a chance to react before"unknown"shows up in production traffic.
See also
- Diagnostics catalogue — the source-code
identifiers each
DiagnosticWireentry’skindfield carries. - Architecture → Error recovery — what the parser actually does after each diagnostic fires.
- Node reference — per-
NodeKinddocumentation for every wirekindtag emitted byserialize_nodes. aozora::wirerustdoc — Rust API surface (envelope structs, theschema_*introspection helpers behind theschemaCargo feature).
Development loop
aozora’s development workflow is built around three rules:
- Docker-only execution. The host toolchain is never invoked.
justis the entry point. Every operation goes through ajustrecipe that wraps the underlying tool inside the dev container.- Lint gates run automatically. lefthook installs git hooks
that run
fmt + clippy + typospre-commit andtest + denypre-push, so a passing local commit roughly mirrors a passing CI run.
First-time setup
git clone git@github.com:P4suta/aozora.git
cd aozora
docker compose build dev # ~5 min the first time, cached afterwards
just hooks # install lefthook git hooks
just test # confirm green
Daily loop
just shell # drop into the dev container
just build # cargo build --workspace --all-targets
just test # workspace nextest
just lint # fmt + clippy + typos + strict-code
just prop # property-based sweep (128 cases / block)
just ci # full CI replica (lint + build + test + prop + deny + audit + udeps + coverage + book-build)
just --list enumerates everything available; just --list --unsorted
preserves the topical grouping (build → test → lint → deps → bench →
docs → release → dev-helpers).
Watch mode (bacon)
just watch # default `check` job
just watch clippy
just watch test
Inside bacon: t test, c clippy, d doc, f failing-only,
esc previous job, q quit, Ctrl-J list jobs. The watcher runs
inside the dev container so file change detection works against the
bind-mounted source.
For headless usage (no TTY, e.g. piping to tee):
just watch-headless check # plain output, no TUI
Why Docker for everything?
Three reasons.
- Toolchain reproducibility. The dev image pins
rust:1.95.0-bookwormplus exact versions ofcargo-nextest,cargo-llvm-cov,cargo-deny,cargo-audit,cargo-udeps,cargo-semver-checks,cargo-fuzz,mdbook,mdbook-mermaid,lychee,git-cliff,bacon, andlefthook. A fresh checkout on any machine produces identical tool behaviour. - sccache hits. The compose file mounts a named volume at
/workspace/.sccacheand setsRUSTC_WRAPPER=sccache. Across sessions and across branches, the cache stays warm. - Host insulation. Nothing in the workspace touches
~/.cargo,~/.rustup, or any global state. Removing the project meansdocker compose down -v && rm -rf aozora/.
The two exceptions to Docker-only:
- samply profiling.
perf_event_open(2)doesn’t survive the container seccomp profile; thesamply-*recipes invoke the host toolchain (see Profiling with samply). - Release builds. GitHub Actions runners build the release binaries natively per OS (the cross-target binary needs to match its runner OS exactly).
Editor / IDE setup
The repository includes a .devcontainer/ config, so:
- VS Code with Dev Containers extension — “Reopen in Container”
picks up the dev image, the rust-analyzer toolchain, and the
aozora-*workspace at once. No host-side rust install needed. - Anything else — point your editor’s rust-analyzer at the dev
container via
docker exec. The cleanest approach is symlinkingtarget/from the named volume to a host-visible path; the alternative is the editor’s own remote-LSP support.
sccache stats
After a build cycle, check that the cache is actually warm:
just sccache-stats
Healthy steady state: 80%+ hit rate during normal iteration. A
sub-50% hit rate usually means RUSTC_WRAPPER got defeated — the
likely culprit is a stray env override or an [env] in
.cargo/config.toml. To reset counters before a measurement window:
just sccache-zero && just clean && just build && just sccache-stats
Pre-commit hooks (lefthook)
lefthook.yml configures:
- pre-commit (parallel):
fmt,clippy,typos. - commit-msg: Conventional Commits regex.
- pre-push (parallel):
test,deny.
The hooks shell into docker compose run --rm dev … so they’re
identical to the just recipes you ran manually. To skip a hook
temporarily, push from the dev container’s shell directly (the
hooks attach to the host git, not the container’s git).
Why lefthook over husky / pre-commit / cargo-husky?
- husky — Node-only ecosystem; would force a Node dep into a Rust workspace.
- pre-commit (Python framework) — Python-only ecosystem; same issue inverted.
- cargo-husky — abandoned upstream.
- lefthook — single Go binary, language-neutral, parallel execution, ships from a small upstream that’s actively maintained. Mainstream choice for polyglot Rust workspaces in 2026.
Conventional commits
The commit-msg hook enforces:
<type>(<scope>): <subject>
Where <type> ∈ feat | fix | docs | style | refactor | perf | test | build | ci | chore | revert,
and <scope> is typically a crate name without the aozora- prefix
(e.g. feat(render): add aozora-tcy class hook).
git-cliff turns these into the CHANGELOG on release.
Adding a new 青空文庫 notation
End-to-end TDD flow:
- Spec fixture. Add a
(input, html, serialise)triple underspec/aozora/cases/. - AST variant. Add a borrowed-arena variant to
AozoraNodeincrates/aozora-syntax/src/borrowed.rs. - Lexer test (red). Add a case to the relevant phase test
under
crates/aozora-lexer/tests/. - Lexer impl (green). Wire the recogniser into the appropriate phase (sanitize → tokenize → pair → classify).
- Renderer. Emit the new HTML shape in
crates/aozora-render/src/html.rsand the canonical serialisation incrates/aozora-render/src/serialize.rs. - Cross-layer invariants. Extend the property test or corpus predicate that the new shape interacts with (escape-safety, round-trip, span well-formedness).
See also
- Testing strategy — what each test layer asserts.
- Release process — how a tag becomes a published release.
Testing strategy
aozora targets C1 100% branch coverage as a goal — but coverage is the floor, not the ceiling. Every invariant is asserted from multiple angles so a single missed test path doesn’t silently hide a regression.
The five test layers
flowchart TD
A["1. Spec cases<br/>(spec/aozora/cases/*.json)"]
B["2. Property tests<br/>(crates/*/tests/property_*.rs)"]
C["3. Corpus sweep<br/>(every Aozora Bunko work)"]
D["4. Fuzz harness<br/>(cargo-fuzz)"]
E["5. Sanitizers<br/>(Miri / TSan / ASan)"]
A --> B --> C --> D --> E
Each layer catches a different kind of bug:
| Layer | Catches |
|---|---|
| Spec cases | Per-feature contract regressions (the (input, html, canonical) triple). |
| Property tests | Invariant violations in the space of inputs (round-trip, escape-safety, span well-formedness). |
| Corpus sweep | Real-world distribution effects the property generator missed. |
| Fuzz | Latent panics on adversarial inputs the corpus doesn’t contain. |
| Sanitizers | UB / data race / heap-corruption issues the language can’t catch. |
When you add a new invariant, land all five touchpoints in the same PR, or split them into a chain of PRs that explicitly references the invariant.
Layer 1: spec cases
spec/aozora/cases/
├── ruby-nested-gaiji.json
├── emphasis-bouten.json
├── emphasis-double-ruby.json
├── kunten-kaeriten.json
├── page-break.json
└── …
Each case pins a (input, html, serialise) triple:
{
"input": "|青梅《おうめ》",
"html": "<ruby>青梅<rt>おうめ</rt></ruby>",
"serialise": "|青梅《おうめ》"
}
The unit test runner (cargo nextest run -p aozora-render) loads
every case, parses, renders, serialises, and compares against the
pinned strings. The property harness also uses these cases as
seed inputs for shrinking.
The flagship in-tree fixture lives at
spec/aozora/fixtures/56656/ — the Japanese translation of Crime
and Punishment (Aozora Bunko card 56656). It exercises 1000+ ruby
annotations, forward-reference bouten, JIS X 0213 gaiji, and
accent decomposition edge cases.
Layer 2: property tests
proptest generators in
crates/aozora-proptest drive parse / render / round-trip
invariants. Default 128 cases per proptest! block (CI budget);
just prop-deep runs 4096 per block (release-cut budget).
just prop # 128 cases
just prop-deep # 4096 cases
AOZORA_PROPTEST_CASES=10000 cargo nextest run --workspace --test 'property_*'
Why proptest over quickcheck:
- Proptest’s shrinker is structural (reduces by the generator’s ops), so a counterexample collapses to a minimal reproduction that still fails. Quickcheck shrinks per-type, which produces noisier outputs.
- Proptest persists failure seeds to
proptest-regressions/— every reproduced failure becomes a permanent regression test. Quickcheck has nothing like this.
Why a separate generator crate (aozora-proptest):
The generators are non-trivial (they have to produce valid 青空文庫 source — random byte streams would just stress the parser’s error path, which the fuzz harness already covers). Centralising them means every property test in every crate gets the same generator quality, and the generator itself can be unit-tested.
Layer 3: corpus sweep
export AOZORA_CORPUS_ROOT=$HOME/aozora-corpus
just corpus-sweep
Walks every .txt under $AOZORA_CORPUS_ROOT, parses, verifies
the round-trip property holds, no panics. ~17 000 works in active
rotation; ~90 s sweep on a modern x86_64 desktop using the parallel
loader.
The sweep catches what the property generator can’t — every weird real-world idiom the maintained corpus has accumulated over 25 years of volunteer encoding choices. It’s the parser’s truth-from-the-field.
See Performance → Corpus sweeps for the corpus structure, archive format, and parallel loader details.
Layer 4: fuzz
just fuzz parse_render -- -runs=10000
Targets under crates/*/fuzz/fuzz_targets/:
parse_render— feed arbitrary bytes throughDocument::new ∘ to_html.serialize_roundtrip—parse ∘ serialize ∘ parsestability.sjis_decode—aozora_encoding::sjis::decode_to_stringon arbitrary byte streams.
Fuzz failures auto-shrink to a minimal byte sequence and land in
crates/<crate>/fuzz/artifacts/. Add the failing input to
spec/aozora/cases/ as a regression case after diagnosing.
Why libFuzzer / cargo-fuzz:
Mainstream Rust fuzzing runs on libFuzzer via cargo-fuzz; it has
the broadest crate-ecosystem support (most upstream crates ship
fuzz targets), the corpus-management tooling is mature, and the
crash artefacts are diff-able with git diff.
Layer 5: sanitizers
bash scripts/sanitizers.sh miri # UB on FFI / scan intrinsics
bash scripts/sanitizers.sh tsan # data races (parallel corpus loader)
bash scripts/sanitizers.sh asan # heap correctness
Sanitizer runs are slower (~10× under Miri) so they don’t run on every PR — they’re nightly via the dev-image cron in CI, plus release-cut. The slow path catches the slow-class of bugs.
Why all three:
- Miri catches undefined behaviour the compiler couldn’t see (out-of- bounds slice access, dangling references, transmute mismatches). The FFI driver and the SIMD scanner have unsafe surfaces; Miri is the only fully-checked oracle for them.
- TSan catches race conditions in the parallel corpus loader. We
use
rayoncorrectly as far as we know, but TSan is the backstop. - ASan catches the small set of heap-correctness bugs that get through Miri (typically C-side issues in the FFI smoke test).
Coverage measurement
just coverage # cargo llvm-cov branch coverage; CI gate
just coverage-html # local HTML report at coverage/html/index.html
just coverage-branch # nightly toolchain, branch-coverage detail
cargo llvm-cov over tarpaulin: tarpaulin is x86_64-linux
only and uses ptrace-based instrumentation that misses some
optimised-out branches. llvm-cov uses LLVM’s source-based
coverage instrumentation — works on every target and gives accurate
branch numbers.
The CI gate is region coverage; branch coverage is informational (it requires the nightly compiler, which the workspace doesn’t pin on the hot path).
Test naming and structure
- Unit tests in
mod tests {}at the bottom of each module. - Integration tests in
crates/<crate>/tests/. One file per area (e.g.tests/lexer_phase0.rs,tests/lexer_phase3.rs). - Property tests prefixed
property_(theproprecipe globs on this). - Doc tests inside
```rustblocks in rustdoc comments. CI runsjust test-docseparately because nextest skips them.
Snapshot testing
Where the output is a multi-line string that’s tedious to inline
(rendered HTML, diagnostic-formatted text), we use
insta:
insta::assert_snapshot!(tree.to_html());
The first run writes tests/snapshots/<test>.snap; subsequent runs
compare against it. Updates happen via cargo insta review (the
interactive UI inside the dev container), never by manually editing
the .snap file.
See also
- Development loop —
just testand friends. - Performance → Corpus sweeps — how the corpus layer 3 works in practice.
Release process
aozora releases are git-tag-driven: push an annotated v<semver>
tag, and .github/workflows/release.yml builds the cross-platform
binaries, generates release notes from Conventional Commits, and
publishes the GitHub Release.
Cutting a release
# 1. Pre-flight (everything green locally)
just ci # lint + build + test + prop + deny + audit + udeps + coverage + book-build
just prop-deep # 4096 cases per proptest block
AOZORA_CORPUS_ROOT=… just corpus-sweep
# 2. Bump workspace version
cargo set-version --workspace 0.2.7
git commit -am "chore(release): bump workspace to v0.2.7"
# 3. Refresh CHANGELOG (Unreleased → version)
just changelog # runs git-cliff with --unreleased --prepend
git add CHANGELOG.md && git commit -m "docs: refresh CHANGELOG for v0.2.7"
# 4. Tag (annotated)
git tag -a v0.2.7 -m "v0.2.7"
git push origin main v0.2.7
release.yml reacts to the tag: builds release binaries on three
runners (linux x86_64, macOS arm64, windows x86_64), assembles
tarballs / zips with the aozora binary + LICENSE-MIT +
LICENSE-APACHE + NOTICE + README.md, and publishes the
archives plus SHA256SUMS to the GitHub Release.
Sanity check after release
# Verify checksums
curl -L -O https://github.com/P4suta/aozora/releases/download/v0.2.7/SHA256SUMS
curl -L -O https://github.com/P4suta/aozora/releases/download/v0.2.7/aozora-v0.2.7-x86_64-unknown-linux-gnu.tar.gz
sha256sum --check SHA256SUMS
# Verify the binary
tar -xzf aozora-v0.2.7-*.tar.gz
./aozora --version # prints "aozora 0.2.7"
Why annotated tags?
git tag -a creates a tagged-tag object with a message; git tag
alone creates a lightweight tag (a bare ref). git-cliff’s release
note extraction only walks annotated tags, and the standard
ecosystem expectation (cargo-release, cargo-dist) is that release
tags are annotated. Using lightweight tags would silently break the
changelog generator.
Why git-tag-driven, not branch-driven?
A release/v0.2.7 branch model is the alternative. We don’t use
it because:
- Single-author workflow doesn’t benefit from the parallel-tracks model that branch-driven releases enable.
- An annotated tag is the release artefact — anything you need to
retroactively understand about a release lives in
git show v0.2.7. A branch loses that locality. - Rollback is
git tag -d+ delete the GitHub release. Trivial.
CHANGELOG generation
git-cliff consumes Conventional Commits
and produces Keep-a-Changelog formatted output:
just changelog # incremental: --unreleased --prepend CHANGELOG.md
just changelog-full # rebuild from scratch
cliff.toml configures the grouping:
| Commit type | Section in CHANGELOG |
|---|---|
feat: | Added |
fix: | Fixed |
perf: | Performance |
refactor: | Changed |
docs: | Documentation |
test: | Tests |
build: | Build |
ci: | CI |
chore: | (skipped unless scope is release) |
revert: | Reverted |
Non-conventional commits are silently skipped (they survive in
git log but don’t pollute the changelog).
Why --unreleased --prepend over -o CHANGELOG.md:
The full-rebuild form (-o) regenerates the entire changelog from
git history every time, which churns the diff for past releases
even when nothing about them changed (whitespace, footer
formatting). The incremental form only writes the new “Unreleased”
section between the latest release and HEAD, leaving past entries
byte-stable.
Why three release targets and not five?
The CI matrix builds:
x86_64-unknown-linux-gnu(linux x86_64)aarch64-apple-darwin(macOS arm64)x86_64-pc-windows-msvc(windows x86_64)
We don’t build x86_64-apple-darwin (macOS Intel — Apple
deprecated the platform; arm64 covers all current Apple Silicon
machines) or aarch64-unknown-linux-gnu (linux arm64 — covered by
cargo install from source for the niche ARM Linux deployment
case).
Adding a target is one line in release.yml; we add them when a
real consumer asks for a binary build of one. Pre-emptive coverage
isn’t worth the CI minutes.
Why not cargo-dist / release-plz?
Both are mainstream choices; we use a hand-written release.yml
because:
cargo-distis opinionated about archive layout (assumes you shipbin/+share/); aozora’s archive is flat (aozora+LICENSE-*+NOTICE+README.md).release-plzautomates the version-bump + PR flow; for a single- author repo the manualcargo set-version+git tagis two commands and one fewer integration to debug.
When the workspace grows past three release targets or aozora goes multi-author, both will be worth re-evaluating.
Pre-1.0 SemVer
aozora is currently in the 0.x series. The contract:
0.x.y→0.x.y+1: patches and additions, no breaks. Always safe to upgrade.0.x.y→0.x+1.0: may break the API.cargo-semver-checksflags the breaks during CI; the version-bump commit references the break in its body.0.x.y→1.0.0: the API freeze. Post-1.0, breaking changes collect on anextbranch and ship in a major bump.
The MSRV pin (rust-toolchain.toml) advances on its own cadence,
roughly quarterly. MSRV bumps are not breaking under our pre-1.0
contract — consumers that need a frozen MSRV pin a release tag.
Publishing to crates.io
Deferred until v1.0. The reasoning:
- Pre-1.0 every minor bump may break the API; pushing those churns
the registry for downstream
Cargo.lockconsumers. - Once published, the published name becomes load-bearing — name changes cost goodwill. Holding the name unpublished keeps the option to refactor the crate boundary.
When v1.0 lands, the publication workflow will run from a tag:
cargo publish per crate in topological order
(aozora-spec first, aozora last), driven from release.yml.
See also
- Development loop — the local pre-flight commands.
- Testing strategy —
prop-deepand corpus sweep details.