Welcome
aozora is a pure-functional Rust parser for 青空文庫記法 (Aozora Bunko notation) — the in-text annotation language used by 青空文庫, the long-running volunteer digital library of Japanese literature in the public domain.
It handles ruby (|青梅《おうめ》), bouten / bousen
([#「X」に傍点]), 縦中横, gaiji references
(※[#…、第3水準1-85-54]), kunten / kaeriten, indent and align
containers ([#ここから2字下げ]… [#ここで字下げ終わり]), and
page / section breaks — every notation that appears in a real Aozora
Bunko .txt source.
The repository is CommonMark-free, Markdown-free: aozora deals only with the 青空文庫 notation. The renderer emits semantic HTML5; the lexer reports structured diagnostics; the AST is a borrowed-arena tree that can be walked in O(n) without copying source bytes. If you want a Markdown dialect that also understands aozora notation, see the sibling project afm, which is built on top of this parser.
What this handbook is for
A practical tour and a deep reference, in one document.
- Tour — install the CLI, drop the library into a Rust project, or call it from WASM, C, Python, Go, or the Extism host-SDK. Not sure which? Start with Choosing a binding.
- Notation reference — every annotation aozora recognises, with examples, output, edge cases, and the diagnostics that fire when authors get them subtly wrong.
- Architecture — what makes aozora fast and small: the borrowed-arena AST, the four-phase lexer, the SIMD scanner backends (Teddy, structural bitmaps, Hoehrmann-style multi-pattern DFA), Eytzinger-layout sorted-set lookup, and the Shift_JIS + 外字 resolver. Every choice is motivated against the alternative we didn’t take.
- Performance — the release-profile decisions, PGO pipeline, samply workflow, criterion benchmarks, and the parallel corpus sweep that exercises the parser against every Aozora Bunko work.
- Reference & contributing — CLI, env vars, rustdoc API, and how the dev loop / TDD policy / release pipeline fit together.
Project shape
aozora is a single-author, green-field project that takes the opportunity to reach for the good algorithm and data structure for each problem rather than the obvious naive one. That orientation permeates every chapter — when you read about the scanner or the arena or the gaiji table, you’ll see why this technique spelled out, not just what the code does.
Status
Released versions track GitHub Releases; the bindings — the CLI, the Rust library, WASM, the C ABI, Go, Python, and the Extism host-SDK — all build and pass CI smoke tests. Public crates.io publication is gated on the v1.0 API freeze; in the meantime, depend on a tagged commit (see Install for the current pin).
A live build of this site lives at https://p4suta.github.io/aozora/; the rustdoc API reference is layered underneath at https://p4suta.github.io/aozora/api/aozora/.
Install
aozora ships in five shapes — pick the one that matches how you want to consume the parser.
CLI binary (release archive)
Pre-built aozora binaries for the three Tier-1 platforms ride on
every GitHub Release:
aozora-vX.Y.Z-x86_64-unknown-linux-gnu.tar.gzaozora-vX.Y.Z-aarch64-apple-darwin.tar.gzaozora-vX.Y.Z-x86_64-pc-windows-msvc.zip
Each archive is shipped with a SHA256SUMS companion. Browse them at
https://github.com/P4suta/aozora/releases.
curl -L -O \
https://github.com/P4suta/aozora/releases/latest/download/aozora-x86_64-unknown-linux-gnu.tar.gz
tar -xzf aozora-*.tar.gz
sudo install -m 0755 aozora /usr/local/bin/
aozora --version
CLI binary (build from source)
The released CLI is on crates.io — cargo install compiles it from the
published source:
cargo install aozora-cli --locked
The --locked flag is non-negotiable — it pins to the exact
Cargo.lock we shipped, which matters because the workspace uses fat
LTO (mismatched dep versions silently change inlining behaviour).
To track the development tip instead, install from git:
cargo install --git https://github.com/P4suta/aozora --locked aozora-cli
Or pin a specific release tag (the current value is on the releases page):
cargo install --git https://github.com/P4suta/aozora \
--tag v0.4.1 --locked aozora-cli
Rust library
aozora is on crates.io. Depend on the umbrella crate alone — it is the
single front door, and the build-block crates (aozora-encoding, …)
are reached through its re-exports (aozora::encoding, …):
[dependencies]
aozora = "0.4"
Bleeding-edge alternative — to track unreleased fixes on main,
pin a git tag instead. This block is the single source of truth for
the recommended git pin — every other doc links here, so a new
release only needs this one tag updated:
[dependencies]
aozora = { git = "https://github.com/P4suta/aozora.git", tag = "v0.4.1" }
The current tag is whatever
GitHub Releases is
marked Latest. Either way the repo follows Conventional Commits and
SemVer: breaking changes advance the major version (post-1.0) or the
minor version (during 0.x), so a "0.4" requirement stays safe.
WASM (browser / Node)
The browser package is on npm as
aozora-wasm:
npm install aozora-wasm
To build it from a checkout instead:
rustup target add wasm32-unknown-unknown # one-time
wasm-pack build --target web --release crates/aozora-wasm
The post-wasm-opt artifact has a 500 KiB size budget. See
Bindings → WASM for the JS surface and the
post-build wasm-opt invocation we recommend.
C ABI
cargo build --release -p aozora-ffi
# → target/release/libaozora_ffi.{so,dylib,a}
# → target/release/aozora.h (cbindgen-generated)
Link with -laozora_ffi and include aozora.h. See
Bindings → C ABI for the API surface and memory
ownership rules.
Python
The wheel is on PyPI as
aozora_py:
pip install aozora_py
For local development against a checkout, build with maturin instead:
pip install maturin # one-time
cd crates/aozora-py
maturin develop -F extension-module # install in current venv
maturin build -F extension-module --release # produce a redistributable wheel
See Bindings → Python for the API and the
unsendable thread-safety contract.
Toolchain pin
aozora pins Rust 1.95.0 as its MSRV (rust-toolchain.toml). CI
enforces it via a dedicated msrv job. If you run rustup show
inside the repo and see something else, your local override needs
updating.
CLI Quickstart
The aozora binary covers three operations:
aozora check FILE.txt # lex + report diagnostics on stderr
aozora fmt FILE.txt # round-trip parse ∘ serialize, print to stdout
aozora render FILE.txt # render to HTML on stdout
- (or no path argument) reads from stdin. --encoding sjis (alias
-E sjis) decodes Shift_JIS source — Aozora Bunko’s distributed
.txt files are Shift_JIS, so this flag is the common case for real
corpus work.
Common invocations
# Lex an Aozora Bunko file and print diagnostics
aozora check -E sjis crime_and_punishment.txt
# Render to HTML (stdout)
aozora render -E sjis crime_and_punishment.txt > out.html
# Pipe from stdin
cat src.txt | aozora render -
# CI gate: fail if format is not idempotent
aozora fmt --check src.txt
Flag reference
| Flag | Subcommand | Effect |
|---|---|---|
-E sjis, --encoding sjis | all | Decode Shift_JIS source. Default is UTF-8. |
--strict | check | Exit non-zero on any diagnostic. |
--check | fmt | Exit non-zero if formatted output differs from input. |
--write | fmt | Overwrite the input file with the canonical form. (Ignored when reading from stdin.) |
--no-color | all | Disable ANSI colour in diagnostics output. |
--verbose | all | Print parse phase timings to stderr. |
Exit codes
| Code | Meaning |
|---|---|
0 | Success. |
1 | Diagnostics emitted under --strict, or formatting mismatch under --check. |
2 | Usage error (bad flag, missing file, decode error). |
Diagnostics format
aozora check prints diagnostics in
miette style — a coloured source snippet
with carets pointing at the byte range, a short message, and (where
applicable) a help line:
× ruby reading mismatch: target spans 3 chars but |《》 reading is empty
╭─[input.txt:42:9]
42 │ |青梅《》
· ───┬───
· ╰── empty reading
╰────
help: provide a reading inside 《…》 or remove the | marker
Every diagnostic carries a stable dotted code
(aozora::lex::empty_ruby_reading, aozora::lex::unresolved_gaiji, …);
see the Diagnostics catalogue for the
full list.
Why not a single subcommand?
check / fmt / render are intentionally separate so each one has
a single, predictable failure mode in shell pipelines:
checkexits 0 on parse success, regardless of warnings (use--strictfor “no diagnostics allowed”).fmtis a pure-text transform: stdin in, canonical text out.--checkupgrades it to a CI gate without forking a second binary.renderis a pure-text-to-HTML transform with the same exit-code shape.
Combining them behind flags would make the exit-code semantics
ambiguous (does --check mean format-check or strict-check?). Keeping
them split is the same logic that splits gofmt from vet from
go build.
Library Quickstart
The minimal Rust use of aozora is six lines:
use aozora::Document;
fn main() {
let source = std::fs::read_to_string("src.txt").unwrap();
let doc = Document::new(source);
let tree = doc.parse();
println!("{}", tree.to_html());
}
That’s enough to get HTML out of any UTF-8 青空文庫 source. The rest of this page covers the lifetime model, the diagnostic stream, and the AST walk — three things you’ll need once you do anything beyond “render to HTML”.
The lifetime model
Document owns two things: a bumpalo::Bump
arena and the source Box<str>. AozoraTree<'a> borrows from both:
let doc = aozora::Document::new(source); // Document: 'static
let tree = doc.parse(); // AozoraTree<'_> bound to &doc
let html = tree.to_html(); // walks the borrow
// dropping doc releases every node in a single Bump::reset()
drop(doc);
That is: hand the Document around, not the tree. If you need
to keep a parse result alive across function boundaries, the function
takes ownership of (or borrows) the Document, and re-derives the
tree on the inside. This is unusual for Rust libraries — most parse
APIs hand back an owned tree — but it’s what makes aozora’s
zero-copy AST safe. See Architecture → Borrowed-arena AST
for why this trade is worth it.
Shift_JIS input
Aozora Bunko ships its corpus as Shift_JIS. Decode through the umbrella
aozora::encoding module first (consumers depend on aozora alone —
never on the internal aozora-encoding crate directly):
use aozora::Document;
use aozora::encoding::decode_sjis;
let bytes = std::fs::read("src.sjis.txt")?;
let utf8 = decode_sjis(&bytes)?; // -> String; Err(DecodeError) on bad input
let doc = Document::new(utf8);
let tree = doc.parse();
decode_sjis handles BOM stripping, JIS X 0213 codepoints, and the
Aozora-specific 外字 references that survive the decode pass as
private-use sentinels (resolved later in the parser). It is strict —
malformed bytes return Err(DecodeError) rather than silently
substituting replacement characters. A runnable version is
just example sjis.
Diagnostics
use aozora::Diagnostic;
let diags: &[Diagnostic] = tree.diagnostics();
for d in diags {
let span = d.span();
// `Diagnostic` is an enum — reach its parts through the accessors.
// `Display` ({d}) renders the human message; there is no `.message`.
eprintln!("[{:?}] {} @ {}..{}", d.severity(), d.code(), span.start, span.end);
}
Each Diagnostic carries a stable code(), a span(), and a
severity() (Error / Warning / Note). A runnable version is
just example diagnostics.
Diagnostics are non-fatal by design: the parser always produces a
tree, even from malformed input. Callers that want strict behaviour
treat any diagnostic as an error themselves. See the
Diagnostics catalogue for the code list.
Walking the AST
AozoraTree::source_nodes() returns a source-ordered side table — one
SourceNode per classified Aozora / container span (plain-text runs
between constructs round-trip verbatim and are not listed). It is the
surface editor tooling uses for semantic tokens and document symbols:
for entry in tree.source_nodes() {
let span = entry.source_span; // byte range into the source
// `entry.node` is a `NodeRef`: Inline / BlockLeaf / BlockOpen /
// BlockClose, each wrapping the borrowed AST node or container kind.
println!("{}..{} {:?}", span.start, span.end, entry.node);
}
Match on entry.node (NodeRef) to destructure a specific construct —
e.g. NodeRef::Inline(AozoraNode::Ruby(r)) gives you the ruby base and
reading. A runnable version is just example walk_ast.
The borrowed nodes are cheap to copy (they’re effectively
(tag, &str, &Bump-slice) triples), so you can keep references around
freely as long as the Document lives.
Round-trip and canonicalisation
Every parse should round-trip:
let parsed = doc.parse();
let canonical: String = parsed.serialize();
assert_eq!(canonical, doc.source()); // for *canonical* input
Real Aozora Bunko sources contain stylistic variations (CRLF vs LF,
NFC vs NFD around accents, half-width vs full-width punctuation) that
the lexer normalises before tokenising. For those the assertion above
holds after aozora fmt has been applied once.
The pure round-trip property is what aozora fmt --check exercises in
CI, and what the corpus sweep verifies across the full Aozora Bunko
catalogue (~17 000 works).
Where to next
- Notation reference for what each node type represents.
- Architecture → Pipeline overview for what
happens between
Document::newandDocument::parse. - API reference for the rustdoc-generated surface.
Node reference
aozora exposes 19 NodeKind variants. Each is documented
on its own page with source examples, the rendered HTML, the
serialize round-trip output, the in-memory AST shape, and the
diagnostics it can fire alongside.
The page layout matches the aozora explain <kind> CLI subcommand:
once you find the variant in the table, the deep dive is one click —
or one shell invocation — away.
| Variant | Wire tag | Notation |
|---|---|---|
| Ruby | ruby | |base《reading》 |
| Bouten | bouten | [#「target」に傍点] |
| TateChuYoko | tateChuYoko | [#「12」は縦中横] |
| Gaiji | gaiji | ※[#...、第3水準1-85-54] |
| Indent | indent | [#2字下げ] |
| AlignEnd | alignEnd | [#地から2字上げ] |
| Warichu | warichu | [#割り注]... |
| Keigakomi | keigakomi | [#罫囲み] |
| PageBreak | pageBreak | [#改ページ] |
| SectionBreak | sectionBreak | [#改丁] |
| AozoraHeading | heading | [#見出し] |
| HeadingHint | headingHint | [#「対象」は中見出し] |
| Sashie | sashie | [#挿絵(path.png)入る] |
| Kaeriten | kaeriten | [#返り点 一・二] |
| Annotation | annotation | [#任意のコメント] |
| AngleQuote | angleQuote | ≪重要≫ → 《重要》 |
| Container | container | [#ここから...]...[#ここで...終わり] |
| ContainerOpen | containerOpen | (NodeRef projection) |
| ContainerClose | containerClose | (NodeRef projection) |
How to read these pages
Every node page follows the same skeleton:
| Section | Content |
|---|---|
| Source examples | One or two minimal Aozora-notation strings that produce this variant. |
| Rendered HTML | What Document::new(src).parse().to_html() emits. |
| Serialize output | What serialize() emits — typically the canonical form of the source. |
| AST shape | The borrowed-AST struct fields the variant carries. |
| When emitted | Phase 3 classification rule that produces this variant. |
| Diagnostics | Codes that may accompany this variant. |
| Related kinds | Cross-links to neighbours (Bouten ↔ Bousen, Indent ↔ Container::Indent, etc.). |
#[non_exhaustive] on NodeKind: a future minor release adding a
new variant lands here without a breaking change. Downstream
consumers that match on NodeKind exhaustively must include a _
arm.
NodeKind::Ruby
Wire tag: ruby — base text + reading annotation. The most common
non-trivial variant in Aozora Bunko.
Source examples
|青梅《おうめ》
青梅《おうめ》
Both forms classify as Ruby; the leading | (U+FF5C) makes the
delimiter explicit and lets the parser disambiguate the base run
when ambiguous neighbours could otherwise extend the base.
Rendered HTML
<ruby>青梅<rp>(</rp><rt>おうめ</rt><rp>)</rp></ruby>
<rp> parens are emitted so HTML clients without ruby support
still display a readable fallback.
Serialize output
serialize() always emits the explicit-delimiter form
(|base《reading》), so a parse → serialize → parse round-trip is
a fixed point regardless of which form the source used.
AST shape
pub struct Ruby<'src> {
pub base: NonEmpty<Content<'src>>,
pub reading: NonEmpty<Content<'src>>,
pub delim_explicit: bool,
}
Both fields are NonEmpty<Content>;
empty base or reading is rejected upstream and never produces a
Ruby node.
When emitted
Phase 3 classifies a 《…》 pair as ruby when the preceding run is a
sequence of CJK / kana / latin glyphs and the close is followed by
neither a glyph (which would extend the base further) nor a stray
opener.
Diagnostics
aozora::lex::unclosed_bracket— unbalanced《reaches EOF.aozora::lex::unmatched_close— stray》with no matching open.
Related kinds
- AngleQuote —
≪…≫double-angle quotation (displays as《…》). - Annotation::InvalidRubySpan — fallback when the ruby pair could not be parsed cleanly.
NodeKind::Bouten
Wire tag: bouten — emphasis dots / sidelines over a target span.
Source examples
青空に[#「青空」に傍点]
青空に[#「青空」に丸傍点]
The bracketed annotation refers backwards to the literal text
quoted with 「…」, so the parser resolves the target by string
match against the preceding line(s).
Rendered HTML
<em class="aozora-bouten aozora-bouten-goma aozora-bouten-right">青空</em>に
The two trailing class slots carry the bouten kind (goma,
circle, wavy-line, …) and the position (right for vertical
text, left for the rare under-side variant).
Serialize output
Round-trips to the explicit [#「target」に<kind>傍点] form.
AST shape
pub struct Bouten<'src> {
pub kind: BoutenKind,
pub target: NonEmpty<Content<'src>>,
pub position: BoutenPosition,
}
BoutenKind enumerates the 11 visual variants (Goma, WhiteSesame,
Circle, …); BoutenPosition is Right (default for vertical text)
or Left.
When emitted
Phase 3 sees [#「QUOTE」に <slug>傍点] / [#「QUOTE」に <slug>傍線],
walks back through the recent text to find QUOTE, and emits the
node with the matched span.
Diagnostics
aozora::lex::unclosed_bracket— annotation[#opened with no matching].Annotation(fallback) — quote target unresolved.
Related kinds
- Annotation — fallback when the target cannot be matched.
NodeKind::TateChuYoko
Wire tag: tateChuYoko — horizontal text inside a vertical
writing-mode run (縦中横, “vertical-with-horizontal-inside”).
Source examples
昭和[#「12」は縦中横]年
Rendered HTML
<span class="aozora-tcy">12</span>
Downstream CSS gives the span text-combine-upright: all for proper
vertical-writing display.
Serialize output
Round-trips to [#「target」は縦中横].
AST shape
pub struct TateChuYoko<'src> {
pub text: NonEmpty<Content<'src>>,
}
When emitted
Phase 3 matches the directive [#「TARGET」は縦中横] and resolves
TARGET in preceding text, then emits with the matched span.
Diagnostics
aozora::lex::unclosed_bracket if [# is unmatched.
Related kinds
- Annotation — fallback if target resolution fails.
NodeKind::Gaiji
Wire tag: gaiji — out-of-character-set glyph reference. The
historical Aozora-Bunko notation for characters Shift_JIS could
not encode; modern files mostly use them for genuine non-Unicode
glyphs.
Source examples
※[#「木+吶のつくり」、第3水準1-85-54]
The ※ (U+203B) flags the construct; [#description、mencode]
carries the human description and a structured Mojikyō / JIS / U+
identifier.
Rendered HTML
<span class="aozora-gaiji" title="木+吶のつくり" data-mencode="第3水準1-85-54">〓</span>
The fallback glyph 〓 (U+3013, “geta mark”) is the conventional
Japanese typesetting placeholder for missing glyphs. When the
resolver finds a Unicode mapping the inner text becomes the
resolved character instead of the geta mark.
Serialize output
Round-trips to ※[#description、mencode].
AST shape
pub struct Gaiji<'src> {
pub description: &'src str,
pub ucs: Option<Resolved>,
pub mencode: Option<&'src str>,
}
Resolved is either a single Unicode scalar or one of 25
predefined static combining sequences (e.g. か゚ — か + the IPA
voicing-pair-mark — kept as a static constant so the borrowed-AST
stays Copy).
When emitted
Phase 3 sees the ※[#…] digraph and parses the description /
mencode payload. The encoding crate’s gaiji resolver lifts the
mencode reference into a Unicode character when one exists.
Diagnostics
None on a well-formed ※[#...]. Ambiguous descriptions land as
Annotation::Unknown instead of Gaiji.
Related kinds
- Annotation — fallback when description is malformed.
NodeKind::Indent
Wire tag: indent — single-line [#N字下げ] indent marker.
Source examples
[#2字下げ]
[#3字下げ]もう一段下げる
Rendered HTML
<span class="aozora-indent" data-amount="2"></span>
CSS controls the actual padding (typically padding-inline-start: Nem).
Serialize output
Round-trips to [#N字下げ].
AST shape
pub struct Indent {
pub amount: u8,
}
When emitted
Phase 3 matches the digraph plus a numeric prefix and emits a
single inline marker. For paired indent regions ([#ここから2字下げ]
… [#ここで字下げ終わり]), see Container.
Diagnostics
None on well-formed input.
Related kinds
- Container — paired indent / dedent regions
(
ContainerKind::Indent). - AlignEnd — right-edge alignment counterpart.
NodeKind::AlignEnd
Wire tag: alignEnd — right-edge alignment marker (字上げ).
Source examples
[#地付き]
[#地から3字上げ]
Rendered HTML
<span class="aozora-align-end" data-offset="0"></span>
offset is 0 for 地付き, N for 地から N 字上げ.
Serialize output
Round-trips to [#地付き] / [#地からN字上げ].
AST shape
pub struct AlignEnd {
pub offset: u8,
}
When emitted
Phase 3 matches the directive form. Paired alignment regions
([#ここから地から N 字上げ] … [#ここで字上げ終わり]) are
Container instead.
Diagnostics
None.
Related kinds
NodeKind::Warichu
Wire tag: warichu — split-line annotation (割注). Two text runs
are stacked into a single line of the surrounding text.
Source examples
[#割り注]上の段/下の段[#割り注終わり]
Rendered HTML
<span class="aozora-warichu">
<span class="aozora-warichu-upper">上の段</span>
<span class="aozora-warichu-lower">下の段</span>
</span>
Serialize output
Round-trips to the explicit [#割り注].../...[#割り注終わり].
AST shape
pub struct Warichu<'src> {
pub upper: Content<'src>,
pub lower: Content<'src>,
}
upper / lower are plain Content;
empty halves are valid (one-sided warichu).
When emitted
The single-line [#割り注]...[#割り注終わり] form is
inline-classified; multi-line [#割注] containers become a
Container of kind Warichu.
Diagnostics
None on well-formed input.
Related kinds
- Container — multi-line counterpart.
NodeKind::Keigakomi
Wire tag: keigakomi — ruled-box annotation (罫囲み).
Source examples
[#罫囲み]本文[#罫囲み終わり]
Rendered HTML
<span class="aozora-keigakomi"></span>
(Inline marker; the multi-line container form yields a
<div class="aozora-container-keigakomi"> wrapper instead — see
Container.)
Serialize output
Round-trips to [#罫囲み]...[#罫囲み終わり].
AST shape
pub struct Keigakomi;
Marker struct with no payload — the surrounding text carries the content.
When emitted
Phase 3 sees the inline form. Multi-line keigakomi blocks classify
as Container Keigakomi.
Diagnostics
None on well-formed input.
Related kinds
- Container — multi-line counterpart.
NodeKind::PageBreak
Wire tag: pageBreak — [#改ページ] page break marker.
Source examples
end of chapter
[#改ページ]
beginning of next chapter
Rendered HTML
<div class="aozora-page-break"></div>
CSS gives the div a page-break-before: always for paged media
(EPUB / print).
Serialize output
Round-trips to [#改ページ]\n.
AST shape
AozoraNode::PageBreak is a unit variant — no payload.
When emitted
Phase 3 sees [#改ページ] and emits a single BlockLeaf
classification covering the whole bracket span.
Diagnostics
None on well-formed input.
Related kinds
- SectionBreak —
[#改丁]family.
NodeKind::SectionBreak
Wire tag: sectionBreak — section breaks (改丁 / 改段 / 改見開き).
Source examples
[#改丁]
[#改段]
[#改見開き]
Rendered HTML
<div class="aozora-section-break aozora-section-break-kaicho"></div>
The second class slot carries the variant slug (kaicho, kaidan,
kaimihiraki, other).
Serialize output
Round-trips to [#改丁] etc.
AST shape
AozoraNode::SectionBreak(SectionKind)
SectionKind is Choho (改丁) / Dan (改段) / Spread (改見開き).
When emitted
Phase 3 matches each directive; the kind enum captures which.
Diagnostics
None on well-formed input.
Related kinds
- PageBreak — finer-grained
[#改ページ]variant.
NodeKind::AozoraHeading
Wire tag: heading — Aozora 見出し (window / sub heading).
Source examples
[#見出し]序章[#見出し終わり]
Rendered HTML
<h2 class="aozora-heading aozora-heading-window">序章</h2>
The Pandoc projection uses level 2 for Window, level 3 for Sub.
Serialize output
Round-trips to [#<kind>見出し]...[#<kind>見出し終わり].
AST shape
pub struct AozoraHeading<'src> {
pub kind: AozoraHeadingKind,
pub text: NonEmpty<Content<'src>>,
}
AozoraHeadingKind is Window (窓見出し) or Sub (副見出し).
When emitted
Phase 3 matches the keyword 見出し family and binds the body run.
Diagnostics
None on well-formed input.
Related kinds
- HeadingHint — forward-reference style heading hint.
NodeKind::HeadingHint
Wire tag: headingHint — forward-reference heading hint
([#「target」は中見出し]).
Source examples
序章
[#「序章」は中見出し]
The hint refers to a quoted target string in the preceding line(s); downstream renderers pick this up as “promote the matched run to a heading.”
Rendered HTML
The marker itself emits no visible content; renderers that honour
the hint elevate the previously-matched span to a <h2> /
<h3> retroactively. The default HTML renderer in aozora-render
emits a structural marker comment.
Serialize output
Round-trips to [#「target」は<level>見出し].
AST shape
pub struct HeadingHint<'src> {
pub level: u8,
pub target: NonEmptyStr<'src>,
}
level follows the Aozora convention: 1=大見出し, 2=中見出し,
3=小見出し.
When emitted
Phase 3 matches the directive and records the level + target. Empty target is rejected and falls through to plain text.
Diagnostics
None on well-formed input.
Related kinds
- AozoraHeading — direct heading-marker variant.
NodeKind::Sashie
Wire tag: sashie — illustration reference (挿絵).
Source examples
[#挿絵(cover.png)入る]
[#挿絵(pages/03.jpg、第3章扉絵)入る]
Rendered HTML
<figure class="aozora-sashie">
<img src="cover.png" alt="">
</figure>
When a caption is present it lands as a <figcaption> next to the
<img>.
Serialize output
Round-trips to [#挿絵(path[、caption])入る].
AST shape
pub struct Sashie<'src> {
pub file: NonEmptyStr<'src>,
pub caption: Option<Content<'src>>,
}
Empty file is rejected upstream — the construct cannot ship a
nameless image.
When emitted
Phase 3 matches the 挿絵(…)入る digraph and parses out the path
- optional caption.
Diagnostics
None on well-formed input.
Related kinds
- Annotation — fallback when the directive is malformed.
NodeKind::Kaeriten
Wire tag: kaeriten — kanbun reading-order marker (返り点).
Source examples
読[#返り点 一・二]本
Rendered HTML
<sup class="aozora-kaeriten" data-mark="一・二"></sup>
CSS positions the sup glyph appropriately for vertical / horizontal writing mode.
Serialize output
Round-trips to [#返り点 mark].
AST shape
pub struct Kaeriten<'src> {
pub mark: NonEmptyStr<'src>,
}
When emitted
Phase 3 matches 返り点 keyword + marker payload. Empty marker
rejected upstream.
Diagnostics
None on well-formed input.
Related kinds
None.
NodeKind::Annotation
Wire tag: annotation — generic [#...] annotation that no
specific recogniser claimed.
Source examples
text[#任意のメモ]more
text[#ふりがな付きの説明]more
Rendered HTML
<span class="aozora-annotation" title="..."></span>
The default renderer suppresses the body; downstream filters can
match on aozora-annotation to surface the comment.
Serialize output
Round-trips to [#<raw>].
AST shape
pub struct Annotation<'src> {
pub raw: NonEmptyStr<'src>,
pub kind: AnnotationKind,
}
AnnotationKind discriminates the recognised sub-variants
(Unknown, AsIs, TextualNote, InvalidRubySpan, …); raw
carries the raw bracket body for any further analysis.
When emitted
Phase 3 reaches [#...] after no specific recogniser matched.
Annotation is the fallback that always preserves the user’s
content rather than dropping it.
Diagnostics
None — Annotation is the recovery path for unrecognised
directives. A genuine invalid-bracket diagnostic
(unclosed_bracket / unmatched_close) appears separately.
Related kinds
NodeKind::AngleQuote
Wire tag: angleQuote — double-angle quotation (二重山括弧).
A 底本’s twin angle brackets 《…》 would collide with the ruby
markers 《…》 (U+300A/U+300B), so Aozora Bunko input encodes them
as ≪…≫ (U+226A/U+226B). The renderer restores the display form
《…》.
Source examples
≪重要≫
底本 《重要》 → aozora text ≪重要≫ → display 《重要》.
Rendered HTML
<span class="aozora-angle-quote">《重要》</span>
The 《…》 display glyphs (U+300A/U+300B) are restored inside the span;
stylesheets target .aozora-angle-quote for any further treatment.
Serialize output
Round-trips to the input form ≪content≫ (U+226A/U+226B).
AST shape
pub struct AngleQuote<'src> {
pub content: NonEmpty<Content<'src>>,
}
content is NonEmpty — empty ≪≫ is rejected upstream and falls
through to plain text rather than producing an empty node.
When emitted
Phase 1 tokenises ≪ / ≫ (U+226A/U+226B) as ordinary single-character
triggers; Phase 3 pairs ≪…≫ into one AngleQuote node. A stray
底本-style 《《…》》 is not this node — it is two ruby openers and
yields a nested-ruby diagnostic with plain fallback.
Diagnostics
aozora::lex::unclosed_bracket—≪reaches EOF without≫.aozora::lex::unmatched_close— stray≫with no matching open.
Related kinds
- Ruby —
《…》reading marker (the colliding notation).
NodeKind::Container
Wire tag: container — paired-container wrapping
([#ここから...]...[#ここで...終わり]).
Source examples
[#ここから2字下げ]
第一節
第二節
[#ここで字下げ終わり]
[#罫囲み]
本文
[#罫囲み終わり]
[#地から3字上げ]
寄付者一覧
[#字上げ終わり]
Rendered HTML
<div class="aozora-container-indent" data-amount="2">
...
</div>
The wrapping div carries the kind-specific class
(aozora-container-indent, aozora-container-warichu,
aozora-container-keigakomi, aozora-container-align-end) plus
any structural data (indent amount, align offset) on data-*.
Serialize output
Round-trips to the explicit-paired directive form.
AST shape
pub struct Container {
pub kind: ContainerKind,
}
pub enum ContainerKind {
Indent { amount: u8 },
Warichu,
Keigakomi,
AlignEnd { offset: u8 },
}
The Container payload appears wrapping the content — the actual
walker driver fires visit_container_open on enter and
visit_container_close on exit so renderers wrap the body cleanly.
When emitted
Phase 2 pairs the [#ここから…] / [#ここで…終わり] openers
and closers; Phase 3’s BlockOpen / BlockClose events project to
this variant.
Diagnostics
unclosed_bracket for unbalanced opens.
Related kinds
- ContainerOpen —
NodeRefprojection of the open boundary. - ContainerClose —
NodeRefprojection of the close boundary. - Indent, AlignEnd, Warichu, Keigakomi — single-line counterparts.
NodeKind::ContainerOpen
Wire tag: containerOpen — paired-container open boundary marker.
This variant only appears in NodeRef-flavoured wire output (e.g.
serialize_nodes); the structural AozoraNode::Container
payload covers the wrapping construct itself.
Source examples
[#ここから2字下げ] <- ContainerOpen
indented body
[#ここで字下げ終わり] <- ContainerClose
Rendered HTML
The default HTML renderer routes the open / close pair through
visit_container_open / visit_container_close and emits the
opening <div class="aozora-container-..."> wrapping the body.
Serialize output
Round-trips together with the matching close to the
[#ここから…]...[#ここで…終わり] form.
AST shape
NodeRef::BlockOpen(ContainerKind) — see
ContainerKind.
When emitted
Phase 2 pairs the open / close brackets; Phase 3’s normalised text
emits a BlockOpen PUA sentinel at the position of the opener so
the registry can dispatch the open event during walking.
Diagnostics
unclosed_bracket if the open never finds a matching close.
Related kinds
- ContainerClose — paired close-side counterpart.
- Container — the structural payload variant.
NodeKind::ContainerClose
Wire tag: containerClose — paired-container close boundary marker.
NodeRef-only counterpart of ContainerOpen.
Source examples
[#ここから2字下げ] <- ContainerOpen
body
[#ここで字下げ終わり] <- ContainerClose
Rendered HTML
Routed through visit_container_close; the default renderer emits
the closing </div> of the
<div class="aozora-container-..."> opened by the matching
ContainerOpen.
Serialize output
Round-trips with the matching open.
AST shape
NodeRef::BlockClose(ContainerKind).
When emitted
Phase 3 normalised-text emits a BlockClose PUA sentinel at the
matching close position.
Diagnostics
unmatched_close if the close has no open partner — in which case
no ContainerClose is emitted and the close-bracket bytes flow
through as plain.
Related kinds
- ContainerOpen — open-side counterpart.
- Container — structural payload.
Notation overview
青空文庫記法 is a small, line-oriented annotation language layered inside a plain-text Japanese document. Authors mark up the text in two distinct registers:
- Inline markers — single-character sigils (
|,《,》,※) that fence inline annotations directly inside the prose. - Block annotations —
[#…]brackets containing a Japanese directive in natural language (“ここから2字下げ”, “「X」に傍点”, …) that act as openers, closers, or self-contained directives.
aozora recognises every annotation that survives in real Aozora Bunko sources — the volunteer corpus has ~17 000 works in active rotation, and the parser is exercised against the entire archive in CI as part of the corpus sweep.
Notations covered
| Chapter | What it marks |
|---|---|
| Ruby | Pronunciation glosses (|青梅《おうめ》, 青梅《おうめ》). |
| Bouten / bousen | Emphasis dots and lines: 傍点 (sesame, white sesame, filled circle, open circle, …) and 傍線 (single, double, dashed, …). |
| 縦中横 | Horizontally-set runs inside vertical text ([#「数字」は縦中横]). |
| Gaiji | Out-of-Shift_JIS character references (※[#…、第3水準1-85-54]) and accented-Latin decomposition. |
| Kunten | 漢文 reading marks: 返り点 (レ, 一, 二, 上, 中, 下), 再読文字, 送り仮名. |
| Indent containers | [#ここから2字下げ]… [#ここで字下げ終わり] and the geji / 地付き / 地寄せ family. |
| Page & section breaks | 改ページ, 改丁, 改見開き, 改段. |
| Diagnostics | The catalogue of structured diagnostics the parser emits. |
Spec source of truth
The authoritative spec lives at
https://www.aozora.gr.jp/annotation/index.html. A snapshot is
vendored at docs/specs/aozora/
in the repo so that every page in this handbook can link to a stable
fragment (the upstream HTML reorganises occasionally; the snapshot
shields chapter cross-references from rot).
When this handbook says “the spec says X”, that means that snapshot. Where the live spec drifts, we update the snapshot, then update the parser, then update this handbook — in that order.
How a sample input looks
|青梅《おうめ》街道を歩いて、※[#「魚+師のつくり」、第3水準1-94-37]を見た。
[#ここから2字下げ]
[#「平和」に傍点]という言葉は、もう古い。
[#ここで字下げ終わり]
[#改ページ]
That single sample exercises ruby, gaiji, indent containers, bouten, and a page break. The parser turns it into a flat node stream — see the per-chapter pages for the exact AST shapes.
Notation we deliberately omit
Aozora Bunko’s spec mentions a handful of annotations that don’t appear in the maintained corpus:
- Image references beyond
[#挿絵]— covered up to the caption, no actual image rendering. - キャプション alignment edge cases that the spec lists but no active work uses (verified against the corpus sweep).
These are kept as a generic Annotation{Unknown} and rendered
best-effort (the “no bare [#” guarantee still holds); a ここから…
opener that names no known container also emits
unrecognised_container_directive.
Adding full support is a one-PR job once a real corpus document needs it.
Ruby (|青梅《おうめ》)
Ruby is a pronunciation gloss attached to a run of base text. In 青空文庫 source it appears in two shapes:
|青梅《おうめ》 ← explicit-base form
青梅《おうめ》 ← implicit-base form (auto-detect)
Both forms render the same HTML:
<ruby>青梅<rt>おうめ</rt></ruby>
Explicit base (|…《…》)
The full-width vertical bar | (U+FF5C) marks the start of the
base text; 《…》 (U+300A / U+300B) wraps the reading. The base
runs from | to the 《. Use this form when:
- The base contains characters that the auto-detect heuristic would otherwise skip (kana, ASCII letters, mixed scripts).
- The boundary between base and surrounding text is ambiguous.
|山田《やまだ》さん → <ruby>山田<rt>やまだ</rt></ruby>さん
|HTTP《ハイパー・テキスト》 → <ruby>HTTP<rt>ハイパー・テキスト</rt></ruby>
Implicit base
When 《…》 follows a run of kanji without a leading |, the
parser auto-detects the base by scanning backwards through the kanji
run. The auto-detect terminates at the first non-kanji character
(kana, punctuation, ASCII, full-width digit).
青梅《おうめ》 → <ruby>青梅<rt>おうめ</rt></ruby>
お青梅《おうめ》 → お<ruby>青梅<rt>おうめ</rt></ruby>
The “kanji” predicate is CJK Unified Ideographs + CJK Compatibility Ideographs + CJK Unified Ideographs Extension A–F
- the iteration mark
々. JIS X 0213 plane-2 ideographs not in Unicode are represented as gaiji references (see Gaiji) and likewise terminate the auto-detect.
Empty reading
|青梅《》 supplies a base but an empty reading. The lexer emits
aozora::lex::empty_ruby_reading
(an Error) and the construct degrades to plain text — no Ruby node is
built.
The implicit-base form silently skips a 《》 with empty contents — the
parser can’t be sure a base was intended, so it treats the bare 《》 as
literal text and stays silent.
Nested ruby (forbidden)
The spec disallows ruby inside ruby. A reading whose body opens another
《…》 (e.g. |漢《か《ん》じ》) fires
aozora::lex::nested_ruby; the outer ruby
is still parsed best-effort. (An adjacent 《《…》》 is a different
construct — double-bracket bouten — not nested ruby.)
AST shape
pub struct Ruby<'src> {
pub base: NonEmpty<Content<'src>>, // never empty
pub reading: NonEmpty<Content<'src>>, // never empty
pub delim_explicit: bool, // true for the |…《…》 form
}
base and reading are [Content] (a Plain(&str) fast path or a
Segments run carrying nested gaiji / annotations), wrapped in
NonEmpty so an empty payload is unrepresentable — Phase 3 only emits a
Ruby once both sides have content (an empty reading takes the
empty-reading path instead). delim_explicit records
whether the source used the |…《…》 form so the serializer re-emits the
| only when the original did.
Edge cases
| Input | Output |
|---|---|
青梅《おうめ》 | <ruby>青梅<rt>おうめ</rt></ruby> |
|青梅《おうめ》 | <ruby>青梅<rt>おうめ</rt></ruby> (canonical-equivalent) |
|山田《やまだ》 | <ruby>山田<rt>やまだ</rt></ruby> |
|HTTP《ハイパー・テキスト》 | <ruby>HTTP<rt>ハイパー・テキスト</rt></ruby> |
お青梅《おうめ》 | お<ruby>青梅<rt>おうめ</rt></ruby> (auto-detect skips kana) |
1青梅《おうめ》 | 1<ruby>青梅<rt>おうめ</rt></ruby> (auto-detect skips digit) |
|青梅《》 | plain text + empty_ruby_reading |
《おうめ》 | literal text (no preceding kanji to anchor) |
|漢《か《ん》じ》 | best-effort ruby + nested_ruby |
See also
- Bouten / bousen — emphasis annotations that share the
「X」に…indirection idiom. - Architecture → Seven-phase lexer — where ruby recognition fits in the classifier pipeline.
Bouten / bousen (傍点・傍線)
Bouten (傍点) are emphasis dots placed beside characters in vertical text — the Japanese typographic equivalent of italic or bold. Bousen (傍線) are the same idea with a line instead of dots. The spec recognises eleven dot variants and six line variants; aozora accepts every one.
Notation forms
Two indirection styles, both common in real corpus:
[#「平和」に傍点] ← target-by-quoting
平和[#「平和」に傍点] ← redundant explicit copy (also accepted)
[#傍点]平和[#傍点終わり] ← range form (bare opener / closer)
The target-by-quoting form is by far the most common: the inline annotation looks backwards in the text for the most recent occurrence of the quoted string and applies the bouten to that run.
Variant catalogue
aozora recognises eleven variants — eight 点 (dot) families and three 線 (line) families:
| Slug | Source keyword | Family |
|---|---|---|
goma | 傍点 | 点 |
white-sesame | 白ゴマ傍点 | 点 |
circle | 丸傍点 | 点 |
white-circle | 白丸傍点 | 点 |
double-circle | 二重丸傍点 | 点 |
janome | 蛇の目傍点 | 点 |
cross | ばつ傍点 | 点 |
white-triangle | 白三角傍点 | 点 |
wavy-line | 波線 | 線 |
under-line | 傍線 | 線 |
double-under-line | 二重傍線 | 線 |
Each variant has a stable slug that the HTML renderer emits as a class
name (e.g. <em class="aozora-bouten-goma">). The 点/線 family boundary is
what mismatched_bouten_container
checks for the range form below.
Default rendering
aozora emits <em class="aozora-bouten-<slug>">…</em> so that an
external stylesheet can pick the visual treatment per variant.
Default CSS hooks live at the consumer side; the parser ships no
stylesheet of its own.
<!-- 平和[#「平和」に傍点] -->
平和<em class="aozora-bouten aozora-bouten-goma aozora-bouten-right">平和</em>
(The redundant copy is intentional — the [#…] indirection
re-emits the target wrapped in <em>, leaving the original run
in place. The HTML rendering matches what print Aozora Bunko output
does in practice.)
Range form
To emphasise a run directly (rather than by quoting it), wrap it between a
bare opener and its matching closer — note there is no ここから /
ここで (those prefixes are for block layout / 太字 / 斜体, not 傍点):
彼は[#傍点]必ず[#傍点終わり]来る
本文[#二重傍線]乙[#二重傍線終わり]
[#左に傍線]丙[#左に傍線終わり]
Renders inline as <em>:
彼は<em class="aozora-bouten aozora-bouten-goma aozora-bouten-right">必ず</em>来る
The opener can be any variant keyword (傍点, 白丸傍点, 二重傍線, …), with
an optional 左に prefix for left-side marks; the closer is the same
keyword plus 終わり. The closer’s family must match the opener’s: a
点 opener ([#傍点]) pairs with a 点 closer ([#傍点終わり]), a 線 opener
([#傍線]) with a 線 closer ([#傍線終わり]). A family mismatch fires
mismatched_bouten_container.
AST shape
Both the indirect ([#「X」に傍点]) and range ([#傍点]…[#傍点終わり])
forms produce Bouten nodes:
pub struct Bouten<'src> {
pub kind: BoutenKind, // one of 11 variants (点 / 線)
pub target: NonEmpty<Content<'src>>, // the emphasised run
pub position: BoutenPosition, // Right (default) | Left (左に…)
pub consumed_predecessor: bool, // whether it absorbed the run before it
}
BoutenKind is a flat enum (BoutenKind::is_line splits 点 from 線); see
the rustdoc for the exact variant list.
See also
- Notation overview — how this fits with the other inline annotations.
- Diagnostics catalogue —
mismatched_bouten_containerandbouten_target_ambiguous.
縦中横 (tate-chū-yoko)
縦中横 (tate-chū-yoko, “horizontal in vertical”) is a typographic construct that lays a short run — usually digits, Latin letters, or mixed punctuation — horizontally inside otherwise vertical text. In print, it is the common treatment for two- or three-digit numbers in a vertical paragraph.
Notation
The annotation always uses the indirect-quoting form:
昭和27年生まれ[#「27」は縦中横]
Renders as:
昭和<span class="aozora-tcy">27</span>年生まれ
The [#…] directive looks back through the most recent text and
applies the tcy treatment to the most recent occurrence of the
quoted run. The target text is not re-emitted — the wrapper is
applied in place, unlike bouten.
Container form
For longer mixed-orientation runs (multi-line table data, Latin abbreviations spanning a paragraph), the container form sits inside an outer indent block:
[#ここから縦中横]
27 / 100 = 0.27
[#ここで縦中横終わり]
Renders as:
<div class="aozora-tcy-block">
27 / 100 = 0.27
</div>
Common targets
| Source | Output |
|---|---|
27[#「27」は縦中横] | <span class="aozora-tcy">27</span> |
100%[#「100」は縦中横] | <span class="aozora-tcy">100</span>% |
A4[#「A4」は縦中横] | <span class="aozora-tcy">A4</span> |
&[#「&」は縦中横] | <span class="aozora-tcy">&</span> |
(HTML escapes are handled by the renderer, not the AST.)
Anchor lookup
The lookup that finds the target run:
- Scans backwards from the
[#…]directive through the current line. - Stops at the first match for the quoted run.
- Falls through to the previous line if no match (with an upper bound of 64 KiB or one paragraph break, whichever comes first).
If no match is found, diagnostic
aozora::lex::tcy_target_not_found
fires and the directive degrades to a plain Annotation{Unknown}.
Authors get the same look-back semantics they’d get from bouten — see
Bouten for the symmetric case.
Why a span, not a flow rotation?
Web renderers reach for writing-mode: horizontal-tb inside a
writing-mode: vertical-rl parent, but that has poor browser support
and breaks line-break propagation. aozora’s HTML output uses a
single class hook (<span class="aozora-tcy">) so the consuming
stylesheet can decide:
- print stylesheet →
font-feature-settings: "vert"; text-combine-upright: all; - screen stylesheet → leave horizontal, set monospace
- e-book renderer → use the renderer’s native tcy primitive
Pushing this decision into the HTML output (e.g. emitting an inline SVG with rotated glyphs) would lock consumers into a specific typographic model. The class-hook output keeps the HTML semantic and defers presentation to the consumer.
AST shape
pub struct Tcy<'src> {
pub text: &'src str,
pub form: TcyForm, // Inline | Container
pub span: Span,
}
See also
- Indent containers — tcy commonly appears inside 字下げ blocks; the parser applies tcy after the indent fence is established so the look-back search is bounded by the inner block.
Gaiji (外字 references)
Aozora Bunko predates ubiquitous Unicode support; many works still ship as Shift_JIS source. Characters that don’t fit in Shift_JIS — JIS X 0213 plane-2 ideographs, accented Latin letters, ad-hoc combining marks — appear in source as gaiji references:
※[#「魚+師のつくり」、第3水準1-94-37]
※[#「彳+寺」、U+5F85、393-13]
※[#濁点付き片仮名ヰ]
The leading ※ (U+203B, reference mark) opens the annotation; the
[#…] body describes the character in three orthogonal ways:
- A descriptive name in Japanese (
「魚+師のつくり」— “魚 plus the right-hand side of 師”) for human readers. - A JIS X 0213 plane / row / cell triple
(
第3水準1-94-37— plane 1, row 94, cell 37). - A Unicode codepoint (
U+5F85) when the character has one.
aozora resolves gaiji references through a compile-time PHF lookup table built from the JIS X 0213 official mapping plus the Unicode UCS register, with the descriptive name as a tertiary fallback.
Why a compile-time table?
The gaiji table has ~14 000 entries. Loading it at runtime from a JSON / TOML asset would:
- Add a startup cost on every
Document::new(the parser is supposed to start reading bytes within microseconds). - Force every binding (CLI, WASM, FFI, Python wheel) to ship the table as a separate asset, complicating distribution.
- Defeat dead-code elimination — the linker can’t strip entries the consumer’s input never references if they’re loaded behind an opaque file read.
A phf::Map baked into the binary at compile time wins on every
axis: zero-allocation lookup, single-binary distribution, full
DCE and LTO visibility. The build cost is real (~40 s the first
time, ~0 s incremental) but happens once per workspace build, not
per-invocation.
phf over static HashMap (which would require runtime construction
in a OnceLock): phf produces a true compile-time perfect-hash
table — O(1) lookup with no first-call cost and no synchronisation
on the hot path.
Resolution order
For a reference like ※[#「魚+師のつくり」、第3水準1-94-37]:
- Unicode codepoint if the source explicitly provided one
(
U+XXXX) — used directly. - JIS X 0213 plane-row-cell lookup (
第N水準P-R-C) — most ideographs land here. - Descriptive name — the parser ships a curated mapping plus a
single-character fallback (a description that is itself one glyph
resolves to it). A reference that matches none of these resolves to
nothing: the
aozora::lex::unresolved_gaijiwarning fires and the gaiji renders as its description text.
AST shape
pub struct Gaiji<'src> {
/// Free-form description from the source (e.g. "魚+師のつくり").
pub description: &'src str,
/// Resolved Unicode value — a single scalar or a static combining
/// sequence — or `None` when no path matched.
pub ucs: Option<Resolved>,
/// Raw mencode reference (e.g. "第3水準1-85-54", "U+XXXX").
pub mencode: Option<&'src str>,
}
Resolved is Char(char) for the 99%+ single-scalar case or
Multi(&'static str) for the 25 JIS X 0213 plane-1 combining-sequence
cells. ucs == None is the unresolved case the
unresolved_gaiji warning flags.
Render output
ucs | HTML |
|---|---|
Some(_) | <span class="aozora-gaiji" data-codepoint="U+20B9B">𠮛</span> — the resolved glyph as content, the scalar(s) as space-separated U+XXXX in data-codepoint. |
None | <span class="aozora-gaiji" data-description="魚+師のつくり">魚+師のつくり</span> — the description as both attribute and content. |
Accent decomposition
Aozora Bunko also encodes accented Latin letters (è, ñ, ä) using a
separate notation that does not go through ※[#…]:
M¡cher ← in some sources
me-zin ← in others
The full table is at https://www.aozora.gr.jp/accent_separation.html — 114 ASCII digraphs / ligatures mapping to Unicode. aozora applies this decomposition during the lexer’s Phase 0 (sanitize), so by the time classification runs the source is pure Unicode. See Architecture → Seven-phase lexer for the phase ordering.
See also
- Architecture → Shift_JIS + 外字 resolver — the encoding pipeline and the PHF table internals.
- Diagnostics →
aozora::lex::unresolved_gaiji— unresolved gaiji reference.
Kunten / kaeriten (訓点・返り点)
Kunten are the marginal annotations Japanese readers add to classical Chinese (漢文) source so that it can be read in Japanese word order. aozora recognises kaeriten (返り点) — the reading-order return marks — in their bracketed form. The recognised marks are:
- single:
レ,一,二,三,四,上,中,下,甲,乙,丙,丁 Xレcompounds:一レ,二レ,三レ,上レ,中レ,下レ- 送り仮名: the parenthesised
(…)form
(Re-reading marks — 再読文字 like 未 / 将 / 当 — and any other kunten that
do not match the above are carried as generic [#…] annotations.)
A handful of late-Edo / Meiji Aozora Bunko works carry these. In real
source the marks sit between characters as [#…] annotations:
有[#二]朋自遠方来[#一]
Notation forms
Bracketed (the recognised form)
aozora recognises the bracketed form only — the mark in a [#…]
annotation:
有[#二]朋自遠方来[#一]
Renders as:
有<sup class="aozora-kaeriten">二</sup>朋自遠方来<sup class="aozora-kaeriten">一</sup>
Inline (not recognised)
A bare reading-mark glyph written directly between characters
(有レ朋自遠方来) is left as plain text — the parser cannot tell a
genuine 返り点 from an ordinary 一 / 上 / レ in running prose, which is
exactly why the bracketed form exists. Use [#…] for any mark you want
recognised.
Okurigana
Kunten 送り仮名 (reading-aid kana) use the parenthesised form, also inside a
[#…] annotation:
有[#(リ)]
These are classified as kaeriten nodes but are not ladder marks (they take no part in the pairing check).
AST shape
The recognised marks (single 一 二 三 四 上 中 下 甲 乙 丙 丁 レ, the Xレ
compounds, and (…) okurigana) all produce one node that stores the raw
mark text:
pub struct Kaeriten<'src> {
pub mark: NonEmptyStr<'src>, // the raw mark, e.g. "二" / "一レ" / "(リ)"
}
The renderer wraps it in <sup class="aozora-kaeriten">…</sup>. The
bracketed_kaeriten_no_pair / kaeriten_outside_kanbun checks classify the
mark’s family and rank from this string at diagnostic time rather than
storing a typed enum.
Diagnostics
| Code | Condition |
|---|---|
kaeriten_outside_kanbun | A lone kaeriten in kana prose (conservative lookahead heuristic) |
bracketed_kaeriten_no_pair | A rank-≥2 mark whose family base (一 / 上 / 甲) is absent from the document |
See also
- Notation overview — the orientation map for all the inline annotations.
Indent & align containers (字下げ)
Aozora Bunko uses paired [#ここから…] / [#ここで…終わり]
brackets to delimit blocks of text with custom layout. The block
container families aozora recognises:
| Family | Opener | Closer | Effect |
|---|---|---|---|
| 字下げ (indent) | [#ここから2字下げ] | [#ここで字下げ終わり] | Indent every line by N full-width chars |
| 地付き / 地上げ (align-end) | [#ここから地付き] / [#ここから地から2字上げ] | [#ここで地付き終わり] | Flush right (vertical: 地 = ground = bottom) |
| 罫囲み (boxed) | [#罫囲み] | [#罫囲み終わり] | Draw a rule frame around the block |
The HTML renderer maps them to <div class="aozora-container …"> wrappers.
Two more container kinds are inline, not block: 割り注
([#割り注]…[#割り注終わり]) and the 傍点 / 傍線 range form
([#傍点]…[#傍点終わり], see bouten).
Single-line forms
The 字下げ / 地付き / 地上げ directives also have a single-line form
(no ここから prefix, no closer) that applies to the rest of the line:
[#地付き]平和への誓い
In the borrowed AST a single-line directive is a zero-width marker
node (AozoraNode::Indent / AlignEnd), not a wrapping container — it
renders as an empty span and the following text stays a sibling:
<span class="aozora-align-end aozora-align-end-0" data-offset="0"></span>平和への誓い
A page / section break sharing the line with such a marker drops it —
see break_in_single_line_container.
AST shape
A paired block container is one Container node tagging the wrapped
children (the lexer splices the enclosed siblings under it during
post-processing); single-line forms and breaks are leaf nodes:
pub struct Container {
pub kind: ContainerKind,
}
pub enum ContainerKind {
Indent { amount: u8 }, // [#ここからN字下げ]
AlignEnd { offset: u8 }, // [#ここから地付き / 地からN字上げ]
Keigakomi, // [#罫囲み]
Warichu, // [#割り注] (inline)
BoutenRange { kind: BoutenKind, position: BoutenPosition }, // [#傍点]… (inline)
}
Why a small flat enum?
ContainerKind is closed by spec. A flat enum (vs a trait object or
string tag) gives the parser O(1) variant dispatch in the classify phase
and the renderer’s HTML walk, and lets the compiler’s exhaustiveness
check enforce that every variant has a render path. The payloads are tiny
(u8 / BoutenKind / BoutenPosition), so the whole enum stays within a
few bytes — pinned by the container_kind_is_copy_and_fits_in_a_word
assertion.
Composition
Containers nest:
[#ここから2字下げ]
通常の段落。
[#ここから地付き]
右寄せの行。
[#ここで地付き終わり]
通常に戻る。
[#ここで字下げ終わり]
Renders as nested divs:
<div class="aozora-indent-2">
通常の段落。
<div class="aozora-align-end">
右寄せの行。
</div>
通常に戻る。
</div>
Mismatched closers (e.g. [#ここから地付き] … [#ここで字下げ終わり])
fire diagnostic
aozora::lex::mismatched_container_close
and the parser auto-closes the offending opener at the closer’s position.
The check compares container families, so closing a 2字下げ opener
with a plain 字下げ終わり (both indent) is fine — only a different
family (indent vs align-end vs 罫囲み vs 割り注) is flagged.
Why containers, not stack-based push/pop tokens?
The spec describes these as opener / closer brackets, but the natural implementation in Rust is a recursive container node. That choice:
- Lets the renderer walk the tree once with a single match on
ContainerKind, instead of maintaining a render-time stack. - Surfaces shape errors (mismatched closers, dangling openers) at parse time — the lexer’s classify phase already has all the information to decide.
- Makes the canonical-serialise pass trivial (each container prints its opener, walks its children, prints its closer).
The trade-off is one extra heap touch per container — a single
bumpalo slice for children. The arena is already hot, so the cost
is negligible (bumpalo returns aligned pointers in O(1) bumps).
See also
- Architecture → Borrowed-arena AST — how container child slices are laid out in the arena.
- Diagnostics →
aozora::lex::mismatched_container_close— mismatched closer.
Page & section breaks (改ページ・改丁)
Aozora Bunko inherits print conventions for page-level structure. Four annotations split a work into pages, signatures, and openings:
| Notation | Renders as | Meaning |
|---|---|---|
[#改ページ] | <div class="aozora-page-break"></div> | Begin a new page |
[#改丁] | <div class="aozora-section-break aozora-section-break-kaicho"></div> | Begin a new 丁 (leaf / recto) |
[#改段] | <div class="aozora-section-break aozora-section-break-kaidan"></div> | Section break (smaller than a page) |
[#改見開き] | <div class="aozora-section-break aozora-section-break-kaimihiraki"></div> | Begin a new two-page spread |
All four are self-contained directives — no opener / closer pair, no inner content. They appear on their own line in the source.
AST shape
[#改ページ] is its own borrowed-AST node; the three 段 / 丁 / 見開き
breaks share one SectionBreak node tagged by [SectionKind]:
// borrowed::AozoraNode variants
AozoraNode::PageBreak, // [#改ページ]
AozoraNode::SectionBreak(SectionKind), // [#改丁 / 改段 / 改見開き]
pub enum SectionKind {
Choho, // 改丁
Dan, // 改段
Spread, // 改見開き
}
Why distinct variants for each break flavour?
The flavours render to identical HTML structure (an empty <div>) but
different class hooks (aozora-page-break,
aozora-section-break-{kaicho,kaidan,kaimihiraki}). Keeping PageBreak separate
and tagging the section flavours with a SectionKind enum (rather than a
string) means:
- The renderer never plumbs the original notation through to the output, preserving the AST’s role as a normalised IR.
- The compiler’s exhaustiveness check guarantees every flavour has a render path.
- Tooling can count breaks of a specific flavour at the AST level without a string match.
Composition with other annotations
Breaks unconditionally close any open inline annotation (ruby, bouten, tcy) at their line. They do not close container directives (字下げ, 地付き, etc.) — those persist across page boundaries, which matches print typography.
[#ここから2字下げ]
第一節
[#改ページ]
第二節 (still 2字下げ)
[#ここで字下げ終わり]
Diagnostics
| Code | Condition |
|---|---|
break_in_single_line_container | A page / section break sharing a line with a single-line container (or inside a warichu range), which drops it |
See also
- Indent containers — containers persist across breaks.
Diagnostics catalogue
aozora is non-fatal by design: the parser always produces a tree, even from malformed input, and reports what it noticed through structured diagnostics that callers choose how to treat. This page is the catalogue.
Each Diagnostic carries:
- a stable code — a dotted string such as
aozora::lex::unclosed_bracket. The string is pinned by a test and never changes within a major release; new diagnostics add new codes. - a severity:
Error/Warning/Note. - a source axis:
Source(your input tripped it) orInternal(a library-bug sanity check — see Internal). - a span — a byte range in the sanitized source (the Phase 0 output: BOM stripped, CRLF→LF, 〔…〕 accents decomposed). For input with none of those, the sanitized bytes equal the original bytes.
Rendering them
The aozora check CLI renders diagnostics three ways, chosen with
--diagnostic-format:
human(the default on a terminal) — a graphicalmiettereport: the source line, a caret under the offending span, the label, the help text, and a link back to this page.json(the default when stderr is piped) — theaozora::wirediagnostics envelope, byte-identical to what the WASM / FFI / Python / Extism front doors emit. This is the machine / agent path.short— one grep-able line per diagnostic:path:offset: severity[code]: message.
Exit codes: 0 (diagnostics printed but tolerated), 1 (--strict with
at least one diagnostic), 2 (CLI usage error), 3 (an Internal
diagnostic fired — a library bug). See the CLI reference.
Library consumers get tree.diagnostics() -> &[Diagnostic] and reach the
parts through code(), severity(), source(), and span(). All
bindings carry the same structured data.
Source diagnostics
These trace back to your input. The parser emits exactly these — the authoring-error catalogue is complete (no diagnostic is specified-but-unimplemented).
Source contains PUA
aozora::lex::source_contains_pua · Warning
…… (a literal U+E001..=U+E004 codepoint in the source)
The source contains a codepoint in U+E001..=U+E004, which the lexer
reserves as inline / block placeholder sentinels. A source-side
occurrence collides with the lexer’s own markers and would confuse the
placeholder registry. Fix: remove the private-use codepoint from the
source (these are not normal text characters and effectively never occur
in real 青空文庫 files).
Unclosed bracket
aozora::lex::unclosed_bracket · Error
[#ここから2字下げ (no matching [#ここで字下げ終わり])
An Aozora open delimiter (ruby |, annotation [#, quote, …) reached
end-of-input with no matching close on the pairing stack. The label
points at the opener. The region degrades to plain text — no pair link
is emitted. Fix: add the missing close delimiter, or remove the
dangling opener.
Unmatched close
aozora::lex::unmatched_close · Error
青空]》 (a close with no matching open on the stack)
A close delimiter was seen with an empty pairing stack, or against a
stack top of a different PairKind. The label points at the stray close.
Fix: add the matching open delimiter, or remove the stray close.
Accent decomposition applied
aozora::lex::accent_decomposition_applied · Note
〔cafe'〕 (decomposed to 〔café〕)
A 〔…〕 accent digraph was rewritten to its Unicode-combined form during
Phase 0 sanitize (cafe' → café, fune + backtick → funè, …). This is
intended behaviour, not an error — it is surfaced as a Note so an
editor can show what changed. One note fires per 〔…〕 span that actually
contained a digraph; a 〔…〕 with no accent digraph is silent. The span is
in sanitized (post-decomposition) coordinates. The transform is loss-free:
the serializer reconstructs the original 〔…〕 source form. See
ADR-0003.
No action required.
Unresolved gaiji
aozora::lex::unresolved_gaiji · Warning
※[#「架空の外字」、第3水準99-99-99] (men-ku-ten out of range)
A 外字 (gaiji) reference — ※[#…] — resolved to neither a Unicode
scalar nor a JIS X 0213 cell: no 第N水準P-R-C men-ku-ten or U+XXXX
reference matched, and the description is not itself a single resolvable
character. The construct still parses; the renderer falls back to the
description text (<span class="aozora-gaiji" data-description="…">…</span>)
rather than the intended glyph. The label points at the ※[#…] reference.
Fix: correct the men-ku-ten / U+XXXX reference, or accept the
description-only rendering. (Fires for top-level references; gaiji nested
inside a ruby / bouten reading is not yet flagged.)
Mismatched container close
aozora::lex::mismatched_container_close · Error
[#ここから2字下げ]…[#ここで地付き終わり] (indent opened, align-end closed)
A paired container opened with one family (indent / warichu /
keigakomi / align-end) was closed by a closer of a different family.
The comparison is by family, so closing a 2字下げ opener with a plain
字下げ終わり (both indent, differing only in amount) is not flagged —
only a genuine family mismatch is. The label points at the close marker.
The parser recovers by auto-closing the opener at the closer’s position
(the container pair is still emitted, keyed by the open family). Fix:
match the closer to the opener — ここから字下げ ↔ ここで字下げ終わり,
ここから地付き ↔ ここで地付き終わり, etc.
Empty ruby reading
aozora::lex::empty_ruby_reading · Error
|青梅《》 (base given, reading empty)
An explicit-base ruby supplied a base (a | precedes the 《) but an
empty 《》 reading. Because the | marks the base unambiguously, this is
a genuine authoring slip rather than a literal 《》 run — so a bare
青梅《》 with no | is not flagged (the parser can’t be sure a base
was intended and treats it as text). The construct degrades to plain text.
The label spans the whole |青梅《》. Fix: supply a reading, or drop
the |…《》 markers to keep the base as plain text.
Nested ruby
aozora::lex::nested_ruby · Error
|漢《か《ん》じ》 (the reading body opens another 《…》)
A ruby reading body itself opened another ruby. Ruby does not nest; the
label points at the inner 《. The outer ruby is still parsed
best-effort. Note that an adjacent 《《…》》 is not nested ruby — the
tokenizer reads 《《 / 》》 as double-bracket bouten, a
separate construct — so this fires only when the inner 《…》 closes
before the outer (text between the two closes, as in the catalogue shape
|…《…《…》…》). Fix: close the outer reading before the inner 《, or
remove the inner 《…》.
Unrecognised container directive
aozora::lex::unrecognised_container_directive · Warning
[#ここからナントカ] (no such container kind)
A [#ここから…] directive looked like a paired-container opener but
named no known container kind (字下げ, 地付き, 地から N 字上げ). The
bracket is kept as a plain Annotation{Unknown} (so output is preserved
and the “no bare [#” guarantee holds) but is not treated as a
container — any matching [#ここで…終わり] will not pair with it. The
label spans the directive. Fix: use a recognised opener, e.g.
[#ここから2字下げ] or [#ここから地付き].
TCY target not found
aozora::lex::tcy_target_not_found · Warning
あ[#「い」は縦中横] (no 「い」 earlier in the line)
A 縦中横 forward reference ([#「X」は縦中横]) named a target that does
not appear anywhere in the preceding text, so it has no run to rotate. The
directive degrades to an Annotation{Unknown}. The label spans the
directive. Fix: check the spelling of the quoted target, or place the
[#「X」は縦中横] after the run it should style.
Bouten target ambiguous
aozora::lex::bouten_target_ambiguous · Warning
青空青空[#「青空」に傍点] (「青空」 occurs twice before the directive)
A forward-reference bouten ([#「X」に傍点]) named a target that occurs
more than once in the preceding look-back window, so which run it
emphasises is ambiguous. The parser still applies it (to the match its
look-back rule selects) but the chosen run may not be the intended one.
The label spans the directive. Fix: reword so the quoted target is
unique before the directive. (Multi-target brackets like [#「A」「B」に傍点]
name distinct runs and are never flagged.)
Mismatched bouten container
aozora::lex::mismatched_bouten_container · Error
彼は[#傍点]必ず[#傍線終わり]来る (傍点 opened, 傍線 closed)
A 傍点 / 傍線 range form ([#傍点] … [#傍点終わり]) was opened with one
family — 点 (dots) or 線 (line) — and closed by the other, e.g. a [#傍点]
opener closed by [#傍線終わり]. The two families render differently (dots
beside the text vs a line alongside it), so the run’s emphasis is
ambiguous. The parser recovers by keying the run to the opener’s variant.
A same-family variant difference (白丸傍点 closed by 丸傍点終わり) is
tolerated. The label points at the close marker. Fix: match the closer’s
family to the opener — [#傍点終わり] for any 点 variant, [#傍線終わり]
for any 線 variant.
Bracketed kaeriten no pair
aozora::lex::bracketed_kaeriten_no_pair · Error
怪物[#二] ([#二] with no [#一] anywhere in the document)
A bracketed kaeriten of rank ≥ 2 ([#二] / [#下] / [#乙]) appears in a
document whose matching family base — [#一] / [#上] / [#甲] — is
absent entirely, so the return mark has nothing to pair back to. The check
is document-wide and base-only by design: real 漢文 return-mark groups span
、 / 。 and line boundaries (and write 二 before 一), and 上下点 may
use just 上 … 下 (skipping 中), so any narrower scope would wrongly
flag valid kanbun. レ (re-ten) is standalone and never flagged; 送り仮名
([#(ス)]) is not a ladder mark. Fix: add the missing base mark, or
check the mark is a genuine 返り点.
Kaeriten outside kanbun
aozora::lex::kaeriten_outside_kanbun · Warning
これは[#レ]と書いた。 (a lone kaeriten in kana prose)
A kaeriten ([#二] / [#レ] / …) is the only one in the entire document
and its surroundings read as ordinary kana prose, so it is most likely a
stray [#…] annotation rather than a genuine 返り点. The lookahead
heuristic is deliberately conservative — a document carrying a cluster of
kaeriten (real 漢文) is never flagged. The label points at the lone mark.
(Only the bracketed [#…] form is recognised; a bare reading-mark glyph in
running text is left as plain text.) Fix: confirm the mark is intended;
remove it if it is not a reading mark.
Break in single line container
aozora::lex::break_in_single_line_container · Warning
[#地付き]本文[#改ページ] (single-line directive shares its line with a break)
A single-line layout directive ([#地付き], [#N字下げ]) or a warichu
range ([#割り注] … [#割り注終わり]) governs only the rest of its line. A
page / section break sharing that line — or, for warichu, falling between
the open and close — drops the container: the break starts a new block, so
the directive’s run is cut short. Paired block forms ([#ここから…] … [#ここで…終わり]) persist across breaks and are not flagged (print
typography keeps the layout across pages). The label points at the break.
Fix: move the break off the line, or use the paired block form.
Internal
aozora::internal · Error · source = Internal
Pipeline-internal sanity checks. A correct build never emits these —
their appearance means a bug in aozora itself, not a problem with your
input. The specific check is identified by an InternalCheckCode:
| Check code | Fires when |
|---|---|
aozora::lex::residual_annotation_marker | an [# digraph survived classification into the normalized text (a missing recogniser) |
aozora::lex::unregistered_sentinel | a PUA sentinel sits at a normalized position not recorded in the placeholder registry |
aozora::lex::registry_out_of_order | a placeholder-registry vector is not strictly ordered by position |
aozora::lex::registry_position_mismatch | a registry entry references a position whose character is not the expected sentinel |
aozora check exits 3 when one fires. Please
report it with the source that
triggered it.
Planned diagnostics
None outstanding. Every authoring-error diagnostic in the catalogue —
including the four model-dependent ones (mismatched_bouten_container,
bracketed_kaeriten_no_pair, kaeriten_outside_kanbun,
break_in_single_line_container) — is now emitted; see the
Source diagnostics above. New 記法 work adds new
codes here as it lands.
Why a stable string code, not just a message?
- Test stability. The corpus sweep and conformance gate count
diagnostics by code; a test like “this corpus emits at most N
unresolved_gaijiwarnings” survives message-wording tweaks and localisation. A test that greps the message string does not. - Tool integration. Editors / LSPs / CI lints filter by code
(e.g. “treat every
Error-severity code as fatal, ignoreunrecognised_container_directivefor legacy files”). String matching on prose is fragile.
See also
- Architecture → Error recovery — what the parser does after each diagnostic fires (preserved output, dropped tokens, where the bytes go).
- CLI reference —
aozora check --diagnostic-formatand the exit-code contract. - Library Quickstart → Diagnostics
- Bindings → Diagnostics as JSON
Pipeline overview
aozora is a pure-functional parser: given the same input, the same
arena, and the same compile-time configuration, the output is
bit-for-bit identical. There are no thread-locals, no OnceCell
caches in the parse path, no environmental side effects. The only
state the parser owns is the arena and a string interner, both reset
per Document.
Three layers
flowchart TD
src["source text<br/>(UTF-8 or Shift_JIS)"]
decode["Shift_JIS decode<br/>(aozora-encoding)"]
lex["Lex<br/>(aozora-pipeline::lex_into_arena)<br/>sanitize → events → pair → classify"]
tree["AozoraTree<'arena><br/>(borrowed AST)"]
render["Render<br/>(aozora-render)<br/>html / serialize"]
out["HTML / canonical 青空文庫 source"]
src --> decode --> lex --> tree --> render --> out
Each arrow is a pure function. The arena is threaded through lex;
nothing else holds state.
Crate dependency graph
flowchart TD
spec["aozora-spec<br/>shared types"]
encoding["aozora-encoding<br/>SJIS + 外字 PHF"]
scan["aozora-scan<br/>SIMD multi-pattern"]
veb["aozora-veb<br/>Eytzinger sorted-set"]
syntax["aozora-syntax<br/>AST node types"]
pipeline["aozora-pipeline<br/>4-phase lexer +<br/>lex_into_arena"]
render["aozora-render<br/>html / serialize"]
facade["aozora<br/>public facade"]
cli["aozora-cli"]
ffi["aozora-ffi"]
wasm["aozora-wasm"]
py["aozora-py"]
spec --> encoding
spec --> scan
spec --> veb
spec --> syntax
encoding --> syntax
scan --> pipeline
veb --> pipeline
syntax --> pipeline
pipeline --> render
render --> facade
facade --> cli
facade --> ffi
facade --> wasm
facade --> py
aozora-spec is the foundation — every other crate depends on it.
The dependency graph forms a strict DAG; circular deps are forbidden
by cargo deny’s bans config and by the cargo metadata check
in just lint.
What each layer does
Sanitize → Events → Pair → Classify
The lexer pipeline is split into four phases because each stage has a different cost / cache profile:
| Phase | Input | Output | Why separate |
|---|---|---|---|
| Sanitize | raw &str | normalised &str + Phase-0 diagnostics | BOM / CRLF / accent decomposition / decorative-rule isolation / PUA collision pre-scan all happen here, once, before any expensive lookahead. Keeps later phases linear-time. |
| Events | sanitised &str | Iterator<Token> | SIMD trigger scan (aozora-scan) fires here; the linear tokenise that follows fuses with the scan so no per-event vector is allocated. |
| Pair | Iterator<Token> | Iterator<PairEvent> | Balanced-stack bracket matching across all opener / closer pairs (|》《, [], 〔〕, 「」, 《《》》). Recovery diagnostics for unclosed / unmatched fire here. |
| Classify | Iterator<PairEvent> | Iterator<ClassifiedSpan> (→ AozoraNode<'arena>) | Decides “is this [#…] an indent opener, a bouten directive, a tcy directive, …” via the slug-canonicalised dispatch table. |
Splitting them lets the parser ship two surface APIs without code duplication:
lex_into_arena— fused, allocates one borrowed-AST tree.- Per-phase calls (
sanitize,tokenize,pair,classify) — used by the bench harness’s per-phase probes and the integration tests incrates/aozora-pipeline/tests/.
Sanitize details
Phase 0 sanitize covers:
- BOM strip — UTF-8 BOM detection at the head.
- CRLF normalisation — CRLF → LF in one
memchr2pass. - Decorative rule isolation — separates long horizontal-rule patterns from neighbouring text so Phase 1’s trigger scan does not split them mid-glyph.
- Accent decomposition — ASCII digraphs / ligatures → Unicode (see Gaiji).
- PUA collision pre-scan — emits
Diagnostic::SourceContainsPuafor strayU+E001..U+E004codepoints in the source so they can never be confused with the lexer’s own sentinel insertions later.
Events: SIMD scan
Trigger byte detection runs the SIMD multi-pattern scanner from
aozora-scan. Multiple backends share a common
trait; selection happens once via runtime CPU detection and is
cached for the process lifetime. See
Architecture → SIMD scanner backends for the dispatch
order and what each backend looks like in samply.
Pair → Classify
Bracket matching is a single linear-time stack walk over the trigger
event stream. Classify then does the actual recognition: each
opener type maps via the SLUGS dispatch table to a recogniser,
and the recogniser produces the borrowed AozoraNode<'arena> that
lex_into_arena then registers and substitutes a PUA sentinel for.
The slug canonicalisation makes prefix collisions
(ここから2字下げ vs ここから2字下げ、地寄せ) deterministic without
relying on declaration order. Look-back targets (bouten / tcy)
resolve in the same walk against the sanitised text.
Render
Two render walkers:
html::render_to_string— a single O(n) tree walk emitting semantic HTML5 withaozora-*class hooks.serialize::serialize— re-emits canonical 青空文庫 source.
Both are pure functions; both allocate exactly the output buffer and nothing else.
What the pipeline does not do
No tree mutation between layers. No optimisation passes. No
“resolver” stage that mutates the AST. The lexer produces the
final tree; the renderer consumes it; that’s it. This is the same
shape as a functional reactive pipeline, and it’s what lets the
borrowed-arena AST (next chapter) work without RefCell or
UnsafeCell.
See also
- Borrowed-arena AST — what
AozoraTree<'arena>actually points at. - Four-phase lexer — the inside of the Lex box.
- Crate map — every crate, its purpose, what depends on what.
Borrowed-arena AST
AozoraTree<'a> is not an owned tree. It’s a borrow into two
things owned by Document:
- the source
Box<str>, - a
bumpalo::Bumparena that holds every intermediate node and child slice.
flowchart LR
subgraph Document
src["Box<str> source"]
bump["bumpalo::Bump arena"]
end
tree["AozoraTree<'a>"]
walk["render / serialize / iterate"]
src -.borrows.-> tree
bump -.borrows.-> tree
tree --> walk
When the Document drops, the source Box<str> and the arena’s
single backing buffer drop in two free() calls — every node, every
container, every interned string releases together. There is no
per-node destructor and no walk-the-tree-to-free pass.
Why an arena and not Box<Node> everywhere?
The naive Rust shape — enum Node { Ruby { target: String, … }, … }
— would allocate per node, per String, per Vec<Node> for
container children. For a typical Aozora Bunko work (~500 KiB
source, ~50 000 nodes) that’s:
- ~50 000 individual heap allocations,
- ~50 000 individual frees on drop (each is a syscall away from the heap allocator’s free list),
- 16+ bytes of allocator metadata per allocation,
- random-access fragmentation that defeats prefetch.
The arena variant produces:
- ~16 bump allocations (4 KiB pages, refilled on overflow),
- 1 free on drop (
Bump::resetreturns the pages to the OS, the pages themselves are typically reused via the cargo / system allocator’s page cache). - Sequential layout: nodes that were lexed near each other live near each other in memory, which is exactly the order the renderer walks them.
Measured on the corpus sweep: the arena variant
parses 6.4× faster than the equivalent Box<Node> shape, and the
peak RSS is 30% lower. The win is cumulative — every binding
(CLI / WASM / FFI / Python) inherits it.
Why bumpalo over typed-arena, slotmap, or hand-rolled?
| Crate | Shape | Why aozora doesn’t use it |
|---|---|---|
typed-arena | One arena per type (Arena<Ruby>, Arena<Bouten>, …) | aozora has 30+ node types; managing 30 arenas is operationally awkward and forces lifetime-bound &'a per type. |
slotmap | Index-keyed nodes; arena owns; access via SlotMap::get | Adds an indirection (key → slot → node) on every walk, regressing render throughput by ~25% on the bench harness. Also forces Copy keys, which for variable-length text fields means re-interning. |
id-arena / index_vec | Index-typed, &str borrowing | Same indirection cost as slotmap. |
| Hand-rolled bump | Custom; tightest control | Correct, but bumpalo is already a stable, mainstream, allocator-aware bump arena with bumpalo::collections::Vec for child slices. Reinventing wins nothing. |
bumpalo | Single arena, type-erased; allocate any T with bump.alloc(T) | One arena per Document; allocate-then-borrow gives &'a T for the lifetime of the arena. Matches aozora’s “one arena per Document” need exactly. |
bumpalo’s collections::Vec<'bump, T> (used for container child
slices) is Vec-shaped but allocated inside the arena — child
slices get the same arena lifetime as the parent without a separate
allocation strategy.
How the AST shape interacts with the lifetime
pub enum AozoraNode<'src> {
Plain(&'src str),
Ruby(Ruby<'src>),
Bouten(Bouten<'src>),
Tcy(Tcy<'src>),
Gaiji(Gaiji<'src>),
Container(&'src Container<'src>), // boxed in the arena
BreakNode(BreakNode),
// … 30+ variants
}
The 'src lifetime is the arena lifetime (re-using 'src because
all node text borrows from the source buffer, which lives at least
as long as the arena). Each variant either:
- holds a
&strslice into the source (zero copy), or - is a small
Copystruct (BreakNode,Saidoku, …), or - is
&'src Container<'src>— boxed in the arena becauseContaineritself contains a&'src [AozoraNode<'src>]child slice.
The whole AozoraNode is Copy (it’s a tagged union of references
and small primitives), so iterating the tree never needs & — just
deref the reference, copy the node, walk on.
What you trade
The big trade-off: you can’t outlive the Document. A
Vec<AozoraNode<'_>> doesn’t compile because the '_ lifetime is
bound to the arena, which is bound to the Document.
In practice this rarely matters — consumers either:
- Render the tree immediately and discard (
tree.to_html()returnsString, which has no lifetime tie). - Walk the tree once and emit their own owned IR (most editor backends do this).
- Hold the
Documentitself across function boundaries and re-derive the tree on the inside.
For consumers that genuinely need an owned tree, the visitor trait
on AozoraTree makes the conversion trivial — walk the tree once
and emit your own owned IR. We resist shipping a built-in
aozora::owned because doing so would push consumers toward it
even when an immediate to_html() or per-walk transcription would
serve them better.
Lifetime safety
The 'src parameter prevents these shapes at compile time:
fn bad() -> AozoraTree<'static> {
let doc = aozora::Document::new("…".into());
doc.parse() // ERROR: cannot return value referencing local
}
Borrow-checker enforcement; no runtime Drop ordering bugs possible.
See also
- Pipeline overview — where the arena is created.
- Crate map —
aozora-syntaxdefines the node types;aozora-pipelinedoes the allocation vialex_into_arena.
Four-phase lexer
aozora-pipeline runs the lexer as four pure-functional phases,
each fn(input) -> output with no shared mutable state. The split
keeps the dominant hot path (Phase 1 events / Phase 3 classify)
tight, lets the bench harness measure each phase independently, and
maps every diagnostic to a single phase boundary.
The single public entry lex_into_arena drives all four phases
and lands the resulting borrowed AST inside an
aozora_syntax::borrowed::Arena provided by the caller. The legacy
“phase 4 normalize / phase 5 registry / phase 6 validate” steps
disappeared into a fused walk inside lex_into_arena; they no
longer have standalone phase functions.
Phase ordering
flowchart LR
p0["Phase 0<br/>sanitize"]
p1["Phase 1<br/>events"]
p2["Phase 2<br/>pair"]
p3["Phase 3<br/>classify"]
fused["lex_into_arena<br/>(fused walk:<br/>normalize + registry + validate)"]
p0 --> p1 --> p2 --> p3 --> fused
Each arrow carries a small data structure (sanitised text, trigger events, pair events, classified spans); no phase reads back into a previous phase’s output.
| Phase | Input | Output | Responsibility |
|---|---|---|---|
| 0 — Sanitize | raw &str | SanitizeOutput { sanitized: &str, .. } | BOM strip, CRLF → LF, accent decomposition, decorative-rule isolation, PUA collision pre-scan |
| 1 — Events | sanitised &str | Iterator<Item = Token> | SIMD trigger scan (aozora-scan) followed by linear tokenise into Plain / trigger events |
| 2 — Pair | Iterator<Token> | Iterator<Item = PairEvent> | Balanced-stack pairing for all opener/closer trigrams (|》《, [], 〔〕, 「」, 《《》》) |
| 3 — Classify | Iterator<PairEvent> | Iterator<Item = ClassifiedSpan> | Full-spec Aozora classification into AozoraNode variants (ruby, bouten, gaiji, tcy, kaeriten, sashie, annotation, …) |
The orchestrator lex_into_arena consumes the Phase 3 stream,
substitutes PUA sentinels into the normalised text, builds the
side-table registry that maps sentinel positions back to
classified AozoraNode values, and accumulates diagnostics — all
in a single fused walk over the classified-span stream.
Phase 0: sanitize
The most varied phase by what it touches. Sub-passes (in order):
- bom_strip — UTF-8 BOM detection and removal at the head.
- normalize_line_endings — CRLF → LF in one
memchr2pass. - rewrite_accent_spans — ASCII digraph / ligature decomposition for accent gaiji.
- isolate_decorative_rules — long horizontal-rule lines (
──────────patterns) get separated from neighbouring text so Phase 1’s trigger scan does not split them mid-glyph. - scan_for_sentinel_collisions — pre-scan for stray PUA codepoints
(
U+E001..U+E004); any hit emitsDiagnostic::SourceContainsPuaand the colliding bytes flow through verbatim (the registry has no entry for them, so they degrade to plain text).
Each sub-pass is independent and runs over the same buffer. The
output SanitizeOutput carries the rewritten text alongside any
diagnostics emitted along the way.
Phase 1: events
The hot path. SIMD multi-pattern scan from
aozora-scan finds every trigger byte position; a
single linear walk converts those positions into Token events:
pub enum Token<'src> {
Plain(&'src str),
Trigger(TriggerKind, Span),
}
The trigger scan and the tokenise loop fuse so the output stream allocates no per-event vector — downstream phases consume the iterator directly. See SIMD scanner backends for the runtime backend selection.
Throughput on a typical mid-size work (crime_and_punishment.txt,
~600 KiB UTF-8): on the order of GB/s for the SIMD backends, which
is well above the rest of the pipeline’s throughput; Phase 1 is
essentially free at the corpus level. Concrete numbers are pinned
by cargo bench -p aozora-bench --bench crime_and_punishment and
the synthetic corpus bench.
Phase 2: pair
Balanced-stack bracket matching. Walk the trigger event stream,
push openers onto a SmallVec<[(PairKind, Span); 8]> (inline
capacity 8 covers 99th-percentile bracket nesting in real corpus),
pop on closers, and emit a PairEvent::Solo / Matched /
Unmatched / Unclosed for every trigger.
Phase 2 is also the first place recovery semantics fire: stray closers and unmatched openers each emit a structured diagnostic but never abort, so downstream consumers see a complete event stream regardless of input wellformedness.
Phase 3: classify
The most code-heavy phase. The classifier maps PairEvents to
AozoraNode variants via a slug-canonicalised dispatch table
(SLUGS / canonicalise_slug). Recognisers are organised per
construct family:
- Ruby (
|青梅《おうめ》, with implicit-base auto-glob) - Bouten / forward-bouten (
[#「平和」に傍点], with look-back target resolution) - Tate-chu-yoko (
[#「12」は縦中横]) - Gaiji (
※[#説明、ページ-行]) - Kaeriten (Chinese-text reading marks)
- Sashie (illustrations)
- Indent / alignment / line-length annotations
- Section / page breaks
The recogniser dispatch is deterministic and slug-canonicalised so
prefix collisions (ここから2字下げ vs ここから2字下げ、地寄せ)
resolve via the SLUGS entry’s family + arity, not by recogniser
ordering. Look-back targets (bouten / tcy) resolve against the
sanitised text in the same walk.
Fused finishing walk
After Phase 3, lex_into_arena runs a single output-build walk
that does what was once three separate phases:
- Normalise — substitute each Aozora span with its PUA sentinel
(
U+E001/E002/E003/E004for inline / block-leaf / block-open / block-close) so the downstream CommonMark parser sees a flat text with single-codepoint placeholders. - Register — build the
Registry(anEytzingerMap<u32, NodeRef<'src>>, see van Emde Boas / Eytzinger layout) keyed by sentinel byte position so the post-process walk can recover the borrowed-AST node from a normalised position inO(log n). - Validate + diagnostics — collect every Phase-0 / Phase-2 /
Phase-3 diagnostic, sort by span, and pin stable codes
(
aozora::lex::source_contains_pua,aozora::lex::unclosed_bracket, …; see diagnostics).
Performing all three in one walk avoids three extra passes over
the (potentially MB-class) source and keeps the Registry’s
EytzingerMap build amortised.
Why four phases, not one big function?
Three reasons.
- Bench-driven optimisation. Per-phase boundaries let
cargo bench -p aozora-benchmeasure each phase’s wall time independently. Knowing that “this document spends 80 % of parse time in Phase 3 classify” tells you where the next perf PR belongs. A monolithiclex()would force re-instrumentation in every PR. - Spec compliance. Each phase corresponds to a discrete transformation the spec describes. Spec gaps in production almost always land in one phase, and the conformance suite can pin regression fixtures targeting that phase only.
- Composability.
aozora-pipelineexposes both the fusedlex_into_arenaentry and the per-phase functions (sanitize,tokenize/tokenize_in,pair/pair_in,classify). Production code uses the fused entry; benchmarks and the type-state Pipeline state machine use per-phase calls to isolate regressions.
The cost is conceptual (more API surface internal to the crate); the win is that every perf decision in the parser has a measurement attached.
See also
- Pipeline overview — how the lexer fits into the full parse layer.
- SIMD scanner backends — Phase 1’s trigger scan.
- Error recovery — what each phase does when a diagnostic fires.
- Performance → Profiling with samply — how to measure the per-phase cost on your own workload.
SIMD scanner backends
Phase 1 of the lexer is a multi-pattern byte scan: find every
occurrence of the 11 Aozora trigger characters (|《》#※[]〔〕「」)
in the source. On a typical Japanese corpus document — where every
codepoint is a 3-byte UTF-8 sequence and trigger characters appear
on the order of 1–2 % of bytes — the scan dominates the
interpretation by an order of magnitude. So this is the place
where SIMD pays for itself.
Architecture: outer driver × inner kernel
aozora-scan ships a single algorithm — Hyperscan-style Teddy with
nibble LUTs — implemented once as a platform-agnostic outer driver
and plugged into per-ISA inner kernels. The split is the spine of
the crate:
crate::kernel::teddy— algorithm side. Defines the const-built bucket LUTs (one bit per pattern; the 11 triggers fit comfortably in the 16-bit mask), the verify table, theTeddyInnertrait every kernel implements, andteddy_outer— the platform- agnostic chunk loop + verify pass.crate::arch::*— platform side. One file per ISA; each implementsTeddyInner::lead_mask_chunkusing the appropriate 16-byte LUT shuffle:pshufbon x86 SSSE3,_mm256_shuffle_epi8on AVX2,vqtbl1q_u8on NEON,i8x16_swizzleon WASM SIMD.
Adding a new SIMD ISA is one file under arch/. Adding a new
algorithm (e.g. SHIFT-OR baseline, AVX-512 64-byte chunk) is one
file under kernel/. The two axes never tangle.
BackendChoice + static dispatch
BackendChoice
is a Copy enum carrying one variant per inner kernel currently
compiled into the build. BackendChoice::detect() runs once at
process start, picks the fastest variant the host CPU supports
(cached in OnceLock), and the match-based
BackendChoice::scan gives static dispatch straight into the
monomorphised teddy_outer<I> instantiation. No &dyn, no virtual
call on the hot path.
Static dispatch is the whole point: a trait object cannot carry a
generic S: OffsetSink method, so a &dyn-based dispatcher would
force every parse to allocate a heap Vec<u32> and memcpy it into
the lex pipeline’s bumpalo arena. The enum-and-match shape gives
us the same runtime-CPU adaptation a single binary needs without
that detour.
Backends compiled into the build
| Variant | Target gate | Kernel size | Notes |
|---|---|---|---|
TeddyAvx2 | x86_64 | 32-byte chunk | Production winner on every modern dev / CI host. _mm256_shuffle_epi8 per-lane LUT shuffle. |
TeddySsse3 | x86_64 | 16-byte chunk | Selected when AVX2 is unavailable but SSSE3 is. _mm_shuffle_epi8 (pshufb). |
TeddyNeon | aarch64 | 16-byte chunk | aarch64 ABI mandates NEON, so always selected on that target. vqtbl1q_u8. |
TeddyWasm | wasm32 | 16-byte chunk | WASM SIMD128 baseline since 2022. i8x16_swizzle. |
ScalarTeddy | always | 16-byte chunk, no SIMD | Pure-Rust reference; the no_std last-resort dispatch target and the proptest oracle for SIMD ports. |
NaiveScanner
(brute-force PHF walker) is #[doc(hidden)] — kept reachable for
the integration proptests and the bake-off bench, never the
dispatch target.
Why a self-rolled Teddy
The previous production stack drove three external crates —
aho_corasick::packed::teddy (SSSE3-only), regex_automata (DFA),
hand-rolled simdjson-style structural bitmap (AVX2). Coverage gaps
forced redundant fallback code on every commit and the trio carried
~1.4 MB of compiled dependency surface.
Switching to a self-rolled Teddy:
- One algorithm, four ISAs. The outer driver is ~120 LOC; each
ISA inner kernel is ~30 LOC. NEON / WASM SIMD ports compile
natively rather than waiting on upstream
aho_corasick. - No external SIMD deps.
aho_corasickandregex_automataare gone from the default dep tree. Theaozora-scanbuild no longer pulls inregex-automata’s ~600 KB of state-table code. - One-bit-per-pattern bucket layout. The 11 triggers fit in
the lower 11 bits of a
u16; we don’t pay for the collision-verify pass Hyperscan’s “fat-finger” packing requires. OffsetSinkvisitor. Every kernel writes through the same generic sink, so the lex pipeline’sBumpVec<'_, u32>receives offsets directly from the SIMD inner loop — the legacy heap-allocate-then-memcpy detour is gone.
Every kernel cross-validates byte-identically against NaiveScanner
in proptest, both in-source (chunk-level) and in
tests/property_backend_equiv.rs
(end-to-end across the workhorse fragment / pathological /
unicode-adversarial distributions).
Verifying the scanner is firing
println!("{}", aozora_scan::BackendChoice::detect().name());
// "teddy-avx2" | "teddy-ssse3" | "teddy-neon" | "teddy-wasm" | "scalar-teddy"
Or under samply, look for one of the per-ISA inner kernels:
aozora_scan::arch::x86_64::lead_mask_chunk_avx2aozora_scan::arch::x86_64::lead_mask_chunk_ssse3aozora_scan::arch::aarch64::lead_mask_chunk_neonaozora_scan::arch::wasm32::lead_mask_chunk_wasmaozora_scan::kernel::teddy::ScalarTeddyKernel::lead_mask_chunk
Their parent in the call tree is always
aozora_scan::kernel::teddy::teddy_outer, where the chunk loop
lives.
See also
- Pipeline overview
- Four-phase lexer — Phase 1 events fits in here.
Eytzinger sorted-set lookup
aozora-veb is a no_std crate that provides one data structure: a
sorted-set lookup over a static byte slice, laid out in
Eytzinger order so that the binary search is cache-friendly. It
backs the placeholder registry the lexer uses to recognise the
fixed-set strings inside [#…] directives (“ここから”, “ここで”,
“傍点”, “傍線”, “字下げ”, …).
flowchart LR
needle["needle: &str"]
table["Eytzinger-laid sorted set<br/>(static &[&str])"]
cmp["compare at index, branch left/right"]
found["Some(idx) | None"]
needle --> cmp
table --> cmp
cmp --> cmp
cmp --> found
What is Eytzinger order?
A standard sorted array stores elements in their natural order:
[a, b, c, d, e, f, g]. Binary search visits indexes
3, 1 or 5, 0/2/4/6 — accesses that are spatially distant in
memory. On modern CPUs that’s a cache miss per level past L1.
Eytzinger order stores the same elements in implicit-binary-tree
order: the root at index 1 (index 0 is reserved as a sentinel),
left child at 2i, right child at 2i+1. The walk visits indexes
1, 2 or 3, 4/5/6/7 — accesses that are consecutive in memory.
For 256+ entries the cache-line packing is a measured 2–3× speedup
over std::slice::binary_search on the same data. Below 64 entries
the difference is in the noise (everything fits in one cache line).
The placeholder registry has ~120 entries — well into Eytzinger’s
favourable regime.
Why this and not phf::Set?
phf::Set is a perfect-hash table: O(1) lookup, but with a real
constant — one hash computation, one table probe, one strcmp. For
short strings (the placeholder registry’s median is 4 chars) the
hash dominates, and the table probe is a pointer chase to a separate
allocation.
Eytzinger search is log N — but for N=120 that’s 7 comparisons,
all in one contiguous slice, no hashing, no separate allocation.
Measured: Eytzinger is ~1.5× faster than phf::Set on this
workload.
For larger sets (the gaiji table at ~14 000 entries),
phf::Set wins — log₂(14000) is 14 comparisons and the cache
locality stops mattering. The choice is entry-count-dependent.
The aozora codebase uses Eytzinger for sub-256-entry tables and
phf::Set for larger ones; the cutoff was determined empirically.
Why not a hash table?
A HashMap<&str, ()> allocates and rehashes; phf and Eytzinger
don’t. In the lexer’s Phase 3 classify, the placeholder registry
is hit once per [#…] directive — measured as ~5 lookups per
KB of source. A HashMap’s startup cost (build the table from a
const array on first use, even with OnceLock) would dominate
the parser’s per-Document::parse cost on tiny inputs.
API
pub struct EytzingerSet<'a> {
entries: &'a [&'a str], // already in Eytzinger order
}
impl<'a> EytzingerSet<'a> {
pub const fn new(entries: &'a [&'a str]) -> Self { Self { entries } }
pub fn contains(&self, needle: &str) -> bool { … }
pub fn position(&self, needle: &str) -> Option<usize> { … }
}
new is const fn so registries are computed at compile time and
end up in .rodata. Lookup is a single function with no allocation.
Building the order
The crate ships a build-time helper that takes a sorted slice and produces the Eytzinger permutation:
const PLACEHOLDERS: &[&str] = aozora_veb::eytzinger_layout!(
"ここから", "ここで", "傍点", "傍線", "字下げ", …
);
The macro is const-evaluated; the resulting slice is what
EytzingerSet::new takes.
Why a separate crate?
The lookup is no_std and has no aozora-specific dependencies. By
extracting it, three things become true:
- The lexer can depend on
aozora-vebwithout pulling in any workspace state, which keepsaozora-veb’s test surface small. aozora-vebcan be reused byaozora-encoding(for the accent decomposition table) and byaozora-bench(for category slug lookups in the trace rollup) without forming a circular dependency.- Future consumers can depend on just
aozora-vebfor the data structure, without taking the whole parser.
See also
- Crate map —
aozora-vebis the foundation crate with no internal deps. - Performance → Benchmarks — the Eytzinger vs
phfcutoff measurement.
Shift_JIS + 外字 resolver
aozora-encoding covers the full source-decoding stack:
- Shift_JIS / Shift_JIS-2004 / cp932 byte stream → UTF-8 string.
- JIS X 0213 plane-2 ideographs → Unicode (where possible).
- 外字 references (
※[#…]) → resolved Unicode codepoint, JIS triple, or descriptive-text fallback. - Accent decomposition (114 ASCII digraph / ligature → Unicode).
All four are pure functions; the crate has no global state and nothing that varies per-call.
Decode chain
flowchart TD
raw["raw bytes<br/>(SJIS-encoded .txt from Aozora Bunko)"]
sjis["encoding_rs::SHIFT_JIS<br/>or aozora-specific JIS X 0213 patch"]
utf8["UTF-8 String"]
sanitize["Phase 0 sanitize<br/>(in aozora-pipeline)"]
pua["PUA assignment for 外字"]
classified["normalised &str ready for Phase 1 scan"]
raw --> sjis --> utf8 --> sanitize --> pua --> classified
The Shift_JIS decode itself uses encoding_rs
— the same crate Firefox uses for HTML decoding. Battle-tested,
SIMD-accelerated, and handles every Shift_JIS variant Aozora Bunko
sources have used since the 1990s. We add a thin patch layer for
JIS X 0213 plane-2 codepoints that encoding_rs’s strict cp932
mapping doesn’t cover (Aozora’s spec extends Shift_JIS into JIS
X 0213 territory; encoding_rs keeps the strict cp932 surface).
外字 (gaiji) PHF table
The reference table contains ~14 000 entries:
static GAIJI_TABLE: phf::Map<&'static str, GaijiEntry> = phf_map! {
"1-94-37" => GaijiEntry::JisX0213 { plane: 1, row: 94, cell: 37, codepoint: '⿰魚師' },
"U+5F85" => GaijiEntry::Direct { codepoint: '待' },
"魚+師のつくり" => GaijiEntry::Description { fallback: "[魚+師]" },
…
};
Why PHF (perfect hash function):
- The table is large enough (~14 000 entries) that linear scan or Eytzinger search would dominate the lookup cost.
- It’s static and known at compile time — the perfect hash is computable once.
phfproduces zero-allocation, zero-comparison-on-collision lookups. The hash is onewyhashround; the probe is one slice index; the comparison is one strcmp. ~25 ns per lookup on the bench harness.
Why not OnceLock<HashMap>:
- First-call cost: building a
HashMap<&str, GaijiEntry>from 14 000 entries on first use takes ~5 ms. That’s longer than parsing a small document end-to-end. - Memory: the runtime
HashMaptakes 2–3× the size of the static PHF (load-factor padding +RawTablemetadata). - Concurrency:
OnceLockadds an atomic load on every access, even after initialisation. PHF isstatic— no synchronisation.
Why not load from a JSON / TOML asset:
- Adds startup cost on every
Document::new(file I/O is microseconds away from the parser’s whole runtime budget for small inputs). - Forces every binding (CLI / WASM / FFI / Python wheel) to ship the asset as a separate file, complicating distribution.
- Defeats dead-code elimination: the linker can’t strip entries the consumer’s input never references.
The build-time cost of compiling the PHF (~40 s the first time, 0 s incremental) is paid once per workspace build, not per-invocation.
Resolution order
pub fn resolve(reference: &str) -> Resolved {
// 1. Direct codepoint (U+XXXX) wins outright.
if let Some(c) = parse_unicode_form(reference) { return Resolved::Direct(c); }
// 2. JIS X 0213 plane-row-cell triple.
if let Some(triple) = parse_jis_triple(reference) {
if let Some(c) = JIS_TABLE.get(&triple) { return Resolved::Lookup(c); }
}
// 3. Descriptive name lookup (curated subset).
if let Some(fallback) = DESCRIPTION_TABLE.get(reference) {
return Resolved::Fallback(fallback);
}
Resolved::Unknown
}
Three layers, in order. Direct wins because the source author
explicitly wrote a Unicode codepoint — overriding it would be
wrong even if our JIS table disagreed. Lookup is the common case.
Fallback is the curated subset of characters that have no Unicode
codepoint at all (~120 entries from the 14 000); we ship a
descriptive-text rendering rather than dropping the character.
Unknown fires diagnostic unresolved_gaiji.
Accent decomposition
Older Aozora works encode accented Latin letters using a separate
notation that is not a ※[#…] reference:
M[i!]cher → Micher
M[a!]ria → Maria
[ae]on → Aeon
The full mapping (114 entries — every digraph and ligature in the
spec) is at accent_separation.html in the spec snapshot. aozora
applies this decomposition during Phase 0 sanitize, before the
trigger scan, so by Phase 1 the source is pure Unicode with no
ASCII-encoded accents.
The lookup is also Eytzinger-laid (see Eytzinger sorted-set lookup) since 114 entries is well inside its favourable regime.
Why a single crate for all of this?
encoding, gaiji, and accent are three distinct concerns, but:
- They all need to be applied once, in order, at the boundary between the source bytes and the parser proper.
- Splitting them would force three separate crate surfaces and three separate trigger points in the lexer.
- Their data tables are all built from upstream Aozora Bunko spec
pages, so a single update workflow (refresh
docs/specs/aozora/, re-extract tables) hits all three at once.
Co-locating them in one crate keeps the boundary tight and the update surface predictable.
See also
- Notation → Gaiji — author-facing notation reference.
- Four-phase lexer → Phase 0 — where the resolver is invoked.
HTML renderer & canonical serialiser
aozora-render ships two walkers over AozoraTree<'_>:
html::render_to_string— emits semantic HTML5 withaozora-*class hooks.serialize::serialize— emits canonical 青空文庫 source.
Both are pure functions. Both walk the tree once, in source order,
allocating exactly the output buffer (a String pre-sized to the
arena footprint).
HTML renderer
Class-name scheme
aozora emits stable class names that downstream stylesheets can hook:
| AST node | HTML | Class hook |
|---|---|---|
Ruby | <ruby>X<rt>Y</rt></ruby> | (no class — semantic ruby element) |
Bouten { kind: Sesame } | <em class="aozora-bouten-sesame">…</em> | aozora-bouten-<slug> |
Tcy | <span class="aozora-tcy">…</span> | aozora-tcy |
Gaiji { resolution: Direct } | <span data-aozora-gaiji-jis="1-94-37">字</span> | data-aozora-gaiji-* |
Gaiji { resolution: Fallback } | <span class="aozora-gaiji-fallback" title="…">[…]</span> | aozora-gaiji-fallback |
Container { kind: Indent { n: 2 } } | <div class="aozora-indent-2">…</div> | aozora-indent-<n> |
Container { kind: AlignEnd } | <div class="aozora-align-end">…</div> | aozora-align-end |
Break::Page | <div class="aozora-page-break"/> | aozora-page-break |
Kaeriten { mark: Re } | <span class="aozora-kaeriten" data-aozora-kaeriten="レ">レ</span> | aozora-kaeriten |
The aozora- prefix is reserved for our class names — a downstream
stylesheet can target every aozora-emitted hook with [class^="aozora-"]
without conflicting with the consumer’s own classes.
Why a class-hook output instead of inline styles?
Inline styles would force a single typographic decision for every consumer — print stylesheet, screen stylesheet, e-book renderer, and LSP/preview pane all want different presentation. The class-hook output:
- Lets each consumer ship its own stylesheet for its medium.
- Survives content-security-policy regimes that block
styleattrs. - Stays diff-able (the rendered HTML is stable across runs; presentation churn doesn’t ripple into snapshot tests).
HTML escaping
The renderer escapes <, >, &, ", ' in user text exactly
once, at emission. Pre-escaped or doubly-escaped output is a
correctness bug, not a perf decision — every CI run validates
render_to_string ∘ html_unescape is the source identity for
plain runs.
Canonical serialiser
The serialiser is the inverse of the lexer’s surface form: walk the tree, emit the source notation that would re-parse identically. It exists for three reasons:
- Round-trip property.
parse ∘ serialize ∘ parsemust be stable on the second iteration. The corpus sweep verifies this on every Aozora Bunko work. aozora fmt. The CLI’sfmtsubcommand canonicalises author input (CRLF → LF, accent decomposition, container directive spacing).- Diff-quality output. When the parser drops a malformed construct, the serialiser re-emits the surrounding text without the offending fragment, so authors can see the exact change.
Why a separate walker, not “render with a different visitor”?
The HTML and canonical-serialise outputs differ on every node type:
- HTML wraps
Ruby { target, reading }in<ruby>X<rt>Y</rt></ruby>; serialise emits|X《Y》(or auto-detect form). - HTML wraps
Container { kind: Indent { n } }in<div class="aozora-indent-N">…</div>; serialise emits the bracketed directives[#ここからN字下げ]…[#ここで字下げ終わり]. - HTML emits
<span data-aozora-gaiji-jis="1-94-37">字</span>for a resolved gaiji; serialise emits the original※[#…、第3水準1-94-37].
The transformations don’t share enough structure to fit a single “visitor with two methods per node” abstraction. Two purpose-built walkers stay clearer and slightly faster — the compiler can inline the per-node match, which a generic visitor with virtual dispatch prevents.
Walker shape
Both walkers follow the same shape:
pub fn render_to_string(tree: &AozoraTree<'_>) -> String {
let mut buf = String::with_capacity(tree.estimated_html_size());
walk(tree, &mut buf);
buf
}
fn walk(tree: &AozoraTree<'_>, out: &mut String) {
for node in tree.nodes() {
match node {
AozoraNode::Plain(s) => out.push_str(html_escape(s)),
AozoraNode::Ruby(r) => emit_ruby(r, out),
AozoraNode::Bouten(b) => emit_bouten(b, out),
AozoraNode::Tcy(t) => emit_tcy(t, out),
AozoraNode::Gaiji(g) => emit_gaiji(g, out),
AozoraNode::Container(c) => emit_container(c, out),
AozoraNode::BreakNode(b) => emit_break(b, out),
// … exhaustive
}
}
}
Single linear pass; no allocation outside the output buffer; no recursion that the compiler can’t unroll (containers recurse, but the fan-out is small — typically 1–4 children per container).
estimated_html_size heuristic
The buffer pre-size avoids String reallocations during the walk.
Empirical heuristic from the corpus sweep: 2.6 × source_byte_len
is at the 95th percentile (some HTML wraps a 3-byte ruby kanji in
30 bytes of <ruby>X<rt>Y</rt></ruby> markup). Going under leaves
~1 reallocation per render in the worst case; going over wastes
memory on every render. 2.6× is the measured optimum.
See also
- Notation overview — what each AST node represents.
- Borrowed-arena AST — the input shape.
- Performance → Benchmarks — the
render_hot_pathprobe that drives the size estimate.
Concrete syntax tree (CST)
A rowan-backed lossless syntax tree lives under the cst
Cargo feature on the aozora crate. The CST is a pure projection
over the existing parse output — Phase 3 classification is unchanged,
the AST stays the perf-critical path, and the CST adds zero overhead
for consumers that don’t enable the feature.
Why a CST exists
The borrowed AST (AozoraNode<'src>) is great for renderers:
classified spans, typed payload, no whitespace noise. It is the wrong
shape for source-faithful tooling:
- A formatter rewriting
日本《にほん》→|日本《にほん》needs the exact whitespace and trivia between tokens. - A LSP
textDocument/foldingRangeprovider needs the open / close positions of every nestable region, including ones the renderer ignores. - A refactor that renames a kanji-range
[#「青空」に傍点]to[#「あおぞら」に傍点]must preserve every bracket character the user wrote, not just the parsedtarget.
A CST whose leaves concatenate to the parser’s input gives those tools what they need without any custom plumbing.
Lossless invariant
The contract is sharp:
Concatenating every leaf token’s text yields the sanitized source bytes the parser actually saw.
“Sanitized” matters: Phase 0 normalises CRLF→LF, strips a leading
BOM, isolates long decorative rule lines with a leading blank line,
and rewrites 〔…〕 accent spans through accent decomposition. These
transformations happen before classification, so source_nodes
coordinates address sanitized bytes. The CST tracks that coordinate
system; an editor that wants to map back to the user’s raw bytes
runs the same Phase 0 transformation and inverts where needed.
The proptest in tests/property_lossless.rs runs the invariant
across the full Aozora-shaped input distribution
(aozora_fragment / pathological_aozora /
unicode_adversarial from aozora-proptest). A regression here
breaks every editor surface that walks the CST.
Architecture
The crate stays decoupled by design:
aozora-cstdepends onaozora-pipeline+aozora-specdirectly, not on theaozorameta crate. Going throughaozorawould create a cycle (the meta crate’scstfeature re-exportsaozora-cst).build_cst(sanitized_source, source_nodes) -> SyntaxNodetakes the lower-level bits explicitly so consumers writing custom pipelines can reach in.aozora::cst::from_tree(&tree) -> SyntaxNodeis the ergonomic entry point; it runs Phase 0 sanitize internally and forwards.- The Phase 3 classifier sees no changes — adding / removing CST consumers cannot perturb AST perf.
SyntaxKind granularity
The CST is intentionally coarser than a token-stream re-construction:
SyntaxKind | Role |
|---|---|
Document | Tree root |
Container | Paired-container region ([#ここから...]...[#ここで...終わり]) |
Construct | Single classified Aozora construct |
ContainerOpen / ContainerClose | Container boundary tokens |
ConstructText | Source slice of a Construct |
Plain | Plain text run between classifications |
Finer per-token granularity (individual punctuation, kana runs, …)
can land later once a concrete consumer needs it. The lossless
property holds at any granularity, so widening the leaf set is
non-breaking for downstream tooling that walks preorder_with_tokens.
Why rowan, not Phase 3 integration
The bumpalo-arena AST stays the hot path; the CST sits on top as an editor-grade convenience layer rather than coupling lossless-tree concerns into the perf-critical classifier. rowan (over cstree) gives the lossless tree a maintained home — rust-analyzer’s tree infrastructure with 86 reverse deps — and the bumpalo / Arc dual-allocator overhead is the price for keeping the AST untouched.
Cross-references
- Architecture → Borrowed-arena AST — the underlying perf-critical tree.
- Architecture → Four-phase lexer — where Phase 0 sanitize and Phase 3 classify do their work.
Document::edit— the incremental-parse counterpart that reuses the same CST.
Error recovery
aozora is non-fatal by design: the parser always returns an
AozoraTree even when the input violates the spec. Every
problem is reported as a structured Diagnostic whose
code tooling can match on; nothing is ever raised as a
panic from Document::parse.
This page documents what the parser actually does when each diagnostic fires — useful when implementing editor surfaces, lint fixers, or anything else that runs over imperfect documents.
Recovery model
Every diagnostic carries two orthogonal axes:
| Axis | Values | Meaning |
|---|---|---|
severity | Error / Warning / Note | Routing hint for downstream surfaces; does not affect parsing. |
source | Source / Internal | Whether the issue is in the user’s input (Source) or in the library’s invariants (Internal). |
The parser keeps running regardless of severity. Error does not
short-circuit; it only marks the surrounding output region as
suspect so callers (CLI --strict, LSP) can decide policy. CI gates
typically treat any Error as failure, but the AST is still safe
to walk — the spans, classifications, and renderer all remain
consistent.
Source-side codes
aozora::lex::source_contains_pua
Hello, …<U+E001>… world.
A user-supplied codepoint in the range U+E001..U+E004 collides with one of the lexer’s PUA sentinel reservations. The placeholder registry keys on these codepoints, so a bare collision means the classifier could no longer tell user-text occurrences from lexer-inserted markers.
Recovery: the colliding bytes are kept verbatim in the sanitised text — Phase 0 does not delete them. Downstream the character flows through as plain text (the registry has no entry for the position so it is treated as ordinary content). Editors that want to surface the collision visually can match on this code; ordinary HTML rendering is unaffected.
aozora::lex::unclosed_bracket
|青梅《おうめ
An open delimiter (|, 《, [, 〔, 「, …) reached
end-of-input with no matching close on the pairing stack.
Recovery: no PairLink is emitted for the orphaned
opener (Unclosed opens have no partner span and would only
confuse editor highlights). Phase 3 then sees no Aozora construct
covering the unclosed open and degrades the whole region to plain
text — the bytes from the opener to EOF are preserved literally,
just without ruby / annotation classification.
aozora::lex::unmatched_close
》orphaned
A close delimiter saw an empty pairing stack, or its PairKind
mismatched the stack top.
Recovery: the stray close is not matched against any opener;
no PairLink is emitted. The bytes flow through as plain text,
preserving the user’s content; nothing on the stack pops. The
diagnostic span points at the close itself so editors can surface
it without corrupting the document tree.
Internal codes
Internal-source diagnostics indicate library bugs — production
parses on well-formed input never emit these. They are kept
publicly visible so tooling can distinguish “user input has a
problem” from “the library has a problem”; the parse still
completes best-effort to keep editors usable.
| Code | What broke |
|---|---|
residual_annotation_marker | An [# digraph survived classification — a recogniser is missing for the contained keyword. |
unregistered_sentinel | A PUA sentinel is in normalised text without a registry entry. |
registry_out_of_order | The placeholder-registry vector is not strictly position-sorted. |
registry_position_mismatch | A registry entry references a normalised position whose codepoint is not the expected sentinel kind. |
Recovery: the parser never acts on internal diagnostics —
the problematic stretch flows through as plain text, the diagnostic
records what was wrong, and Document::parse returns normally.
Reproductions belong in aozora-spec test fixtures so the bug
surface keeps shrinking over releases.
What recovery is not
The parser does not attempt fix-it suggestions. There is no
“did you mean [#ここで字下げ終わり]?” guess; the diagnostic’s
help text describes the symptom, not the cure. Higher-level
tooling (LSPs, editor extensions) is the right place for fix-it
proposals — they have user context the parser does not.
The parser also does not try to synthesise missing tokens. A
truly unclosed bracket stays unclosed in the tree; we don’t insert
a phantom 》 to “balance” it. Synthesising tokens hides the
diagnostic from any caller that walks the AST instead of the
diagnostic list, and turns a fixable user error into a silent
correction.
Cross-references
- Diagnostics catalogue — code-by-code
reference, including the
[#改ページ]-family directives this page does not cover. - Architecture → Seven-phase lexer — which pipeline phase emits which code.
- Wire format → DiagnosticWire — the JSON shape every binding (FFI, WASM, Python) carries diagnostics over.
tree-sitter reference grammar
aozora ships a tree-sitter grammar at
grammars/aozora.tree-sitter/grammar.js as a reference
implementation alongside the canonical Rust parser. When the two
disagree the Rust parser wins; this grammar exists to plug Aozora
documents into the tree-sitter ecosystem (neovim, helix,
web-tree-sitter / CodeMirror) and to serve as a teaching artefact.
Why a separate grammar at all
The Rust parser is a four-phase pipeline with a hand-rolled classifier; reading it tells you how the canonical implementation works but not what the spec accepts. A declarative grammar is the language community’s preferred form for “what the spec accepts.” Shipping one alongside the parser lets external tooling consume Aozora without binding to the Rust ABI.
What it does cover
The grammar handles bracket structure faithfully:
|base《reading》andbase《reading》— explicit / implicit ruby《《content》》— double-bracket bouten※[#...]— gaiji marker[#...]— generic bracket annotation〔...〕— tortoise-bracket / accent-decomposition span
Plain text — any byte that is not one of the bracket openers —
flows through as a plain_text token, keeping the grammar lossless
against the byte stream.
What it deliberately does not cover
Three classes of behaviour are intentionally out of reach:
- Stateful container pairing.
[#ここから2字下げ]matches[#ここで字下げ終わり]across intervening content; a context- free grammar without a hand-writtenscanner.ccannot close this. Consumers rely on the body content of the bracket annotation to recognise the pairing themselves, or fall back to the Rust parser. - Forward
「target」に傍点resolution. The bouten directive walks back through preceding text to bind to a quoted run. The grammar accepts the directive faithfully; the lookup stays the consumer’s job. - Ruby base disambiguation. When the glyph run preceding
《...》could extend further, the Rust classifier uses a more nuanced rule. The grammar accepts the greedy base match uniformly.
A scanner.c extension could plug some of these gaps, but doing
so contradicts the declarative-reference framing of the artefact
and would put the canonical-parser-replacement question on the
table prematurely.
Status
The grammar covers approximately 40 % of the canonical parser’s
constructs as measured by an unweighted variant count. The gap to
full coverage is documented; closing it would require a scanner.c
extension, which trades the declarative-reference framing for a
higher ceiling.
Cross-references
- Architecture → Concrete syntax tree — the rowan-backed in-process equivalent.
- Conformance suite — a future
xtask conformance run --implementation tree-sitterwill run the fixture set against this grammar to compute the per-tier pass rate againstmust/should/may. grammars/aozora.tree-sitter/README.md— build instructions.
Crate map
aozora is a 21-crate workspace. The split exists for three reasons:
narrow each crate’s compile surface (faster cargo check), pin
dependency boundaries (cycles are forbidden by the layout), and let
each binding (CLI, WASM, FFI, Python) compose only the layers it
needs.
At a glance
flowchart TD
subgraph foundation
spec
end
subgraph types
veb
syntax
encoding
scan
end
subgraph parser
pipeline
render
end
subgraph editor
cst
query
end
subgraph integration
pandoc
end
subgraph facade
aozora_facade["aozora"]
end
subgraph bindings
cli
ffi
wasm
py
end
subgraph dev
bench
conformance
corpus
proptest
trace
xtask
end
spec --> veb
spec --> syntax
spec --> encoding
spec --> scan
veb --> pipeline
syntax --> pipeline
encoding --> pipeline
scan --> pipeline
pipeline --> render
render --> aozora_facade
aozora_facade --> cli
aozora_facade --> ffi
aozora_facade --> wasm
aozora_facade --> py
aozora_facade --> bench
pipeline --> cst
cst --> query
syntax --> pandoc
aozora_facade --> conformance
corpus --> bench
proptest --> pipeline
trace --> xtask
Per-crate purpose
Foundation
| Crate | Role |
|---|---|
aozora-spec | Single source of truth for shared types: Span, Diagnostic, TriggerKind, PairKind, PUA sentinel codepoints, SLUGS dispatch table. No internal dependencies — every other crate may depend on it. |
Types & primitives
| Crate | Role |
|---|---|
aozora-veb | no_std Eytzinger-layout sorted-set lookup. Cache-friendly binary search for sub-256-entry registries. |
aozora-syntax | AST node types — AozoraNode<'src>, Container<'src>, Bouten<'src>, Ruby<'src>, …. Borrows from the bumpalo arena. |
aozora-encoding | Shift_JIS decoding, JIS X 0213 patch, 外字 PHF resolver, accent decomposition. |
aozora-scan | SIMD-friendly multi-pattern byte scanner (Phase 1’s trigger scan). One of three crates that locally relaxes unsafe_code — for aligned-load SIMD intrinsics. |
Parser
| Crate | Role |
|---|---|
aozora-pipeline | Four-phase lexer (sanitize → events → pair → classify) plus the lex_into_arena orchestrator that fuses normalize + registry + diagnostics into a single output walk. |
aozora-render | HTML and canonical-serialisation walkers. Single O(n) tree pass each; no allocation outside the output buffer. |
Editor-grade surface
| Crate | Role |
|---|---|
aozora-cst | Lossless rowan-backed concrete syntax tree built as a pure projection over the AST. Powers formatters, LSP folding, source-faithful refactors. |
aozora-query | Tree-sitter-flavoured pattern DSL over aozora-cst’s SyntaxNode. Selects nodes by SyntaxKind + capture name. |
Integration
| Crate | Role |
|---|---|
aozora-pandoc | Pandoc AST projection — turns an AozoraTree into pandoc_ast::Pandoc, unlocking 50+ output formats via Pandoc’s writer matrix. |
Facade
| Crate | Role |
|---|---|
aozora | Public facade. Document::parse() -> AozoraTree<'_>, tree.to_html(), tree.serialize(), tree.diagnostics(). The single import for library consumers. |
Bindings
| Crate | Role |
|---|---|
aozora-cli | The aozora binary (check / fmt / schema / kinds / explain / pandoc). |
aozora-ffi | C ABI driver. Opaque handles, JSON-encoded structured data. Locally relaxes unsafe_code; every block carries a // SAFETY: comment. |
aozora-wasm | wasm32-unknown-unknown target with wasm-bindgen exports. |
aozora-py | PyO3 binding shipped via maturin. |
Development-only
| Crate | Role |
|---|---|
aozora-bench | Criterion + corpus-driven probes. Source of the PGO training data. |
aozora-conformance | WPT-style fixture runner; pins golden HTML / serialise / diagnostics / wire output across 23 fixtures. |
aozora-corpus | Corpus source abstraction (zstd-archived, blake3-pinned). Dev-only. |
aozora-proptest | Shared proptest strategies (aozora_fragment, pathological_aozora, unicode_adversarial, xss_payload). Dev-only. |
aozora-trace | DWARF symbolicator + samply gecko-trace loader. Dev-only. |
aozora-xtask | Host-side dev tooling (samply wrapper, trace analysis, corpus pack/unpack, schema dumps). Not on the just build path. |
Why 21 crates?
Three concrete wins from the split.
1. Compile latency
A single-crate workspace with the same code would force a full re-compile on any internal change. With the workspace split, a change in the renderer doesn’t touch the lexer, scanner, or any of the bindings — incremental compile times stay sub-second on iteration.
2. No-std reach
aozora-veb and aozora-spec are no_std-clean. aozora-scan is
no_std-clean by default; the SIMD backends opt in to the std
feature for runtime CPU detection. That matters for the wasm32
build (where std is a real cost) and would matter for embedded
targets if anyone ever needed one. Keeping them in dedicated crates
enforces the no_std discipline at the crate-graph level —
adding a std import would require depending on a std-using
crate, which is a visible Cargo.toml change.
3. Binding modularity
The C ABI driver (aozora-ffi) needs aozora + serde and nothing
else. It does not pull in the bench harness, the trace loader, or
the corpus crate. The wasm driver is similarly minimal. Each
binding’s dependency closure is exactly what it needs — which is
what keeps the wasm bundle inside its 500 KiB budget.
What we deliberately don’t split
A few things stay co-located despite plausible split points:
- HTML render and canonical serialise in
aozora-render. Both are tree walkers; sharing the visitor helper between them keeps the implementation small. - Phase 0 sanitize sub-passes in
aozora-pipeline. Each sub-pass is < 100 LOC and operates on the same&strslice; pulling them out would create a 5-crate ecosystem for a transformation that’s conceptually one phase. - Trigger-byte enum and pair-kind enum in
aozora-spec. They’re used by bothaozora-scan(which produces them) andaozora-pipeline(which consumes them); putting them inspecavoids a back-reference.
Splits aren’t free — every additional crate adds a Cargo.toml, a
README, doc-link reachability, and a test surface. Splits land when
the cohesion benefit (one of the three above) is real.
See also
- Pipeline overview
- Borrowed-arena AST
- Reference → API — generated rustdoc for the public surface.
Choosing a binding
aozora reaches a lot of languages, but there is only one parser behind
them. Every surface — the Rust library, the CLI, the wasm package, the
PyO3 module, the Go module, the C ABI, the Extism plugin — funnels the same
source text through the same lexer and renders it through the same
aozora::wire authority. The HTML, the canonical
serialise, and the diagnostic stream are therefore byte-identical across
every binding. What differs between them is only the host language you
write in and the overhead you pay to cross the language boundary.
So the decision is not “which binding is more correct” — they all produce the same bytes. It is “which one fits the language and runtime I already have, at the cost I’m willing to pay.”
Decision table
Find the row that describes you; the rest of the page explains the trade-offs behind it.
| You are… | Use | Why | Distribution |
|---|---|---|---|
| Writing Rust | umbrella aozora library | Zero-copy borrowed AST, full type safety, the fastest path. No serialise. | crates.io¹ |
| At a shell / in CI / scripting | the aozora binary | check / render / fmt / pandoc, reads stdin, exits with a code. | GitHub release |
| In the browser, Node, or TypeScript | aozora-wasm | wasm-bindgen Document class; runs client-side and at the edge. | npm |
| Writing Python | aozora-py (PyO3) | In-process native module via maturin; idiomatic Python API. | build-from-source² |
| Writing Go | aozora-go | Pure-Go wazero host — no cgo, no C toolchain. | go get |
| Embedding from C / C++ / another native FFI | aozora-ffi C ABI | Opaque handle + JSON over a stable C header; link it like any library. | GitHub release |
| Writing Java, PHP, Ruby, or the long tail | aozora-extism host SDK | One portable aozora.wasm loaded by any Extism SDK. | GitHub release |
| Producing anything other than HTML (EPUB, LaTeX/PDF, DOCX, …) | aozora pandoc | Projects to the Pandoc AST; 50+ output formats via Pandoc writers. | GitHub release (CLI) |
¹ crates.io publication tracks the v1.0 API freeze; until then the git-tag form in the install chapter is the canonical entry point. ² PyPI wheels are pending; pre-1.0 the Python binding builds from source via maturin.
In-process vs host-runtime
The bindings fall into two camps, and the split is the single most useful lens for choosing.
In-process / native — the Rust library, aozora-py, aozora-wasm, and
the C ABI. The parser runs inside your process’s address space. Overhead is
zero (Rust) to low (a string copy and a JSON projection at the boundary for
the others). The cost is on our side: each of these is a native artifact
that has to be built and published for every (OS × arch) pair the
ecosystem expects.
Host-runtime — aozora-extism. The parser is a single portable
aozora.wasm, and your language loads it through its
Extism host SDK. You pay a JSON round-trip at the
wasm boundary (text in, a versioned JSON envelope out), and your host gains
one runtime dependency (the Extism runtime). In exchange, we do not
maintain a native build matrix for your language — the same wasm bytes
load identically on every platform. This is the deliberate breadth strategy
for the long tail of languages where writing and shipping a bespoke native
binding is the real cost. See
ADR-0006
for the full reasoning.
Because every binding already presents an interface of “string → JSON envelope”, the serialization round-trip Extism adds is intrinsic to the shape of the problem, not a new tax. Native bindings simply skip it by sharing the address space.
By language
A quick jump list:
- Rust → the
aozoraumbrella library. - JavaScript / TypeScript (browser, Node, Deno, edge) →
aozora-wasm. - Python →
aozora-py. - Go →
aozora-go. - C / C++ / Zig / any FFI-capable native language → the
aozora-ffiC ABI. - Java, PHP, Ruby, .NET, Elixir, Haskell, … → the
aozora-extismhost SDK. - None of the above / shell / CI → the
aozoraCLI binary.
By output format
aozora’s renderer emits semantic HTML5. The decision here is binary:
- You want HTML. Use the built-in renderer —
to_html()in any library binding, oraozora renderat the CLI. It is the canonical output and what the conformance suite gates on. - You want anything else — EPUB, LaTeX/PDF, DOCX, ODT, MediaWiki, and
~50 more — use
aozora pandoc. It projects the parsed tree into the Pandoc AST, where every Pandoc writer is one pipe away. Adding a new format means adding a Pandoc filter, never extending the parser.
A note on performance
If raw throughput is the deciding factor, the ordering is:
- Rust, borrowed-arena. The library hands you
AozoraNodes that borrow directly from thebumpaloarena — no copies, no serialise, no JSON. Nothing is faster. - In-process native bindings (
aozora-py,aozora-wasm, C ABI). One string copy in, one JSON projection out, but all in-process. Low, constant overhead. - Extism. A wasm-boundary JSON round-trip on top of the in-process cost. The slowest of the three transports — and still the right choice when the alternative is no binding for your language at all.
For the overwhelming majority of documents this difference is invisible against I/O. Reach for the Rust library’s borrowed AST only when you are parsing at scale (the corpus sweep over ~17 000 works is the motivating case); otherwise pick the binding that fits your language and let the constant overhead disappear into the noise.
See also
- Rust library — the first-class, zero-copy binding.
- WASM (wasm-pack) — browser / Node / edge.
- Python (PyO3 / maturin) — the native Python module.
- Go — pure-Go wazero host, no cgo.
- C ABI — opaque handle over a stable C header.
- Extism plugin — one wasm for the polyglot long tail.
- Pandoc AST projection — every non-HTML output format.
- Wire format — the shared JSON envelope every binding agrees on.
Rust library
The first-class binding. Full type safety, zero copy, and the borrowed-arena AST exposed directly.
Adding to a project
The recommended Cargo.toml snippet (with the current release tag)
lives in the install chapter.
Keeping the pin in one place avoids drift between this doc and the
install page when a new release lands.
crates.io publication tracks the v1.0 API freeze; until then, the git tag form documented there is the canonical entry point.
Surface
The public surface is small by design — three types and four methods cover everything:
pub struct Document { /* opaque */ }
impl Document {
pub fn new(source: String) -> Self;
pub fn parse(&self) -> AozoraTree<'_>;
pub fn source(&self) -> &str;
}
pub struct AozoraTree<'a> { /* borrows from Document */ }
impl<'a> AozoraTree<'a> {
pub fn nodes(&self) -> impl Iterator<Item = AozoraNode<'a>>;
pub fn to_html(&self) -> String;
pub fn serialize(&self) -> String;
pub fn diagnostics(&self) -> &[Diagnostic];
}
pub enum AozoraNode<'src> { Plain(&'src str), Ruby(Ruby<'src>), … }
See Library Quickstart for the walk-through.
Feature flags
aozora exposes one optional feature:
| Feature | Default | What it enables |
|---|---|---|
serde | off | serde::Serialize / Deserialize impls on AozoraNode, Diagnostic, Span. Useful for downstream tools that need to ship the AST over a wire. |
The default-off policy keeps cargo build aozora slim — the JSON
encoders that the bindings need live in the bindings themselves
(aozora-ffi, aozora-wasm, aozora-py), not in the core crate.
Error handling
Three philosophies, used consistently:
- Diagnostics are not errors.
Document::parse()always returns aAozoraTree<'_>. Per-input diagnostics live intree.diagnostics(). Callers decide whether to treat any diagnostic as fatal. - Decoding is fallible.
aozora_encoding::sjis::decode_to_stringreturnsResult<Cow<str>, DecodeError>. Malformed Shift_JIS is the one place a function actually fails — the parser proper assumes UTF-8. - Panics are bugs. No
.unwrap()on user-data paths in non-test code; clippy’sunwrap_usedandexpect_usedare warned workspace-wide. If you ever see a panic inaozora::*, file a bug.
Thread safety
Document is Send but not Sync — the bumpalo arena does not
support concurrent allocation. Pass a Document between threads
freely; do not share &Document across threads.
AozoraTree<'_> borrows from &Document, so by Rust’s lifetime
rules the same shape applies: a &AozoraTree is Send + Sync (it’s
just & to immutable data), but it can’t outlive its Document.
For parallel corpus processing (e.g. the corpus sweep harness
parsing 1000s of documents concurrently), each thread creates its
own Document from its own source. The arena resets per-Document,
so there’s no contention point.
MSRV policy
aozora pins Rust 1.95.0. The MSRV advances roughly once per
quarter, when a new stable feature is needed and the workspace
moves to it. The msrv job in CI gates every PR; Dependabot is
configured to not auto-bump the MSRV pin (manual decision).
Public API stability
Pre-1.0: minor-version bumps may break the API. cargo-semver-checks
runs in CI to catch unintentional breakage between releases, so
vX.Y.* patch bumps are always safe; only a minor bump
(vX.Y.* → vX.Y+1.*) opens the door for breaks. The current pin to
track lives in the install chapter.
Post-1.0 (planned): semver discipline. Breaking changes accumulate
on a next branch and ship in a major bump.
See also
- Library Quickstart
- Borrowed-arena AST — the lifetime model.
- Reference → API — generated rustdoc.
WASM (wasm-pack)
The aozora-wasm crate compiles to wasm32-unknown-unknown and
exposes a Document class via wasm-bindgen. The wasm artifact has
a hard 500 KiB size budget after wasm-opt -O3 — measured on every
release.
Build
rustup target add wasm32-unknown-unknown # one-time
wasm-pack build --target web --release crates/aozora-wasm
Outputs land at crates/aozora-wasm/pkg/:
aozora_wasm_bg.wasm— the binary moduleaozora_wasm.js— the wasm-bindgen JS shimaozora_wasm.d.ts— TypeScript typespackage.json— minimal npm-publishable metadata
Why wasm-opt = false in Cargo.toml?
wasm-pack ships its own bundled wasm-opt (via the binaryen crate)
which lags upstream. Recent Rust releases emit bulk-memory opcodes
(memory.copy, memory.fill) that the bundled wasm-opt mishandles
on -O3, occasionally producing artifacts that crash on init. We
disable the bundled run and recommend a fresh wasm-opt invocation
externally:
wasm-opt -O3 \
--enable-bulk-memory \
--enable-mutable-globals \
crates/aozora-wasm/pkg/aozora_wasm_bg.wasm \
-o crates/aozora-wasm/pkg/aozora_wasm_bg.wasm
The post-wasm-opt artifact has a 500 KiB size budget. CI gates on
this number — exceeding it is a release-blocking regression.
Usage
import init, { Document } from "./pkg/aozora_wasm.js";
await init(); // load the .wasm
const doc = new Document("|青梅《おうめ》");
const html = doc.to_html();
const canonical = doc.serialize();
const diagnostics = JSON.parse(doc.diagnostics_json());
console.log(html);
doc.free(); // release the bumpalo arena
In TypeScript, the .d.ts file gives you full type checking on
every method.
API surface
| Method | Returns | Notes |
|---|---|---|
new Document(source: string) | Document | Copies the JS string into a Rust Box<str>. |
to_html() | string | Renders to semantic HTML5 with aozora-* class hooks. |
serialize() | string | Re-emits canonical 青空文庫 source. |
diagnostics_json() | string | JSON-encoded array of diagnostic objects. |
source_byte_len() | number | Source byte length, useful for progress UI. |
free() | — | Explicit drop; otherwise the JS GC eventually releases. |
The diagnostics JSON shape mirrors aozora-ffi’s C ABI:
interface Diagnostic {
code: string; // "aozora::lex::unresolved_gaiji", …
level: "error" | "warning" | "info";
message: string;
span: { start: number; end: number };
help?: string;
}
Why a hand-written JSON projection over serde-wasm-bindgen?
serde-wasm-bindgen would let us pass the Diagnostic directly to
JS as a structured object — no JSON round-trip needed. We don’t use
it because:
- It pulls in a meaningful chunk of
serde_jsonmachinery that bloats the wasm bundle by ~80 KiB. - The wire format (
{ code: "aozora::lex::unresolved_gaiji", level: "warning", … }) is exactly what every JS consumer is going to deserialise into anyway. - It would force a
serde::Serializederivation on every diagnostic-related type inaozora-spec, which the Rust library consumers don’t otherwise need (they take&[Diagnostic]directly).
A small, hand-written JSON emitter (one core::fmt::Write impl, ~60
LOC) costs nothing and keeps the bundle small.
Why Document.free() and not just GC?
wasm-bindgen does wire Drop to a JS finalizer, but JS finalizers
fire on the GC’s schedule — which can be minutes after the last
reference goes out of scope, especially on Node.js where the GC
batches aggressively. For large documents this means the bumpalo
arena (potentially several MB) sits unreleased.
Explicit .free() is the same idiom every wasm-bindgen library
exposes for resource-heavy types. Consumers that want JS-native
ergonomics wrap the class in their own using (TC39 stage-3 explicit
resource management) helper.
Browser support
Tier-1 (CI-tested):
- Chrome 110+
- Firefox 110+
- Safari 16+
Tier-2 (works, not in CI):
- Node.js 18+ (use
--target nodejsinwasm-pack build) - Deno 1.30+
The bundle uses bulk-memory and mutable-globals; both have been universally supported since 2021.
Why wasm at all?
The CLI and the Rust library cover Linux / macOS / Windows native; the wasm build covers everywhere else — particularly:
- Browser-side preview / formatter for a 青空文庫 LSP front-end.
- Cloudflare Workers / Vercel Edge / Deno Deploy serverless rendering.
- Notebook environments (Jupyter via
pyodide, Observable, Quarto).
The same parser, same diagnostics, same canonical-serialise — across every wasm-runtime host.
See also
- Install
- Architecture → SIMD scanner backends — the wasm32 scanner backend.
Python (PyO3 / maturin)
The aozora-py crate is a PyO3 binding shipped
via maturin.
Install
pip install maturin # one-time
cd crates/aozora-py
maturin develop -F extension-module # install in current venv
# or
maturin build -F extension-module --release # produce a redistributable wheel
The extension-module feature gates the PyO3 import-side machinery
behind a flag, so a plain cargo build --workspace succeeds without
Python development headers installed. CI has both modes covered.
Minimal Python usage
from aozora_py import Document
doc = Document("|青梅《おうめ》")
print(doc.to_html()) # <ruby>青梅<rt>おうめ</rt></ruby>
print(doc.serialize()) # |青梅《おうめ》
print(doc.diagnostics()) # JSON-encoded list of diagnostic dicts
API surface
| Method | Returns | Notes |
|---|---|---|
Document(source: str) | Document | The constructor copies source into a Rust Box<str>. |
to_html() -> str | str | Renders to semantic HTML5 with aozora-* class hooks. |
serialize() -> str | str | Re-emits canonical 青空文庫 source. |
diagnostics() -> str | str | JSON-encoded list (same schema as the WASM and FFI bindings). |
source_byte_len() -> int | int | Source byte length. |
The diagnostics JSON shape is shared across every binding — see Bindings → WASM for the schema.
Thread safety: unsendable
The Document type is marked unsendable (PyO3 marker) because
the underlying bumpalo arena uses interior Cell state. Concurrent
access from another Python thread raises a RuntimeError:
import threading
from aozora_py import Document
doc = Document(open("src.txt").read())
def worker(): doc.to_html() # raises RuntimeError on second thread
threading.Thread(target=worker).start() # boom
For parallel corpus processing, create a Document per thread.
The arena resets per-Document, so there’s no contention point;
each thread allocates from its own arena.
Why not Send?
PyO3 has a Sendable trait that enables cross-thread access for
binding types. We don’t enable it because:
- Arena correctness.
bumpalo::Bumpis!Sync— the per-page allocator state isn’t atomic. Marking itSendablefrom PyO3 would require a mutex around every allocation, which is the cost we designed the arena to avoid in the first place. - GIL semantics. Python threads share the GIL; “concurrent” in
the Python sense is rarely actually parallel. The
unsendablemarker turns the misuse case into a loudRuntimeErrorinstead of a silent data race. - Multiprocessing path. The right answer for parallel corpus
work is
multiprocessing(oneDocumentper process — the arenas are independent by construction). Theunsendablemarker nudges users toward this.
Why JSON-encoded diagnostics?
Same reason as the WASM binding:
- The wire shape is stable across every binding.
- Avoids forcing a
pyclassdeclaration on every diagnostic-related type. - Downstream Python consumers
json.loads()once and work with native dicts — no second translation.
The diagnostics() method returns a str, not a list[dict], so
the json.loads is visible to the caller. Hiding it behind a
PyO3 Vec<PyDict> mapping would silently allocate one Python
object per diagnostic per call.
Wheel distribution
aozora_py is on PyPI (since v0.4.1):
pip install aozora_py
To build a wheel from a checkout instead:
maturin build -F extension-module --release # → target/wheels/*.whl
pip install target/wheels/aozora_py-*.whl
Release wheels are built in CI with maturin for every supported
(python, target) combination — the mainstream path for PyO3
projects.
See also
- Install → Python
- Bindings → C ABI — same diagnostics JSON shape.
- PyO3 user guide — the binding framework.
Go (wazero host SDK)
The Go binding is a host SDK over the portable aozora.wasm Extism
plugin, run through the pure-Go wazero runtime.
There is no cgo and no native libextism to link — go get is the
whole install:
go get github.com/P4suta/aozora-go
It is one spoke of aozora’s polyglot binding strategy: rather than a
hand-written native binding per language, every non-Rust front door
funnels through the same aozora.wasm bytes and the same aozora::wire
authority. See Choosing a binding for when to reach for Go
versus the native C ABI or the in-process Rust library,
and Extism plugin for the wasm artifact this SDK loads. The
rationale for the whole approach is recorded in ADR-0006 (linked from the
crate README).
Install & quickstart
package main
import (
"context"
"fmt"
aozora "github.com/P4suta/aozora-go"
)
func main() {
ctx := context.Background()
p, err := aozora.Open(ctx)
if err != nil {
panic(err)
}
defer p.Close(ctx)
html, _ := p.ToHTML("|青梅《おうめ》")
fmt.Println(html) // <ruby>青梅<rt>おうめ</rt></ruby>
nodes, _ := p.Nodes("|青梅《おうめ》")
for _, n := range nodes.Data {
fmt.Printf("%s @ [%d,%d)\n", n.Kind, n.Span.Start, n.Span.End)
}
}
Open(ctx) instantiates the plugin once; reuse the returned Parser
across calls and Close(ctx) it when done. Beyond ToHTML and Nodes,
a Parser exposes Serialize, Diagnostics, Pairs, and
ContainerPairs — each returning the matching wire envelope decoded into
the generated Go types:
| Method | Returns | Notes |
|---|---|---|
ToHTML(src) | string | Semantic HTML5 with aozora-* class hooks. |
Serialize(src) | wire envelope | Canonical 青空文庫 source round-trip. |
Nodes(src) | wire envelope | Borrowed-AST nodes with Kind + Span. |
Diagnostics(src) | wire envelope | Same diagnostic schema as every other binding (see WASM → API surface). |
Pairs(src) | wire envelope | Matched ruby / bracket / quote pairs. |
ContainerPairs(src) | wire envelope | Matched indent / align-end container pairs. |
Concurrency
A Parser is not safe for concurrent use — the underlying Extism
instance carries per-call wasm linear-memory state. Open one Parser per
goroutine (each Open is independent), or guard a shared one behind your
own mutex. For parallel corpus processing the per-goroutine pattern is the
intended one; instances do not contend.
How it works
Open(ctx) loads the embedded aozora.wasm plugin into a fresh wazero
runtime. Every method serialises its argument, calls the corresponding
plugin export, and decodes the JSON envelope into a Go type. Those wire
types live in wire_gen.go and are generated by just types-langs
(quicktype, fed from the wire JSON Schema) — they are not hand-maintained,
so they cannot drift from the Rust aozora::wire definitions. Because the
plugin bytes and the wire schema are shared, the Go output is
byte-identical to the Rust, WASM, Python, and C-ABI front doors:
same HTML, same canonical serialisation, same diagnostics.
Building / contributing
From the aozora workspace root, just smoke-go builds the plugin
(just extism-build), embeds the resulting aozora.wasm into the module,
and runs gofmt + go vet + go test. The aozora.wasm artifact is
git-ignored locally and dropped in by that target or by the release
workflow; wire_gen.go is regenerated by just types-langs and must not
be edited by hand.
Reference
- aozora-go README — the canonical, deeper reference for the module layout, generated files, and the ADR-0006 link.
- Choosing a binding — Go vs. the other front doors.
- Extism plugin — the
aozora.wasmartifact this SDK drives. - WASM → API surface — the shared diagnostics JSON schema.
C ABI
The aozora-ffi crate compiles to a cdylib + staticlib. The API
is opaque-handle + JSON-encoded structured data — the C side never
sees a Rust type, just opaque pointers and byte buffers.
Build
cargo build --release -p aozora-ffi
# → target/release/libaozora_ffi.{so,dylib,a}
# → target/release/aozora.h (cbindgen-generated)
The build script regenerates aozora.h automatically. After build,
the header lands at:
target/release/aozora.h— host-side convenience copy$OUT_DIR/aozora.h— cargo build-script standard location
#include "aozora.h" and link with -laozora_ffi.
Smoke test
just smoke-ffi
Builds the cdylib, compiles crates/aozora-ffi/tests/c_smoke/smoke.c
against it, runs it end-to-end. CI runs this on every PR — if the
ABI shape changes accidentally, the smoke test fails before the PR
merges.
Minimal C usage
#include <stdint.h>
#include <stdio.h>
#include <string.h>
#include "aozora.h"
int main(void) {
const char *src = "|青梅《おうめ》";
AozoraDocument *doc = NULL;
if (aozora_document_new((const uint8_t *)src, strlen(src), &doc) != 0)
return 1;
AozoraBytes html = {0};
if (aozora_document_to_html(doc, &html) != 0) {
aozora_document_free(doc);
return 1;
}
fwrite(html.ptr, 1, html.len, stdout);
aozora_bytes_free(&html);
aozora_document_free(doc);
return 0;
}
API surface
typedef struct AozoraDocument AozoraDocument;
typedef struct {
uint8_t *ptr;
size_t len;
size_t cap;
} AozoraBytes;
extern int32_t aozora_document_new(const uint8_t *src, size_t src_len,
AozoraDocument **out_doc);
extern int32_t aozora_document_to_html(const AozoraDocument *doc,
AozoraBytes *out_html);
extern int32_t aozora_document_serialize(const AozoraDocument *doc,
AozoraBytes *out_canonical);
extern int32_t aozora_document_diagnostics_json(const AozoraDocument *doc,
AozoraBytes *out_json);
extern void aozora_bytes_free(AozoraBytes *bytes);
extern void aozora_document_free(AozoraDocument *doc);
Status codes
| Code | Meaning |
|---|---|
0 | Ok |
-1 | Null input pointer |
-2 | Input was not valid UTF-8 |
-3 | Allocation failed |
-4 | Internal serialisation error |
Memory ownership
Every pointer or AozoraBytes returned by an aozora_* function
must be released by the matching _free call:
| Returned by | Free with |
|---|---|
aozora_document_new (AozoraDocument *) | aozora_document_free |
aozora_document_to_html (AozoraBytes) | aozora_bytes_free |
aozora_document_serialize (AozoraBytes) | aozora_bytes_free |
aozora_document_diagnostics_json (AozoraBytes) | aozora_bytes_free |
Dropping a handle without _free leaks; freeing then dereferencing
is undefined behaviour. This is the standard ABI contract — any
unsafe { Box::from_raw(...) } mistake on the consumer side
trips both ASan and miri (both run in CI on the FFI test suite).
Why JSON for diagnostics, not a C struct?
Three reasons.
- Variant types.
Diagnostichas optional fields (help, sometimes a multi-span). A flat C struct would either lose data or grow nullable pointers everywhere. JSON expresses optionality naturally. - Schema stability. Adding a new diagnostic field is a backward-compatible JSON change. Adding a field to a C struct breaks every consumer that compiled against the old size.
- Single emitter. The same JSON shape is produced by
aozora-wasm(consumed by JS) andaozora-py(consumed by Python). Aligning the C ABI on the same shape means downstream polyglot consumers don’t translate between three different schemas.
The cost is one serde_json::to_string call per
aozora_document_diagnostics_json invocation — a one-shot O(N)
allocation that is a rounding error compared to the parse itself.
Why opaque handle + bytes, not a flat C struct projection?
A flat C struct projection of AozoraTree would require:
- Naming every Rust enum variant in C (not supported cleanly via cbindgen for tagged unions).
- Translating the bumpalo arena into a malloc-backed block contiguous with the tree (which means copying the tree out).
- Pinning the AST shape across the C ABI — internal refactors
(e.g. adding a new
AozoraNodevariant) would break ABI without warning.
The opaque-handle approach keeps the AST entirely Rust-side. C consumers ask for HTML, canonical text, or JSON-encoded diagnostics — three stable shapes that don’t change with internal refactors.
Use from Go / Zig / Nim
Anything with a C FFI. The aozora.h header is plain C99 — no
inline functions, no macros that depend on a compiler-specific
extension, no #pragma. Tested in CI by the smoke test against
gcc, clang, and msvc.
See also
- Install → C ABI
- Bindings → WASM — same JSON diagnostics shape.
Extism host SDKs (Java / PHP / Ruby / … the polyglot tail)
The aozora-extism crate compiles to one portable
wasm32-unknown-unknown artifact — aozora.wasm — that any language
with an Extism host SDK can load. The bytes are
identical on every platform, so there is no per-(OS × arch) native
build to produce, sign, and publish: a Java, PHP, or Ruby host loads
the same wasm a Go host does.
This is the breadth strategy for new languages (ADR-0006). The native bindings stay where they already pay their way — Python (PyO3) and the browser WASM (wasm-bindgen) are in-process and faster — and the C ABI remains for max-performance embedders willing to ship a native library per platform. Extism covers everyone else with a single artifact and mechanically generated types.
The contract is the same “text in → bytes out” waist as the C ABI: each
export takes the Aozora source as input bytes and returns either HTML, a
round-tripped source string, or a versioned JSON envelope. Every JSON
path delegates to aozora::wire — the single
cross-driver authority — so the output is byte-identical to the C ABI,
the browser WASM, and the PyO3 drivers.
The plugin contract
aozora.wasm exports seven #[plugin_fn] entry points. Each takes the
source text as input and returns a string:
| Export | Input | Returns | Shape |
|---|---|---|---|
to_html | source | string | Semantic HTML5 with aozora-* class hooks. |
serialize | source | string | Canonical 青空文庫 source (round-trip). |
diagnostics_json | source | string | Wire envelope of diagnostics. |
nodes_json | source | string | Wire envelope of source-keyed nodes. |
pairs_json | source | string | Wire envelope of matched open/close pairs. |
container_pairs_json | source | string | Wire envelope of container open/close pairs. |
schema_version | (ignored) | string | The wire schema version as a decimal string. |
The four *_json exports each emit the standard wire envelope
{ "schema_version": 1, "data": [ /* … entries … */ ] }
The per-endpoint data entry shapes — and the committed JSON Schema for
each — are documented in the Wire format chapter.
to_html and serialize return a bare string (no envelope), and
schema_version returns just the integer rendered as text (e.g.
"1"); it ignores its input, so a host calls it with an empty buffer.
A source larger than the parser’s 4 GiB (u32::MAX) span limit is
rejected on the Extism error channel rather than aborting the instance —
the same guard the C ABI and browser WASM apply.
The schema_version wire contract
Every *_json export wraps its payload in
{ "schema_version": N, "data": [...] }, where N is
aozora::wire::SCHEMA_VERSION baked into the wasm at build time.
A host MUST call schema_version at load time and assert that the
returned integer equals the version its types were generated for:
- The wasm and the host’s generated types are version-locked. Mismatch
means the
dataarray may not decode into the types you compiled against. schema_versionis a cheap, input-free probe — the canonical place to fail fast, before the first real parse.
A SCHEMA_VERSION bump is a breaking change to the wire shape (a new
kind value, a field rename, an envelope restructuring). Per
ADR-0006’s consequences, a bump forces:
- regeneration of every language’s types (
just types-langs, drift-gated), and - a coordinated SDK release — the wasm release asset and the host SDKs are released together, version-locked.
So a host that asserts schema_version == <generated-for> at load can
treat any other value as “this wasm is from a different release than my
types” and refuse to proceed, rather than silently decoding against the
wrong shape.
Worked example: the Go SDK
The reference host SDK is aozora-go — a pure-Go host built on
the wazero runtime (no cgo, no native build). It is
the concrete instance of the language-agnostic pattern below: load
aozora.wasm, assert schema_version, call the exports, and decode the
envelopes with types generated from the committed JSON Schema. Every
other Extism host SDK follows the identical shape — only the host-SDK API
calls and the generated type syntax differ.
See aozora-go for the worked, idiomatic version; the section
below is the template every language instantiates.
Language-agnostic “call a plugin export” template
The steps are the same in every Extism host SDK; only the method names and type syntax change.
- Obtain
aozora.wasm. Download it from a GitHub release asset, or build it yourself withjust extism-build(see Building the plugin). - Create an Extism plugin from the bytes. Hand the wasm bytes to your host SDK’s plugin constructor. WASI is not required — the plugin needs no filesystem or environment access.
- Call
schema_versionand assert. Invokeschema_versionwith an empty input, parse the returned decimal string to an integer, and assert it equals the version your types were generated for. Abort on mismatch. - Call
to_html(source). Pass the source bytes; receive the HTML5 string. - Call
nodes_json(source)(or any*_jsonexport). Receive the JSON envelope string and parse it. - Decode
datawith generated types. Deserialize the envelope’sdataarray into the types generated from the committed JSON Schema for that endpoint.
plugin = ExtismPlugin(read("aozora.wasm")) // step 2
ver = int(plugin.call("schema_version", "")) // step 3
assert ver == EXPECTED_SCHEMA_VERSION
html = plugin.call("to_html", source) // step 4
env = json_parse(plugin.call("nodes_json", source)) // step 5
assert env.schema_version == EXPECTED_SCHEMA_VERSION
nodes = decode<NodeWire[]>(env.data) // step 6
One plugin instance is not concurrency-safe. A single Extism plugin wraps a single wasm instance with its own linear memory; do not call into one instance from multiple threads at once. Use one instance per thread, or pool them.
Per-language pointers
Extism publishes host SDKs for roughly 15 languages — including
Java, PHP, Ruby, .NET, Elixir, Haskell, OCaml, C/C++, and more — plus the
pure-Go aozora-go reference. Browse the current set at the
Extism host-SDK docs.
- Types for every supported language are generated from the
committed wire JSON Schema by
just types-langs(the quicktype driver), wired into the same drift-gate that guards the TypeScript.d.ts. Generate once perSCHEMA_VERSION; commit the output. - The wasm ships as a GitHub release asset (one artifact, all
platforms) and is reproducible locally via
just extism-build.
Building the plugin
just extism-build
Builds aozora-extism for wasm32-unknown-unknown and runs binaryen’s
wasm-opt (the pinned, bulk-memory-capable build baked into the dev
image), producing:
crates/aozora-extism/dist/aozora.wasm— the portable plugin artifact.
To exercise it end-to-end:
just smoke-extism
Both run inside the dev image — never invoke cargo / wasm-opt on the
host.
See also
- Choosing a binding — native vs. C ABI vs. Extism, and when to reach for each.
- Go SDK — the reference Extism host SDK (pure-Go wazero).
- Wire format — the envelope shape, the four endpoint payloads, and their JSON Schemas.
- C ABI — the in-process alternative for embedders that ship a native library.
- ADR-0006 — why Extism + schema-driven type generation is the breadth strategy.
Pandoc integration
The aozora-pandoc crate (workspace-internal, available via the
aozora CLI) projects a parsed Aozora document into the
Pandoc AST. Once you have Pandoc JSON, every Pandoc
output format (HTML, EPUB, LaTeX/PDF, DOCX, ODT, MediaWiki, …) is one
shell pipe away.
This is the recommended path if you want to convert Aozora Bunko notation into anything other than the built-in HTML renderer. Adding a new output format means adding a Pandoc filter (or none, if the default Span/Div mapping is enough), not extending the parser crate.
Quickstart
# Pandoc JSON to stdout
aozora pandoc input.txt > out.json
# Or pipe through pandoc directly
aozora pandoc input.txt | pandoc -f json -t html
aozora pandoc input.txt | pandoc -f json -t epub3 -o out.epub
# `--format` is shorthand for the pipe (requires pandoc on PATH)
aozora pandoc input.txt --format html > out.html
aozora pandoc -E sjis legacy.txt -t epub > out.epub
Projection rules
Each AozoraNode variant lifts to a Pandoc construct
carrying a stable CSS class so downstream filters or stylesheets can
specialise the rendering:
| Aozora variant | Pandoc construct | Class on the construct |
|---|---|---|
Ruby | Span | aozora-ruby |
| ↳ base text | nested Span | aozora-ruby-base |
| ↳ reading text | nested Span | aozora-ruby-reading |
Bouten | Span over target text | aozora-bouten |
TateChuYoko | Span | aozora-tate-chu-yoko |
Gaiji | Span carrying mencode | aozora-gaiji |
Indent, AlignEnd | empty Span (marker) | aozora-indent / align-end |
Warichu | Span with two children | aozora-warichu |
AngleQuote | Span | aozora-angle-quote |
Annotation, Kaeriten, HeadingHint | empty Span carrying raw | aozora-annotation / etc. |
PageBreak | HorizontalRule block | (n/a — semantic block) |
SectionBreak | empty Div | aozora-section-break |
AozoraHeading | Header block | aozora-heading |
Sashie | Para with Image | aozora-sashie |
| Container (字下げ等) | Div wrapping inner blocks | aozora-container-indent / etc. |
The structural attribute kvs (Pandoc’s third Attr tuple) carries
non-textual metadata (bouten kind / position, gaiji description /
mencode, indent amount, container kind). Filters that want
format-native rendering pattern-match on the class + kvs.
Why a Pandoc projection at all
Aozora notation has rich semantic markup (ruby, bouten, tate-chu-yoko,
gaiji…) that no single Pandoc native construct captures. The naive
shortcut of emitting RawInline("html", "<ruby>…</ruby>") would only
work for the HTML writer; every other Pandoc output format would
strip the raw HTML and lose the meaning.
By lifting each Aozora variant to a Span / Div with a stable
class, the same JSON renders sensibly across every Pandoc format
today (each format’s writer renders Span as a stylable container)
and stays open for richer format-native rendering tomorrow via
filters. That’s the same pattern Pandoc itself uses for
[content]{.smallcaps} — semantic in the AST, format-specific in the
writer.
Architecture
The library entry point is aozora_pandoc::to_pandoc:
use aozora::Document;
use aozora_pandoc::to_pandoc;
let doc = Document::new(std::fs::read_to_string("input.txt")?);
let pandoc = to_pandoc(&doc.parse());
let json = serde_json::to_string(&pandoc)?;
aozora-cli wires that into aozora pandoc so binary consumers
don’t need to write Rust.
Recipes
Task-shaped, copy-paste answers to “how do I do X with aozora?”. Each recipe is a single problem stated in one sentence, the minimal correct code to solve it, the output you should expect, and a jump list to the deeper chapters.
The Rust snippets use the umbrella aozora
crate and nothing else — downstream consumers depend on aozora
alone, never the internal build-block crates. The shell snippets use
the aozora binary. If you have not yet got either
in scope, start at Install, then the
Library or
CLI quickstart.
Each recipe that has a Rust solution maps to a runnable example under
crates/aozora/examples/, so you can read the whole program and run
it rather than reassembling fragments. Where that applies the recipe
says so — e.g. run with just example walk_ast.
The recipes
| I want to… | Recipe |
|---|---|
| Pull every ruby base + reading pair out of a document | Extract ruby pairs |
| Get diagnostics as machine-readable JSON | Diagnostics as JSON |
| Walk the parsed tree node by node | Walk the AST |
| Parse a Shift_JIS file and resolve 外字 | Shift_JIS & gaiji |
| Convert to EPUB / LaTeX / DOCX | EPUB via Pandoc |
| Check that a file is already canonical | Round-trip & fmt –check |
| Call aozora from Go / Java / Python / JS | Call from another language |
The example programs
The recipes mirror these runnable examples (authored under
crates/aozora/examples/); each is launched with just example <name>:
| Example | Mirrors |
|---|---|
hello | The six-line render in the Library quickstart |
walk_ast | Walk the AST, Extract ruby pairs |
diagnostics | Diagnostics as JSON |
round_trip | Round-trip & fmt –check |
sjis | Shift_JIS & gaiji |
See also
- Library Quickstart — the lifetime
model and the core
Document→AozoraTreeflow every recipe assumes. - Choosing a binding — picking the surface (Rust / CLI / wasm / Python / Go / Extism) before you start.
- Node reference — what each AST node represents.
- Wire format — the JSON envelope the
aozora::wireserialisers emit.
Extract ruby pairs
Problem. You want every ruby annotation in a document as
(base, reading) string pairs — to build a furigana glossary, audit
readings, or feed a dictionary.
Solution
Walk source_nodes() (see Walk the AST), keep only the
Ruby nodes, and read each node’s base and reading. Both are
NonEmpty<Content>; call .get() to get the Content, then
.as_plain() for the common case where the text carries no nested
constructs.
use aozora::{Document, AozoraNode, NodeRef};
fn main() {
let source = "|青梅《おうめ》街道を|逢《お》う";
let doc = Document::new(source);
let tree = doc.parse();
for sn in tree.source_nodes() {
// Ruby is always an inline construct.
if let NodeRef::Inline(AozoraNode::Ruby(ruby)) = sn.node {
// `base` / `reading` are NonEmpty<Content>; `.get()` is the
// Content, `.as_plain()` its text when there are no nested nodes.
let base = ruby.base.get().as_plain().unwrap_or("<mixed>");
let reading = ruby.reading.get().as_plain().unwrap_or("<mixed>");
println!("{base}\t{reading}");
}
}
}
Expected output
青梅 おうめ
逢 お
Notes
-
Why
NonEmpty. The parser only emits aRubynode once both base and reading have content, so the fields areNonEmpty<Content>— an empty side is unrepresentable, and you never have to guard against it..get()unwraps to the innerContent. -
The
<mixed>arm.Content::as_plain()returnsNonewhen the run carries nested constructs (a gaiji reference or annotation inside the base, for instance). That is rare for readings but does happen for bases. To flatten those too, iterate the segments instead of bailing (Segmentlives under thesyntaxmodule since it is not in the top-level re-export set):use aozora::syntax::borrowed::Segment; fn text_of(content: aozora::Content<'_>) -> String { let mut out = String::new(); for seg in content.iter() { if let Segment::Text(s) = seg { out.push_str(s); } // Segment::Gaiji / Segment::Annotation carry non-plain payloads; // handle them here if your glossary needs them. } out }Content::iter()yields aSegmentper logical run; thePlaincase yields exactly oneTextsegment, so the loop is uniform. -
delim_explicit.ruby.delim_explicitrecords whether the source used the explicit|base delimiter. It does not affect the base/reading text — see the Ruby node chapter for why both source forms classify identically.
See also
- Runnable example:
just example walk_ast(crates/aozora/examples/walk_ast.rs) shows the full node walk this recipe narrows. - Walk the AST — the general traversal.
- Ruby node reference — the
Rubystruct, the two source forms, and the rendered HTML. - Ruby notation — the
|青梅《おうめ》syntax itself.
Diagnostics as JSON
Problem. You want the parser’s diagnostics as a stable, machine-readable JSON document — to feed an editor, a CI annotation, or a cross-language tool.
Solution (library)
The parser always produces a tree, even from malformed input;
diagnostics ride alongside it. AozoraTree::diagnostics is the typed
slice, and aozora::wire::serialize_diagnostics projects that slice
into the shared wire envelope — the exact JSON
every binding (FFI, wasm, Python, Extism) emits.
use aozora::Document;
use aozora::wire::serialize_diagnostics;
fn main() {
// U+E001 is a private-use sentinel the parser reserves; feeding one
// in raises a diagnostic without aborting the parse.
let doc = Document::new("abc\u{E001}def");
let tree = doc.parse();
let json = serialize_diagnostics(tree.diagnostics());
println!("{json}");
}
The
wiremodule is behind thewireCargo feature onaozora.
Expected output
{"schema_version":1,"data":[{"kind":"source_contains_pua","severity":"warning","source":"source","span":{"start":3,"end":6},"codepoint":""}]}
Each entry is { kind, severity, source, span: { start, end }, codepoint? }. schema_version lets a consumer branch before an added
variant shows up; see the Wire format chapter
for the full schema and the "unknown" fallback contract.
Walking diagnostics without serialising
If you are staying in Rust, you usually do not need JSON at all — read the typed slice directly:
for d in tree.diagnostics() {
// `Diagnostic` is an enum: `{d}` is the human message (thiserror),
// `code()` the stable id, `span()` the byte range.
let span = d.span();
eprintln!("[{}] {d} @ {}..{}", d.code(), span.start, span.end);
}
Diagnostics are non-fatal by design: callers that want strict behaviour treat any diagnostic as an error themselves. The Diagnostics catalogue lists every stable code.
Solution (CLI)
For shell / CI use, aozora check lexes a file and reports
diagnostics, exiting non-zero under --strict:
aozora check src.txt # human-readable; exit 0 even with warnings
aozora check --strict src.txt # warnings → exit 1 (the CI gate)
cat src.txt | aozora check # reads stdin
A JSON output mode for check (--diagnostic-format json, emitting
the same serialize_diagnostics envelope) is planned so scripts get
the structured stream without writing Rust. Until it lands, the
library path above is the supported way to obtain the JSON; the CLI’s
current output is the human-readable form documented in the
CLI reference.
See also
- Runnable example:
just example diagnostics(crates/aozora/examples/diagnostics.rs). - Diagnostics catalogue — every code, severity, and what triggers it.
- Wire format — the envelope schema and version contract.
- CLI reference →
aozora check— flags and exit codes.
Walk the AST
Problem. You have parsed a document and want to visit every classified Aozora construct in source order — to count node kinds, build an index, or drive a custom renderer.
Solution
AozoraTree::source_nodes returns a slice of SourceNode, one per
classified construct, sorted by source position. Each carries a
source_span (byte offsets into the source) and a node, which is a
NodeRef tagging the sentinel kind that fired.
use aozora::{Document, NodeRef};
fn main() {
let source = "|青梅《おうめ》の[#ここから2字下げ]街道《かいどう》[#ここで字下げ終わり]";
let doc = Document::new(source);
let tree = doc.parse();
for sn in tree.source_nodes() {
let span = sn.source_span;
match sn.node {
NodeRef::Inline(node) | NodeRef::BlockLeaf(node) => {
// `node` is an AozoraNode; `.kind()` is the cross-cutting tag.
println!("{:>3}..{:<3} {:?}", span.start, span.end, node.kind());
}
NodeRef::BlockOpen(kind) => {
println!("{:>3}..{:<3} open {kind:?}", span.start, span.end);
}
NodeRef::BlockClose(kind) => {
println!("{:>3}..{:<3} close {kind:?}", span.start, span.end);
}
}
}
}
Expected output
0..21 Ruby
24..45 open Indent { amount: 2 }
45..72 Ruby
72..105 close Indent { amount: 2 }
(Byte offsets are over the full-width UTF-8 source; the exact numbers depend on your input.)
How the surface is shaped
source_nodes() is the source-coordinate view — the one editor
features and indexers want. The NodeRef variant tells you where the
construct landed:
Inline— an inline construct (ruby, bouten, gaiji, 縦中横, …) carrying anAozoraNode.BlockLeaf— a standalone block construct (page break, section break, heading) carrying anAozoraNode.BlockOpen/BlockClose— the two ends of a paired container ([#ここから…]/[#ここで…終わり]), each carrying aContainerKind.
NodeRef::kind() collapses all four into a single
NodeKind tag when you only need the discriminant;
NodeRef::sentinel_kind() gives the sentinel family.
Matching container open/close pairs
The walk above sees opens and closes as independent events. When you
need them paired — “where does this [#ここから…] close?” —
read AozoraTree::container_pairs instead, which yields one entry per
balanced pair (in normalized coordinates). The inline-delimiter
analogue (ruby 《…》, brackets) is AozoraTree::pairs. See
Indent & align containers for the container
model.
Reaching inside a node
AozoraNode is a borrowed enum; its payload fields hold the
construct’s content. To pull text out of a specific variant — say the
base and reading of a ruby node — match the variant and read its
Content; that is the next recipe,
Extract ruby pairs.
See also
- Runnable example:
just example walk_ast(crates/aozora/examples/walk_ast.rs). - Extract ruby pairs — the same walk, narrowed to one node kind, reading its content.
- Library Quickstart → Walking the AST.
- Node reference — every
NodeKindand what it carries. - Borrowed-arena AST — why nodes borrow from the
Document’s arena and what that means for lifetimes.
Shift_JIS & gaiji
Problem. Aozora Bunko ships its corpus as Shift_JIS, and those
files contain 外字 (gaiji) references like
※[#「木+吶のつくり」、第3水準1-85-54]. You want to decode the
bytes and see how each gaiji reference resolved.
Two concerns, two layers
- Encoding is not the parser’s job — the parser is strictly
UTF-8. Decode Shift_JIS first with
aozora::encoding, then hand the resultingStringtoDocument::new. - Gaiji resolution is the parser’s job. As it classifies a
※[#…]reference it resolves the mencode against the bundled JIS X 0213 tables, attaching the result to theGaijinode. You read it off the node; you do not call the resolver yourself.
Solution (library)
use aozora::{Document, AozoraNode, NodeRef};
use aozora::encoding::decode_sjis;
fn main() -> Result<(), Box<dyn std::error::Error>> {
// Decode the Shift_JIS archive file to UTF-8 (strict — errors on
// malformed bytes rather than substituting replacement chars).
let bytes = std::fs::read("crime_and_punishment.txt")?;
let utf8 = decode_sjis(&bytes)?;
let doc = Document::new(utf8);
let tree = doc.parse();
for sn in tree.source_nodes() {
if let NodeRef::Inline(AozoraNode::Gaiji(g)) = sn.node {
match g.ucs.and_then(|r| r.as_char()) {
Some(ch) => println!("{} → {ch}", g.description),
None => println!("{} → (unresolved)", g.description),
}
}
}
Ok(())
}
Expected output
木+吶のつくり → 吶
Gaiji carries three fields: description (the free-form source
text), ucs (the resolved Resolved, None when no table matched),
and mencode (the raw reference such as 第3水準1-85-54). Resolved
is either a single Char — recovered via as_char() above — or a
Multi combining sequence for the handful of plane-1 cells that need
one; see the Gaiji node chapter.
Picking the decoder
aozora::encoding offers more than one entry point:
decode_sjis(&[u8]) -> Result<String, _>— force Shift_JIS. Use it when you know the input is the canonical archive encoding.decode_auto(&[u8]) -> Result<Cow<str>, _>— sniff: valid UTF-8 is returned borrowed (zero-copy), otherwise the bytes decode as Shift_JIS. Use it for a mixed corpus where some files are pre-converted UTF-8 mirrors.
Both are strict — neither substitutes replacement characters — so you learn when you are looking at corrupted source rather than silently absorbing it.
Solution (CLI)
The aozora binary decodes Shift_JIS with -E sjis (alias
--encoding sjis); the default is UTF-8:
aozora render -E sjis crime.txt > crime.html
aozora check -E sjis crime.txt # diagnostics on the decoded text
aozora pandoc -E sjis crime.txt -t epub > crime.epub
See also
- Runnable example:
just example sjis(crates/aozora/examples/sjis.rs). - Gaiji node reference — the
Gaijistruct and theResolvedshapes. - Gaiji notation — the
※[#…]reference syntax. - Shift_JIS + 外字 resolver — the decode + resolution architecture.
- Library Quickstart → Shift_JIS input.
EPUB via Pandoc
Problem. You want an EPUB (or LaTeX/PDF, DOCX, ODT, …) out of an Aozora Bunko source — anything the built-in HTML renderer does not produce.
Solution
The aozora binary projects a parsed document into the
Pandoc AST as JSON. Pipe that JSON into
pandoc, and every Pandoc output format is one writer away. For EPUB:
aozora pandoc input.txt | pandoc -f json -t epub3 -o out.epub
That is the whole recipe. The same pipe reaches the other formats by
swapping the -t writer:
aozora pandoc input.txt | pandoc -f json -t latex -o out.tex
aozora pandoc input.txt | pandoc -f json -t docx -o out.docx
aozora pandoc input.txt | pandoc -f json -t html > out.html
Shift_JIS source decodes with -E sjis, exactly as for render /
check:
aozora pandoc -E sjis crime.txt | pandoc -f json -t epub3 -o crime.epub
aozora pandoc also has a --format / -t shorthand that runs the
pipe for you when pandoc is on PATH:
aozora pandoc input.txt -t epub > out.epub
Expected output
out.epub — a valid EPUB 3 container. Each Aozora construct lifts to
a Pandoc Span / Div carrying a stable CSS class (aozora-ruby,
aozora-bouten, …), so you can style or filter them per format. The
projection-rules table lists
every variant’s mapping.
Why a Pandoc projection at all
Aozora notation has rich semantic markup (ruby, bouten, 縦中横,
gaiji) that no single Pandoc native construct captures. Emitting raw
HTML would only survive the HTML writer; every other format would
strip it. Lifting each variant to a classed Span/Div instead means
the same JSON renders sensibly across every Pandoc format today and
stays open to richer format-native rendering via filters tomorrow.
Adding a new output format is a Pandoc filter, never a parser change.
See also
- Pandoc AST projection — the full projection
table, the
kvsmetadata contract, and the library entry point. - Choosing a binding → By output format.
- CLI reference —
aozora pandocflags.
Round-trip & fmt –check
Problem. You want to confirm a file is already in canonical Aozora form — or to canonicalise it — and to rely on parse ∘ serialize being lossless.
The property
AozoraTree::serialize re-emits Aozora source from the parsed tree.
The guarantee is a fixed point: parsing a canonical document and
serialising it returns the same bytes, and serialising again changes
nothing.
use aozora::Document;
fn main() {
let source = "|青梅《おうめ》";
let once = Document::new(source).parse().serialize();
let twice = Document::new(once.clone()).parse().serialize();
assert_eq!(once, twice, "serialize is a fixed point");
println!("{twice}");
}
Expected output
|青梅《おうめ》
Canonical vs. raw input
Real Aozora Bunko sources carry stylistic variation the lexer
normalises before tokenising — CRLF vs LF, NFC vs NFD around accents,
and the bare-vs-explicit ruby delimiter (青梅《おうめ》 vs
|青梅《おうめ》). For raw input, therefore:
// Not guaranteed for arbitrary raw input:
assert_eq!(Document::new(raw).parse().serialize(), raw); // may differ
// Guaranteed: the SECOND pass is a fixed point.
let canonical = Document::new(raw).parse().serialize();
assert_eq!(Document::new(canonical.clone()).parse().serialize(), canonical);
The first serialize() is the canonical form (e.g. it always emits
the explicit | ruby delimiter — see the
Ruby node chapter); from there it is stable. This
fixed-point property is what the corpus sweep verifies across the full
~17 000-work catalogue.
Solution (CLI)
aozora fmt is the round-trip at the shell. With --check it is a
read-only gate — exit 0 if the file is already canonical, 1 if it
would change:
aozora fmt --check src.txt # CI gate: nonzero if not canonical
aozora fmt src.txt > out.txt # write the canonical form to stdout
aozora fmt --write src.txt # rewrite in place
cat src.txt | aozora fmt # stdin → stdout
Exit codes: 0 on success (or no diff under --check), 1 on a
formatting mismatch under --check, 2 on a usage error. aozora fmt --check is exactly what this project runs in CI to keep fixtures
canonical.
See also
- Runnable example:
just example round_trip(crates/aozora/examples/round_trip.rs). - Library Quickstart → Round-trip.
- HTML renderer & canonical serialiser — how the canonical form is defined.
- CLI reference →
aozora fmt— flags and exit codes.
Call from another language
Problem. You are not writing Rust — you want to parse Aozora notation from Go, Java, Python, JavaScript, Ruby, PHP, or something further down the long tail.
One parser, many front doors
There is exactly one parser. Every binding funnels the same source text through the same lexer and emits the same HTML, the same canonical serialise, and the same wire-envelope JSON — byte-identical across every language. So the decision is not “which binding is more correct”; it is “which fits the language and runtime I already have.” Choosing a binding is the full decision table; this recipe is the short jump list.
Pick your language
-
JavaScript / TypeScript (browser, Node, Deno, edge) →
aozora-wasm. Awasm-bindgenDocumentclass; runs client-side and at the edge, distributed on npm. -
Python →
aozora-py. An in-process PyO3 native module built with maturin:from aozora_py import Document doc = Document("|青梅《おうめ》") print(doc.to_html()) # <ruby>青梅<rt>おうめ</rt></ruby> -
Go →
aozora-go. A pure-Go wazero host overaozora.wasm— no cgo, no C toolchain:go get github.com/P4suta/aozora-go -
C / C++ / Zig / any FFI-capable native language → the
aozora-ffiC ABI: an opaque handle plus JSON over a stable C header (aozora.h). -
Java, PHP, Ruby, .NET, Elixir, Haskell, … the long tail → the
aozora-extismhost SDK. One portableaozora.wasmthat any Extism host SDK loads — see below. -
Anything other than HTML (EPUB, LaTeX/PDF, DOCX, …) → the
aozora pandocpipe, regardless of host language.
The Extism template (the breadth strategy)
For the languages without a bespoke native binding, the answer is the
single aozora.wasm artifact loaded through that language’s Extism
host SDK. The steps are identical in every SDK — only the method names
change:
- Obtain
aozora.wasm(a GitHub release asset). - Load it with your host SDK’s plugin constructor (no WASI needed).
- Assert
schema_versionmatches the wire schema you compiled against. - Call an export with the source string:
to_html/serialize→ a bare string;diagnostics_json/nodes_json/pairs_json/container_pairs_json→ a{ schema_version, data }wire envelope.
- Parse the envelope
datawith types generated from the committed JSON Schema.
The reference host SDK (aozora-go) is exactly
this template instantiated in Go; every other Extism SDK follows the
same shape. The full export list and the language-agnostic walkthrough
live in the Extism chapter. Why a wasm plugin
for the tail rather than a native binding per language is
ADR-0006;
the short version is in
Choosing a binding → In-process vs host-runtime.
See also
- Choosing a binding — the decision table and the performance ordering.
- Extism host SDKs — the wasm exports and the per-language template.
- Go · Python · WASM · C ABI — the native / in-process bindings.
- Wire format — the JSON envelope every binding agrees on.
Release profile & PGO
aozora’s [profile.release] is tuned for cross-crate inlining at
the expense of compile time:
[profile.release]
lto = "fat" # full LTO across the whole workspace
codegen-units = 1 # single CGU so LTO sees everything
strip = "symbols" # smaller binary, faster cold start
panic = "abort" # no unwinding tables in the binary
opt-level = 3
Why fat LTO over thin
A thin LTO build keeps each crate’s IR isolated; the cross-crate inliner only inlines through summary stubs. Fat LTO concatenates every crate’s IR into one module before optimisation, so the inliner can see across the whole pipeline.
For aozora that pays off because the lex pipeline is deep:
aozora-render → aozora → aozora-pipeline::lex_into_arena →
per-phase functions, each living behind a crate boundary or a
module boundary that LLVM treats the same way under thin LTO. A
function call across that depth under thin LTO costs several
indirect calls and stack frames; the fat LTO build folds the chain
into ~40 inlined instructions on the hot per-byte path.
Measured on the corpus sweep: fat LTO is 30%+ faster than thin LTO once the lex orchestrator is split across crates. Compile-time cost is real (release builds take ~3 minutes vs ~1 minute for thin), but release builds happen at tag time, not on every iteration.
Why codegen-units = 1
codegen-units = N splits each crate into N parallel codegen jobs
during compilation. Each unit optimises independently, then the
linker stitches them together. With N > 1 the LLVM inliner can’t
see across unit boundaries inside a single crate — which under fat
LTO defeats half the point.
codegen-units = 1 ensures fat LTO actually sees every function in
every crate. Compile time grows; runtime wins back.
Why panic = "abort"
aozora is a parser, not a server. There’s no panic handler to
recover into — a panic on user input would be a parser bug, not a
recoverable error. panic = "abort":
- Drops the unwinding tables from the binary (~80 KiB savings on the CLI).
- Removes the panic-handling overhead from every function call (the compiler doesn’t insert landing pads).
- Surfaces parser bugs as
SIGABRTimmediately, which is what we want — a panic always indicates an invariant violation that needs fixing, not a state to gracefully degrade through.
For library consumers that want unwinding (e.g. embedding in a long-running server), the dependency-mode build inherits the consumer’s profile, so this only affects the binaries we publish.
Profile-guided optimisation (PGO)
The release pipeline supports PGO via scripts/pgo-build.sh:
./scripts/pgo-build.sh
Three-stage build:
- Instrumented build —
cargo build --releasewithRUSTFLAGS="-Cprofile-generate=/tmp/pgo-data". The resulting binary is slower than vanilla release because of the instrumentation overhead. - Profile collection — run the corpus sweep against the
instrumented binary. The corpus must contain a representative
spread of document sizes and notation density. The
aozora-benchthroughput_by_classprobe handles this. - Final build —
cargo build --releasewithRUSTFLAGS="-Cprofile-use=/tmp/pgo-data/merged.profdata". LLVM uses the profile to drive its inliner, branch-prediction hints, and basic-block ordering decisions.
Measured win on the corpus sweep: 8–12% faster than non-PGO release build. The cost is operational complexity (the build-script needs a real corpus available); the win compounds with fat LTO, since both target the same hot paths.
BOLT (post-link optimisation)
BOLT is the next layer after PGO: it reorders basic blocks in the
final binary based on the same profile. scripts/pgo-build.sh ends
with an optional BOLT pass when llvm-bolt is on PATH.
BOLT wins another ~3% on top of PGO, mostly by improving I-cache density for the lex hot path. The win is smaller than PGO’s because PGO already used the profile during compilation; BOLT only refines the final binary’s layout.
Why we don’t use specific tricks
-Cforce-frame-pointers=yes— would help samply unwind on some platforms, but the workspace[profile.bench]covers the profiling case (debug = 1 + strip = none). Release builds get the smaller binary.unsafeperf shortcuts —unsafe_code = "forbid"at the workspace level. Three crates locally relax it (FFI / scan / xtask), each with// SAFETY:comments and#[deny(unsafe_op_in_unsafe_fn)]. Where a perf opportunity needs unsafe, we measure it first and cite the win in the comment.#[inline(always)]— used sparingly. The compiler’s default heuristics have improved enough that forcing inlining usually costs binary size for negligible win. Where it does help (e.g. the per-byte scanner inner loop), the call site has a measurement comment.
See also
- Profiling with samply — how to measure whether a perf change helped.
- Benchmarks — the harness that produces the PGO profile.
- Corpus sweeps — the input the bench harness consumes.
Profiling with samply
samply is the workspace’s
sampling profiler. It produces .json.gz traces in the
Firefox-Profiler gecko format
that can be loaded into the web UI for visual analysis, or fed to
the in-tree aozora-trace crate for automated rollups.
Quick commands
# Single corpus document
AOZORA_CORPUS_ROOT=/path/to/corpus \
just samply-doc 001529/files/50685_ruby_67979/50685_ruby_67979.txt
# Full corpus, parser-bound (5 parse passes after the one-time load)
AOZORA_CORPUS_ROOT=/path/to/corpus just samply-corpus
# Full corpus, render-bound
AOZORA_CORPUS_ROOT=/path/to/corpus just samply-render
# Open in Firefox-Profiler
samply load /tmp/aozora-corpus-<timestamp>.json.gz
All three are wrappers over the aozora-xtask samply subcommand,
which:
- Builds the bench probe with
--profile=bench(debug info preserved). - Runs samply against the resulting binary.
- Drops the
.json.gzin/tmp/.
Why these run on the host (not Docker)
samply uses perf_event_open(2) for kernel sampling. Docker’s
default seccomp profile blocks that syscall. The xtask binary
therefore runs on the host (not via docker compose run) and the
Justfile recipes are exempt from the workspace’s normal
“everything in Docker” policy.
The recipes check /proc/sys/kernel/perf_event_paranoid on entry
and print the fix-up command if the value is too high (default 2;
needs to be ≤ 1 for unprivileged sampling):
echo 1 | sudo tee /proc/sys/kernel/perf_event_paranoid
Why --profile=bench and not --release
cargo build --release uses [profile.release], which has
debug = 0 + strip = "symbols". Samply still records samples,
but they show up as raw addresses (0x8fb61) instead of function
names — every sample becomes useless to a human reader.
The workspace [profile.bench] inherits from release but sets
debug = 1 + strip = "none". The xtask wrappers automatically
build with --profile=bench. If you launch samply manually, do the
same.
Corpus load dominates a single-pass trace
throughput_by_class and render_hot_path spend most wall time in
Shift_JIS decode + filesystem I/O during the one-time corpus load.
A single-pass samply trace puts __memmove_avx_unaligned and
encoding_rs::ShiftJisDecoder at the top — not the parser.
Fix: set AOZORA_PROFILE_REPEAT=K (or pass K to
just samply-corpus) so the parse pass runs K times after the
load. The xtask defaults to 5; raise to 10+ for very small corpora.
Trace analysis from the CLI
aozora-xtask trace … (and the just trace-* shortcuts) load
saved .json.gz traces, symbolicate them via the aozora-trace
crate (DWARF lookup is pure-Rust through addr2line::Loader), and
run the bundled analyses.
# 1. One-time per trace: write the symbol cache next to it
just trace-cache /tmp/aozora-corpus-<ts>.json.gz
# 2. Analyses (cache is auto-loaded if present)
just trace-libs /tmp/aozora-corpus-<ts>.json.gz # binary vs libc vs vdso
just trace-hot /tmp/aozora-corpus-<ts>.json.gz 25 # top-25 hot leaf frames
just trace-rollup /tmp/aozora-corpus-<ts>.json.gz # bucketed by aozora's built-in categories
just trace-stacks /tmp/aozora-corpus-<ts>.json.gz 'teddy' 5 # full call chains hitting any frame matching `teddy`
just trace-compare /tmp/before.json.gz /tmp/after.json.gz 25 # before/after diff
just trace-flame /tmp/aozora-corpus-<ts>.json.gz | flamegraph.pl > flame.svg
Each analysis returns a typed report — HotReport, LibraryReport,
RollupReport, ComparisonReport, MatchedStacksReport,
FlameReport — whose module docstring explains the algorithm.
Why a pure-Rust DWARF symbolicator?
The mainstream alternative is shelling out to addr2line(1) from
binutils. We don’t because:
- Process spawn cost. A typical trace has 5 000+ unique addresses;
spawning
addr2lineper address is unworkable. Pipelining through a single subprocess works but ties symbolisation to the presence of binutils onPATH(not always true on minimal containers). - Build-id verification. The
aozora-trace::Symbolicatorchecks the binary’sgnu-build-idagainst the trace’scodeIdso rebuilding between recording and analysis fails loudly rather than producing wrong symbol names.addr2line(1)has no such check. - Caching. The symbolicator writes a sidecar
<trace>.symbols.jsonon first call (~100 ms per binary) and reads from it on every subsequent call (instant). Re-runningaddr2lineper analysis would re-walk DWARF every time.
Verifying the SIMD scanner is firing
// In any binary or test
println!("{}", aozora_scan::BackendChoice::detect().name());
// "teddy-avx2" | "teddy-ssse3" | "teddy-neon" | "teddy-wasm" | "scalar-teddy"
Or under samply, look for aozora_scan::arch::x86_64::lead_mask_chunk_avx2
in the trace’s call tree. If the trace shows
aozora_scan::arch::x86_64::lead_mask_chunk_ssse3 instead, the
SSSE3 fallback is firing because the host lacked AVX2;
aozora_scan::kernel::teddy::ScalarTeddyKernel::lead_mask_chunk
indicates the pure-Rust last resort fired.
Workflow recipes
“I changed something, did I regress?”
# Microbench the per-band tokenizer throughput
cargo bench -p aozora-pipeline --bench tokenize_compare
# Macrobench the full pipeline end-to-end
AOZORA_CORPUS_ROOT=… cargo run --release --example throughput_by_class -p aozora-bench
AOZORA_CORPUS_ROOT=… cargo run --release --example render_hot_path -p aozora-bench
# Check the worst doc didn't regress
AOZORA_CORPUS_ROOT=… AOZORA_PROBE_DOC=000286/files/49178_ruby_58807/49178_ruby_58807.txt \
cargo run --release --example pathological_probe -p aozora-bench
“Where is lex_into_arena spending its time?”
# Macroscopic per-phase split
AOZORA_CORPUS_ROOT=… cargo run --release --example phase_breakdown -p aozora-bench
# Latency tail shape
AOZORA_CORPUS_ROOT=… cargo run --release --example latency_histogram -p aozora-bench
# Microscopic: which classify recogniser dominates a specific doc?
AOZORA_CORPUS_ROOT=… AOZORA_PROBE_DOC=… \
cargo run --release --features instrument --example pathological_probe -p aozora-bench
See also
- Benchmarks — the per-probe descriptions.
- Corpus sweeps — corpus setup and
AOZORA_*env vars.
Benchmarks (criterion)
aozora ships two layers of perf measurement:
- Criterion microbenchmarks in
crates/aozora-pipeline/benches/,crates/aozora-syntax/benches/,crates/aozora-scan/benches/, andcrates/aozora-bench/benches/. Reproducible per-function timings with statistical confidence intervals. - Corpus probes in
crates/aozora-bench/examples/. Each probe is acargo run --release --example <name>binary that reports per-band statistics across a real corpus.
Criterion microbenchmarks
Run a specific bench:
cargo bench -p aozora-pipeline --bench tokenize_compare
cargo bench -p aozora-pipeline --bench classify_kaeriten
cargo bench -p aozora-syntax --bench accent_decompose
cargo bench -p aozora-scan --bench scanner_bakeoff
cargo bench -p aozora-bench --bench crime_and_punishment
cargo bench -p aozora-bench --bench synthetic_corpus
Criterion writes HTML reports under target/criterion/. Each bench
reports throughput in MB/s, ns/byte, and a confidence interval; the
HTML reports include violin plots that surface multi-modal latency
distributions (which often indicate cache-line or page-fault
effects we’d otherwise miss).
Why criterion over #[bench]
Three reasons.
- Statistical rigour.
#[bench]reports the minimum of N iterations; criterion fits a model and reports a confidence interval. The minimum is a known-bad estimator on a system with any noise (which is every real machine). - Iteration count auto-tuning. Criterion picks the iteration
count to reach a target precision;
#[bench]requires a hand-picked count. - Stability.
#[bench]is unstable Rust, only works on nightly. Criterion is stable Rust.
Corpus probes
Each probe under crates/aozora-bench/examples/ reports a different
slice of the workload. All read AOZORA_CORPUS_ROOT; most accept
AOZORA_PROFILE_LIMIT=N to cap the sweep.
| Probe | Question it answers | Output shape |
|---|---|---|
throughput_by_class | Per-band MB/s for lex_into_arena | 4-band table + p50 / p90 / p99 / max + ns/byte |
phase_breakdown | Per-phase ms for sanitize / events / pair / classify | per-doc latencies + top-5 worst classify / sanitize |
latency_histogram | Log-bucketed latency distribution per phase | bar histogram, 10 buckets, 1 µs … 1 s |
pathological_probe | Single-doc 100-iter avg per phase | tight per-call numbers; takes AOZORA_PROBE_DOC for any corpus path |
phase0_breakdown | Per-sub-pass cost inside Phase 0 sanitize | bom_strip / crlf / rule_isolate / accent / pua_scan |
phase0_impact | Does Phase 0 sub-pass firing change Phase 1 cost? | bucketed by which sub-passes fired |
phase3_subsystems | Per-recogniser ms inside classify | requires --features instrument |
diagnostic_distribution | What fraction of docs emit diagnostics? | histogram by diag count; latency-by-diag-bucket |
allocator_pressure | Arena bytes / source byte ratio + intern dedup | per-doc histograms |
fused_vs_materialized | Does the deforestation actually win? | per-band gap % between fused (lex_into_arena) and materialized (per-phase collect) |
intern_dedup_ratio | How well does the interner dedup short strings? | corpus-aggregate (cache + table) / calls |
render_hot_path | Per-band MB/s for HTML render | 4-band MB/s + render/parse ratio + out/in size ratio |
Each probe is invoked directly:
AOZORA_CORPUS_ROOT=… cargo run --release --example <name> -p aozora-bench
For phase3_subsystems, build with the instrumentation feature:
AOZORA_CORPUS_ROOT=… cargo run --release --features instrument \
--example phase3_subsystems -p aozora-bench
Why corpus probes and criterion benches?
Different questions.
- Criterion answers “is function
Xfaster after my change?” on a fixed input. Microscopic, reproducible, the right tool for optimising a single hot loop. - Corpus probes answer “is the parser faster on the real Aozora Bunko catalogue after my change?” Macroscopic, includes every distribution effect (small-doc dispatch overhead, large-doc cache pressure, gaiji-density variation). The right tool for validating a perf PR end-to-end.
A perf PR that wins on criterion but loses on the corpus is suspicious — usually it’s optimised the small-input path at the cost of the large-input path. The corpus probe catches it.
Phase 3 instrumentation caveat
phase3-instrument wraps every recogniser entry in a
SubsystemGuard that calls Instant::now() on construction +
drop. For the dominant inner-loop recognisers this adds enough
overhead that the report’s own timing is significantly skewed.
Use the instrumentation to compare relative costs between
subsystems, not as an absolute number. For absolute numbers, run
phase_breakdown (no instrumentation).
Where to look in samply
If a corpus probe regresses, sample-profile the same workload:
AOZORA_CORPUS_ROOT=… just samply-corpus 5
samply load /tmp/aozora-corpus-<ts>.json.gz
# or
just trace-rollup /tmp/aozora-corpus-<ts>.json.gz
The trace-rollup analysis groups samples into aozora’s built-in
categories (Phase 0/1/2/3 + corpus_load + intern + alloc + …) so a
regression’s category jumps out at a glance.
See also
- Profiling with samply — the trace workflow.
- Corpus sweeps — what
AOZORA_CORPUS_ROOTshould point at. - Release profile & PGO — the build profile that produces these numbers.
Corpus sweeps
aozora’s tier-A acceptance gate is a corpus sweep: every Aozora
Bunko work parses without panicking, and the
parse ∘ serialize ∘ parse round-trip is stable. The corpus has
~17 000 works in active rotation; sweeping the lot takes ~90 s on a
modern x86_64 desktop.
Setting up the corpus
AOZORA_CORPUS_ROOT should point at a directory containing the
unpacked Aozora Bunko tarball:
$AOZORA_CORPUS_ROOT/
├── 000001/
│ └── files/
│ └── 18310_ruby_01058/
│ └── 18310_ruby_01058.txt ← Shift_JIS .txt source
├── 000002/
│ └── files/
│ └── …
└── …
The structure mirrors the upstream aozorabunko repo. Set the env var once in your shell:
export AOZORA_CORPUS_ROOT=/path/to/aozorabunko
Every probe, every sample-profile recipe, and the corpus sweep test suite reads it.
Running the sweep
just corpus-sweep
Wraps the aozora-corpus crate’s ParallelSweep runner. Iterates
every .txt file under $AOZORA_CORPUS_ROOT, parses it, verifies:
- No panic.
tree.diagnostics()count is within an expected envelope.parse(serialize(parse(source))) == parse(source)(round-trip property).- Render emits valid UTF-8 HTML (no broken byte sequences).
Failure: prints the offending document path + diagnostic, exits non-zero.
Why blake3 / zstd for the archive variant?
aozora-corpus ships an archive mode: the corpus packed into a
single .zst file with a blake3 manifest. This is what CI uses
(the corpus is downloaded once per workflow run and unpacked
in-memory).
- blake3 for per-entry content-addressed hashing. Used so the archive packer can detect “this work hasn’t changed since the last build” and skip re-encoding it. blake3 over sha256: ~10× faster on the same data, no security trade-off for our use case (we’re not signing anything, just diffing).
- zstd for compression. Frame-level random access matters
because the
ParallelSweeprunner wants to mmap individual works on demand without decompressing the whole archive. zstd over gzip / xz: 5–10× faster decompression at comparable ratios.
Both crates are mainstream pure-Rust APIs (the underlying libzstd
is C, but the boundary is hidden behind the zstd crate’s safe API).
Why parallel sweep?
A serial sweep runs sequentially through every work; on a 16-core
machine that’s wall-clock 16× the per-doc parse time. The
ParallelSweep runner uses rayon to parse documents in parallel,
sized to physical cores via num_cpus::get_physical() — not
logical cores.
The reason is memory bandwidth. The parser is bandwidth-bound, not ALU-bound (the SIMD scanner streams the source through L1 once per trigger byte, then the lexer touches each token a few more times). SMT siblings starve each other for cache lines and bus bandwidth, so oversubscribing logical cores actively slows the sweep. Sized to physical, the throughput peaks where the bandwidth ceiling does.
posix_fadvise(POSIX_FADV_DONTNEED) for honest cold-cache numbers
The xtask corpus uncache command evicts every corpus file from
the kernel page cache before a measurement run:
cargo run -p aozora-xtask --release -- corpus uncache
It uses posix_fadvise(fd, 0, 0, POSIX_FADV_DONTNEED) per file —
no sudo required (unlike echo 1 > /proc/sys/vm/drop_caches, which
needs root and drops every cache, defeating the purpose).
Why this matters: a “fresh” benchmark run that finds the corpus
already warm in the page cache reports throughput numbers that no
cold start can ever achieve. The uncache step makes “cold
benchmark” a real, repeatable thing.
Probes that go corpus-wide
| Probe | What |
|---|---|
throughput_by_class | Per-band MB/s for lex_into_arena. Splits the corpus by document size (small / medium / large / huge). |
phase_breakdown | Per-phase ms per doc. |
latency_histogram | Log-bucketed latency distribution per phase. |
diagnostic_distribution | What fraction of docs emit diagnostics? Histogram by diag count. |
allocator_pressure | Arena bytes / source byte ratio + intern dedup ratio. |
render_hot_path | Per-band render MB/s. |
See Benchmarks for the full list.
Why a dedicated aozora-corpus crate?
Three concerns kept apart from aozora-bench:
- Corpus discovery and loading. Walking the directory, decoding Shift_JIS, applying any per-work filters. This is shared by every probe + by the xtask corpus pack/unpack tooling.
- Archive format. The blake3 + zstd packing/unpacking lives here so the bench harness doesn’t pull in compression libraries.
- Parallel sweep runner. A reusable
rayon::par_iterwrapper with the right ordering (largest documents first to balance load).
aozora-bench then builds on this — each probe is a thin
for doc in corpus { measure(doc) } loop, with the corpus crate
handling all the I/O.
Why a separate AOZORA_PROFILE_REPEAT?
samply traces of probes that include corpus loading get dominated
by I/O and Shift_JIS decode (see
Profiling with samply).
Running the parse pass K times per document after the one-time
load gives samply enough parse-bound wall time to catch the
parser hot frames. Default K = 5; raise to 10+ for very small
corpora.
See also
- Benchmarks — the per-probe descriptions.
- Profiling with samply — the trace workflow.
Phase D — Sentinel enum + single-table registry
The single-table registry collapsed four per-kind sentinel position
tables into one position-keyed EytzingerMap dispatched through a
NodeRef enum. Before the refactor the registry held independent
inline / block_leaf / block_open / block_close
EytzingerMaps and Registry::node_at(pos) swept them in
declaration order with four if let Some(...) = table.get(&pos)
chains; the current shape is one binary search per lookup, with the
variant tag carried on the entry itself.
Structural changes
old : Registry { inline, block_leaf, block_open, block_close } // 4× EytzingerMap
node_at(pos) → 4-way if-let chain, ~4 binary searches worst-case
now : Registry { table: EytzingerMap<u32, NodeRef<'src>> } // 1× EytzingerMap
node_at(pos) → one binary search, NodeRef variant tags the kind
Renderers (crates/aozora-render/src/html.rs,
crates/aozora-render/src/serialize.rs) replaced the parallel
4-way if let Some(...) = registry.<kind>.get(...) chains with
a single (Structural, NodeRef) cross-product match — the
compiler now enforces variant coverage at the call site.
Expected runtime impact
Theoretical: per-lookup binary search count drops from ≤ 4 to 1.
Render hot path is dominated by registry lookups inside the
memchr2_iter loop in html::render_into (one lookup per PUA
sentinel hit), so the savings scale with sentinel density. Aozora
corpus profiling against the four-table layout showed registry
lookups at ~12 % of render time on bouten-heavy documents; the
unified dispatch should absorb roughly that fraction.
Measuring before / after
The repro recipe lives in perf/samply.md.
Numerical comparisons against the previous release are produced as
release-PR artefacts (the corpus-sweep run output in
/tmp/aozora-corpus-<timestamp>.json.gz, plus the diff produced by
xtask trace compare) and summarised in the CHANGELOG entry for
the release that lands the change. Pinned numbers in this page
would rot; the recipe + per-release artefact pair stays current
without an editing step here.
CLI reference
Full reference for the aozora binary. For a guided tour, see
CLI Quickstart.
Synopsis
aozora <SUBCOMMAND> [OPTIONS] [ARGS]
| Subcommand | What it does |
|---|---|
check | Lex + report diagnostics. |
fmt | Round-trip parse ∘ serialize (canonicalise). |
render | Render to HTML on stdout. |
pandoc | Project to a Pandoc AST (JSON, or pipe through pandoc). |
kinds | Tabulate every NodeKind / PairKind / Severity / … wire tag. |
schema | Print the JSON Schema for a wire envelope. |
explain | Print short prose for a NodeKind tag. |
There are no global options beyond clap’s -h/--help and
-V/--version; the input-shaping flags below are per-subcommand. All
document subcommands accept - (or no path) to read stdin.
| Common flag | Subcommands | Effect |
|---|---|---|
-E, --encoding {auto,utf8,sjis} | check / fmt / render / pandoc | Source encoding. Default auto — UTF-8 if the bytes are valid UTF-8, else Shift_JIS. |
Colour follows the terminal and the NO_COLOR environment variable
(miette honours it); there is no --no-color flag.
aozora check
aozora check [OPTIONS] [PATH]
Lex the source and report diagnostics. PATH of - (or omitted) reads
from stdin.
| Option | Effect |
|---|---|
--strict, -s | Exit non-zero (1) on any diagnostic. |
--encoding, -E | Source encoding (see above). |
--diagnostic-format {human,json,short} | How to render diagnostics. Default auto: human when stderr is a terminal, json when piped. |
The three formats:
human— a graphicalmiettereport: the source line, a caret under the span, the label, the help, and a link to the diagnostics catalogue.json— theaozora::wirediagnostics envelope, byte-identical to every other binding. The machine / agent path (the default when piped).short— one grep-able line:path:offset: severity[code]: msg.
Exit codes: 0 (parse succeeded; diagnostics may have been printed but
were tolerated), 1 (--strict and at least one diagnostic), 2 (usage
error), 3 (an Internal-source diagnostic fired — a library bug, not
bad input; please report it).
aozora check src.txt # human on a TTY, json when piped
aozora check --strict src.txt # any diagnostic -> exit 1
aozora check -E sjis crime.txt # Shift_JIS source
aozora check --diagnostic-format short - # one line per diagnostic, from stdin
cat src.txt | aozora check # json envelope (stderr is piped)
aozora fmt
aozora fmt [OPTIONS] [PATH]
Round-trip the source through parse ∘ serialize. Default prints the
canonical form on stdout.
| Option | Effect |
|---|---|
--check | Exit non-zero if the formatted output differs from the input (after Phase 0 sanitize: BOM strip, CRLF→LF). Mutually exclusive with --write. |
--write | Overwrite the input file with the canonical form. Ignored when reading from stdin. |
--encoding, -E | Source encoding (see above). |
Exit codes: 0 (success, or no diff under --check), 1 (formatting
mismatch under --check), 2 (usage error).
aozora fmt src.txt > formatted.txt
aozora fmt --check src.txt # CI gate
aozora fmt --write src.txt # in-place
cat src.txt | aozora fmt # stdin -> stdout
aozora render
aozora render [OPTIONS] [PATH]
Render the parsed tree to HTML on stdout. Accepts --encoding/-E.
aozora render src.txt > out.html
aozora render -E sjis crime.txt > crime.html
cat src.txt | aozora render -
The output is semantic HTML5 with aozora-* class hooks (no inline
styles). See HTML renderer for
the class-name reference.
aozora pandoc
aozora pandoc [OPTIONS] [PATH]
Project the parsed document to a Pandoc AST. Without --format/-t,
prints Pandoc JSON to stdout (consumable by pandoc -f json -t …); with
--format, spawns pandoc and pipes the JSON through it. Accepts
--encoding/-E.
aozora pandoc src.txt | pandoc -f json -t epub3 -o out.epub
aozora pandoc src.txt -t latex > src.tex # spawns pandoc directly
See Bindings → Pandoc.
Introspection subcommands
kinds, schema {diagnostics|nodes|pairs|container-pairs}, and
explain <tag> print typed contracts and need no input file. They back
the drift-gated wire artefacts; see Wire format.
Exit codes
| Code | Meaning |
|---|---|
0 | Success. |
1 | Diagnostics under --strict, or a formatting mismatch under fmt --check, or a spawned tool (pandoc) exited non-zero. |
2 | Usage error (bad flag, unreadable file, decode failure). |
3 | An Internal-source diagnostic fired during check — a library bug. |
Environment
| Variable | Effect |
|---|---|
NO_COLOR | If set (any value), disable ANSI colour in diagnostics output. |
AOZORA_LOG | tracing-subscriber filter (e.g. aozora_pipeline=debug). Internal debugging; not part of the stable surface. |
See Reference → Environment variables for the full matrix.
See also
- CLI Quickstart — examples and the subcommand rationale.
- Notation overview — what the parser recognises.
- Diagnostics catalogue — the codes you’ll
see in
check’s output and how--diagnostic-formatrenders them.
API reference (rustdoc)
The full rustdoc surface for every crate in the workspace is auto-deployed alongside this handbook. Browse it at:
The landing redirects to the top-level facade (aozora); from there
every workspace crate is reachable via the side panel.
Why /api/ instead of docs.rs?
aozora is on crates.io (since v0.4.1), so
docs.rs/aozora hosts the released API
reference. We also build and deploy the full rustdoc under /api/
on every main push: the in-tree copy tracks the development tip —
ahead of whatever the latest crates.io release renders on docs.rs —
and presents the umbrella plus every build-block crate as one
cross-linked set.
Read docs.rs for the version you depend on; use the /api/ mirror
here when you need unreleased main.
Layout
| Path | What |
|---|---|
/aozora/ (this site) | Handbook (this mdbook) |
/aozora/api/aozora/ | Public facade crate |
/aozora/api/aozora_pipeline/ | Four-phase lexer + lex_into_arena orchestrator |
/aozora/api/aozora_render/ | HTML / serialise renderers |
/aozora/api/aozora_syntax/ | AST node types |
/aozora/api/aozora_spec/ | Shared types + SLUGS dispatch table |
/aozora/api/aozora_scan/ | SIMD trigger scanner |
/aozora/api/aozora_veb/ | Eytzinger sorted-set |
/aozora/api/aozora_encoding/ | SJIS + 外字 |
/aozora/api/aozora_cst/ | rowan-backed lossless CST |
/aozora/api/aozora_query/ | tree-sitter-flavoured pattern DSL |
/aozora/api/aozora_pandoc/ | Pandoc AST projection |
/aozora/api/aozora_cli/ | CLI binary internals |
/aozora/api/aozora_ffi/ | C ABI driver |
/aozora/api/aozora_wasm/ | WASM driver |
/aozora/api/aozora_py/ | Python binding |
/aozora/api/aozora_bench/ | Bench probes |
/aozora/api/aozora_conformance/ | Conformance fixture runner |
/aozora/api/aozora_corpus/ | Corpus runner |
/aozora/api/aozora_proptest/ | Proptest strategies |
/aozora/api/aozora_trace/ | Samply trace loader |
/aozora/api/aozora_xtask/ | Dev tooling |
Doc-link discipline
The workspace [workspace.lints.rustdoc] block denies every
documentation lint:
broken_intra_doc_links = "deny"— every[name]link in a doc comment must resolve.private_intra_doc_links = "deny"— links topub(crate)items flagged so the public docs don’t dangle into private structures.invalid_codeblock_attributes = "deny"— typos in```rust,no_runstyle attributes get caught.invalid_html_tags = "deny"— accidental<foo>in prose flagged.invalid_rust_codeblocks = "deny"— every```rustblock must parse as Rust.bare_urls = "deny"— links must be<https://...>or[label](url), not bare URLs (which markdown parses inconsistently).redundant_explicit_links = "deny"—[x](x)where the autolink form would do.unescaped_backticks = "deny"— stray backticks flagged.
Every workspace-internal pub item that lands in rustdoc is
verified by cargo doc --workspace --no-deps running with
RUSTDOCFLAGS=-D warnings.
Local rustdoc build
just doc # workspace-wide rustdoc (no deps)
just doc-open # rustdoc + open in default browser
Both run inside the dev container; output lands at
target/doc/aozora/index.html.
Building this handbook
just book-build # render to crates/aozora-book/book/
just book-serve # live-preview at localhost:3000
just book-linkcheck # lychee link verification
See Contributing → Development loop for the full toolchain.
Environment variables
A central reference for every env var aozora reads. Variables fall into three groups: parser configuration, dev / bench harness, and container plumbing.
Parser configuration
| Variable | Read by | Effect |
|---|---|---|
NO_COLOR | aozora-cli | If set (any value), disable ANSI colour output. Same as --no-color. Standard convention from https://no-color.org. |
AOZORA_LOG | aozora-cli, library opt-in | tracing-subscriber filter directive (e.g. aozora_pipeline=debug,aozora_render=info). For internal debugging; not part of the stable surface. |
Dev / bench harness
| Variable | Read by | Effect |
|---|---|---|
AOZORA_CORPUS_ROOT | aozora-corpus, every probe, every sample-profile recipe, the corpus sweep | Directory of 青空文庫 source files (UTF-8 or Shift_JIS). Required for any corpus-driven operation. |
AOZORA_PROFILE_LIMIT | aozora-bench probes | Cap the number of corpus documents per probe. Useful for fast iteration; set to 100 for a sub-second sweep. |
AOZORA_PROFILE_REPEAT | samply-corpus, samply-render | Number of parse / render passes per document after the one-time corpus load. Default 5; raise to give samply enough parser-bound wall time to attach to. |
AOZORA_PROBE_DOC | pathological_probe | Single corpus path to probe in tight per-call mode. Path is relative to $AOZORA_CORPUS_ROOT. |
AOZORA_PROPTEST_CASES | aozora-proptest::config | Override default proptest case count (default 128 per block). 4096 for just prop-deep. |
Container plumbing
These are set by docker-compose.yml and don’t need manual handling
unless you’re invoking cargo directly outside the dev container.
| Variable | Set by | Purpose |
|---|---|---|
CARGO_HOME | compose | /workspace/.cargo — registry + git deps cached on a named volume. |
CARGO_TARGET_DIR | compose | /workspace/target — build output cached on a named volume. |
RUSTC_WRAPPER | compose | sccache — compile cache. |
SCCACHE_DIR | compose | /workspace/.sccache — sccache backing store on a named volume. |
SCCACHE_CACHE_SIZE | compose | 10G — default cap. |
CARGO_INCREMENTAL | compose | 0 — incremental compile defeats sccache; turning it off lets sccache cache the very crates we build most often. |
RUST_BACKTRACE | compose | 1 — full backtraces on panic. |
GIT_CONFIG_* | compose | Whitelists /workspace for git’s “dubious ownership” check (the bind-mounted host source is a non-root UID; the container runs as root). |
Variables we deliberately do not read
A few standard variables aozora intentionally ignores:
| Variable | Why ignored |
|---|---|
LANG / LC_ALL | aozora handles its own encoding via --encoding. Locale-driven byte interpretation would make the parser non-reproducible across machines. |
RUSTFLAGS (in non-build context) | The release / bench / PGO profiles set their own flags; per-invocation RUSTFLAGS would defeat sccache hits for unrelated crates. |
CARGO_BUILD_JOBS | Cargo’s default (CPU count) is what we want. Overriding usually fights the bench harness’s own parallelism control. |
See also
- CLI reference → Environment — the CLI’s per-invocation env.
- Performance → Corpus sweeps — the
AOZORA_CORPUS_ROOTsetup. - Performance → Profiling with samply — the
AOZORA_PROFILE_REPEATknob.
Conformance suite
aozora ships a WPT-style conformance corpus so other implementations of the Aozora Bunko notation (the tree-sitter reference grammar, third-party ports, alternate parsers in other languages) can measure their adherence against the same set of cases the Rust parser is held to.
Tier model
| Level | Meaning | Effect on xtask conformance run |
|---|---|---|
must | Required for any conforming implementation. | A failure here exits non-zero. |
should | Recommended but not strictly required. | A failure here logs a warning. |
may | Optional; implementations decide. | Pure information; never fails. |
The tier is declared per case in
crates/aozora-conformance/fixtures/render/<case>/meta.toml
alongside a feature tag (ruby, bouten, composite, recovery,
…). The runner aggregates pass / fail counts by (feature, level).
Running
just conformance # full suite, exits non-zero on must-fail
just render-gate # the byte-identical render gate, K3-style
xtask conformance run # invoke the runner directly
A successful run also writes
crates/aozora-book/src/conformance-results.json with per-case
detail. The JSON shape is stable; downstream dashboards / shields
parse it.
What gets compared
The runner pins six axes per fixture:
tree.to_html()byte-identical toexpected.html.tree.serialize()byte-identical toexpected.serialize.txt.aozora::wire::serialize_diagnostics(tree.diagnostics())byte-identical toexpected.diagnostics.json.aozora::wire::serialize_nodes(&tree)byte-identical toexpected.nodes.json.aozora::wire::serialize_pairs(&tree)byte-identical toexpected.pairs.json.aozora::wire::serialize_container_pairs(&tree)byte-identical toexpected.container_pairs.json.
Axes 1–2 anchor the human-readable surface; axes 3–6 pin the JSON projections that drivers (FFI / WASM / PyO3) consume in production, so a regression that survives the renderer gate but breaks a wire client lights up here.
All six goldens regenerate via
UPDATE_GOLDEN=1 cargo test -p aozora-conformance --test render_gate
after intentional output changes.
Implementations
The runner currently targets a single implementation — the Rust
parser itself. The results.json format carries an implementation
field so external runs can append their own results without
disturbing the canonical Rust pass-rate.
See also
- Architecture → Error recovery — what the
parser does after each diagnostic fires; the
recovery-feature fixtures pin those semantics. - Node reference — per-
NodeKinddocumentation.
AST query DSL
A tree-sitter-flavoured pattern DSL selects nodes / tokens from the
concrete syntax tree. Editor surfaces (LSP
textDocument/documentHighlight, “find all ruby annotations”,
refactoring filters, syntax-aware search) compose against the DSL
instead of re-implementing tree walks.
The DSL ships behind the query Cargo feature on the aozora
crate; that feature also enables cst since queries run against
SyntaxNode.
Quickstart
use aozora::Document;
use aozora::query::compile;
let doc = Document::new("|青梅《おうめ》と|青空《あおぞら》");
let cst = aozora::cst::from_tree(&doc.parse());
let query = compile("(Construct @ruby)").expect("compile");
for capture in query.captures(&cst) {
println!("{} -> {:?}", capture.name, capture.node);
}
Grammar
query := pattern ('\n' pattern)* '\n'?
pattern := '(' kind capture? ')'
| '(' '_' capture? ')'
kind := SyntaxKind ident // e.g. `Construct`, `Container`
capture := '@' ident
ident := [A-Za-z_][A-Za-z0-9_-]*
(Construct)— match everyConstructnode.(Construct @ruby)— capture eachConstructunder the nameruby.(_)— match any kind (node or token).(_ @any)— combined; tour every kind in preorder.- Multiple patterns separated by newlines run as an OR — every
matching node yields one
Captureper pattern that hits.
Execution model
The DSL compiles once into a Vec<Pattern>; the engine then tests
every pattern at every preorder step (O(nodes × patterns)). The
small capture-only surface keeps the implementation tight while the
predicate / field-access / alternation extensions wait for a
concrete consumer ask.
Not yet supported
- Predicates (
#eq?,#match?) — the tree-sitter query language exposes per-capture filters. The DSL ships without them; consumers filter the resulting [Capture] vec in Rust. - Field accessors (
(Container body: (Construct))) — the CST has no named fields yet. - Quantifiers (
(...)?,(...)*,(...)+). - Alternation
[...]between patterns.
These extensions are forward-compatible with the existing API
shape (compile → captures); a future release can land them
without breaking existing queries.
Cross-references
- Architecture → Concrete syntax tree — the CST the DSL queries.
- Node reference —
NodeKind/SyntaxKinddocumentation.
Wire format
aozora ships a stable JSON wire format used by every binding —
aozora-ffi (C ABI), aozora-wasm (npm), aozora-py (PyO3) —
to project the parser’s output across language boundaries.
aozora::wire
is the single authority for that projection; downstream drivers
call into it and receive bit-identical output.
Envelope shape
Every wire JSON has the form
{ "schema_version": 1, "data": [ /* … entries … */ ] }
where schema_version is the major version of the wire contract and
data is the per-endpoint payload array.
The four endpoint envelopes are:
| Endpoint | Entry shape | JSON Schema |
|---|---|---|
serialize_diagnostics | { kind, severity, source, span, codepoint? } | schema-diagnostics.json |
serialize_nodes | { kind, span: { start, end } } | schema-nodes.json |
serialize_pairs | { kind, open: { start, end }, close: { … } } | schema-pairs.json |
serialize_container_pairs | { kind, open: { offset }, close: { offset } } | schema-container-pairs.json |
SCHEMA_VERSION
The schema_version integer (aozora::wire::SCHEMA_VERSION)
bumps on any breaking change to the serialised shape — variant
additions exposing as a new kind value, field renames, envelope
restructuring. Clients should branch on the version and handle
unknown values defensively; schema 1 makes no forward-compatibility
guarantees with later schemas.
Stability vs. non_exhaustive
Diagnostic
and AozoraNode
are #[non_exhaustive] — minor releases can add variants. The wire
format protects callers in two ways:
- Unrecognised variants emit
kind: "unknown"rather than failing to serialise, so an old client never sees parse-time data loss. SCHEMA_VERSIONbumps when new variants ship in the wire surface, giving version-branching clients a chance to react before"unknown"shows up in production traffic.
See also
- Diagnostics catalogue — the source-code
identifiers each
DiagnosticWireentry’skindfield carries. - Architecture → Error recovery — what the parser actually does after each diagnostic fires.
- Node reference — per-
NodeKinddocumentation for every wirekindtag emitted byserialize_nodes. aozora::wirerustdoc — Rust API surface (envelope structs, theschema_*introspection helpers behind theschemaCargo feature).
Your first PR
Not every contribution touches the parser. A typo fix, a clarified sentence, a broken link — these are real, welcome PRs, and they ride a much lighter path than the add-a-notation TDD flow. This chapter is that lightweight path. For parser changes (a new 青空文庫 notation, a lexer phase, a renderer shape), follow the full TDD flow in Development loop → Adding a new 青空文庫 notation instead.
Before your first commit: environment setup
If this is a brand-new checkout, get the environment standing first:
just setup # one-shot first-time environment bootstrap
# or, equivalently:
./bootstrap
That builds the dev image and installs the lefthook git hooks. If
hooks ever stop firing later, re-run just hooks (see
Troubleshooting → Hooks not firing).
The lightweight path (a doc / typo fix)
-
Pick a small fix. A typo, a stale link, an unclear sentence in the handbook (
crates/aozora-book/src/…) or a top-level doc. Keep it to a single logical change. -
Branch.
mainis branch-protected — never commit to it directly.git switch -c docs/fix-ruby-example-typo -
Edit the
.md. Use only your editor here; no parser code is involved, so there’s nothing to compile. -
Verify locally. Two quick gates cover doc-only changes:
just typos # spelling across the tree just book-build # the mdbook handbook still buildsjust book-buildcatches broken intra-book links and bad Markdown that a typo check won’t. Both run inside the dev container, matching what CI runs. -
Commit with a signed, Conventional Commit. Doc changes take the
docs:type:git commit -m "docs: fix ruby example in the notation chapter"Both requirements are enforced by hooks: the
commit-msghook rejects a non-Conventional subject, and the signing layers reject an unsigned commit. Scope is optional for cross-cutting doc edits; use one when the change is crate-local (e.g.docs(render): …). See Conventional commits for the accepted types. -
Push and open a PR. The PR title mirrors the commit subject (
docs: …). The PR template walks the checklist — keep it. CI re-runs the same gates you ran locally.
If a commit is rejected for signing, or a hook misbehaves, jump to Troubleshooting & gate recovery.
The inner loop (while you iterate)
For anything beyond a one-line fix, run a watcher in a second terminal so feedback is continuous instead of per-commit:
just watch # default check job — recompiles on save
just watch-lint # fmt + clippy on save
just watch-test # nextest on save
The watcher runs inside the dev container, so it detects saves against the bind-mounted source. See Development loop → Watch mode for the in-watcher keybindings.
How this differs from a parser change
A doc fix is intentionally cheap. A parser change is not: it lands a failing test first, then the fix, and extends every test layer the new shape touches. The contrast is deliberate.
| Doc / typo fix | New notation / parser change | |
|---|---|---|
| Touches | A .md file | Spec fixture → AST → lexer → renderer → invariants |
| Verify | just typos, just book-build | just test, just prop, just coverage |
| Commit type | docs: | feat: / fix: / perf: … |
| TDD | Not applicable | Red test first, then green — required |
Both paths share the same two hard rules: signed commits and Conventional Commits. Everything else scales with the size of the change.
See also
- Development loop — the daily loop and the full notation TDD flow.
- Troubleshooting & gate recovery — when a hook or gate blocks you.
- Testing strategy — what the parser-change path is verifying.
Development loop
aozora’s development workflow is built around three rules:
- Docker-only execution. The host toolchain is never invoked.
justis the entry point. Every operation goes through ajustrecipe that wraps the underlying tool inside the dev container.- Lint gates run automatically. lefthook installs git hooks
that run
fmt + clippy + typospre-commit, and pre-push runs the full local CI gate suite plus a deep property sweep before every push (signed-commit check first), so a passing local commit roughly mirrors a passing CI run.
First-time setup
git clone git@github.com:P4suta/aozora.git
cd aozora
docker compose build dev # ~5 min the first time, cached afterwards
just hooks # install lefthook git hooks
just test # confirm green
Daily loop
just shell # drop into the dev container
just build # cargo build --workspace --all-targets
just test # workspace nextest
just lint # fmt + clippy + typos + strict-code
just prop # property-based sweep (128 cases / block)
just ci # full CI replica (lint + build + test + prop + deny + audit + udeps + coverage + book-build)
just --list enumerates everything available; just --list --unsorted
preserves the topical grouping (build → test → lint → deps → bench →
docs → release → dev-helpers).
Watch mode (bacon)
just watch # default `check` job
just watch clippy
just watch test
Inside bacon: t test, c clippy, d doc, f failing-only,
esc previous job, q quit, Ctrl-J list jobs. The watcher runs
inside the dev container so file change detection works against the
bind-mounted source.
For headless usage (no TTY, e.g. piping to tee):
just watch-headless check # plain output, no TUI
Why Docker for everything?
Three reasons.
- Toolchain reproducibility. The dev image pins
rust:1.95.0-bookwormplus exact versions ofcargo-nextest,cargo-llvm-cov,cargo-deny,cargo-audit,cargo-udeps,cargo-semver-checks,cargo-fuzz,mdbook,mdbook-mermaid,lychee,git-cliff,bacon, andlefthook. A fresh checkout on any machine produces identical tool behaviour. - sccache hits. The compose file mounts a named volume at
/workspace/.sccacheand setsRUSTC_WRAPPER=sccache. Across sessions and across branches, the cache stays warm. - Host insulation. Nothing in the workspace touches
~/.cargo,~/.rustup, or any global state. Removing the project meansdocker compose down -v && rm -rf aozora/.
The two exceptions to Docker-only:
- samply profiling.
perf_event_open(2)doesn’t survive the container seccomp profile; thesamply-*recipes invoke the host toolchain (see Profiling with samply). - Release builds. GitHub Actions runners build the release binaries natively per OS (the cross-target binary needs to match its runner OS exactly).
Editor / IDE setup
The repository includes a .devcontainer/ config, so:
- VS Code with Dev Containers extension — “Reopen in Container”
picks up the dev image, the rust-analyzer toolchain, and the
aozora-*workspace at once. No host-side rust install needed. - Anything else — point your editor’s rust-analyzer at the dev
container via
docker exec. The cleanest approach is symlinkingtarget/from the named volume to a host-visible path; the alternative is the editor’s own remote-LSP support.
sccache stats
After a build cycle, check that the cache is actually warm:
just sccache-stats
Healthy steady state: 80%+ hit rate during normal iteration. A
sub-50% hit rate usually means RUSTC_WRAPPER got defeated — the
likely culprit is a stray env override or an [env] in
.cargo/config.toml. To reset counters before a measurement window:
just sccache-zero && just clean && just build && just sccache-stats
Pre-commit hooks (lefthook)
lefthook.yml configures:
- pre-commit (parallel):
fmt,clippy,typos. - commit-msg: Conventional Commits regex.
- pre-push: the full local CI gate suite plus a deep property sweep before every push (the signed-commit check runs first).
The hooks shell into docker compose run --rm dev … so they’re
identical to the just recipes you ran manually. To skip a hook
temporarily, push from the dev container’s shell directly (the
hooks attach to the host git, not the container’s git).
Why lefthook over husky / pre-commit / cargo-husky?
- husky — Node-only ecosystem; would force a Node dep into a Rust workspace.
- pre-commit (Python framework) — Python-only ecosystem; same issue inverted.
- cargo-husky — abandoned upstream.
- lefthook — single Go binary, language-neutral, parallel execution, ships from a small upstream that’s actively maintained. Mainstream choice for polyglot Rust workspaces in 2026.
Conventional commits
The commit-msg hook enforces:
<type>(<scope>): <subject>
Where <type> ∈ feat | fix | docs | style | refactor | perf | test | build | ci | chore | revert,
and <scope> is typically a crate name without the aozora- prefix
(e.g. feat(render): add aozora-tcy class hook).
git-cliff turns these into the CHANGELOG on release.
Adding a new 青空文庫 notation
End-to-end TDD flow:
- Conformance fixture. Add a
source+expected.*golden undercrates/aozora-conformance/fixtures/render/(and, for a normative case, a spec vector in../aozora-notation-spec, synced viajust sync-spec-vectors). - AST variant. Add a borrowed-arena variant to
AozoraNodeincrates/aozora-syntax/src/borrowed.rs. - Lexer test (red). Add a case to the relevant phase test
under
crates/aozora-pipeline/tests/. - Lexer impl (green). Wire the recogniser into the appropriate phase (sanitize → events → pair → classify).
- Renderer. Emit the new HTML shape in
crates/aozora-render/src/html.rsand the canonical serialisation incrates/aozora-render/src/serialize.rs. - Cross-layer invariants. Extend the property test or corpus predicate that the new shape interacts with (escape-safety, round-trip, span well-formedness).
See also
- Testing strategy — what each test layer asserts.
- Release process — how a tag becomes a published release.
Testing strategy
aozora targets C1 100% branch coverage as a goal — but coverage is the floor, not the ceiling. Every invariant is asserted from multiple angles so a single missed test path doesn’t silently hide a regression.
The five test layers
flowchart TD
A["1. Conformance suite<br/>(crates/aozora-conformance/)"]
B["2. Property tests<br/>(crates/*/tests/property_*.rs)"]
C["3. Corpus sweep<br/>(every Aozora Bunko work)"]
D["4. Fuzz harness<br/>(cargo-fuzz)"]
E["5. Sanitizers<br/>(Miri / TSan / ASan)"]
A --> B --> C --> D --> E
Each layer catches a different kind of bug:
| Layer | Catches |
|---|---|
| Conformance suite | Per-feature contract regressions — render goldens + spec vectors. |
| Property tests | Invariant violations in the space of inputs (round-trip, escape-safety, span well-formedness). |
| Corpus sweep | Real-world distribution effects the property generator missed. |
| Fuzz | Latent panics on adversarial inputs the corpus doesn’t contain. |
| Sanitizers | UB / data race / heap-corruption issues the language can’t catch. |
When you add a new invariant, land all five touchpoints in the same PR, or split them into a chain of PRs that explicitly references the invariant.
Layer 1: conformance suite
The aozora-conformance crate is the per-feature contract layer,
with two CI-gated halves:
- Render fixtures —
crates/aozora-conformance/fixtures/render/<case>/pins asourceplusexpected.html/expected.nodes/expected.pairsgoldens.just render-gateasserts byte-identical output;just render-gate-update(UPDATE_GOLDEN=1) refreshes the goldens after an intentional change. - Spec vectors —
crates/aozora-conformance/spec-vectors/vectors/<case>/vector.jsonpins a(source, expected.{html,serialize,nodes,pairs,diagnostics})tuple. The specification repo (../aozora-notation-spec) is the single source of truth: vectors are vendored viajust sync-spec-vectors,just verify-spec-vectorsguards the copy against drift, andjust conformanceruns them across must / should / may tiers.
Both halves enumerate their fixture directories automatically, so a
new case is picked up without editing a manual list. The romaji CSS
slugs the fixtures assert are themselves centralised in
aozora-spec::RENDER_SLUGS and machine-checked against their kana
reading, so a misread slug fails a unit test before it ever reaches a
fixture.
The flagship corpus fixture lives at spec/aozora/fixtures/56656/ —
the Japanese translation of Crime and Punishment (Aozora Bunko card
56656). It exercises 1000+ ruby annotations, forward-reference bouten,
JIS X 0213 gaiji, and accent decomposition edge cases.
Layer 2: property tests
proptest generators in
crates/aozora-proptest drive parse / render / round-trip
invariants. Default 128 cases per proptest! block (CI budget);
just prop-deep runs 4096 per block (release-cut budget).
just prop # 128 cases
just prop-deep # 4096 cases
AOZORA_PROPTEST_CASES=10000 cargo nextest run --workspace --test 'property_*'
Why proptest over quickcheck:
- Proptest’s shrinker is structural (reduces by the generator’s ops), so a counterexample collapses to a minimal reproduction that still fails. Quickcheck shrinks per-type, which produces noisier outputs.
- Proptest persists failure seeds to
proptest-regressions/— every reproduced failure becomes a permanent regression test. Quickcheck has nothing like this.
Why a separate generator crate (aozora-proptest):
The generators are non-trivial (they have to produce valid 青空文庫 source — random byte streams would just stress the parser’s error path, which the fuzz harness already covers). Centralising them means every property test in every crate gets the same generator quality, and the generator itself can be unit-tested.
Layer 3: corpus sweep
export AOZORA_CORPUS_ROOT=$HOME/aozora-corpus
just corpus-sweep
Walks every .txt under $AOZORA_CORPUS_ROOT, parses, verifies
the round-trip property holds, no panics. ~17 000 works in active
rotation; ~90 s sweep on a modern x86_64 desktop using the parallel
loader.
The sweep catches what the property generator can’t — every weird real-world idiom the maintained corpus has accumulated over 25 years of volunteer encoding choices. It’s the parser’s truth-from-the-field.
See Performance → Corpus sweeps for the corpus structure, archive format, and parallel loader details.
Layer 4: fuzz
just fuzz parse_render -- -runs=10000
Targets under crates/*/fuzz/fuzz_targets/:
parse_render— feed arbitrary bytes throughDocument::new ∘ to_html.serialize_roundtrip—parse ∘ serialize ∘ parsestability.sjis_decode—aozora_encoding::sjis::decode_to_stringon arbitrary byte streams.
Fuzz failures auto-shrink to a minimal byte sequence and land in
crates/<crate>/fuzz/artifacts/. Add the failing input to
crates/aozora-conformance/fixtures/render/ as a regression case
after diagnosing.
Why libFuzzer / cargo-fuzz:
Mainstream Rust fuzzing runs on libFuzzer via cargo-fuzz; it has
the broadest crate-ecosystem support (most upstream crates ship
fuzz targets), the corpus-management tooling is mature, and the
crash artefacts are diff-able with git diff.
Layer 5: sanitizers
bash scripts/sanitizers.sh miri # UB on FFI / scan intrinsics
bash scripts/sanitizers.sh tsan # data races (parallel corpus loader)
bash scripts/sanitizers.sh asan # heap correctness
Sanitizer runs are slower (~10× under Miri) so they don’t run on every PR — they’re nightly via the dev-image cron in CI, plus release-cut. The slow path catches the slow-class of bugs.
Why all three:
- Miri catches undefined behaviour the compiler couldn’t see (out-of- bounds slice access, dangling references, transmute mismatches). The FFI driver and the SIMD scanner have unsafe surfaces; Miri is the only fully-checked oracle for them.
- TSan catches race conditions in the parallel corpus loader. We
use
rayoncorrectly as far as we know, but TSan is the backstop. - ASan catches the small set of heap-correctness bugs that get through Miri (typically C-side issues in the FFI smoke test).
Coverage measurement
just coverage # cargo llvm-cov branch coverage; CI gate
just coverage-html # local HTML report at coverage/html/index.html
just coverage-branch # nightly toolchain, branch-coverage detail
cargo llvm-cov over tarpaulin: tarpaulin is x86_64-linux
only and uses ptrace-based instrumentation that misses some
optimised-out branches. llvm-cov uses LLVM’s source-based
coverage instrumentation — works on every target and gives accurate
branch numbers.
The CI gate is region coverage; branch coverage is informational (it requires the nightly compiler, which the workspace doesn’t pin on the hot path).
Test naming and structure
- Unit tests in
mod tests {}at the bottom of each module. - Integration tests in
crates/<crate>/tests/. One file per area (e.g.tests/lexer_phase0.rs,tests/lexer_phase3.rs). - Property tests prefixed
property_(theproprecipe globs on this). - Doc tests inside
```rustblocks in rustdoc comments. CI runsjust test-docseparately because nextest skips them.
Snapshot testing
Where the output is a multi-line string that’s tedious to inline
(rendered HTML, diagnostic-formatted text), we use
insta:
insta::assert_snapshot!(tree.to_html());
The first run writes tests/snapshots/<test>.snap; subsequent runs
compare against it. Updates happen via cargo insta review (the
interactive UI inside the dev container), never by manually editing
the .snap file.
See also
- Development loop —
just testand friends. - Performance → Corpus sweeps — how the corpus layer 3 works in practice.
Troubleshooting & gate recovery
Most first-run friction is environmental, not code. This chapter collects the failures people actually hit on a fresh checkout and the shortest path back to a green tree. If you haven’t set up the environment yet, start with Development loop → First-time setup; if a commit is being rejected for signing, see Your first PR.
First-run failures
Docker daemon not running
Every just target shells into the dev container via docker compose run …, so a stopped daemon makes all of them fail at once — usually
with Cannot connect to the Docker daemon at unix:///var/run/docker.sock.
Start (or restart) Docker Desktop / the docker service and re-run the
target. Nothing in aozora runs on the host toolchain; the daemon is a
hard prerequisite for build, test, lint, and ci alike.
Disk full / image build fails midway
docker compose build dev pulls a rust:*-bookworm base and layers a
pinned toolchain on top. A build that dies partway through — or a
no space left on device from just build — almost always means the
Docker volume is out of room, not a Dockerfile bug.
docker system prune # reclaim dangling images / layers / build cache
Keep roughly 5 GB free for the dev image plus the named cargo /
sccache volumes. After pruning, re-run docker compose build dev; the
layer cache resumes from the last good step.
Commit signing fails
Signed commits are mandatory. If a commit is rejected — or the
post-commit re-amend rolls your commit back because the signer was
unavailable — your SSH/GPG signing key isn’t reachable from the
container’s git context. Walk through the signing setup in
CONTRIBUTING.md → First-time setup
and confirm git config commit.gpgsign is true with a configured
user.signingkey.
This is the three-layer defense working as designed: a post-commit
re-amend, the signing-check pre-push command
(scripts/check-signed-commits.sh), and GitHub’s “require signed
commits” ruleset. Do not weaken any layer — the redundancy is
intentional. Fix the key, don’t disable the gate.
Hooks not firing
If fmt / clippy / typos aren’t running on commit, or the signing
re-amend never happens, lefthook isn’t installed for this clone. Hooks
live in .git/hooks/, which is per-clone and never committed, so a
fresh checkout always needs:
just hooks # (re)install the lefthook git hooks
Re-run it any time hooks go quiet (e.g. after git init-level surgery
or switching the hooks path).
Reading lefthook output
Lefthook prints one icon per command in its post-run summary. The non-obvious one:
- 🥊 is a failure, not decoration. Lefthook falls back to its
branding glyph when the underlying tooling (a
docker compose run, or a multi-stepjustrecipe with background jobs) buries the real exit status. Treat 🥊 exactly like a plain failure mark and scroll up: the actual error line is in the command output above the summary, not in the summary itself.
Each command in lefthook.yml also carries a fail_text: hint
naming the recipe responsible, so a failing push prints both the raw
output and a pointer at what to fix.
When a gate fails
just ci runs the full pipeline; the pre-push hook runs the same jobs
plus a deep property sweep. When one trips, this table maps the symptom
to its recovery recipe:
| Gate | Symptom | Recovery |
|---|---|---|
| coverage | Region coverage below the floor | just coverage-html, open coverage/html/index.html, add tests for the uncovered regions |
| clippy / fmt | cargo fmt --check diff or clippy denial | just fmt to auto-format, fix any clippy findings, then re-run just lint |
| drift-gate (schema) | wire JSON Schema is stale | just schema to regenerate, then commit the diff |
| drift-gate (types) | TypeScript .d.ts drift | just types to regenerate, then commit the diff |
| drift-gate (langs) | Generated host-SDK wire types are stale | just types-langs to regenerate, then commit the diff |
| typos | Spelling hit | just typos to see every hit; fix, or add a genuine term to typos.toml |
| deny / audit | License / advisory failure | Read the captured log under /tmp (the recipe writes the full cargo deny / cargo audit output there), then update the dependency or the deny.toml exception |
For the schema / types gates, the regenerate-then-commit step is the
fix — the gate only checks that the committed artefact matches what the
generator would emit, so a stale checkout fails until you regenerate
and stage it. See Wire format for what wire /
.d.ts / langs each cover.
Escape hatches
Two exist, and they are not interchangeable:
SKIP_TAGS=deep git push— the narrow hatch. Skips only the taggeddeepcommand (the 4096-caseprop-deepsweep) while leavingsigning-check,ci, and everything else in force. Use this when a deep-sweep regression is unrelated to your change — and file an issue against the failing crate so it doesn’t stay hidden.LEFTHOOK=0— the nuclear hatch. Disables all hooks, includingsigning-check. An unsigned commit pushed this way is rejected server-side by the ruleset anyway, so you gain nothing but a later, more confusing failure. Avoid it. Reach forSKIP_TAGSinstead.
See also
- Development loop — the daily
justrecipes and watch mode. - Your first PR — the lightweight doc-fix path.
- Testing strategy — what each coverage layer asserts.
Release process
aozora releases are git-tag-driven: push an annotated v<semver>
tag, and .github/workflows/release.yml builds the cross-platform
binaries, generates release notes from Conventional Commits, and
publishes the GitHub Release.
Cutting a release
# 1. Pre-flight (everything green locally)
just ci # lint + build + test + prop + deny + audit + udeps + coverage + book-build
just prop-deep # 4096 cases per proptest block
AOZORA_CORPUS_ROOT=… just corpus-sweep
just smoke-py # host-side: abi3 wheel build + mypy + pytest (not in `just ci`)
# 2. Bump workspace version
cargo set-version --workspace 0.2.7
git commit -am "chore(release): bump workspace to v0.2.7"
# 3. Refresh CHANGELOG (Unreleased → version)
just changelog # runs git-cliff with --unreleased --prepend
git add CHANGELOG.md && git commit -m "docs: refresh CHANGELOG for v0.2.7"
# 4. Tag (annotated)
git tag -a v0.2.7 -m "v0.2.7"
git push origin main v0.2.7
release.yml reacts to the tag: builds release binaries on three
runners (linux x86_64, macOS arm64, windows x86_64), assembles
tarballs / zips with the aozora binary + LICENSE-MIT +
LICENSE-APACHE + NOTICE + README.md, and publishes the
archives plus SHA256SUMS to the GitHub Release.
Sanity check after release
# Verify checksums
curl -L -O https://github.com/P4suta/aozora/releases/download/v0.2.7/SHA256SUMS
curl -L -O https://github.com/P4suta/aozora/releases/download/v0.2.7/aozora-v0.2.7-x86_64-unknown-linux-gnu.tar.gz
sha256sum --check SHA256SUMS
# Verify the binary
tar -xzf aozora-v0.2.7-*.tar.gz
./aozora --version # prints "aozora 0.2.7"
Why annotated tags?
git tag -a creates a tagged-tag object with a message; git tag
alone creates a lightweight tag (a bare ref). git-cliff’s release
note extraction only walks annotated tags, and the standard
ecosystem expectation (cargo-release, cargo-dist) is that release
tags are annotated. Using lightweight tags would silently break the
changelog generator.
Why git-tag-driven, not branch-driven?
A release/v0.2.7 branch model is the alternative. We don’t use
it because:
- Single-author workflow doesn’t benefit from the parallel-tracks model that branch-driven releases enable.
- An annotated tag is the release artefact — anything you need to
retroactively understand about a release lives in
git show v0.2.7. A branch loses that locality. - Rollback is
git tag -d+ delete the GitHub release. Trivial.
CHANGELOG generation
git-cliff consumes Conventional Commits
and produces Keep-a-Changelog formatted output:
just changelog # incremental: --unreleased --prepend CHANGELOG.md
just changelog-full # rebuild from scratch
cliff.toml configures the grouping:
| Commit type | Section in CHANGELOG |
|---|---|
feat: | Added |
fix: | Fixed |
perf: | Performance |
refactor: | Changed |
docs: | Documentation |
test: | Tests |
build: | Build |
ci: | CI |
chore: | (skipped unless scope is release) |
revert: | Reverted |
Non-conventional commits are silently skipped (they survive in
git log but don’t pollute the changelog).
Why --unreleased --prepend over -o CHANGELOG.md:
The full-rebuild form (-o) regenerates the entire changelog from
git history every time, which churns the diff for past releases
even when nothing about them changed (whitespace, footer
formatting). The incremental form only writes the new “Unreleased”
section between the latest release and HEAD, leaving past entries
byte-stable.
Why three release targets and not five?
The CI matrix builds:
x86_64-unknown-linux-gnu(linux x86_64)aarch64-apple-darwin(macOS arm64)x86_64-pc-windows-msvc(windows x86_64)
We don’t build x86_64-apple-darwin (macOS Intel — Apple
deprecated the platform; arm64 covers all current Apple Silicon
machines) or aarch64-unknown-linux-gnu (linux arm64 — covered by
cargo install from source for the niche ARM Linux deployment
case).
Adding a target is one line in release.yml; we add them when a
real consumer asks for a binary build of one. Pre-emptive coverage
isn’t worth the CI minutes.
Why not cargo-dist / release-plz?
Both are mainstream choices; we use a hand-written release.yml
because:
cargo-distis opinionated about archive layout (assumes you shipbin/+share/); aozora’s archive is flat (aozora+LICENSE-*+NOTICE+README.md).release-plzautomates the version-bump + PR flow; for a single- author repo the manualcargo set-version+git tagis two commands and one fewer integration to debug.
When the workspace grows past three release targets or aozora goes multi-author, both will be worth re-evaluating.
Pre-1.0 SemVer
aozora is currently in the 0.x series. The contract:
0.x.y→0.x.y+1: patches and additions, no breaks. Always safe to upgrade.0.x.y→0.x+1.0: may break the API.cargo-semver-checksflags the breaks during CI; the version-bump commit references the break in its body.0.x.y→1.0.0: the API freeze. Post-1.0, breaking changes collect on anextbranch and ship in a major bump.
The MSRV pin (rust-toolchain.toml) advances on its own cadence,
roughly quarterly. MSRV bumps are not breaking under our pre-1.0
contract — consumers that need a frozen MSRV pin a release tag.
When you raise the MSRV, bump the Dockerfile FROM rust: base in the
same commit so the dev image keeps building on exactly the pinned
channel (one toolchain, no dead second one). Dependabot deliberately
ignores the rust base image (.github/dependabot.yml) precisely so it
cannot drift ahead of rust-toolchain.toml, so this base bump is manual.
Resolve the new digest with docker buildx imagetools inspect rust:<ver>-bookworm.
Publishing to crates.io
Live since v0.4.1. The whole workspace publishes through the manual
.github/workflows/publish-crates.yml workflow:
gh workflow run publish-crates.yml -f dry_run=false
It runs cargo publish --workspace (cargo 1.90+), which publishes
every publishable member in topological order — aozora-encoding /
aozora-spec first, aozora and aozora-cli last — and waits for
crates.io index propagation between dependent crates itself. Members
marked publish = false (aozora-corpus, aozora-conformance,
aozora-bench, aozora-trace, aozora-xtask, plus the
aozora-wasm / aozora-ffi / aozora-py drivers that ship through
npm / GitHub Releases / PyPI) are skipped automatically.
The default dry_run: true runs cargo publish --workspace --dry-run
only — a safe metadata gate that succeeds even on a first publish
because --workspace resolves intra-workspace deps locally. A live
run needs the CARGO_TOKEN repo secret populated with a crates.io API
token carrying both the publish-new and publish-update scopes (the
first run creates brand-new crates).
Single front door, still. The parser is built from many internal
crates (aozora-spec, aozora-syntax, aozora-pipeline,
aozora-render, aozora-encoding, aozora-scan, aozora-veb, plus
aozora-cst / aozora-query / aozora-proptest). They are now on
crates.io so the umbrella aozora crate can depend on them, but they
carry no API-stability contract — their crate descriptions say so,
and downstream consumers should depend on aozora alone.
Why we publish before v1.0
Earlier this was deferred to v1.0 (every pre-1.0 minor may break the
API; a published name is load-bearing). We publish now because the
crate boundary has stabilised and claiming the aozora* namespace is
itself worth doing. The pre-1.0 SemVer contract above still holds — a
0.x → 0.x+1 bump may break the API and is flagged by
cargo-semver-checks.
Publishing to npm and PyPI
The browser (WASM) and Python drivers ship through their own
manual workflows, same dry_run: true default as crates:
# npm — aozora-wasm (needs the NPM_TOKEN repo secret)
gh workflow run publish-npm.yml -f dry_run=false
# PyPI — aozora_py wheels (OIDC trusted publishing; no token secret)
gh workflow run publish-pypi.yml -f dry_run=false
publish-npm.yml builds the package with wasm-pack build --target web --release and npm publishes crates/aozora-wasm/pkg/.
publish-pypi.yml builds one cp311-abi3 wheel per OS (pyo3
abi3-py311, so a single wheel covers CPython 3.11 → 3.14 and future
3.x — no per-Python-version matrix) plus an sdist, and uploads via PyPI
trusted publishing (configure the project’s trusted publisher once,
pointing at this repo + publish-pypi.yml). Run just smoke-py first.
Linux aarch64, macOS universal2, and free-threaded (3.13t/3.14t,
which abi3 cannot target) wheels are a future cibuildwheel addition.
Cut these from the same vX.Y.Z tag as the GitHub Release so every
channel ships the same version. Run each workflow once with the
default dry_run: true first and confirm it’s green before flipping
to dry_run=false.
Code signing
Release binaries are not CA code-signed (no Authenticode on the
Windows .exe, no Apple Developer ID / notarization on the macOS
build). This is a deliberate pre-1.0 decision.
What we ship instead — and why it covers the current audience:
- Build provenance attestation (
actions/attest-build-provenance, since v0.4.0): every archive carries a Sigstore-backed SLSA provenance statement, verifiable withgh attestation verify <archive> --repo P4suta/aozora— no certificates, no CA. It proves which CI built which artefact from which source: a supply-chain control, not an OS-level execution-trust signal. - SHA256SUMS for integrity; signed git tags / commits for authorship.
CA code signing solves a different problem — suppressing the Windows
SmartScreen / macOS Gatekeeper “unknown publisher” prompt for end users
who double-click a downloaded binary. For a parser library + developer
CLI installed via cargo install / package managers, that prompt is
low-friction, so the recurring cost and operational overhead (HSM-stored
keys mandatory since 2023-06; ≤458-day cert validity since 2026-03) is
not justified yet.
When we revisit this (post-1.0, if desktop double-click installs become a real distribution path):
- Windows → SignPath Foundation free OSS code signing (Sectigo-issued, HSM-backed, CI-integrated). Note the 2024 SmartScreen change: EV no longer buys instant trust — both OV and EV build reputation organically over downloads.
- macOS → Apple Developer ID ($99/yr Apple Developer Program) + notarization. Third-party CA certs (e.g. ssl.com) do not satisfy Gatekeeper; only an Apple-issued Developer ID does.
- A paid CA (ssl.com eSigner, etc.) was evaluated and rejected: it covers Windows only, no longer removes the first-run warning on day one, and adds a yearly cost the project does not need pre-1.0.
See also
- Development loop — the local pre-flight commands.
- Testing strategy —
prop-deepand corpus sweep details.