Gaiji (外字 references)

Aozora Bunko predates ubiquitous Unicode support; many works still ship as Shift_JIS source. Characters that don’t fit in Shift_JIS — JIS X 0213 plane-2 ideographs, accented Latin letters, ad-hoc combining marks — appear in source as gaiji references:

※［＃「魚＋師のつくり」、第3水準1-94-37］
※［＃「彳＋寺」、U+5F85、393-13］
※［＃濁点付き片仮名ヰ］

The leading ※ (U+203B, reference mark) opens the annotation; the ［＃…］ body describes the character in three orthogonal ways:

A descriptive name in Japanese (「魚＋師のつくり」 — “魚 plus the right-hand side of 師”) for human readers.
A JIS X 0213 plane / row / cell triple (第3水準1-94-37 — plane 1, row 94, cell 37).
A Unicode codepoint (U+5F85) when the character has one.

aozora resolves gaiji references through a compile-time PHF lookup table built from the JIS X 0213 official mapping plus the Unicode UCS register, with the descriptive name as a tertiary fallback.

Why a compile-time table?

The gaiji table has ~14 000 entries. Loading it at runtime from a JSON / TOML asset would:

Add a startup cost on every Document::new (the parser is supposed to start reading bytes within microseconds).
Force every binding (CLI, WASM, FFI, Python wheel) to ship the table as a separate asset, complicating distribution.
Defeat dead-code elimination — the linker can’t strip entries the consumer’s input never references if they’re loaded behind an opaque file read.

A phf::Map baked into the binary at compile time wins on every axis: zero-allocation lookup, single-binary distribution, full DCE and LTO visibility. The build cost is real (~40 s the first time, ~0 s incremental) but happens once per workspace build, not per-invocation.

phf over static HashMap (which would require runtime construction in a OnceLock): phf produces a true compile-time perfect-hash table — O(1) lookup with no first-call cost and no synchronisation on the hot path.

Resolution order

For a reference like ※［＃「魚＋師のつくり」、第3水準1-94-37］:

Unicode codepoint if the source explicitly provided one (U+XXXX) — used directly.
JIS X 0213 plane-row-cell lookup (第N水準P-R-C) — most ideographs land here.
Descriptive name — the parser ships a curated mapping plus a single-character fallback (a description that is itself one glyph resolves to it). A reference that matches none of these resolves to nothing: the aozora::lex::unresolved_gaiji warning fires and the gaiji renders as its description text.

AST shape

pub struct Gaiji<'src> {
    /// Free-form description from the source (e.g. "魚＋師のつくり").
    pub description: &'src str,
    /// Resolved Unicode value — a single scalar or a static combining
    /// sequence — or `None` when no path matched.
    pub ucs: Option<Resolved>,
    /// Raw mencode reference (e.g. "第3水準1-85-54", "U+XXXX").
    pub mencode: Option<&'src str>,
}

Resolved is Char(char) for the 99%+ single-scalar case or Multi(&'static str) for the 25 JIS X 0213 plane-1 combining-sequence cells. ucs == None is the unresolved case the unresolved_gaiji warning flags.

Render output

`ucs`	HTML
`Some(_)`	`<span class="aozora-gaiji" data-codepoint="U+20B9B">𠮛</span>` — the resolved glyph as content, the scalar(s) as space-separated `U+XXXX` in `data-codepoint`.
`None`	`<span class="aozora-gaiji" data-description="魚＋師のつくり">魚＋師のつくり</span>` — the description as both attribute and content.

Accent decomposition

Aozora Bunko also encodes accented Latin letters (è, ñ, ä) using a separate notation that does not go through ※［＃…］:

M&iexcl;cher    ← in some sources
me-zin       ← in others

The full table is at https://www.aozora.gr.jp/accent_separation.html — 114 ASCII digraphs / ligatures mapping to Unicode. aozora applies this decomposition during the lexer’s Phase 0 (sanitize), so by the time classification runs the source is pure Unicode. See Architecture → Seven-phase lexer for the phase ordering.

aozora — 青空文庫記法 Parser Handbook

Gaiji (外字 references)

Why a compile-time table?

Resolution order

AST shape

Render output

Accent decomposition

See also

Keyboard shortcuts

aozora — 青空文庫記法 Parser Handbook

Gaiji (外字 references)

Why a compile-time table?

Resolution order

AST shape

Render output

Accent decomposition

See also