Expand description
Gaiji (外字) resolution — mapping ※[#…、mencode] references
to real Unicode characters.
Two incoming shapes per the Aozora annotation manual:
※[#「description」、第3水準1-85-54] ← JIS X 0213 plane-row-cell
※[#「description」、U+XXXX、page-line] ← explicit Unicode codepointThe lexer’s Phase 3 recogniser (aozora-lexer::phase3_classify::recognize_gaiji)
captures description and mencode verbatim and leaves ucs = None;
this module turns that reference into a concrete Resolved by
consulting two phf::Maps compiled into the binary
(one for the single-codepoint majority and one for the 25
combining-sequence cells) and, for U+XXXX shaped mencodes,
parsing the hex digits directly.
§Why a Resolved enum
25 cells in JIS X 0213:2004 plane 1 (Ainu か゚ family, IPA tone marks,
a handful of accented Latin) decode to a combining sequence — two
Unicode scalars that must travel together. A single char cannot
carry them, so the resolved value is either a char (the
~99.4% common path) or a &'static str borrowed from the
generated combo table. Both variants are Copy, so embedding
Option<Resolved> in the parser’s Gaiji payload does not
perturb its Copy-able tree.
§Lookup order
existing— the caller-provided codepoint (e.g. extracted by an earlier escape recogniser); short-circuit identity.- Combo table — checked first for
mencodebecause it is the only way to honour a 2-codepoint cell. - Single-char table — the bulk path; one perfect-hash probe
in
.rodata. U+XXXXprefix —U+followed by 1–6 hex digits. Parsed as a hex integer, validated viachar::from_u32.- Description fallback — small secondary table keyed by the literal description text (well-known shapes like 〓, 〻).
- None — unresolved. Renderer falls back to the raw
descriptionbytes.
§Why two PHF maps rather than one enum-valued map
The single-char map is 4 329 entries; the combo map is 25.
Storing the common path as phf::Map<&str, char> keeps each value
at 4 bytes (vs 16-byte &str) and the cache footprint of the hot
lookup path tight. The combo map is consulted second; misses
there cost a single probe.
Enums§
- Resolved
- Resolution outcome — either a single Unicode scalar or a static string covering a combining sequence.
Functions§
- lookup
- Pure-function lookup used by
aozora-lexer’s Phase 3 classifier to populateborrowed::Gaiji::ucsat construction time. - parse_
u_ 🔒plus - Parse a
U+XXXXstyle mencode — 1 to 6 hex digits after the literalU+prefix — and validate the result viachar::from_u32. ReturnsNonefor surrogates, non-characters, and out-of-range integers, rather than panicking, so malformed input falls cleanly through to the description fallback. - table_
sizes - Pretty-printer for tests and diagnostics. Returns
(single_char_count, combo_count, description_count).