Skip to main content

Module gaiji

Module gaiji 

Source
Expand description

Gaiji (外字) resolution — mapping ※[#…、mencode] references to real Unicode characters.

Two incoming shapes per the Aozora annotation manual:

  ※[#「description」、第3水準1-85-54]    ← JIS X 0213 plane-row-cell
  ※[#「description」、U+XXXX、page-line] ← explicit Unicode codepoint

The lexer’s Phase 3 recogniser (aozora-lexer::phase3_classify::recognize_gaiji) captures description and mencode verbatim and leaves ucs = None; this module turns that reference into a concrete Resolved by consulting two phf::Maps compiled into the binary (one for the single-codepoint majority and one for the 25 combining-sequence cells) and, for U+XXXX shaped mencodes, parsing the hex digits directly.

§Why a Resolved enum

25 cells in JIS X 0213:2004 plane 1 (Ainu か゚ family, IPA tone marks, a handful of accented Latin) decode to a combining sequence — two Unicode scalars that must travel together. A single char cannot carry them, so the resolved value is either a char (the ~99.4% common path) or a &'static str borrowed from the generated combo table. Both variants are Copy, so embedding Option<Resolved> in the parser’s Gaiji payload does not perturb its Copy-able tree.

§Lookup order

  1. existing — the caller-provided codepoint (e.g. extracted by an earlier escape recogniser); short-circuit identity.
  2. Combo table — checked first for mencode because it is the only way to honour a 2-codepoint cell.
  3. Single-char table — the bulk path; one perfect-hash probe in .rodata.
  4. U+XXXX prefixU+ followed by 1–6 hex digits. Parsed as a hex integer, validated via char::from_u32.
  5. Description fallback — small secondary table keyed by the literal description text (well-known shapes like 〓, 〻).
  6. None — unresolved. Renderer falls back to the raw description bytes.

§Why two PHF maps rather than one enum-valued map

The single-char map is 4 329 entries; the combo map is 25. Storing the common path as phf::Map<&str, char> keeps each value at 4 bytes (vs 16-byte &str) and the cache footprint of the hot lookup path tight. The combo map is consulted second; misses there cost a single probe.

Enums§

Resolved
Resolution outcome — either a single Unicode scalar or a static string covering a combining sequence.

Functions§

lookup
Pure-function lookup used by aozora-lexer’s Phase 3 classifier to populate borrowed::Gaiji::ucs at construction time.
parse_u_plus 🔒
Parse a U+XXXX style mencode — 1 to 6 hex digits after the literal U+ prefix — and validate the result via char::from_u32. Returns None for surrogates, non-characters, and out-of-range integers, rather than panicking, so malformed input falls cleanly through to the description fallback.
table_sizes
Pretty-printer for tests and diagnostics. Returns (single_char_count, combo_count, description_count).