Expand description
Aozora Bunko accent decomposition — ASCII digraph → Unicode letter.
Spec: https://www.aozora.gr.jp/accent_separation.html
The scheme encodes accented Latin letters using a base ASCII letter followed by a one-character marker. The full 118-entry table from the spec is encoded here as a compile-time slice so the lexer (for pre-parse rewriting) and downstream tools share the same authoritative lookup.
use aozora_syntax::accent::decompose_fragment;
assert_eq!(decompose_fragment("fune`bre"), "funèbre");
assert_eq!(decompose_fragment("ae&on"), "æon");
assert_eq!(decompose_fragment("plain"), "plain");§Invariants
- The table is closed: no ASCII digraph maps to more than one Unicode
codepoint. Longest-match on ligatures first (
ae&,AE&,oe&,OE&) then single-letter digraphs. decompose_fragmentmay grow the byte length of some substrings (m'= ḿ,e~= ẽ are BMP codepoints ≥ U+1E00 whose UTF-8 forms are 3 bytes — larger than their 2-byte ASCII digraphs). Callers that back-map diagnostic spans across the rewrite must record a per-position delta.
§Scope of use
The function is only safe to call on the body of a 〔...〕 span:
aozora restricts accent decomposition to that convention to avoid
false-matching English text like text, (which would otherwise be
decomposed to texţ via the legitimate-in-Polish t, = ţ entry).
Constants§
- ACCENT_
MARKERS - ASCII characters that act as accent markers in the spec.
- ACCENT_
MARKER_ 🔒MASK - 128-bit bitmap of
ACCENT_MARKERSfor branchless ASCII membership testing. Bitnis 1 iff bytenis an accent marker. Computed at compile time fromACCENT_MARKERSso the two stay in lockstep. - ACCENT_
TABLE - The full accent decomposition table in spec-page order.
Statics§
- ACCENT_
DIGRAPHS 🔒 - 2-byte digraphs as a compile-time perfect hash table. 110 entries,
&[u8]keys (the 2 ASCII bytes),charvalues.phf::Map::getis O(1) and constant-comparison-bounded, replacing the 110-entry linear scan that the oldACCENT_TABLElookup used.
Functions§
- decompose_
fragment - Decompose Aozora accent digraphs anywhere inside
fragment. - is_
accent_ marker - Branchless membership test against
ACCENT_MARKERS. - match_
ligature 🔒 - 3-byte ligatures (ASCII keys → Latin char). Only four entries, so a
matchbeatsphf::Maphere: the compiler lowers it to a small jump table, branch prediction nails the common ASCII miss path, and thematchkeeps the keys inlined as immediates rather than reaching out to a static array. - try_
match 🔒 - Attempt to match a table entry starting at
bytes[i]. Longest-first (the spec rule): try 3-byte ligatures before 2-byte digraphs.