Skip to main content

Module accent

Module accent 

Source
Expand description

Aozora Bunko accent decomposition — ASCII digraph → Unicode letter.

Spec: https://www.aozora.gr.jp/accent_separation.html

The scheme encodes accented Latin letters using a base ASCII letter followed by a one-character marker. The full 118-entry table from the spec is encoded here as a compile-time slice so the lexer (for pre-parse rewriting) and downstream tools share the same authoritative lookup.

use aozora_syntax::accent::decompose_fragment;
assert_eq!(decompose_fragment("fune`bre"), "funèbre");
assert_eq!(decompose_fragment("ae&on"), "æon");
assert_eq!(decompose_fragment("plain"), "plain");

§Invariants

  • The table is closed: no ASCII digraph maps to more than one Unicode codepoint. Longest-match on ligatures first (ae&, AE&, oe&, OE&) then single-letter digraphs.
  • decompose_fragment may grow the byte length of some substrings (m' = ḿ, e~ = ẽ are BMP codepoints ≥ U+1E00 whose UTF-8 forms are 3 bytes — larger than their 2-byte ASCII digraphs). Callers that back-map diagnostic spans across the rewrite must record a per-position delta.

§Scope of use

The function is only safe to call on the body of a 〔...〕 span: aozora restricts accent decomposition to that convention to avoid false-matching English text like text, (which would otherwise be decomposed to texţ via the legitimate-in-Polish t, = ţ entry).

Constants§

ACCENT_MARKERS
ASCII characters that act as accent markers in the spec.
ACCENT_MARKER_MASK 🔒
128-bit bitmap of ACCENT_MARKERS for branchless ASCII membership testing. Bit n is 1 iff byte n is an accent marker. Computed at compile time from ACCENT_MARKERS so the two stay in lockstep.
ACCENT_TABLE
The full accent decomposition table in spec-page order.

Statics§

ACCENT_DIGRAPHS 🔒
2-byte digraphs as a compile-time perfect hash table. 110 entries, &[u8] keys (the 2 ASCII bytes), char values. phf::Map::get is O(1) and constant-comparison-bounded, replacing the 110-entry linear scan that the old ACCENT_TABLE lookup used.

Functions§

decompose_fragment
Decompose Aozora accent digraphs anywhere inside fragment.
is_accent_marker
Branchless membership test against ACCENT_MARKERS.
match_ligature 🔒
3-byte ligatures (ASCII keys → Latin char). Only four entries, so a match beats phf::Map here: the compiler lowers it to a small jump table, branch prediction nails the common ASCII miss path, and the match keeps the keys inlined as immediates rather than reaching out to a static array.
try_match 🔒
Attempt to match a table entry starting at bytes[i]. Longest-first (the spec rule): try 3-byte ligatures before 2-byte digraphs.