Skip to main content

decompose_fragment

Function decompose_fragment 

Source
pub fn decompose_fragment(fragment: &str) -> Cow<'_, str>
Expand description

Decompose Aozora accent digraphs anywhere inside fragment.

Call this on the body of a 〔...〕 span only; the transform is restricted to that convention so English text (isn't, text,, word's) doesn’t false-match legitimate spec entries (n'=ń, t,=ţ, and friends).

Guarantees:

  • Returns Cow::Borrowed(fragment) when no accent marker byte appears (zero alloc on the common Japanese-only case).
  • Greedy longest-match: ligatures (3-byte, e.g. ae& = æ) beat the 2-byte digraphs that share a prefix (a& = å would otherwise apply).
  • Byte length of the output can be up to 3 bytes per 2-byte digraph for the few entries that land in U+1Exx (m' = ḿ, e~ = ẽ). Most entries shrink (3-byte ligature → 2-byte UTF-8). The invariant we do hold: the result is always a valid UTF-8 string.

The implementation is linear in fragment.len(): we walk the byte stream left-to-right, peek <= 3 bytes at a time, and commit the longest match that’s in the table.