Module wtf8

Expand description

WTF-8 name encoding and search-oriented case folding.

NTFS file names are arbitrary u16 sequences and may contain unpaired surrogates. Storing them as lossy UTF-8 would corrupt such names and make them impossible to open from the UI (docs/ARCHITECTURE.md). WTF-8 encodes unpaired surrogates as their 3-byte sequences, is a superset of UTF-8, and round-trips back to the original UTF-16.

The index keeps two pools with shared offsets (docs/ARCHITECTURE.md, ADR-0003), so the folded form of a name must have exactly the same byte length as the original. Folding therefore lowercases a code point only when the result is a single code point of identical encoded length; everything else is kept as-is. The same rule must be applied to query needles (fold_str) or case-insensitive matches would misalign.

Functions§

fold_char 🔒: Lowercase c only if the result is a single char with the same encoded length; otherwise return c unchanged.
fold_str: Fold a valid UTF-8 string (query needle) with the same rule as the pool.
has_uppercase: True if folding would change s — i.e. the needle benefits from the case-insensitive pool at all.
push_code_point 🔒: Append the WTF-8 encoding of a single code point (may be a lone surrogate).
push_wtf8_pair: Decode UTF-16 (with possible unpaired surrogates) and append both the WTF-8 original and its folded form. The two outputs always grow by the same number of bytes.
utf8_len 🔒
wtf8_to_utf16: Decode WTF-8 back to UTF-16 units (inverse of push_wtf8_pair’s name output). Used when handing names across the FFI boundary.

Module wtf8

Module wtf8 Copy item path

Functions§

Module wtf8