Expand description
WTF-8 name encoding and search-oriented case folding.
NTFS file names are arbitrary u16 sequences and may contain unpaired surrogates. Storing them as lossy UTF-8 would corrupt such names and make them impossible to open from the UI (docs/ARCHITECTURE.md). WTF-8 encodes unpaired surrogates as their 3-byte sequences, is a superset of UTF-8, and round-trips back to the original UTF-16.
The index keeps two pools with shared offsets (docs/ARCHITECTURE.md,
ADR-0003), so the folded form of a name must have exactly the same byte
length as the original. Folding therefore lowercases a code point only
when the result is a single code point of identical encoded length;
everything else is kept as-is. The same rule must be applied to query
needles (fold_str) or case-insensitive matches would misalign.
Functionsยง
- fold_
char ๐ - Lowercase
conly if the result is a single char with the same encoded length; otherwise returncunchanged. - fold_
str - Fold a valid UTF-8 string (query needle) with the same rule as the pool.
- has_
uppercase - True if folding would change
sโ i.e. the needle benefits from the case-insensitive pool at all. - push_
code_ ๐point - Append the WTF-8 encoding of a single code point (may be a lone surrogate).
- push_
wtf8_ pair - Decode UTF-16 (with possible unpaired surrogates) and append both the WTF-8 original and its folded form. The two outputs always grow by the same number of bytes.
- utf8_
len ๐ - wtf8_
to_ utf16 - Decode WTF-8 back to UTF-16 units (inverse of
push_wtf8_pairโs name output). Used when handing names across the FFI boundary.