Skip to main content

Module wtf8

Module wtf8 

Source
Expand description

WTF-8 name encoding and search-oriented case folding.

NTFS file names are arbitrary u16 sequences and may contain unpaired surrogates. Storing them as lossy UTF-8 would corrupt such names and make them impossible to open from the UI (docs/ARCHITECTURE.md). WTF-8 encodes unpaired surrogates as their 3-byte sequences, is a superset of UTF-8, and round-trips back to the original UTF-16.

The index keeps two pools with shared offsets (docs/ARCHITECTURE.md, ADR-0003), so the folded form of a name must have exactly the same byte length as the original. Folding therefore lowercases a code point only when the result is a single code point of identical encoded length; everything else is kept as-is. The same rule must be applied to query needles (fold_str) or case-insensitive matches would misalign.

Functionsยง

fold_char ๐Ÿ”’
Lowercase c only if the result is a single char with the same encoded length; otherwise return c unchanged.
fold_str
Fold a valid UTF-8 string (query needle) with the same rule as the pool.
has_uppercase
True if folding would change s โ€” i.e. the needle benefits from the case-insensitive pool at all.
push_code_point ๐Ÿ”’
Append the WTF-8 encoding of a single code point (may be a lone surrogate).
push_wtf8_pair
Decode UTF-16 (with possible unpaired surrogates) and append both the WTF-8 original and its folded form. The two outputs always grow by the same number of bytes.
utf8_len ๐Ÿ”’
wtf8_to_utf16
Decode WTF-8 back to UTF-16 units (inverse of push_wtf8_pairโ€™s name output). Used when handing names across the FFI boundary.