Expand description
Arena-backed string interner.
Open-addressing hash table over &'a str slots whose payloads
point into a paired Arena. The first call to Interner::intern
with a given byte content allocates the string in the arena and
records the resulting &'a str in the table; every subsequent
call with byte-equal content reuses that pointer in O(1) amortised
time, with no further allocation.
§Why an interner here
Aozora corpora are dense in short, frequently repeated strings
— single-mora ruby readings (の, に, を, で, が), bouten
kind keywords, container kind labels, kaeriten marks. A naive
arena.alloc_str(s) for every visit copies the same bytes into
the arena dozens to hundreds of times per document. Interning
collapses those duplicates into a single allocation; the table’s
per-string overhead is one &str slot (16 bytes) plus a fraction
of probe load, which is more than recovered by the eliminated
duplicate-byte allocations on any document with non-trivial
repetition.
§Algorithm
- Open addressing with linear probing.
- Power-of-two capacity (initial 256) so slot index is a single
bitmask, no
%modulo. - Resize at 7/8 load factor; growth doubles capacity and rebuilds the table via fresh probing.
- FxHash-style multiply-then-xor mix on the byte stream:
one
wrapping_mul+ one xor-shift per byte. Fast on short strings (< 32 bytes — the dominant length class for Aozora ruby readings); no per-call state allocation that stdSipHashwould incur. - Inline cache for the most recent intern result. Long runs
of identical interns (e.g. 100 consecutive
《の》ruby readings in a poem) skip both the hash and the probe. - Length threshold: strings longer than 64 bytes bypass the
table entirely and go straight to
arena.alloc_str. The hash + probe cost is dominated by the byte scan for long strings, and long strings are very rarely repeated in practice (raw annotations, gaiji descriptions). Skipping them avoids polluting the table with unique entries that will never be queried again.
§Memory model
The interner does not own the underlying arena — it borrows it
immutably. This is sound because Arena::alloc_str takes
&self (bumpalo allows interior mutation through a shared
reference). A single arena can therefore back multiple interners
(e.g. one per worker in a parallel parse path), each with its own
probe table.
Structs§
- Intern
Stats - Diagnostic counters surfaced by
Interner::stats. - Interner
- Open-addressing intern table over arena-allocated strings.
Constants§
- FX_
PRIME 🔒 - FxHash-style mix constant. The same constant rustc internally uses
for
FxHasher; chosen for fast diffusion on short inputs. - INITIAL_
CAPACITY 🔒 - Initial table capacity. Power of two so probe-index =
hash & mask. 256 covers the typical small-to-medium document without growth; large (multi-MB) documents grow once or twice on the way to ~2048. - INTERN_
LENGTH_ 🔒LIMIT - Length threshold beyond which strings bypass the intern table. Picked from corpus profile: ruby readings, kaeriten marks, and container kind labels are all under 32 bytes; raw annotations and gaiji descriptions can exceed 64 bytes and almost never repeat.
Functions§
- fx_hash 🔒
wrapping_mul-and-xor mix loop. Fast on short inputs (the dominant case for Aozora ruby readings); avoids the per-call state setup cost of stdSipHash.