Skip to main content

Module intern

Module intern 

Source
Expand description

Arena-backed string interner.

Open-addressing hash table over &'a str slots whose payloads point into a paired Arena. The first call to Interner::intern with a given byte content allocates the string in the arena and records the resulting &'a str in the table; every subsequent call with byte-equal content reuses that pointer in O(1) amortised time, with no further allocation.

§Why an interner here

Aozora corpora are dense in short, frequently repeated strings — single-mora ruby readings (, , , , ), bouten kind keywords, container kind labels, kaeriten marks. A naive arena.alloc_str(s) for every visit copies the same bytes into the arena dozens to hundreds of times per document. Interning collapses those duplicates into a single allocation; the table’s per-string overhead is one &str slot (16 bytes) plus a fraction of probe load, which is more than recovered by the eliminated duplicate-byte allocations on any document with non-trivial repetition.

§Algorithm

  • Open addressing with linear probing.
  • Power-of-two capacity (initial 256) so slot index is a single bitmask, no % modulo.
  • Resize at 7/8 load factor; growth doubles capacity and rebuilds the table via fresh probing.
  • FxHash-style multiply-then-xor mix on the byte stream: one wrapping_mul + one xor-shift per byte. Fast on short strings (< 32 bytes — the dominant length class for Aozora ruby readings); no per-call state allocation that std SipHash would incur.
  • Inline cache for the most recent intern result. Long runs of identical interns (e.g. 100 consecutive 《の》 ruby readings in a poem) skip both the hash and the probe.
  • Length threshold: strings longer than 64 bytes bypass the table entirely and go straight to arena.alloc_str. The hash + probe cost is dominated by the byte scan for long strings, and long strings are very rarely repeated in practice (raw annotations, gaiji descriptions). Skipping them avoids polluting the table with unique entries that will never be queried again.

§Memory model

The interner does not own the underlying arena — it borrows it immutably. This is sound because Arena::alloc_str takes &self (bumpalo allows interior mutation through a shared reference). A single arena can therefore back multiple interners (e.g. one per worker in a parallel parse path), each with its own probe table.

Structs§

InternStats
Diagnostic counters surfaced by Interner::stats.
Interner
Open-addressing intern table over arena-allocated strings.

Constants§

FX_PRIME 🔒
FxHash-style mix constant. The same constant rustc internally uses for FxHasher; chosen for fast diffusion on short inputs.
INITIAL_CAPACITY 🔒
Initial table capacity. Power of two so probe-index = hash & mask. 256 covers the typical small-to-medium document without growth; large (multi-MB) documents grow once or twice on the way to ~2048.
INTERN_LENGTH_LIMIT 🔒
Length threshold beyond which strings bypass the intern table. Picked from corpus profile: ruby readings, kaeriten marks, and container kind labels are all under 32 bytes; raw annotations and gaiji descriptions can exceed 64 bytes and almost never repeat.

Functions§

fx_hash 🔒
wrapping_mul-and-xor mix loop. Fast on short inputs (the dominant case for Aozora ruby readings); avoids the per-call state setup cost of std SipHash.