Expand description
Pre-/post-process pass that hides 青空文庫 trigger characters inside CommonMark fenced code blocks.
§Why this exists
aozora_pipeline recognises every | / 《 / 》 / [ / ] / ※ /
〔 / 〕 / 「 / 」 as a candidate trigger and rewrites it
into a PUA sentinel before comrak ever sees the source. That is
exactly what we want for prose; it is exactly what we don’t want
inside a fenced code block, where every byte is supposed to flow
through to <pre><code> literally.
aozora_pipeline is intentionally CommonMark-blind (ADR-0010 — the
parser core has no opinion on Markdown), so the responsibility for
teaching it about code-block context lives here. We:
- Scan the source line by line and locate every fenced code block (CommonMark info-string fence: a run of three or more backticks or three or more tildes after at most three leading spaces, closed by a same-character run that is at least as long).
- Replace each Aozora trigger character inside a fence with
MASK_CHAR(U+E000 — Private Use Area, distinct from the four sentinels U+E001..U+E004) and stash the original char in insertion order. - After
comrak::format_html, restore the trigger characters in the HTML output by walking the originals list in the same order.comrak’s HTML escape only touches<,>,&,";MASK_CHARflows through untouched.
§Why not \u{E000} collisions?
aozora_pipeline’s Phase 0 already scans for source-supplied PUA
characters and emits a Diagnostic::SourceContainsPua for any
encountered. We pre-scan for MASK_CHAR in the original source
and skip masking entirely if any is present (returning the source
verbatim and an empty originals list). That preserves the lexer’s
diagnostic on the user’s pristine input and avoids an
ambiguity-of-origin in the unmask step.
Structs§
Enums§
Constants§
- AOZORA_
TRIGGERS 🔒 - Every char
aozora_pipelinetreats as a recogniser trigger. Mirrors the upstreamaozora_pipelinePhase 1 event tokeniser; if the upstream list grows, this list must follow. - MASK_
CHAR 🔒 - Private-use code point used to stand in for an Aozora trigger
character that lives inside a fenced code block. Distinct from
aozora_pipeline::INLINE_SENTINEL(U+E001) and the three block sentinels (U+E002..U+E004), so the masking pass cannot collide with the lexer’s own sentinels.
Functions§
- is_
fence_ 🔒close - Recognise a closing fence: same marker char as
open, at leastopen.widthrepetitions, optional leading indent up to 3 spaces, nothing but whitespace after the run. - mask_
code_ 🔒block_ triggers - Mask every Aozora trigger character that appears inside a fenced
code block. Returns the modified source and the ordered list of
original characters that were replaced (for use by
unmask_html). - parse_
fence_ 🔒open - Recognise the opening of a fenced code block on this line.
CommonMark allows up to 3 leading spaces before the fence run.
Returns the fence shape if
lineis a valid open fence. - trim_
leading_ 🔒indent - Strip up to
maxleading ASCII spaces fromline. Tabs are not expanded — CommonMark allows them inside the indent budget but our masking pass is a pre-pass for trigger char masking, not a CommonMark conformance check; tabs flow through untouched and the fence-detector simply fails on lines that lead with a tab. That is a strict subset of valid fences but matches every real-world afm source we have seen. - unmask_
html 🔒 - Reverse the masking. For every
MASK_CHARinhtml, take the next entry fromoriginals(in source-scan order, which matches the order they appear in HTML).