Module code_block_mask

Expand description

Pre-/post-process pass that hides 青空文庫 trigger characters inside CommonMark fenced code blocks.

§Why this exists

aozora_pipeline recognises every ｜ / 《 / 》 / ［ / ］ / ※ / 〔 / 〕 / 「 / 」 as a candidate trigger and rewrites it into a PUA sentinel before comrak ever sees the source. That is exactly what we want for prose; it is exactly what we don’t want inside a fenced code block, where every byte is supposed to flow through to <pre><code> literally.

aozora_pipeline is intentionally CommonMark-blind (ADR-0010 — the parser core has no opinion on Markdown), so the responsibility for teaching it about code-block context lives here. We:

Scan the source line by line and locate every fenced code block (CommonMark info-string fence: a run of three or more backticks or three or more tildes after at most three leading spaces, closed by a same-character run that is at least as long).
Replace each Aozora trigger character inside a fence with MASK_CHAR (U+E000 — Private Use Area, distinct from the four sentinels U+E001..U+E004) and stash the original char in insertion order.
After comrak::format_html, restore the trigger characters in the HTML output by walking the originals list in the same order. comrak’s HTML escape only touches <, >, &, "; MASK_CHAR flows through untouched.

§Why not `\u{E000}` collisions?

aozora_pipeline’s Phase 0 already scans for source-supplied PUA characters and emits a Diagnostic::SourceContainsPua for any encountered. We pre-scan for MASK_CHAR in the original source and skip masking entirely if any is present (returning the source verbatim and an empty originals list). That preserves the lexer’s diagnostic on the user’s pristine input and avoids an ambiguity-of-origin in the unmask step.

Structs§

FenceOpen 🔒

Enums§

MaskState 🔒

Constants§

AOZORA_TRIGGERS 🔒: Every char aozora_pipeline treats as a recogniser trigger. Mirrors the upstream aozora_pipeline Phase 1 event tokeniser; if the upstream list grows, this list must follow.
MASK_CHAR 🔒: Private-use code point used to stand in for an Aozora trigger character that lives inside a fenced code block. Distinct from aozora_pipeline::INLINE_SENTINEL (U+E001) and the three block sentinels (U+E002..U+E004), so the masking pass cannot collide with the lexer’s own sentinels.

Functions§

is_fence_close 🔒: Recognise a closing fence: same marker char as open, at least open.width repetitions, optional leading indent up to 3 spaces, nothing but whitespace after the run.
mask_code_block_triggers 🔒: Mask every Aozora trigger character that appears inside a fenced code block. Returns the modified source and the ordered list of original characters that were replaced (for use by unmask_html).
parse_fence_open 🔒: Recognise the opening of a fenced code block on this line. CommonMark allows up to 3 leading spaces before the fence run. Returns the fence shape if line is a valid open fence.
trim_leading_indent 🔒: Strip up to max leading ASCII spaces from line. Tabs are not expanded — CommonMark allows them inside the indent budget but our masking pass is a pre-pass for trigger char masking, not a CommonMark conformance check; tabs flow through untouched and the fence-detector simply fails on lines that lead with a tab. That is a strict subset of valid fences but matches every real-world afm source we have seen.
unmask_html 🔒: Reverse the masking. For every MASK_CHAR in html, take the next entry from originals (in source-scan order, which matches the order they appear in HTML).