Ruby (`｜青梅《おうめ》`)

Ruby is a pronunciation gloss attached to a run of base text. In 青空文庫 source it appears in two shapes:

｜青梅《おうめ》            ← explicit-base form
青梅《おうめ》              ← implicit-base form (auto-detect)

Both forms render the same HTML:

<ruby>青梅<rt>おうめ</rt></ruby>

Explicit base (`｜…《…》`)

The full-width vertical bar ｜ (U+FF5C) marks the start of the base text; 《…》 (U+300A / U+300B) wraps the reading. The base runs from ｜ to the 《. Use this form when:

The base contains characters that the auto-detect heuristic would otherwise skip (kana, ASCII letters, mixed scripts).
The boundary between base and surrounding text is ambiguous.

｜山田《やまだ》さん         → <ruby>山田<rt>やまだ</rt></ruby>さん
｜HTTP《ハイパー・テキスト》 → <ruby>HTTP<rt>ハイパー・テキスト</rt></ruby>

When 《…》 follows a run of kanji without a leading ｜, the parser auto-detects the base by scanning backwards through the kanji run. The auto-detect terminates at the first non-kanji character (kana, punctuation, ASCII, full-width digit).

青梅《おうめ》     → <ruby>青梅<rt>おうめ</rt></ruby>
お青梅《おうめ》   → お<ruby>青梅<rt>おうめ</rt></ruby>

The “kanji” predicate is CJK Unified Ideographs + CJK Compatibility Ideographs + CJK Unified Ideographs Extension A–F

the iteration mark 々. JIS X 0213 plane-2 ideographs not in Unicode are represented as gaiji references (see Gaiji) and likewise terminate the auto-detect.

Empty reading

｜青梅《》 supplies a base but an empty reading. The lexer emits aozora::lex::empty_ruby_reading (an Error) and the construct degrades to plain text — no Ruby node is built.

The implicit-base form silently skips a 《》 with empty contents — the parser can’t be sure a base was intended, so it treats the bare 《》 as literal text and stays silent.

Nested ruby (forbidden)

The spec disallows ruby inside ruby. A reading whose body opens another 《…》 (e.g. ｜漢《か《ん》じ》) fires aozora::lex::nested_ruby; the outer ruby is still parsed best-effort. (An adjacent 《《…》》 is a different construct — double-bracket bouten — not nested ruby.)

AST shape

pub struct Ruby<'src> {
    pub base:           NonEmpty<Content<'src>>,  // never empty
    pub reading:        NonEmpty<Content<'src>>,  // never empty
    pub delim_explicit: bool,                     // true for the ｜…《…》 form
}

base and reading are [Content] (a Plain(&str) fast path or a Segments run carrying nested gaiji / annotations), wrapped in NonEmpty so an empty payload is unrepresentable — Phase 3 only emits a Ruby once both sides have content (an empty reading takes the empty-reading path instead). delim_explicit records whether the source used the ｜…《…》 form so the serializer re-emits the ｜ only when the original did.

Edge cases

Input	Output
`青梅《おうめ》`	`<ruby>青梅<rt>おうめ</rt></ruby>`
`｜青梅《おうめ》`	`<ruby>青梅<rt>おうめ</rt></ruby>` (canonical-equivalent)
`｜山田《やまだ》`	`<ruby>山田<rt>やまだ</rt></ruby>`
`｜HTTP《ハイパー・テキスト》`	`<ruby>HTTP<rt>ハイパー・テキスト</rt></ruby>`
`お青梅《おうめ》`	`お<ruby>青梅<rt>おうめ</rt></ruby>` (auto-detect skips kana)
`1青梅《おうめ》`	`1<ruby>青梅<rt>おうめ</rt></ruby>` (auto-detect skips digit)
`｜青梅《》`	plain text + `empty_ruby_reading`
`《おうめ》`	literal text (no preceding kanji to anchor)
`｜漢《か《ん》じ》`	best-effort ruby + `nested_ruby`

aozora — 青空文庫記法 Parser Handbook