Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Ruby (|青梅《おうめ》)

Ruby is a pronunciation gloss attached to a run of base text. In 青空文庫 source it appears in two shapes:

|青梅《おうめ》            ← explicit-base form
青梅《おうめ》              ← implicit-base form (auto-detect)

Both forms render the same HTML:

<ruby>青梅<rt>おうめ</rt></ruby>

Explicit base (|…《…》)

The full-width vertical bar (U+FF5C) marks the start of the base text; 《…》 (U+300A / U+300B) wraps the reading. The base runs from to the . Use this form when:

  • The base contains characters that the auto-detect heuristic would otherwise skip (kana, ASCII letters, mixed scripts).
  • The boundary between base and surrounding text is ambiguous.
|山田《やまだ》さん         → <ruby>山田<rt>やまだ</rt></ruby>さん
|HTTP《ハイパー・テキスト》 → <ruby>HTTP<rt>ハイパー・テキスト</rt></ruby>

Implicit base

When 《…》 follows a run of kanji without a leading , the parser auto-detects the base by scanning backwards through the kanji run. The auto-detect terminates at the first non-kanji character (kana, punctuation, ASCII, full-width digit).

青梅《おうめ》     → <ruby>青梅<rt>おうめ</rt></ruby>
お青梅《おうめ》   → お<ruby>青梅<rt>おうめ</rt></ruby>

The “kanji” predicate is CJK Unified Ideographs + CJK Compatibility Ideographs + CJK Unified Ideographs Extension A–F

  • the iteration mark . JIS X 0213 plane-2 ideographs not in Unicode are represented as gaiji references (see Gaiji) and likewise terminate the auto-detect.

Empty reading

|青梅《》 supplies a base but an empty reading. The lexer emits aozora::lex::empty_ruby_reading (an Error) and the construct degrades to plain text — no Ruby node is built.

The implicit-base form silently skips a 《》 with empty contents — the parser can’t be sure a base was intended, so it treats the bare 《》 as literal text and stays silent.

Nested ruby (forbidden)

The spec disallows ruby inside ruby. A reading whose body opens another 《…》 (e.g. |漢《か《ん》じ》) fires aozora::lex::nested_ruby; the outer ruby is still parsed best-effort. (An adjacent 《《…》》 is a different construct — double-bracket bouten — not nested ruby.)

AST shape

pub struct Ruby<'src> {
    pub base:           NonEmpty<Content<'src>>,  // never empty
    pub reading:        NonEmpty<Content<'src>>,  // never empty
    pub delim_explicit: bool,                     // true for the |…《…》 form
}

base and reading are [Content] (a Plain(&str) fast path or a Segments run carrying nested gaiji / annotations), wrapped in NonEmpty so an empty payload is unrepresentable — Phase 3 only emits a Ruby once both sides have content (an empty reading takes the empty-reading path instead). delim_explicit records whether the source used the |…《…》 form so the serializer re-emits the only when the original did.

Edge cases

InputOutput
青梅《おうめ》<ruby>青梅<rt>おうめ</rt></ruby>
|青梅《おうめ》<ruby>青梅<rt>おうめ</rt></ruby> (canonical-equivalent)
|山田《やまだ》<ruby>山田<rt>やまだ</rt></ruby>
|HTTP《ハイパー・テキスト》<ruby>HTTP<rt>ハイパー・テキスト</rt></ruby>
お青梅《おうめ》お<ruby>青梅<rt>おうめ</rt></ruby> (auto-detect skips kana)
1青梅《おうめ》1<ruby>青梅<rt>おうめ</rt></ruby> (auto-detect skips digit)
|青梅《》plain text + empty_ruby_reading
《おうめ》literal text (no preceding kanji to anchor)
|漢《か《ん》じ》best-effort ruby + nested_ruby

See also