Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Ruby (|青梅《おうめ》)

Ruby is a pronunciation gloss attached to a run of base text. In 青空文庫 source it appears in two shapes:

|青梅《おうめ》            ← explicit-base form
青梅《おうめ》              ← implicit-base form (auto-detect)

Both forms render the same HTML:

<ruby>青梅<rt>おうめ</rt></ruby>

Explicit base (|…《…》)

The full-width vertical bar (U+FF5C) marks the start of the base text; 《…》 (U+300A / U+300B) wraps the reading. The base runs from to the . Use this form when:

  • The base contains characters that the auto-detect heuristic would otherwise skip (kana, ASCII letters, mixed scripts).
  • The boundary between base and surrounding text is ambiguous.
|山田《やまだ》さん         → <ruby>山田<rt>やまだ</rt></ruby>さん
|HTTP《ハイパー・テキスト》 → <ruby>HTTP<rt>ハイパー・テキスト</rt></ruby>

Implicit base

When 《…》 follows a run of kanji without a leading , the parser auto-detects the base by scanning backwards through the kanji run. The auto-detect terminates at the first non-kanji character (kana, punctuation, ASCII, full-width digit).

青梅《おうめ》     → <ruby>青梅<rt>おうめ</rt></ruby>
お青梅《おうめ》   → お<ruby>青梅<rt>おうめ</rt></ruby>

The “kanji” predicate is CJK Unified Ideographs + CJK Compatibility Ideographs + CJK Unified Ideographs Extension A–F

  • the iteration mark . JIS X 0213 plane-2 ideographs not in Unicode are represented as gaiji references (see Gaiji) and likewise terminate the auto-detect.

Empty reading

|青梅《》 is a parse error. The lexer emits diagnostic E0001 (“ruby reading mismatch: target spans N chars but |《》 reading is empty”) and the node is dropped from the AST.

The implicit-base form silently skips a 《》 with empty contents — that combination cannot have arisen from valid markup, so the parser treats the bare 《》 as literal text.

Nested ruby (forbidden)

The spec disallows ruby inside ruby. Sources with |青梅《|お《お》うめ》 are rejected with diagnostic E0002.

AST shape

pub struct Ruby<'src> {
    pub target:  &'src str,   // borrowed from source
    pub reading: &'src str,   // borrowed from source
    pub span:    Span,        // byte range in the source
    pub explicit_base: bool,  // true if the input used the |…《…》 form
}

Both target and reading are &str slices into the Document-owned source — no allocation, no copy. Re-emitting canonical form is exactly:

match (ruby.explicit_base, ruby.target, ruby.reading) {
    (true,  t, r) => format!("|{t}《{r}》"),
    (false, t, r) => format!("{t}《{r}》"),
}

Edge cases

InputOutput
青梅《おうめ》<ruby>青梅<rt>おうめ</rt></ruby>
|青梅《おうめ》<ruby>青梅<rt>おうめ</rt></ruby> (canonical-equivalent)
|山田《やまだ》<ruby>山田<rt>やまだ</rt></ruby>
|HTTP《ハイパー・テキスト》<ruby>HTTP<rt>ハイパー・テキスト</rt></ruby>
お青梅《おうめ》お<ruby>青梅<rt>おうめ</rt></ruby> (auto-detect skips kana)
1青梅《おうめ》1<ruby>青梅<rt>おうめ</rt></ruby> (auto-detect skips digit)
|青梅《》parse error E0001
《おうめ》literal text (no preceding kanji to anchor)
|青梅《|お《お》うめ》parse error E0002

See also