Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Diagnostics catalogue

aozora is non-fatal by design: the parser always produces a tree, even from malformed input, and reports what it noticed through structured diagnostics that callers choose how to treat. This page is the catalogue.

Each Diagnostic carries:

  • a stable code — a dotted string such as aozora::lex::unclosed_bracket. The string is pinned by a test and never changes within a major release; new diagnostics add new codes.
  • a severity: Error / Warning / Note.
  • a source axis: Source (your input tripped it) or Internal (a library-bug sanity check — see Internal).
  • a span — a byte range in the sanitized source (the Phase 0 output: BOM stripped, CRLF→LF, 〔…〕 accents decomposed). For input with none of those, the sanitized bytes equal the original bytes.

Rendering them

The aozora check CLI renders diagnostics three ways, chosen with --diagnostic-format:

  • human (the default on a terminal) — a graphical miette report: the source line, a caret under the offending span, the label, the help text, and a link back to this page.
  • json (the default when stderr is piped) — the aozora::wire diagnostics envelope, byte-identical to what the WASM / FFI / Python / Extism front doors emit. This is the machine / agent path.
  • short — one grep-able line per diagnostic: path:offset: severity[code]: message.

Exit codes: 0 (diagnostics printed but tolerated), 1 (--strict with at least one diagnostic), 2 (CLI usage error), 3 (an Internal diagnostic fired — a library bug). See the CLI reference.

Library consumers get tree.diagnostics() -> &[Diagnostic] and reach the parts through code(), severity(), source(), and span(). All bindings carry the same structured data.

Source diagnostics

These trace back to your input. The parser emits exactly these — the authoring-error catalogue is complete (no diagnostic is specified-but-unimplemented).

Source contains PUA

aozora::lex::source_contains_pua · Warning

…￯…        (a literal U+E001..=U+E004 codepoint in the source)

The source contains a codepoint in U+E001..=U+E004, which the lexer reserves as inline / block placeholder sentinels. A source-side occurrence collides with the lexer’s own markers and would confuse the placeholder registry. Fix: remove the private-use codepoint from the source (these are not normal text characters and effectively never occur in real 青空文庫 files).

Unclosed bracket

aozora::lex::unclosed_bracket · Error

[#ここから2字下げ            (no matching [#ここで字下げ終わり])

An Aozora open delimiter (ruby , annotation [#, quote, …) reached end-of-input with no matching close on the pairing stack. The label points at the opener. The region degrades to plain text — no pair link is emitted. Fix: add the missing close delimiter, or remove the dangling opener.

Unmatched close

aozora::lex::unmatched_close · Error

青空]》            (a close with no matching open on the stack)

A close delimiter was seen with an empty pairing stack, or against a stack top of a different PairKind. The label points at the stray close. Fix: add the matching open delimiter, or remove the stray close.

Accent decomposition applied

aozora::lex::accent_decomposition_applied · Note

〔cafe'〕        (decomposed to 〔café〕)

A 〔…〕 accent digraph was rewritten to its Unicode-combined form during Phase 0 sanitize (cafe'café, fune + backtick → funè, …). This is intended behaviour, not an error — it is surfaced as a Note so an editor can show what changed. One note fires per 〔…〕 span that actually contained a digraph; a 〔…〕 with no accent digraph is silent. The span is in sanitized (post-decomposition) coordinates. The transform is loss-free: the serializer reconstructs the original 〔…〕 source form. See ADR-0003. No action required.

Unresolved gaiji

aozora::lex::unresolved_gaiji · Warning

※[#「架空の外字」、第3水準99-99-99]   (men-ku-ten out of range)

A 外字 (gaiji) reference — ※[#…] — resolved to neither a Unicode scalar nor a JIS X 0213 cell: no 第N水準P-R-C men-ku-ten or U+XXXX reference matched, and the description is not itself a single resolvable character. The construct still parses; the renderer falls back to the description text (<span class="aozora-gaiji" data-description="…">…</span>) rather than the intended glyph. The label points at the ※[#…] reference. Fix: correct the men-ku-ten / U+XXXX reference, or accept the description-only rendering. (Fires for top-level references; gaiji nested inside a ruby / bouten reading is not yet flagged.)

Mismatched container close

aozora::lex::mismatched_container_close · Error

[#ここから2字下げ]…[#ここで地付き終わり]   (indent opened, align-end closed)

A paired container opened with one family (indent / warichu / keigakomi / align-end) was closed by a closer of a different family. The comparison is by family, so closing a 2字下げ opener with a plain 字下げ終わり (both indent, differing only in amount) is not flagged — only a genuine family mismatch is. The label points at the close marker. The parser recovers by auto-closing the opener at the closer’s position (the container pair is still emitted, keyed by the open family). Fix: match the closer to the opener — ここから字下げここで字下げ終わり, ここから地付きここで地付き終わり, etc.

Empty ruby reading

aozora::lex::empty_ruby_reading · Error

|青梅《》        (base given, reading empty)

An explicit-base ruby supplied a base (a precedes the ) but an empty 《》 reading. Because the marks the base unambiguously, this is a genuine authoring slip rather than a literal 《》 run — so a bare 青梅《》 with no is not flagged (the parser can’t be sure a base was intended and treats it as text). The construct degrades to plain text. The label spans the whole |青梅《》. Fix: supply a reading, or drop the |…《》 markers to keep the base as plain text.

Nested ruby

aozora::lex::nested_ruby · Error

|漢《か《ん》じ》      (the reading body opens another 《…》)

A ruby reading body itself opened another ruby. Ruby does not nest; the label points at the inner . The outer ruby is still parsed best-effort. Note that an adjacent 《《…》》 is not nested ruby — the tokenizer reads 《《 / 》》 as double-bracket bouten, a separate construct — so this fires only when the inner 《…》 closes before the outer (text between the two closes, as in the catalogue shape |…《…《…》…》). Fix: close the outer reading before the inner , or remove the inner 《…》.

Unrecognised container directive

aozora::lex::unrecognised_container_directive · Warning

[#ここからナントカ]      (no such container kind)

A [#ここから…] directive looked like a paired-container opener but named no known container kind (字下げ, 地付き, 地から N 字上げ). The bracket is kept as a plain Annotation{Unknown} (so output is preserved and the “no bare [#” guarantee holds) but is not treated as a container — any matching [#ここで…終わり] will not pair with it. The label spans the directive. Fix: use a recognised opener, e.g. [#ここから2字下げ] or [#ここから地付き].

TCY target not found

aozora::lex::tcy_target_not_found · Warning

あ[#「い」は縦中横]      (no 「い」 earlier in the line)

A 縦中横 forward reference ([#「X」は縦中横]) named a target that does not appear anywhere in the preceding text, so it has no run to rotate. The directive degrades to an Annotation{Unknown}. The label spans the directive. Fix: check the spelling of the quoted target, or place the [#「X」は縦中横] after the run it should style.

Bouten target ambiguous

aozora::lex::bouten_target_ambiguous · Warning

青空青空[#「青空」に傍点]      (「青空」 occurs twice before the directive)

A forward-reference bouten ([#「X」に傍点]) named a target that occurs more than once in the preceding look-back window, so which run it emphasises is ambiguous. The parser still applies it (to the match its look-back rule selects) but the chosen run may not be the intended one. The label spans the directive. Fix: reword so the quoted target is unique before the directive. (Multi-target brackets like [#「A」「B」に傍点] name distinct runs and are never flagged.)

Mismatched bouten container

aozora::lex::mismatched_bouten_container · Error

彼は[#傍点]必ず[#傍線終わり]来る   (傍点 opened, 傍線 closed)

A 傍点 / 傍線 range form ([#傍点] … [#傍点終わり]) was opened with one family — 点 (dots) or 線 (line) — and closed by the other, e.g. a [#傍点] opener closed by [#傍線終わり]. The two families render differently (dots beside the text vs a line alongside it), so the run’s emphasis is ambiguous. The parser recovers by keying the run to the opener’s variant. A same-family variant difference (白丸傍点 closed by 丸傍点終わり) is tolerated. The label points at the close marker. Fix: match the closer’s family to the opener — [#傍点終わり] for any 点 variant, [#傍線終わり] for any 線 variant.

Bracketed kaeriten no pair

aozora::lex::bracketed_kaeriten_no_pair · Error

怪物[#二]   ([#二] with no [#一] anywhere in the document)

A bracketed kaeriten of rank ≥ 2 ([#二] / [#下] / [#乙]) appears in a document whose matching family base — [#一] / [#上] / [#甲] — is absent entirely, so the return mark has nothing to pair back to. The check is document-wide and base-only by design: real 漢文 return-mark groups span / and line boundaries (and write before ), and 上下点 may use just (skipping ), so any narrower scope would wrongly flag valid kanbun. (re-ten) is standalone and never flagged; 送り仮名 ([#(ス)]) is not a ladder mark. Fix: add the missing base mark, or check the mark is a genuine 返り点.

Kaeriten outside kanbun

aozora::lex::kaeriten_outside_kanbun · Warning

これは[#レ]と書いた。   (a lone kaeriten in kana prose)

A kaeriten ([#二] / [#レ] / …) is the only one in the entire document and its surroundings read as ordinary kana prose, so it is most likely a stray [#…] annotation rather than a genuine 返り点. The lookahead heuristic is deliberately conservative — a document carrying a cluster of kaeriten (real 漢文) is never flagged. The label points at the lone mark. (Only the bracketed [#…] form is recognised; a bare reading-mark glyph in running text is left as plain text.) Fix: confirm the mark is intended; remove it if it is not a reading mark.

Break in single line container

aozora::lex::break_in_single_line_container · Warning

[#地付き]本文[#改ページ]   (single-line directive shares its line with a break)

A single-line layout directive ([#地付き], [#N字下げ]) or a warichu range ([#割り注] … [#割り注終わり]) governs only the rest of its line. A page / section break sharing that line — or, for warichu, falling between the open and close — drops the container: the break starts a new block, so the directive’s run is cut short. Paired block forms ([#ここから…] … [#ここで…終わり]) persist across breaks and are not flagged (print typography keeps the layout across pages). The label points at the break. Fix: move the break off the line, or use the paired block form.

Internal

aozora::internal · Error · source = Internal

Pipeline-internal sanity checks. A correct build never emits these — their appearance means a bug in aozora itself, not a problem with your input. The specific check is identified by an InternalCheckCode:

Check codeFires when
aozora::lex::residual_annotation_markeran [# digraph survived classification into the normalized text (a missing recogniser)
aozora::lex::unregistered_sentinela PUA sentinel sits at a normalized position not recorded in the placeholder registry
aozora::lex::registry_out_of_ordera placeholder-registry vector is not strictly ordered by position
aozora::lex::registry_position_mismatcha registry entry references a position whose character is not the expected sentinel

aozora check exits 3 when one fires. Please report it with the source that triggered it.

Planned diagnostics

None outstanding. Every authoring-error diagnostic in the catalogue — including the four model-dependent ones (mismatched_bouten_container, bracketed_kaeriten_no_pair, kaeriten_outside_kanbun, break_in_single_line_container) — is now emitted; see the Source diagnostics above. New 記法 work adds new codes here as it lands.

Why a stable string code, not just a message?

  1. Test stability. The corpus sweep and conformance gate count diagnostics by code; a test like “this corpus emits at most N unresolved_gaiji warnings” survives message-wording tweaks and localisation. A test that greps the message string does not.
  2. Tool integration. Editors / LSPs / CI lints filter by code (e.g. “treat every Error-severity code as fatal, ignore unrecognised_container_directive for legacy files”). String matching on prose is fragile.

See also