Diagnostics catalogue
aozora is non-fatal by design: the parser always produces a tree, even from malformed input, and reports what it noticed through structured diagnostics that callers choose how to treat. This page is the catalogue.
Each Diagnostic carries:
- a stable code — a dotted string such as
aozora::lex::unclosed_bracket. The string is pinned by a test and never changes within a major release; new diagnostics add new codes. - a severity:
Error/Warning/Note. - a source axis:
Source(your input tripped it) orInternal(a library-bug sanity check — see Internal). - a span — a byte range in the sanitized source (the Phase 0 output: BOM stripped, CRLF→LF, 〔…〕 accents decomposed). For input with none of those, the sanitized bytes equal the original bytes.
Rendering them
The aozora check CLI renders diagnostics three ways, chosen with
--diagnostic-format:
human(the default on a terminal) — a graphicalmiettereport: the source line, a caret under the offending span, the label, the help text, and a link back to this page.json(the default when stderr is piped) — theaozora::wirediagnostics envelope, byte-identical to what the WASM / FFI / Python / Extism front doors emit. This is the machine / agent path.short— one grep-able line per diagnostic:path:offset: severity[code]: message.
Exit codes: 0 (diagnostics printed but tolerated), 1 (--strict with
at least one diagnostic), 2 (CLI usage error), 3 (an Internal
diagnostic fired — a library bug). See the CLI reference.
Library consumers get tree.diagnostics() -> &[Diagnostic] and reach the
parts through code(), severity(), source(), and span(). All
bindings carry the same structured data.
Source diagnostics
These trace back to your input. The parser emits exactly these — the authoring-error catalogue is complete (no diagnostic is specified-but-unimplemented).
Source contains PUA
aozora::lex::source_contains_pua · Warning
…… (a literal U+E001..=U+E004 codepoint in the source)
The source contains a codepoint in U+E001..=U+E004, which the lexer
reserves as inline / block placeholder sentinels. A source-side
occurrence collides with the lexer’s own markers and would confuse the
placeholder registry. Fix: remove the private-use codepoint from the
source (these are not normal text characters and effectively never occur
in real 青空文庫 files).
Unclosed bracket
aozora::lex::unclosed_bracket · Error
[#ここから2字下げ (no matching [#ここで字下げ終わり])
An Aozora open delimiter (ruby |, annotation [#, quote, …) reached
end-of-input with no matching close on the pairing stack. The label
points at the opener. The region degrades to plain text — no pair link
is emitted. Fix: add the missing close delimiter, or remove the
dangling opener.
Unmatched close
aozora::lex::unmatched_close · Error
青空]》 (a close with no matching open on the stack)
A close delimiter was seen with an empty pairing stack, or against a
stack top of a different PairKind. The label points at the stray close.
Fix: add the matching open delimiter, or remove the stray close.
Accent decomposition applied
aozora::lex::accent_decomposition_applied · Note
〔cafe'〕 (decomposed to 〔café〕)
A 〔…〕 accent digraph was rewritten to its Unicode-combined form during
Phase 0 sanitize (cafe' → café, fune + backtick → funè, …). This is
intended behaviour, not an error — it is surfaced as a Note so an
editor can show what changed. One note fires per 〔…〕 span that actually
contained a digraph; a 〔…〕 with no accent digraph is silent. The span is
in sanitized (post-decomposition) coordinates. The transform is loss-free:
the serializer reconstructs the original 〔…〕 source form. See
ADR-0003.
No action required.
Unresolved gaiji
aozora::lex::unresolved_gaiji · Warning
※[#「架空の外字」、第3水準99-99-99] (men-ku-ten out of range)
A 外字 (gaiji) reference — ※[#…] — resolved to neither a Unicode
scalar nor a JIS X 0213 cell: no 第N水準P-R-C men-ku-ten or U+XXXX
reference matched, and the description is not itself a single resolvable
character. The construct still parses; the renderer falls back to the
description text (<span class="aozora-gaiji" data-description="…">…</span>)
rather than the intended glyph. The label points at the ※[#…] reference.
Fix: correct the men-ku-ten / U+XXXX reference, or accept the
description-only rendering. (Fires for top-level references; gaiji nested
inside a ruby / bouten reading is not yet flagged.)
Mismatched container close
aozora::lex::mismatched_container_close · Error
[#ここから2字下げ]…[#ここで地付き終わり] (indent opened, align-end closed)
A paired container opened with one family (indent / warichu /
keigakomi / align-end) was closed by a closer of a different family.
The comparison is by family, so closing a 2字下げ opener with a plain
字下げ終わり (both indent, differing only in amount) is not flagged —
only a genuine family mismatch is. The label points at the close marker.
The parser recovers by auto-closing the opener at the closer’s position
(the container pair is still emitted, keyed by the open family). Fix:
match the closer to the opener — ここから字下げ ↔ ここで字下げ終わり,
ここから地付き ↔ ここで地付き終わり, etc.
Empty ruby reading
aozora::lex::empty_ruby_reading · Error
|青梅《》 (base given, reading empty)
An explicit-base ruby supplied a base (a | precedes the 《) but an
empty 《》 reading. Because the | marks the base unambiguously, this is
a genuine authoring slip rather than a literal 《》 run — so a bare
青梅《》 with no | is not flagged (the parser can’t be sure a base
was intended and treats it as text). The construct degrades to plain text.
The label spans the whole |青梅《》. Fix: supply a reading, or drop
the |…《》 markers to keep the base as plain text.
Nested ruby
aozora::lex::nested_ruby · Error
|漢《か《ん》じ》 (the reading body opens another 《…》)
A ruby reading body itself opened another ruby. Ruby does not nest; the
label points at the inner 《. The outer ruby is still parsed
best-effort. Note that an adjacent 《《…》》 is not nested ruby — the
tokenizer reads 《《 / 》》 as double-bracket bouten, a
separate construct — so this fires only when the inner 《…》 closes
before the outer (text between the two closes, as in the catalogue shape
|…《…《…》…》). Fix: close the outer reading before the inner 《, or
remove the inner 《…》.
Unrecognised container directive
aozora::lex::unrecognised_container_directive · Warning
[#ここからナントカ] (no such container kind)
A [#ここから…] directive looked like a paired-container opener but
named no known container kind (字下げ, 地付き, 地から N 字上げ). The
bracket is kept as a plain Annotation{Unknown} (so output is preserved
and the “no bare [#” guarantee holds) but is not treated as a
container — any matching [#ここで…終わり] will not pair with it. The
label spans the directive. Fix: use a recognised opener, e.g.
[#ここから2字下げ] or [#ここから地付き].
TCY target not found
aozora::lex::tcy_target_not_found · Warning
あ[#「い」は縦中横] (no 「い」 earlier in the line)
A 縦中横 forward reference ([#「X」は縦中横]) named a target that does
not appear anywhere in the preceding text, so it has no run to rotate. The
directive degrades to an Annotation{Unknown}. The label spans the
directive. Fix: check the spelling of the quoted target, or place the
[#「X」は縦中横] after the run it should style.
Bouten target ambiguous
aozora::lex::bouten_target_ambiguous · Warning
青空青空[#「青空」に傍点] (「青空」 occurs twice before the directive)
A forward-reference bouten ([#「X」に傍点]) named a target that occurs
more than once in the preceding look-back window, so which run it
emphasises is ambiguous. The parser still applies it (to the match its
look-back rule selects) but the chosen run may not be the intended one.
The label spans the directive. Fix: reword so the quoted target is
unique before the directive. (Multi-target brackets like [#「A」「B」に傍点]
name distinct runs and are never flagged.)
Mismatched bouten container
aozora::lex::mismatched_bouten_container · Error
彼は[#傍点]必ず[#傍線終わり]来る (傍点 opened, 傍線 closed)
A 傍点 / 傍線 range form ([#傍点] … [#傍点終わり]) was opened with one
family — 点 (dots) or 線 (line) — and closed by the other, e.g. a [#傍点]
opener closed by [#傍線終わり]. The two families render differently (dots
beside the text vs a line alongside it), so the run’s emphasis is
ambiguous. The parser recovers by keying the run to the opener’s variant.
A same-family variant difference (白丸傍点 closed by 丸傍点終わり) is
tolerated. The label points at the close marker. Fix: match the closer’s
family to the opener — [#傍点終わり] for any 点 variant, [#傍線終わり]
for any 線 variant.
Bracketed kaeriten no pair
aozora::lex::bracketed_kaeriten_no_pair · Error
怪物[#二] ([#二] with no [#一] anywhere in the document)
A bracketed kaeriten of rank ≥ 2 ([#二] / [#下] / [#乙]) appears in a
document whose matching family base — [#一] / [#上] / [#甲] — is
absent entirely, so the return mark has nothing to pair back to. The check
is document-wide and base-only by design: real 漢文 return-mark groups span
、 / 。 and line boundaries (and write 二 before 一), and 上下点 may
use just 上 … 下 (skipping 中), so any narrower scope would wrongly
flag valid kanbun. レ (re-ten) is standalone and never flagged; 送り仮名
([#(ス)]) is not a ladder mark. Fix: add the missing base mark, or
check the mark is a genuine 返り点.
Kaeriten outside kanbun
aozora::lex::kaeriten_outside_kanbun · Warning
これは[#レ]と書いた。 (a lone kaeriten in kana prose)
A kaeriten ([#二] / [#レ] / …) is the only one in the entire document
and its surroundings read as ordinary kana prose, so it is most likely a
stray [#…] annotation rather than a genuine 返り点. The lookahead
heuristic is deliberately conservative — a document carrying a cluster of
kaeriten (real 漢文) is never flagged. The label points at the lone mark.
(Only the bracketed [#…] form is recognised; a bare reading-mark glyph in
running text is left as plain text.) Fix: confirm the mark is intended;
remove it if it is not a reading mark.
Break in single line container
aozora::lex::break_in_single_line_container · Warning
[#地付き]本文[#改ページ] (single-line directive shares its line with a break)
A single-line layout directive ([#地付き], [#N字下げ]) or a warichu
range ([#割り注] … [#割り注終わり]) governs only the rest of its line. A
page / section break sharing that line — or, for warichu, falling between
the open and close — drops the container: the break starts a new block, so
the directive’s run is cut short. Paired block forms ([#ここから…] … [#ここで…終わり]) persist across breaks and are not flagged (print
typography keeps the layout across pages). The label points at the break.
Fix: move the break off the line, or use the paired block form.
Internal
aozora::internal · Error · source = Internal
Pipeline-internal sanity checks. A correct build never emits these —
their appearance means a bug in aozora itself, not a problem with your
input. The specific check is identified by an InternalCheckCode:
| Check code | Fires when |
|---|---|
aozora::lex::residual_annotation_marker | an [# digraph survived classification into the normalized text (a missing recogniser) |
aozora::lex::unregistered_sentinel | a PUA sentinel sits at a normalized position not recorded in the placeholder registry |
aozora::lex::registry_out_of_order | a placeholder-registry vector is not strictly ordered by position |
aozora::lex::registry_position_mismatch | a registry entry references a position whose character is not the expected sentinel |
aozora check exits 3 when one fires. Please
report it with the source that
triggered it.
Planned diagnostics
None outstanding. Every authoring-error diagnostic in the catalogue —
including the four model-dependent ones (mismatched_bouten_container,
bracketed_kaeriten_no_pair, kaeriten_outside_kanbun,
break_in_single_line_container) — is now emitted; see the
Source diagnostics above. New 記法 work adds new
codes here as it lands.
Why a stable string code, not just a message?
- Test stability. The corpus sweep and conformance gate count
diagnostics by code; a test like “this corpus emits at most N
unresolved_gaijiwarnings” survives message-wording tweaks and localisation. A test that greps the message string does not. - Tool integration. Editors / LSPs / CI lints filter by code
(e.g. “treat every
Error-severity code as fatal, ignoreunrecognised_container_directivefor legacy files”). String matching on prose is fragile.
See also
- Architecture → Error recovery — what the parser does after each diagnostic fires (preserved output, dropped tokens, where the bytes go).
- CLI reference —
aozora check --diagnostic-formatand the exit-code contract. - Library Quickstart → Diagnostics
- Bindings → Diagnostics as JSON