Welcome
Aozora Flavored Markdown (afm) is a Markdown dialect that layers
Aozora Bunko (青空文庫) typography — ruby, bouten, 縦中横, [#…]
annotations, gaiji, accent decomposition — on top of CommonMark + GFM
for Japanese vertical and horizontal writing.
Like GFM, afm is a strict superset of its base: any pure
CommonMark or GFM document parses identically under afm, and the
Aozora extensions kick in only where the input actually uses them.
The file extension remains .md.
This handbook is both a practical tour and a reference:
- Tour — install the CLI, try the CLI Quickstart, embed the library.
- Reference — walk the parse pipeline, read the architectural decisions, browse the CLI reference and API reference.
Status
100% CommonMark / GFM spec compatibility, all major Aozora Bunko annotations implemented, with a 96% regions coverage floor.
See the project README for an at-a-glance summary and the CHANGELOG for release history.
Install
afm ships as a single afm binary and as a Rust library. The two
entry points share the same parser core, so a CLI run and a library
embed produce identical HTML for the same input.
From GitHub Releases
Pre-built binaries for the following targets are published to GitHub Releases:
| Target | Archive |
|---|---|
x86_64-unknown-linux-gnu | .tar.gz |
x86_64-unknown-linux-musl | .tar.gz |
aarch64-apple-darwin | .tar.gz |
x86_64-apple-darwin | .tar.gz |
x86_64-pc-windows-msvc | .zip |
Each archive bundles the afm binary alongside LICENSE-MIT,
LICENSE-APACHE, NOTICE, and README.md. A release-wide
SHA256SUMS file is attached to the release for bulk verification:
# Replace vX.Y.Z with the release tag you want from the Releases page.
curl -L https://github.com/P4suta/afm/releases/download/vX.Y.Z/SHA256SUMS -o SHA256SUMS
sha256sum --check --ignore-missing SHA256SUMS
tar xzf afm-vX.Y.Z-x86_64-unknown-linux-gnu.tar.gz
afm-vX.Y.Z-x86_64-unknown-linux-gnu/afm --version
From source
git clone https://github.com/P4suta/afm
cd afm
just build-release
This produces target/release/afm. The build runs inside the dev
Docker image per ADR-0002; the host does not need a
Rust toolchain installed.
As a Rust library
afm is not on crates.io yet; depend on it directly by git URL:
[dependencies]
afm-markdown = { git = "https://github.com/P4suta/afm" }
See Library Usage for a minimal parse + render example.
CLI Quickstart
afm [--encoding utf8|sjis] [--strict] <subcommand>
Subcommands
| Subcommand | Purpose |
|---|---|
afm render <file> | Parse and emit HTML on stdout. |
afm check <file> | Parse without rendering; exit non-zero on failure. |
Examples
Render a UTF-8 file:
afm render input.md > out.html
Render a Shift_JIS Aozora Bunko text directly from its published form:
afm render --encoding sjis tsumito_batsu.txt > tsumito_batsu.html
Validate a document under strict mode (treat every lexer diagnostic as an error — useful in CI pre-flight):
afm check --strict input.md
See CLI Reference for the full flag listing and exit-code semantics.
Library Usage
afm ships as a Rust library (afm-markdown) alongside the CLI. The
binary is a thin wrapper over the same public API every embedder
calls — there is no parallel “library-only” path that the CLI
bypasses, so a CLI run and a library embed produce byte-identical
HTML for the same input.
Add the dependency
afm is not on crates.io yet; depend on it directly by git URL:
[dependencies]
afm-markdown = { git = "https://github.com/P4suta/afm" }
The aozora-encoding sibling crate provides Shift_JIS decoding when
you need it; pin it from the same repo set:
[dependencies]
aozora-encoding = { git = "https://github.com/P4suta/aozora" }
Render to HTML — the simple path
use afm_markdown::{Options, render_to_string};
fn main() {
let rendered = render_to_string(
"彼は|青梅《おうめ》に行った。",
&Options::afm_default(),
);
println!("{}", rendered.html);
for diag in &rendered.diagnostics {
eprintln!("warning: {diag}");
}
}
Options::afm_default() enables the GFM extensions afm uses on top
of CommonMark (strikethrough, tables, autolinks, task lists),
hardbreaks (so each Aozora source newline becomes a <br> — verse /
dialogue boundaries are load-bearing in 青空文庫 source), and the
Aozora pre-pass.
For pure CommonMark or pure GFM behaviour (no Aozora recognition),
use Options::commonmark_only() or Options::gfm_only() — these are
also what the CommonMark 0.31.2 and GFM 0.29 spec runners exercise.
Render to a structured IR
render_to_ir returns the same HTML alongside a typed IrDocument
that mirrors the TypeScript IRDocument consumed by afm-obsidian:
use afm_markdown::ir::{IrBlock, IrInline};
use afm_markdown::{Options, render_to_ir};
fn main() {
let rendered = render_to_ir(
"# 第一章\n\n|青梅《おうめ》",
&Options::afm_default(),
);
for block in &rendered.ir.blocks {
match block {
IrBlock::Heading { level, .. } => println!("h{level}"),
IrBlock::Paragraph { children, .. } => {
let ruby_count = children
.iter()
.filter(|c| matches!(c, IrInline::Ruby { .. }))
.count();
println!("paragraph with {ruby_count} ruby span(s)");
}
other => println!("{other:?}"),
}
}
}
The IR carries every Aozora-side construct (Ruby, DoubleRuby,
Bouten, Tcy, Gaiji, Annotation, Container, PageBreak,
SectionBreak) plus the markdown-side block / inline shapes — so
JS-side renderers in afm-obsidian / afm-logseq can pick their own
output target (DOM fragment, CodeMirror RangeSet, semantic tokens)
without re-parsing the HTML.
Render block-by-block (streaming)
For long documents where you want to checkpoint between blocks
(afm-obsidian uses this for AbortSignal cancellation in chunked
post-processors), use render_blocks_to_ir:
#![allow(unused)]
fn main() {
use afm_markdown::{Options, render_blocks_to_ir};
let (blocks, diagnostics) = render_blocks_to_ir(
"first paragraph\n\n|second《せかんど》paragraph",
&Options::afm_default(),
);
for block in blocks {
println!("{} ir nodes at line {}", block.ir.len(), block.source_line);
println!("{}", block.html);
}
assert!(diagnostics.is_empty());
}
The shared StreamingIrBuilder threads the sentinel cursor across
calls, so per-block IR projection stays in lockstep with the
whole-document path. A block may carry zero IR entries (e.g.
container-open paragraphs that drain at the next call boundary) or
more than one (a container that finally closes).
Reading Shift_JIS input
Aozora Bunko ships its text files in Shift_JIS. aozora-encoding
exposes a transparent decoder so your pipeline doesn’t need to know
the encoding ahead of time:
use afm_markdown::{Options, render_to_string};
use aozora_encoding::decode_sjis;
fn main() -> std::io::Result<()> {
let bytes = std::fs::read("tsumito_batsu.txt")?;
let utf8 = decode_sjis(&bytes).expect("decoded");
let rendered = render_to_string(&utf8, &Options::afm_default());
std::fs::write("tsumito_batsu.html", rendered.html)?;
Ok(())
}
Round-tripping through the lexer
afm_markdown::serialize is the inverse of the lex pre-pass: it
replays the borrowed-AST registry to reconstruct the original afm
markup byte-for-byte (modulo the lexer’s Phase-0 sanitisation). This
is what the upstream 17 k-work corpus sweep exercises as I3 (round-
trip fixed point):
use afm_markdown::serialize;
fn main() {
let source = "彼は|青梅《おうめ》に行った。";
assert_eq!(serialize(source), source);
}
More examples
End-to-end snippets live under
crates/afm-markdown/examples/
in the repository:
render-utf8.rs— UTF-8 source → HTML on stdout.render-sjis.rs— Shift_JIS source viaaozora-encoding.ast-walk.rs— walk the parsed AST and tallyAozoraNodevariants.serialize-round-trip.rs— verifyserialize ∘ lex ≡ idon one file.
Run any of them with:
cargo run --example <name> -p afm-markdown -- <path>
Pipeline Overview
afm composes three independent black boxes, each with a single responsibility, glued together by a tiny sentinel-stream cursor that keeps the two output paths (HTML and IR) in lockstep without re-running the parser.
source (UTF-8 or Shift_JIS)
│
▼ aozora_encoding::decode_sjis (Shift_JIS → UTF-8, sibling repo)
│
▼ aozora_pipeline::lex_into_arena (青空文庫記法 borrowed-AST)
│ ├─ Phase 0 sanitize BOM / CRLF→LF / 〔…〕 accent / PUA collision scan
│ ├─ Phase 1 events SIMD trigger-byte tokenise
│ ├─ Phase 2 pair balanced-stack bracket / ruby / quote pairing
│ ├─ Phase 3 classify borrowed AozoraNode<'arena> + ContainerKind
│ └─ Phase 4 normalize PUA sentinels (U+E001..U+E004) + Registry
│
│ ┌──────────────────────────── Output ────────────────────────────┐
│ │ BorrowedLexOutput<'arena> { │
│ │ normalized: &str, │
│ │ registry: Registry<'arena>, // sentinel pos → NodeRef │
│ │ diagnostics: Vec<Diagnostic>, │
│ │ } │
│ └────────────────────────────────────────────────────────────────┘
│
▼ comrak::parse_document (vanilla CommonMark + GFM)
│ sentinels survive as plain UTF-8 — they aren't in the
│ `<>&"'` escape set, so format_html passes them through too.
│
▼ comrak::format_html (HTML with sentinels in body)
│
▼ afm_markdown::post_process::splice_aozora_html
│ · single-pass scan over the emitted HTML
│ · sentinel ↔ aozora_render::render_node output substitution
│ · paragraph-aware: HeadingHint promotes to <h{level}>;
│ sole-block-sentinel paragraphs become standalone blocks
│ · brand boundary: aozora-* CSS classes → afm-* (ADR-0011)
│
▼ HTML
How the splicer stays in lockstep
Both consumers of the lex output — the HTML splicer and the IR
projector — walk the same source-order sequence of registry
entries. The shared abstraction is
SentinelCursor
in crates/afm-markdown/src/sentinels.rs:
┌──────────────── BorrowedLexOutput ────────────────┐
│ normalized = "前\u{E001}後..." │
│ registry = { 3 → Inline(Ruby{…}), … } │
└─────────────────────┬─────────────────────────────┘
│
flatten_registry_in_source_order
│
▼
┌──── &[NodeRef<'src>] (sorted by source pos) ─────┐
│ [Inline(Ruby), BlockOpen(Indent), …] │
└──────────────────────────────────────────────────┘
│ │
│ shared cursor │
┌─────────┴────────┐ ┌─────────┴────────┐
│ HTML splicer │ │ IR builder │
│ (post_process) │ │ (ir.rs) │
│ │ │ │
│ String buffer │ │ Vec<IrBlock> │
│ container_stack: │ │ container_stack: │
│ Vec< │ │ Vec< │
│ ContainerKind│ │ OpenContainer│ <- holds children
│ > │ │ > │
└──────────────────┘ └──────────────────┘
Both walkers consume entries linearly via cursor.next(), peek
ahead via cursor.peek(offset), and maintain their own
container-stack so paired open / close markers nest correctly. They
never interfere because each render_to_string / render_to_ir
call materialises its own cursor over its own flattened slice.
The streaming path (render_blocks_to_ir) reuses this design: the
public StreamingIrBuilder owns the materialised slice and a
cursor_idx that threads across walk_block calls, so per-block
IR projection stays consistent with the whole-document path.
Dependency direction
afm depends on aozora. The reverse must not hold:
┌────────────────┐ git dependency ┌─────────────────┐
│ afm (this repo)│ ─────────────────────────▶│ aozora (sibling)│
│ afm-markdown │ │ aozora-pipeline│
│ afm-cli │ │ aozora-syntax │
│ afm-wasm │ │ aozora-render │
│ afm-book │ │ aozora-encoding│
└────────────────┘ │ aozora-spec │
└─────────────────┘
Anything afm needs from aozora travels through aozora’s public API. Anything aozora needs from afm doesn’t exist — by construction (see ADR-0011 for the brand boundary that codifies this rule, and ADR-0010 for the original split).
What lives in the vendored comrak tree
upstream/comrak/ is a verbatim copy of comrak v0.52.0 with a
0-line diff (ADR-0001 v0.2.4). afm composes comrak as a black
box: parse_document, format_html, and the AST type tree are
imported, the sentinels survive both passes as plain UTF-8, and
post-process owns the entire afm-side surface. Upgrading comrak is
a cargo xtask upstream-sync <tag> away — no patches to re-apply.
See the architectural decisions for the full rationale and the alternatives that led here (ADR-0008 reset the design to zero-parser-hooks; ADR-0010 split parser / renderer into the sibling repo; ADR-0011 nailed down the brand boundary).
Architecture Decision Records
afm records load-bearing design decisions as MADR-formatted ADRs
under
docs/adr/. The
rationale, alternatives considered, and concrete consequences live
there in full — the table below is a map.
| # | Title | Status |
|---|---|---|
| 0001 | Fork comrak in-tree, 0-line diff budget | Accepted (budget collapsed in v0.2.4) |
| 0002 | Docker-only execution for development and CI | Accepted |
| 0003 | Initial afm-parser architecture | Superseded by ADR-0010 (v0.2.0 split) |
| 0004 | Accent decomposition inside 〔…〕 | Moved to sibling aozora repo |
| 0005 | Paired block annotation container hook | Superseded by ADR-0008 |
| 0006 | Lint profile policy and scope discipline | Mirrored in sibling aozora repo |
| 0007 | 17 k-work corpus sweep strategy | Moved to sibling aozora repo |
| 0008 | Zero-parser-hook Aozora-first pipeline | Moved to sibling aozora repo |
| 0009 | Authoring tools live in sibling repositories | Accepted |
| 0010 | Extract aozora-* core into a sibling repo | Accepted (executed v0.2.0, 2026-04-25) |
| 0011 | Brand boundary — aozora-* → afm-* HTML rewrite | Accepted (2026-05-04) |
ADRs marked Moved kept their number on this side as redirect
stubs (e.g. 0008-MOVED.md); the canonical text now lives in the
sibling P4suta/aozora repo.
What’s load-bearing today
If you change anything in these areas, read the cited ADR first:
upstream/comrak/— ADR-0001. 0-line diff means any change here is a fork divergence and needs its own ADR.- CI / dev environment — ADR-0002. Host toolchain is forbidden;
every command runs through
just+ Docker. - Adding a new Aozora notation — ADR-0010 + the sibling
aozorarepo’s CLAUDE.md. The lexer, AST, and per-node renderer all live there now. - Splicing aozora output into HTML — ADR-0008 (zero parser
hooks) + ADR-0011 (brand boundary). afm’s only afm-side rewrite
of upstream HTML is the
aozora-*→afm-*class pass. - Authoring tools (formatter / LSP / VS Code extension) —
ADR-0009 routes them to the sibling
P4suta/aozora-toolsrepo.
New decisions follow the same MADR format. Scaffold one with:
cargo xtask new-adr '<title>'
Why ADRs live in-repo
ADRs are part of the diff budget for upstream comrak: when a PR
touches upstream/comrak/, the ADR is the contract that says why.
Keeping them next to the code — and reviewable in the same PR —
means the contract evolves with the implementation.
CLI Reference
The up-to-date reference is afm --help / afm <subcommand> --help.
The pages below mirror the same information for offline browsing.
afm
afm [--encoding utf8|sjis] [--strict] <subcommand> [<args>]
Global flags
| Flag | Default | Effect |
|---|---|---|
--encoding <enc> | utf8 | Input encoding. utf8 or sjis. |
--strict | off | Promote every lexer diagnostic to a hard error. |
--help | — | Print help and exit. |
--version | — | Print version and exit. |
Exit codes
| Code | Meaning |
|---|---|
| 0 | Success. |
| 1 | Generic error (I/O, invalid flag, …). |
| 2 | Lexer / parser diagnostic in --strict mode. |
afm render <input>
Parse <input> and write HTML on stdout.
afm render input.md > out.html
<input> may be - to read from stdin.
afm check <input>
Parse <input> without emitting HTML. Useful for CI pre-flight.
afm check --strict input.md
Exits non-zero on parse errors or — under --strict — on any lexer
diagnostic.
API Reference
The Rust API reference is generated by cargo doc and published
alongside this book at:
afm crates (this repo)
These are the crates published from the afm workspace. They
compose the sibling aozora-* parser into a Markdown integration
layer.
afm_markdown— public entry points:render_to_string,render_to_ir,render_blocks_to_ir,serialize, plusOptions(withafm_default/commonmark_only/gfm_onlyfactories) and theIrDocument/IrBlock/IrInlinetree underafm_markdown::ir.afm_wasm—#[wasm_bindgen]surface used by afm-obsidian and other browser hosts. The IR is serialised throughserde-wasm-bindgenand matches the TypeScriptIRDocumentdeclared inafm-obsidian/src/ir/types.ts.afm-cli— theafmbinary. No library API; see CLI Reference for invocation details.
Sibling crates (aozora repo)
Pulled in as a git dependency from
P4suta/aozora. Their published
rustdoc lives on that repo’s GitHub Pages site; afm just embeds the
returned types in its own surface.
aozora_pipeline— the lex driver.lex_into_arena(src, &arena)produces theBorrowedLexOutput(normalized text +Registryof borrowedAozoraNodepayloads + diagnostics) thatafm-markdownconsumes.aozora_syntax— the borrowed AST:AozoraNode,Container,ContainerKind,BoutenKind,BoutenPosition,AozoraHeadingKind,Indent,AlignEnd,SectionKind, plus the arena types (Arena,NonEmpty,Registry,NodeRef).aozora_render— per-node HTML writer (render_node::render) and the source-level serializer (serialize::serialize). afm invokes the writer once per sentinel during HTML splicing.aozora_encoding— Shift_JIS decoder (decode_sjis) and the gaiji resolution table.aozora_spec—Diagnostic,Severity,DiagnosticSource,Span, sentinel codepoint constants.
Local preview
When viewing this handbook locally (e.g. via just book-serve) the
API link above will 404 — run cargo doc --workspace --no-deps and
mount the resulting target/doc/ at /afm/api/ to mirror the
published Pages layout, or visit the published site at
https://p4suta.github.io/afm/api/ directly.