Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Welcome

Aozora Flavored Markdown (afm) is a Markdown dialect that layers Aozora Bunko (青空文庫) typography — ruby, bouten, 縦中横, [#…] annotations, gaiji, accent decomposition — on top of CommonMark + GFM for Japanese vertical and horizontal writing.

Like GFM, afm is a strict superset of its base: any pure CommonMark or GFM document parses identically under afm, and the Aozora extensions kick in only where the input actually uses them. The file extension remains .md.

This handbook is both a practical tour and a reference:

Status

100% CommonMark / GFM spec compatibility, all major Aozora Bunko annotations implemented, with a 96% regions coverage floor.

See the project README for an at-a-glance summary and the CHANGELOG for release history.

Install

afm ships as a single afm binary and as a Rust library. The two entry points share the same parser core, so a CLI run and a library embed produce identical HTML for the same input.

From GitHub Releases

Pre-built binaries for the following targets are published to GitHub Releases:

TargetArchive
x86_64-unknown-linux-gnu.tar.gz
x86_64-unknown-linux-musl.tar.gz
aarch64-apple-darwin.tar.gz
x86_64-apple-darwin.tar.gz
x86_64-pc-windows-msvc.zip

Each archive bundles the afm binary alongside LICENSE-MIT, LICENSE-APACHE, NOTICE, and README.md. A release-wide SHA256SUMS file is attached to the release for bulk verification:

# Replace vX.Y.Z with the release tag you want from the Releases page.
curl -L https://github.com/P4suta/afm/releases/download/vX.Y.Z/SHA256SUMS -o SHA256SUMS
sha256sum --check --ignore-missing SHA256SUMS
tar xzf afm-vX.Y.Z-x86_64-unknown-linux-gnu.tar.gz
afm-vX.Y.Z-x86_64-unknown-linux-gnu/afm --version

From source

git clone https://github.com/P4suta/afm
cd afm
just build-release

This produces target/release/afm. The build runs inside the dev Docker image per ADR-0002; the host does not need a Rust toolchain installed.

As a Rust library

afm is not on crates.io yet; depend on it directly by git URL:

[dependencies]
afm-markdown = { git = "https://github.com/P4suta/afm" }

See Library Usage for a minimal parse + render example.

CLI Quickstart

afm [--encoding utf8|sjis] [--strict] <subcommand>

Subcommands

SubcommandPurpose
afm render <file>Parse and emit HTML on stdout.
afm check <file>Parse without rendering; exit non-zero on failure.

Examples

Render a UTF-8 file:

afm render input.md > out.html

Render a Shift_JIS Aozora Bunko text directly from its published form:

afm render --encoding sjis tsumito_batsu.txt > tsumito_batsu.html

Validate a document under strict mode (treat every lexer diagnostic as an error — useful in CI pre-flight):

afm check --strict input.md

See CLI Reference for the full flag listing and exit-code semantics.

Library Usage

afm ships as a Rust library (afm-markdown) alongside the CLI. The binary is a thin wrapper over the same public API every embedder calls — there is no parallel “library-only” path that the CLI bypasses, so a CLI run and a library embed produce byte-identical HTML for the same input.

Add the dependency

afm is not on crates.io yet; depend on it directly by git URL:

[dependencies]
afm-markdown = { git = "https://github.com/P4suta/afm" }

The aozora-encoding sibling crate provides Shift_JIS decoding when you need it; pin it from the same repo set:

[dependencies]
aozora-encoding = { git = "https://github.com/P4suta/aozora" }

Render to HTML — the simple path

use afm_markdown::{Options, render_to_string};

fn main() {
    let rendered = render_to_string(
        "彼は|青梅《おうめ》に行った。",
        &Options::afm_default(),
    );

    println!("{}", rendered.html);
    for diag in &rendered.diagnostics {
        eprintln!("warning: {diag}");
    }
}

Options::afm_default() enables the GFM extensions afm uses on top of CommonMark (strikethrough, tables, autolinks, task lists), hardbreaks (so each Aozora source newline becomes a <br> — verse / dialogue boundaries are load-bearing in 青空文庫 source), and the Aozora pre-pass.

For pure CommonMark or pure GFM behaviour (no Aozora recognition), use Options::commonmark_only() or Options::gfm_only() — these are also what the CommonMark 0.31.2 and GFM 0.29 spec runners exercise.

Render to a structured IR

render_to_ir returns the same HTML alongside a typed IrDocument that mirrors the TypeScript IRDocument consumed by afm-obsidian:

use afm_markdown::ir::{IrBlock, IrInline};
use afm_markdown::{Options, render_to_ir};

fn main() {
    let rendered = render_to_ir(
        "# 第一章\n\n|青梅《おうめ》",
        &Options::afm_default(),
    );

    for block in &rendered.ir.blocks {
        match block {
            IrBlock::Heading { level, .. } => println!("h{level}"),
            IrBlock::Paragraph { children, .. } => {
                let ruby_count = children
                    .iter()
                    .filter(|c| matches!(c, IrInline::Ruby { .. }))
                    .count();
                println!("paragraph with {ruby_count} ruby span(s)");
            }
            other => println!("{other:?}"),
        }
    }
}

The IR carries every Aozora-side construct (Ruby, DoubleRuby, Bouten, Tcy, Gaiji, Annotation, Container, PageBreak, SectionBreak) plus the markdown-side block / inline shapes — so JS-side renderers in afm-obsidian / afm-logseq can pick their own output target (DOM fragment, CodeMirror RangeSet, semantic tokens) without re-parsing the HTML.

Render block-by-block (streaming)

For long documents where you want to checkpoint between blocks (afm-obsidian uses this for AbortSignal cancellation in chunked post-processors), use render_blocks_to_ir:

#![allow(unused)]
fn main() {
use afm_markdown::{Options, render_blocks_to_ir};

let (blocks, diagnostics) = render_blocks_to_ir(
    "first paragraph\n\n|second《せかんど》paragraph",
    &Options::afm_default(),
);

for block in blocks {
    println!("{} ir nodes at line {}", block.ir.len(), block.source_line);
    println!("{}", block.html);
}
assert!(diagnostics.is_empty());
}

The shared StreamingIrBuilder threads the sentinel cursor across calls, so per-block IR projection stays in lockstep with the whole-document path. A block may carry zero IR entries (e.g. container-open paragraphs that drain at the next call boundary) or more than one (a container that finally closes).

Reading Shift_JIS input

Aozora Bunko ships its text files in Shift_JIS. aozora-encoding exposes a transparent decoder so your pipeline doesn’t need to know the encoding ahead of time:

use afm_markdown::{Options, render_to_string};
use aozora_encoding::decode_sjis;

fn main() -> std::io::Result<()> {
    let bytes = std::fs::read("tsumito_batsu.txt")?;
    let utf8 = decode_sjis(&bytes).expect("decoded");

    let rendered = render_to_string(&utf8, &Options::afm_default());
    std::fs::write("tsumito_batsu.html", rendered.html)?;
    Ok(())
}

Round-tripping through the lexer

afm_markdown::serialize is the inverse of the lex pre-pass: it replays the borrowed-AST registry to reconstruct the original afm markup byte-for-byte (modulo the lexer’s Phase-0 sanitisation). This is what the upstream 17 k-work corpus sweep exercises as I3 (round- trip fixed point):

use afm_markdown::serialize;

fn main() {
    let source = "彼は|青梅《おうめ》に行った。";
    assert_eq!(serialize(source), source);
}

More examples

End-to-end snippets live under crates/afm-markdown/examples/ in the repository:

  • render-utf8.rs — UTF-8 source → HTML on stdout.
  • render-sjis.rs — Shift_JIS source via aozora-encoding.
  • ast-walk.rs — walk the parsed AST and tally AozoraNode variants.
  • serialize-round-trip.rs — verify serialize ∘ lex ≡ id on one file.

Run any of them with:

cargo run --example <name> -p afm-markdown -- <path>

Pipeline Overview

afm composes three independent black boxes, each with a single responsibility, glued together by a tiny sentinel-stream cursor that keeps the two output paths (HTML and IR) in lockstep without re-running the parser.

source (UTF-8 or Shift_JIS)
   │
   ▼  aozora_encoding::decode_sjis        (Shift_JIS → UTF-8, sibling repo)
   │
   ▼  aozora_pipeline::lex_into_arena     (青空文庫記法 borrowed-AST)
   │   ├─ Phase 0  sanitize     BOM / CRLF→LF / 〔…〕 accent / PUA collision scan
   │   ├─ Phase 1  events       SIMD trigger-byte tokenise
   │   ├─ Phase 2  pair         balanced-stack bracket / ruby / quote pairing
   │   ├─ Phase 3  classify     borrowed AozoraNode<'arena> + ContainerKind
   │   └─ Phase 4  normalize    PUA sentinels (U+E001..U+E004) + Registry
   │
   │   ┌──────────────────────────── Output ────────────────────────────┐
   │   │ BorrowedLexOutput<'arena> {                                    │
   │   │     normalized: &str,                                          │
   │   │     registry: Registry<'arena>,    // sentinel pos → NodeRef   │
   │   │     diagnostics: Vec<Diagnostic>,                              │
   │   │ }                                                              │
   │   └────────────────────────────────────────────────────────────────┘
   │
   ▼  comrak::parse_document               (vanilla CommonMark + GFM)
   │   sentinels survive as plain UTF-8 — they aren't in the
   │   `<>&"'` escape set, so format_html passes them through too.
   │
   ▼  comrak::format_html                  (HTML with sentinels in body)
   │
   ▼  afm_markdown::post_process::splice_aozora_html
   │     · single-pass scan over the emitted HTML
   │     · sentinel ↔ aozora_render::render_node output substitution
   │     · paragraph-aware: HeadingHint promotes to <h{level}>;
   │       sole-block-sentinel paragraphs become standalone blocks
   │     · brand boundary: aozora-* CSS classes → afm-* (ADR-0011)
   │
   ▼  HTML

How the splicer stays in lockstep

Both consumers of the lex output — the HTML splicer and the IR projector — walk the same source-order sequence of registry entries. The shared abstraction is SentinelCursor in crates/afm-markdown/src/sentinels.rs:

                ┌──────────────── BorrowedLexOutput ────────────────┐
                │ normalized = "前\u{E001}後..."                    │
                │ registry   = { 3 → Inline(Ruby{…}), … }           │
                └─────────────────────┬─────────────────────────────┘
                                      │
                  flatten_registry_in_source_order
                                      │
                                      ▼
             ┌──── &[NodeRef<'src>] (sorted by source pos) ─────┐
             │   [Inline(Ruby), BlockOpen(Indent), …]           │
             └──────────────────────────────────────────────────┘
                          │                            │
                          │  shared cursor             │
                ┌─────────┴────────┐         ┌─────────┴────────┐
                │ HTML splicer     │         │ IR builder       │
                │ (post_process)   │         │ (ir.rs)          │
                │                  │         │                  │
                │ String buffer    │         │ Vec<IrBlock>     │
                │ container_stack: │         │ container_stack: │
                │   Vec<           │         │   Vec<           │
                │     ContainerKind│         │     OpenContainer│ <- holds children
                │   >              │         │   >              │
                └──────────────────┘         └──────────────────┘

Both walkers consume entries linearly via cursor.next(), peek ahead via cursor.peek(offset), and maintain their own container-stack so paired open / close markers nest correctly. They never interfere because each render_to_string / render_to_ir call materialises its own cursor over its own flattened slice.

The streaming path (render_blocks_to_ir) reuses this design: the public StreamingIrBuilder owns the materialised slice and a cursor_idx that threads across walk_block calls, so per-block IR projection stays consistent with the whole-document path.

Dependency direction

afm depends on aozora. The reverse must not hold:

┌────────────────┐      git dependency       ┌─────────────────┐
│ afm (this repo)│ ─────────────────────────▶│ aozora (sibling)│
│   afm-markdown │                           │  aozora-pipeline│
│   afm-cli      │                           │  aozora-syntax  │
│   afm-wasm     │                           │  aozora-render  │
│   afm-book     │                           │  aozora-encoding│
└────────────────┘                           │  aozora-spec    │
                                             └─────────────────┘

Anything afm needs from aozora travels through aozora’s public API. Anything aozora needs from afm doesn’t exist — by construction (see ADR-0011 for the brand boundary that codifies this rule, and ADR-0010 for the original split).

What lives in the vendored comrak tree

upstream/comrak/ is a verbatim copy of comrak v0.52.0 with a 0-line diff (ADR-0001 v0.2.4). afm composes comrak as a black box: parse_document, format_html, and the AST type tree are imported, the sentinels survive both passes as plain UTF-8, and post-process owns the entire afm-side surface. Upgrading comrak is a cargo xtask upstream-sync <tag> away — no patches to re-apply.

See the architectural decisions for the full rationale and the alternatives that led here (ADR-0008 reset the design to zero-parser-hooks; ADR-0010 split parser / renderer into the sibling repo; ADR-0011 nailed down the brand boundary).

Architecture Decision Records

afm records load-bearing design decisions as MADR-formatted ADRs under docs/adr/. The rationale, alternatives considered, and concrete consequences live there in full — the table below is a map.

#TitleStatus
0001Fork comrak in-tree, 0-line diff budgetAccepted (budget collapsed in v0.2.4)
0002Docker-only execution for development and CIAccepted
0003Initial afm-parser architectureSuperseded by ADR-0010 (v0.2.0 split)
0004Accent decomposition inside 〔…〕Moved to sibling aozora repo
0005Paired block annotation container hookSuperseded by ADR-0008
0006Lint profile policy and scope disciplineMirrored in sibling aozora repo
000717 k-work corpus sweep strategyMoved to sibling aozora repo
0008Zero-parser-hook Aozora-first pipelineMoved to sibling aozora repo
0009Authoring tools live in sibling repositoriesAccepted
0010Extract aozora-* core into a sibling repoAccepted (executed v0.2.0, 2026-04-25)
0011Brand boundary — aozora-*afm-* HTML rewriteAccepted (2026-05-04)

ADRs marked Moved kept their number on this side as redirect stubs (e.g. 0008-MOVED.md); the canonical text now lives in the sibling P4suta/aozora repo.

What’s load-bearing today

If you change anything in these areas, read the cited ADR first:

  • upstream/comrak/ — ADR-0001. 0-line diff means any change here is a fork divergence and needs its own ADR.
  • CI / dev environment — ADR-0002. Host toolchain is forbidden; every command runs through just + Docker.
  • Adding a new Aozora notation — ADR-0010 + the sibling aozora repo’s CLAUDE.md. The lexer, AST, and per-node renderer all live there now.
  • Splicing aozora output into HTML — ADR-0008 (zero parser hooks) + ADR-0011 (brand boundary). afm’s only afm-side rewrite of upstream HTML is the aozora-*afm-* class pass.
  • Authoring tools (formatter / LSP / VS Code extension) — ADR-0009 routes them to the sibling P4suta/aozora-tools repo.

New decisions follow the same MADR format. Scaffold one with:

cargo xtask new-adr '<title>'

Why ADRs live in-repo

ADRs are part of the diff budget for upstream comrak: when a PR touches upstream/comrak/, the ADR is the contract that says why. Keeping them next to the code — and reviewable in the same PR — means the contract evolves with the implementation.

CLI Reference

The up-to-date reference is afm --help / afm <subcommand> --help. The pages below mirror the same information for offline browsing.

afm

afm [--encoding utf8|sjis] [--strict] <subcommand> [<args>]

Global flags

FlagDefaultEffect
--encoding <enc>utf8Input encoding. utf8 or sjis.
--strictoffPromote every lexer diagnostic to a hard error.
--helpPrint help and exit.
--versionPrint version and exit.

Exit codes

CodeMeaning
0Success.
1Generic error (I/O, invalid flag, …).
2Lexer / parser diagnostic in --strict mode.

afm render <input>

Parse <input> and write HTML on stdout.

afm render input.md > out.html

<input> may be - to read from stdin.

afm check <input>

Parse <input> without emitting HTML. Useful for CI pre-flight.

afm check --strict input.md

Exits non-zero on parse errors or — under --strict — on any lexer diagnostic.

API Reference

The Rust API reference is generated by cargo doc and published alongside this book at:

/afm/api/

afm crates (this repo)

These are the crates published from the afm workspace. They compose the sibling aozora-* parser into a Markdown integration layer.

  • afm_markdown — public entry points: render_to_string, render_to_ir, render_blocks_to_ir, serialize, plus Options (with afm_default / commonmark_only / gfm_only factories) and the IrDocument / IrBlock / IrInline tree under afm_markdown::ir.
  • afm_wasm#[wasm_bindgen] surface used by afm-obsidian and other browser hosts. The IR is serialised through serde-wasm-bindgen and matches the TypeScript IRDocument declared in afm-obsidian/src/ir/types.ts.
  • afm-cli — the afm binary. No library API; see CLI Reference for invocation details.

Sibling crates (aozora repo)

Pulled in as a git dependency from P4suta/aozora. Their published rustdoc lives on that repo’s GitHub Pages site; afm just embeds the returned types in its own surface.

  • aozora_pipeline — the lex driver. lex_into_arena(src, &arena) produces the BorrowedLexOutput (normalized text + Registry of borrowed AozoraNode payloads + diagnostics) that afm-markdown consumes.
  • aozora_syntax — the borrowed AST: AozoraNode, Container, ContainerKind, BoutenKind, BoutenPosition, AozoraHeadingKind, Indent, AlignEnd, SectionKind, plus the arena types (Arena, NonEmpty, Registry, NodeRef).
  • aozora_render — per-node HTML writer (render_node::render) and the source-level serializer (serialize::serialize). afm invokes the writer once per sentinel during HTML splicing.
  • aozora_encoding — Shift_JIS decoder (decode_sjis) and the gaiji resolution table.
  • aozora_specDiagnostic, Severity, DiagnosticSource, Span, sentinel codepoint constants.

Local preview

When viewing this handbook locally (e.g. via just book-serve) the API link above will 404 — run cargo doc --workspace --no-deps and mount the resulting target/doc/ at /afm/api/ to mirror the published Pages layout, or visit the published site at https://p4suta.github.io/afm/api/ directly.