Pandoc integration

The aozora-pandoc crate (workspace-internal, available via the aozora CLI) projects a parsed Aozora document into the Pandoc AST. Once you have Pandoc JSON, every Pandoc output format (HTML, EPUB, LaTeX/PDF, DOCX, ODT, MediaWiki, …) is one shell pipe away.

This is the recommended path if you want to convert Aozora Bunko notation into anything other than the built-in HTML renderer. Adding a new output format means adding a Pandoc filter (or none, if the default Span/Div mapping is enough), not extending the parser crate.

Quickstart

# Pandoc JSON to stdout
aozora pandoc input.txt > out.json

# Or pipe through pandoc directly
aozora pandoc input.txt | pandoc -f json -t html
aozora pandoc input.txt | pandoc -f json -t epub3 -o out.epub

# `--format` is shorthand for the pipe (requires pandoc on PATH)
aozora pandoc input.txt --format html > out.html
aozora pandoc -E sjis legacy.txt -t epub > out.epub

Projection rules

Each AozoraNode variant lifts to a Pandoc construct carrying a stable CSS class so downstream filters or stylesheets can specialise the rendering:

Aozora variant	Pandoc construct	Class on the construct
`Ruby`	`Span`	`aozora-ruby`
↳ base text	nested `Span`	`aozora-ruby-base`
↳ reading text	nested `Span`	`aozora-ruby-reading`
`Bouten`	`Span` over target text	`aozora-bouten`
`TateChuYoko`	`Span`	`aozora-tate-chu-yoko`
`Gaiji`	`Span` carrying mencode	`aozora-gaiji`
`Indent`, `AlignEnd`	empty `Span` (marker)	`aozora-indent` / `align-end`
`Warichu`	`Span` with two children	`aozora-warichu`
`AngleQuote`	`Span`	`aozora-angle-quote`
`Annotation`, `Kaeriten`, `HeadingHint`	empty `Span` carrying raw	`aozora-annotation` / etc.
`PageBreak`	`HorizontalRule` block	(n/a — semantic block)
`SectionBreak`	empty `Div`	`aozora-section-break`
`AozoraHeading`	`Header` block	`aozora-heading`
`Sashie`	`Para` with `Image`	`aozora-sashie`
Container (字下げ等)	`Div` wrapping inner blocks	`aozora-container-indent` / etc.

The structural attribute kvs (Pandoc’s third Attr tuple) carries non-textual metadata (bouten kind / position, gaiji description / mencode, indent amount, container kind). Filters that want format-native rendering pattern-match on the class + kvs.

Why a Pandoc projection at all

Aozora notation has rich semantic markup (ruby, bouten, tate-chu-yoko, gaiji…) that no single Pandoc native construct captures. The naive shortcut of emitting RawInline("html", "<ruby>…</ruby>") would only work for the HTML writer; every other Pandoc output format would strip the raw HTML and lose the meaning.

By lifting each Aozora variant to a Span / Div with a stable class, the same JSON renders sensibly across every Pandoc format today (each format’s writer renders Span as a stylable container) and stays open for richer format-native rendering tomorrow via filters. That’s the same pattern Pandoc itself uses for [content]{.smallcaps} — semantic in the AST, format-specific in the writer.

Architecture

The library entry point is aozora_pandoc::to_pandoc:

use aozora::Document;
use aozora_pandoc::to_pandoc;

let doc = Document::new(std::fs::read_to_string("input.txt")?);
let pandoc = to_pandoc(&doc.parse());
let json = serde_json::to_string(&pandoc)?;

aozora-cli wires that into aozora pandoc so binary consumers don’t need to write Rust.

Keyboard shortcuts

aozora — 青空文庫記法 Parser Handbook

Pandoc integration

Quickstart

Projection rules

Why a Pandoc projection at all

Architecture