Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Shift_JIS & gaiji

Problem. Aozora Bunko ships its corpus as Shift_JIS, and those files contain 外字 (gaiji) references like ※[#「木+吶のつくり」、第3水準1-85-54]. You want to decode the bytes and see how each gaiji reference resolved.

Two concerns, two layers

  • Encoding is not the parser’s job — the parser is strictly UTF-8. Decode Shift_JIS first with aozora::encoding, then hand the resulting String to Document::new.
  • Gaiji resolution is the parser’s job. As it classifies a ※[#…] reference it resolves the mencode against the bundled JIS X 0213 tables, attaching the result to the Gaiji node. You read it off the node; you do not call the resolver yourself.

Solution (library)

use aozora::{Document, AozoraNode, NodeRef};
use aozora::encoding::decode_sjis;

fn main() -> Result<(), Box<dyn std::error::Error>> {
    // Decode the Shift_JIS archive file to UTF-8 (strict — errors on
    // malformed bytes rather than substituting replacement chars).
    let bytes = std::fs::read("crime_and_punishment.txt")?;
    let utf8 = decode_sjis(&bytes)?;

    let doc = Document::new(utf8);
    let tree = doc.parse();

    for sn in tree.source_nodes() {
        if let NodeRef::Inline(AozoraNode::Gaiji(g)) = sn.node {
            match g.ucs.and_then(|r| r.as_char()) {
                Some(ch) => println!("{} → {ch}", g.description),
                None => println!("{} → (unresolved)", g.description),
            }
        }
    }
    Ok(())
}

Expected output

木+吶のつくり → 吶

Gaiji carries three fields: description (the free-form source text), ucs (the resolved Resolved, None when no table matched), and mencode (the raw reference such as 第3水準1-85-54). Resolved is either a single Char — recovered via as_char() above — or a Multi combining sequence for the handful of plane-1 cells that need one; see the Gaiji node chapter.

Picking the decoder

aozora::encoding offers more than one entry point:

  • decode_sjis(&[u8]) -> Result<String, _> — force Shift_JIS. Use it when you know the input is the canonical archive encoding.
  • decode_auto(&[u8]) -> Result<Cow<str>, _> — sniff: valid UTF-8 is returned borrowed (zero-copy), otherwise the bytes decode as Shift_JIS. Use it for a mixed corpus where some files are pre-converted UTF-8 mirrors.

Both are strict — neither substitutes replacement characters — so you learn when you are looking at corrupted source rather than silently absorbing it.

Solution (CLI)

The aozora binary decodes Shift_JIS with -E sjis (alias --encoding sjis); the default is UTF-8:

aozora render -E sjis crime.txt > crime.html
aozora check  -E sjis crime.txt          # diagnostics on the decoded text
aozora pandoc -E sjis crime.txt -t epub > crime.epub

See also