Skip to content

Document AI

2 posts with the tag “Document AI”

Structuring MinerU output into a clean doc tree

Last Updated: 2026-06-04

In part 1 I parsed the whole 5,039-page ECMA-376 Part 1 standard with MinerU on RunPod: 36 clause-aligned batches, about $1.15 of GPU time, out came 36 content_list.json files. That’s where most write-ups stop. Parsed is not the same as usable. A vision-language model hands you a flat stream of typed blocks with OCR quirks and no document structure. For a coding agent to answer “what does §17.9.4 isLgl (legal numbering) say?” it needs one small, faithful, addressable file, not a 200-page batch of blocks.

This post is the post-processing half: cleaning, structuring, cross-linking, and verifying the output. The whole thing is distilled into a small, document-agnostic toolkit you can run on your own MinerU output: examples/doc-structuring/ (start with the README).

End state: 5,039 pages became 9,948 Markdown files, every section addressable, ~7,900 cross-references turned into relative links, and every tag and attribute name verified against the official schema and the source PDF.

What does MinerU’s content_list.json actually give you?

Section titled “What does MinerU’s content_list.json actually give you?”

A flat, page-ordered list of typed blocks: text (with an optional text_level for headings), list, table (HTML in table_body), code (in code_body, not text), equation, and image (with a VLM content description). Mixed in is page_number/header noise. No tree, no cross-reference graph, and a long tail of OCR artifacts.

That shape is fine for a quick read. It’s wrong for retrieval. Three jobs turn it into something an agent can navigate: rebuild the structure, render each block faithfully to Markdown, and verify the names didn’t get garbled on the way through the model. Everything below is generic; nothing in the toolkit knows about ECMA-376 specifically.

How do you rebuild the document structure?

Section titled “How do you rebuild the document structure?”

The blocks arrive in reading order, so structure is just segmenting that stream by heading. A single forward walk does it: each block belongs to the section whose heading appeared most recently. Boundaries are only ever set by a real heading, so a section can never steal a neighbour’s content. You inject one callback, heading_id(block), where all your domain logic lives.

That callback is the only place document specifics enter: numbered headings, styled text_level lines, the occasional heading MinerU buried inside a code caption. The forward walk is the part that doesn’t break, because it has no notion of page or position to get wrong. The segmenter lives in segment.py.

Then the tree itself (tree.py):

  • a section with sub-sections becomes a folder plus a barrel file (*-0-index.md) holding its intro prose and a child index;
  • a leaf becomes one file. The golden rule: never split a leaf into parts.
  • an agent walks root → barrel → barrel → leaf, reading one small index per level instead of one giant batch.

Naming encodes the section id plus a short slug (17-4-37-tbl-table.md), so any clause is glob-findable and the camelCase tag (instrText) survives for a regex lookup.

Because faithful Markdown is where every MinerU artifact bites. The structuring is a clean algorithm; the rendering is a pile of special cases, each one traced to a real defect on this run. Skip any of them and the content silently corrupts: examples render empty, tables lose columns, prose gets promoted to a heading.

Every fix in render.py earned its place:

  • Code lives in code_body (pre-fenced) and code_caption, not text. Miss this and all 3,591 XML examples render empty.
  • [Example: … end example] markers have to bracket the code. They routinely land before or inside the fence.
  • Page-split code halves (one example broken across a page) sit directly adjacent, so merge them. Genuinely separate examples have prose between them, so they’re left alone.
  • Mislabelled fences (txt/asp/hcl that actually hold XML) get relabelled to xml, but only when the body has namespaced tags, so a text-output example stays text.
  • Tables: HTML to Markdown. Inline XML examples wrapped as $<w:…>$ math get deleted by naive tag-stripping unless you protect them. Fully-empty illustration columns get dropped.
  • Lists: no doubled bullets (- - foo), and ordered items keep their numbers.
  • A long or sentence-ending text_level block is figure text or prose, not a heading. Don’t render it as ##.
  • $§17.4.62$-wrapped references and \- escaped-dash bullets get normalized.

None of these are exotic. They’re just what VLM output looks like at scale, and each one quietly damages content if you skip it.

Section titled “How do you cross-link a densely-referenced spec?”

ECMA-376 cites itself constantly (§17.9.11, ST_Jc (§17.18.44), and on). Two moves in crosslink.py: normalize every reference to one canonical §N.N.N form so a single regex collects them all, then turn each resolvable one into a relative Markdown link to the target’s file or barrel.

The regex is §(\d+(?:\.\d+)*|[A-Z](?:\.\d+)+), which catches both numbered clauses and lettered annexes and gives you the citation graph for free. Links are relative on purpose: they’re computed section-to-section within the tree, so they stay valid wherever you mount it. No host path baked in, no rewrite on move. On this run, 7,877 of 7,959 references (98%) became working relative links.

How do you verify the parse is actually correct?

Section titled “How do you verify the parse is actually correct?”

Two independent signals, both in verify.py. First, a vocabulary check: build the canonical set of element, attribute, and type names plus enum values straight from the official XSDs, then flag any name in the tree that isn’t in it but closely resembles one. Second, a source cross-check against the PDF text layer, which is the definitive one.

The vocabulary check (Vocabulary.from_xsd([...])) is fast and needs no PDF. It catches the obvious garbles: fontAlign when the schema only knows fontAlgn. But it misses the nasty case where a misread happens to spell a different real name.

That’s what the source cross-check is for. A name a file uses that is absent from that section’s own PDF page, while a near-miss correct name is present, is a confirmed garble. The PDF text layer is independent ground truth. This catches algn misread as align: align is a real element elsewhere, so it passes the vocabulary check, but on the actual page the source says algn. The check is bounded, each token is tested only against its section’s pages and deduped, so there’s no whole-document scan and no processing hole.

Confirmed garbles feed a vetted correction map (corrections.py), applied scoped to name contexts so a garble that collides with a real name is corrected only as an attribute, never as an element. Re-run the verifier and it reports zero. On this run that fixed a consistent align→algn / fontAlign→fontAlgn class across DrawingML, plus displacedByCustomXML, t12br→tl2br, subseted, and more, each confirmed against the PDF before it was applied.

Why replace OCR’d schema dumps instead of correcting them?

Section titled “Why replace OCR’d schema dumps instead of correcting them?”

Because for the annexes you already have the real source, so OCR is the wrong input. The annexes are machine-generated schema listings (the full XSD and RELAX-NG for the formats). MinerU OCR’d them like everything else, producing the same garbles: CT_Placelder for CT_Placeholder, underscores read as spaces, a dangling fragment where a split cut mid-element. The real schema files exist, so swap them in.

The generic core is schema.py. Index every declaration in the official .xsd/.rnc, work out which schema file each annex dump came from (highest declaration-name overlap), and replace each parsed declaration with the authoritative one, matched by name then kind, exact → case-insensitive → fuzzy. The authoritative kind even drives the output folder and filename, so a mis-named OCR fragment self-corrects on rebuild. Result: 99.8% (5,699 of 5,710) declarations replaced. The ~11 too garbled to match confidently keep their OCR text.

How do you close the long tail of one-off OCR damage?

Section titled “How do you close the long tail of one-off OCR damage?”

The residue is per-instance damage that won’t generalize into a correction map: a \@ date-switch read as $@$, a < read as #, a dropped ), glued attribute names. Past a point, stop writing heuristics. Let agents propose fixes and gate every one of them on the source PDF.

I ran an adversarial multi-agent fan-out. The worklist was ~100 items: every §17.16.5 field clause, enum truncations found by diffing the rendered table against the authoritative XSD, and the named defects. Each item carried its PDF ground truth. About 130 agents proposed fixes, and only PDF-verified ones were accepted: 41 patches, zero that the build couldn’t apply.

They live as a per-section overlay (apply_overlay) applied last, so each find matches the on-disk text and a stale one gets reported on the next rebuild rather than vanishing silently. After all of it, verify_against_pdf reports 0 actionable garbles. The 11 it still flags were each reviewed against the PDF as genuine, distinct OOXML names (useFirstPageNumber and firstPageNumber both exist; o:cname confirmed on p4968) and recorded as benign.

Source5,039 pages, 36 MinerU batches
Output9,948 Markdown files (4,245 leaf clauses, 356 barrels, 5,130 split schema declarations, 1 root index)
Cross-references linked7,877 / 7,959 (98%), relative
Verificationnames vs official XSD vocab (2,058 elements + 1,806 attributes) and vs the source PDF text layer

The full worked wiring is example_pipeline.py. Note how little domain code it is: an outline loader, a heading detector, a naming scheme, a reference regex, and a correction map, all driven by CLI flags with nothing hard-coded. Everything else is the library.

Honest limits, because the toolkit isn’t magic:

  • The long tail is OCR, not logic, and verification closes it, not more regex. The systematic fixes get you most of the way, the authoritative-schema swap handles the annexes, and the per-instance residue is closed by the adversarial fan-out where every proposed edit is gated by the source PDF before it lands.
  • Verification needs a schema and a text-layer PDF. Without an authoritative vocabulary you lose the first signal; without a real text layer (a scanned PDF) you lose the second.
  • Structure quality equals outline quality. The tree is only as good as the section hierarchy you feed it, here the PDF bookmarks. Garbage outline, garbage tree.

Isn’t MinerU’s Markdown output enough?

Section titled “Isn’t MinerU’s Markdown output enough?”

For reading, sometimes. For an addressable, agent-navigable, verified corpus, no. You need structure (the tree), a citation graph (the cross-links), and verification against ground truth. That’s the post-processing this toolkit does on top of what MinerU emits.

Why a per-section PDF cross-check instead of just trusting the schema?

Section titled “Why a per-section PDF cross-check instead of just trusting the schema?”

Because a garble can collide with a valid name elsewhere (align is a real element), so the schema vocabulary alone passes it. The source page is the only authority on which name belongs here. Scoping the check to the section’s own pages keeps it cheap.

No, it’s document-agnostic. Supply your own Section hierarchy and a few callbacks and it runs on any MinerU output. See the README. ECMA-376 is just the worked example.

Relative, computed within the tree. They’re identical wherever the tree is mounted, so the output ships anywhere with zero rewriting.

What block types does content_list.json contain?

Section titled “What block types does content_list.json contain?”

text (with an optional text_level for headings), list, table (HTML in table_body), code (in code_body), equation, and image (with a VLM content description), plus page_number and header noise you filter out.

How do you handle code that MinerU split across a page?

Section titled “How do you handle code that MinerU split across a page?”

The two halves arrive directly adjacent in the block stream, so merge adjacent code blocks. Genuinely separate examples always have prose between them, so they’re left untouched.

If this saved you time, the easiest way to say thanks is signing up for RunPod through this link. Star the repo on GitHub for updates.


Disclosure: RunPod links in this post use a referral code that credits me at no cost to you. The post would read the same without it.

Clause-aligned batching for large PDFs on MinerU + RunPod

Last Updated: 2026-06-03

ECMA-376 Part 1 (the Office Open XML File Formats — Fundamentals and Markup Language Reference) is 5,039 pages of dense, table-heavy, XML-schema-laden specification in a single 35 MB PDF. It is the document that defines .docx, .xlsx, and .pptx down to the attribute. If you want a machine-readable, clause-addressable version of it, you have to parse all 5,039 pages, and almost everything about that page count makes a naive approach fall over.

This is the story of parsing the whole thing through the mineru-runpod serverless worker. The headline result: 36 batches, 5,039 pages, 46,637 content blocks, 4,174 tables, full contiguous coverage, ~$1.15 of GPU time. The interesting part is not the total. It’s how you cut a 5,000-page document into pieces without breaking it, which it turns out is a decision about clause structure, not page numbers.

Why not just send the whole 5,000-page PDF?

Section titled “Why not just send the whole 5,000-page PDF?”

The worker accepts a file_url and parses front-to-back, so technically you could send all 5,039 pages as one job. You shouldn’t, for four reasons that all get worse with size:

  • All-or-nothing failure. A single job that dies at page 4,800 (OOM, a transient GPU eviction, a timeout) costs you the entire run. At ~78 minutes of GPU work (more on that below), that’s an expensive coin flip.
  • No resumability. One job has no natural checkpoint. If it fails you start over.
  • The 20 MB response cap. MinerU’s output for a few hundred pages already blows past RunPod’s ~20 MB sync-response ceiling. For 5,000 pages it isn’t close: the extracted output here was 869 MB on disk.
  • Memory. Holding the layout model output for thousands of pages in one process is a needless VRAM/RAM risk when the work is embarrassingly sliceable.

Batching fixes all four, but only if you batch at the right boundaries.

Why cut at clause boundaries instead of every N pages?

Section titled “Why cut at clause boundaries instead of every N pages?”

The obvious split is “every 100 pages.” The problem: a standard isn’t a stream of interchangeable pages, it’s a tree of clauses. Clause §17.4 (Tables) might start three lines from the bottom of a page and run for 40 pages. If a batch boundary lands in the middle of it, you’ve torn a logical unit across two parse jobs, and every downstream step (clause extraction, cross-referencing, chunking for retrieval) has to stitch it back together.

So I don’t cut by page count. I cut by clause:

  1. Build an outline from the PDF’s 4,600+ bookmarks, giving a clause → page index for the whole document.
  2. Place batch boundaries only at clause starts, never mid-clause.
  3. Treat the huge top-level clauses (§17 WordprocessingML, §18 SpreadsheetML, §19 PresentationML, §20/§21 DrawingML, §22 Shared MLs, and the annexes) as mandatory anchors, so a big reference section always begins a fresh batch.
  4. Aim for ~100 pages per batch, allow up to ~200, and accept whatever the nearest clause boundary gives.

The result was 36 batches averaging 140 pages (smallest 66, largest 238). Every batch starts and ends on a clause edge, so no clause is ever split across the seam between two parse jobs.

(One calibration gotcha specific to this PDF: printed page 1 is PDF page index 9. There’s a 9-page front-matter offset you have to fold into the bookmark→page mapping or every boundary is off by nine.)

A useful consequence: because the worker slices the PDF server-side via start_page/end_page (see the API reference), you never pre-split the PDF. You upload it once and each batch job asks for its page range out of the same source file.

Did the batches actually cover the whole document?

Section titled “Did the batches actually cover the whole document?”

Yes, and this is worth verifying mechanically rather than trusting. After the run, I checked each batch’s produced page span against its planned range and confirmed the batches tile the document with no gaps and no overlaps:

MetricValue
Batches36
Pages5,039 (contiguous 0–5038, 0 gaps, 0 overlaps)
Content blocks46,637
Tables4,174
Code blocks3,591
Pages/batchmean 140, min 66 (b35), max 238 (b25)
Output downloaded~465 MB (compressed tarballs)
Output on disk869 MB (extracted)

The contiguity check is the one piece of validation I wouldn’t skip on a document this size: it’s the difference between “the run finished” and “the run is complete.”

How was the document transported in and out?

Section titled “How was the document transported in and out?”

Two different transports, for two different size problems.

Input: R2 URL. At 35 MB the PDF is well over the 20 MB inline (file_b64) limit, so it can’t ride in the request body. I put it on Cloudflare R2 and passed a public URL as file_url. The worker downloads it (≤200 MB cap) and slices the requested pages itself. One upload, 36 jobs read from it.

Output: transport="s3". Per-batch output is large (the biggest batch produced a 32 MB tarball), so embedding results in the sync response was out. With transport="s3", the worker uploads each result .tar.gz back to R2 and returns a presigned URL the client downloads and extracts. The tarball carries everything: content_list.json (the flat, typed, page-indexed block list I treat as source of truth), the rendered markdown, middle.json, and a layout-overlay PDF.

The presigned URL has a 1-hour TTL, which has a real consequence for batching: you must download each batch’s result as its job finishes, not in a sweep at the end of a 78-minute run. By then the early URLs have expired.

What GPU and backend, and what did throughput look like?

Section titled “What GPU and backend, and what did throughput look like?”

Backend: vlm-auto-engine (MinerU 2.5 Pro, the MinerU2.5-Pro-2605-1.2B vision-language model) on a 24 GB AMPERE_24 (RTX A5000-class) RunPod serverless GPU. One parse per worker (MINERU_MAX_CONCURRENCY=1: vLLM’s KV cache isn’t safe to drive from concurrent parses on a 24 GB card). For how to pick a card, see Choosing a GPU.

Across the 35 timed batches, total GPU compute was 4,674.8 s (77.9 min) at an overall 1.04 pages/sec, with individual batches ranging 0.84–1.27 pp/s depending on table density. A few representative batches:

BatchClausePagesWorker timepp/s
b00§1 Scope (front matter)176145.0 s1.21
b01§17 WordprocessingML100102.8 s0.97
b17§18.17.7 (functions)176147.9 s1.19
b25§21.2 DrawingML – Charts238231.8 s1.03
b32Annex L Primer10280.2 s1.27

Cost worked out to roughly $0.00023/page, ~$1.15 for the whole standard. Before committing to that, a 3-page smoke test (cents, ~110–130 s dominated by cold start) validated the entire pipeline end-to-end (URL fetch → parse → R2 upload → download → extract), the cheapest insurance you can buy on a big run.

The parallelism lesson: it’s RunPod-side, not in the worker

Section titled “The parallelism lesson: it’s RunPod-side, not in the worker”

This is the part that cost the most confusion: a single batch is already parallelized inside the worker (the VLM batches many page-images through the GPU at once), but running multiple batches at once is a RunPod scaling decision, not something you trigger by submitting more jobs.

I learned this the hard way. The client submitted 3 batches concurrently, and RunPod ran exactly one while two sat in the queue. The endpoint was configured workersMax=1: one GPU worker, one batch at a time, no matter how many jobs you fire. Raising workersMax to 3 (and matching the client’s concurrency) is what actually delivered 3×: the remaining 31 batches then finished in 27.8 minutes wall-clock. The scaling guide covers how concurrency and workersMax interact.

The mental-model fix:

  • Inside one job: pages are parallelized on one GPU. Already maxed.
  • workersMax: how many separate GPUs run separate jobs at once. This is your throughput dial.

A related myth worth busting: MinerU’s pipeline logs mention a window_size=64. That is a GPU throughput batch (how many page-images stream through the model at a time to bound VRAM), not a context window. Pages are recognized independently regardless of it, so it has zero effect on content continuity across pages. Which is exactly why clause-aligned batch boundaries matter and the internal window size doesn’t: continuity is something you protect at the batch layer, not by tuning a throughput knob.

Which clauses produced the most structure?

Section titled “Which clauses produced the most structure?”

Block and table counts track the content shape of the standard almost perfectly: the reference-material and function-catalog clauses dominate:

BatchClauseBlocksTables
b25§21.2 DrawingML – Charts2,829378
b17§18.17.7 (spreadsheet functions)2,805239
b26§22 Shared MLs2,509
b10§17.17 Miscellaneous238
b11§18 SpreadsheetML224

These are the dense element/attribute reference tables that make ECMA-376 what it is. They’re a good reminder to spot-check table fidelity on exactly these batches before trusting the output downstream.

The annex schema dumps look completely different

Section titled “The annex schema dumps look completely different”

The most striking per-batch contrast is the annexes. Annex A (W3C XML Schema), Annex B (RELAX NG) and friends are long code listings, not prose with tables, and the numbers show it. Same ~150-page batch size, radically smaller output:

BatchAnnexTarball
b29Annex B (RELAX NG)1.15 MB
b30B.3 PresentationML1.75 MB
b27Annex A (XML Schema)2.06 MB
b28A.3 PresentationML2.23 MB

Compare that to the prose-and-table batches that ran 31–32 MB (b10, b11) for a similar page count: roughly a 15× size difference driven entirely by content type. MinerU classifies the schema listings as code, so they compress to almost nothing relative to a table-dense reference section.

The runner keeps a manifest.json keyed by batch, and writes each batch’s result atomically: extract into a temporary directory, then rename into place. A batch is only marked ok after its download, extraction, and rename all succeed. Two payoffs:

  • Pause/resume. Midway through, I paused the run to raise workersMax (you don’t want to change cluster settings while jobs are in flight). Stopping the client abandoned the in-flight jobs, but because their downloads hadn’t completed, the manifest never marked them done, so resuming re-ran them. Completed batches were skipped. No corruption, no duplicate downloads.
  • Crash recovery is free. The same mechanism means any crash resumes from the last completed batch.

For a 36-job run that you might interrupt, the resumable manifest is what turns “a long fragile script” into “a process you can walk away from.”

Honest limitations:

  • 1-hour presign expiry forces eager download. You cannot defer pulling results to the end of a long run; download each batch as it lands. My runner does this, but it’s a constraint to design around, not a free lunch.
  • Clause boundaries are only as good as the outline. The whole scheme leans on the PDF’s bookmark tree being accurate and complete. A document with missing or wrong bookmarks needs a fallback (TOC parsing, heading detection) before this works.
  • Table/code fidelity needs spot-checking. 4,174 tables and 3,591 code blocks is a lot of structure to trust blindly; the dense reference batches (b25, b17, b11) and the annex code dumps are where I’d sample-verify first.
  • One GPU is the ceiling. Throughput is fundamentally workersMax × per-GPU rate. There’s no in-job trick to go faster: you pay for more workers or you wait. And more workers means more cold starts, so wall-clock and cost don’t scale perfectly linearly.

What I’d change next time: drive client concurrency directly from the endpoint’s live workersMax so the two never drift, and prune the middle.json + layout PDF from batches where I only need content_list.json. They were roughly half the on-disk footprint.

How long does it take to parse a 5,000-page PDF with MinerU?

Section titled “How long does it take to parse a 5,000-page PDF with MinerU?”

About 78 minutes of single-GPU compute (~1 page/sec on a 24 GB RTX A5000-class card with the VLM backend), or ~28 minutes of wall-clock at 3× worker concurrency. Cost is roughly $1.15 total at ~$0.00023/page.

Why batch at clause boundaries instead of fixed page counts?

Section titled “Why batch at clause boundaries instead of fixed page counts?”

So no logical unit is split across two parse jobs. A clause can start mid-page and span dozens of pages; cutting by page count tears it in half and forces every downstream step to reassemble it. Cutting at clause starts keeps each clause whole within a batch.

How do you handle output larger than RunPod’s 20 MB response cap?

Section titled “How do you handle output larger than RunPod’s 20 MB response cap?”

Use transport="s3": the worker uploads each result tarball to an S3-compatible bucket (Cloudflare R2 here) and returns a presigned URL you download. Per-batch output here reached 32 MB, far past the sync-response ceiling.

Does sending more concurrent jobs make a single endpoint faster?

Section titled “Does sending more concurrent jobs make a single endpoint faster?”

No. Concurrency above the endpoint’s workersMax just fills the queue. Parallelism is the number of GPU workers RunPod runs, set by workersMax. Raise that to go wider.

Do I need to split the PDF before uploading?

Section titled “Do I need to split the PDF before uploading?”

No. Upload the full PDF once (or host it on R2 and pass file_url); each batch job requests its page range via start_page/end_page and the worker slices server-side.

How do I know the whole document was covered?

Section titled “How do I know the whole document was covered?”

Verify mechanically: check each batch’s produced page span against its planned range and confirm the batches tile the document with zero gaps and zero overlaps. “The run finished” and “the run is complete” are not the same claim.

The output is 36 batches of content_list.json with page-indexed, typed blocks. The next step is joining each block’s page back to the clause outline to emit a clause-addressable tree (one compact file per clause) that an agent or a retrieval index can navigate. The clause-aligned batching is what makes that join clean: every block already lives inside exactly one clause’s batch. Part 2 walks through that post-processing: cleaning the blocks, building the tree, cross-linking, and verifying the result against the source PDF.

ECMA-376 is freely available from Ecma International; it’s used here purely as a parsing benchmark. The parsed corpus is kept in a private repository for internal use, and this post shares only the parsing process and aggregate statistics, not the standard’s content.

If you want per-phase timings (fetch / parse / package) and throughput dashboards for a run like this, the worker can ship OpenTelemetry traces and metrics to any OTLP backend. See the observability guide.

If this saved you time, the easiest way to say thanks is signing up for RunPod through this link. Star the repo on GitHub for updates.


Disclosure: RunPod links in this post use a referral code that credits me at no cost to you. The post would read the same without it.