Input formats

The worker accepts five input formats through the same three transports (file_url, file_b64, volume_path). Format is auto-detected from the input bytes — there’s no format field to set, and the field name file_* is intentionally format-agnostic. Send whatever you have; the worker figures out what it is.

Supported formats

Format	Magic bytes	Path through MinerU
PDF	`%PDF`	Passes straight to `aio_do_parse`
Image (PNG / JPEG / GIF / BMP / TIFF / WebP)	`\x89PNG`, `\xff\xd8\xff`, `GIF8`, `BM`, `II\x00`, `MM\x00`, `RIFF`	Converted to single-page PDF via `images_bytes_to_pdf_bytes`, then parsed
DOCX (Word)	`PK\x03\x04` (ZIP/OOXML)	Parsed via MinerU’s `office_docx_analyze` (python-docx)
PPTX (PowerPoint)	`PK\x03\x04`	Parsed via `office_pptx_analyze` (python-pptx)
XLSX (Excel)	`PK\x03\x04`	Parsed via `office_xlsx_analyze` (openpyxl)

DOCX / PPTX / XLSX share the same ZIP magic — MinerU’s guess_suffix_by_bytes inspects the archive’s [Content_Types].xml to discriminate them downstream.

What gets returned

The output shape is the same regardless of input format: Markdown + content_list + middle.json + extracted images. The exact contents differ, but the schema doesn’t. A DOCX with a chart and a PDF with the same chart produce comparably structured output.

When format matters for the backend choice

The five MinerU backends (see Picking a backend) handle the input pool differently:

vlm-auto-engine (default): MinerU’s end-to-end VLM. Best for PDF. MinerU’s README attaches the 109-language claim specifically to the pipeline backend’s OCR layer; the VLM’s model card carries only English and Chinese HuggingFace language tags. Verified empirically on Russian (2026-05-21, on the prior Pro-2604 weights): the Pro VLM does produce real Cyrillic output despite the limited model-card tags. The 2605 release kept the same language tag set; we have not re-run the Cyrillic check on 2605 but its lineage gives no reason to expect a regression. Coverage of other non-Latin scripts (Arabic, Devanagari, etc.) is undocumented and unverified by us; if you hit transliteration on those, pipeline with the matching script-family lang code is the documented-safe alternative.
pipeline: native multi-language OCR (109 languages). Best for images and scanned PDFs without an embedded text layer. Respects the lang parameter (script-family codes — see below).
hybrid-auto-engine: routes between pipeline and VLM based on page content. Best when you don’t know in advance what mix of layouts you have.

Office formats (DOCX/PPTX/XLSX) are parsed by MinerU’s dedicated analysers regardless of which backend you set — the choice only affects which engine handles the embedded graphics / equations / scanned regions inside the document.

The `lang` parameter (pipeline backend only)

For non-English/Chinese content, set backend: "pipeline" and use a script-family code (NOT an ISO language code):

Script family	Use for
`arabic`	Arabic, Persian, Urdu
`ch`, `ch_lite`, `ch_server`, `en`	Variants of Chinese / English defaults
`chinese_cht`	Traditional Chinese
`cyrillic`	Bulgarian, Macedonian, Mongolian, Serbian (non-Slavic Cyrillic)
`devanagari`	Hindi, Marathi, Nepali
`east_slavic`	Belarusian, Russian, Ukrainian
`el`	Greek
`japan`	Japanese (kanji + kana)
`ka`, `ta`, `te`	Kannada, Tamil, Telugu
`korean`	Korean (hangul)
`latin`	French, German, Indonesian, Polish, Spanish, Vietnamese, etc.
`th`	Thai

The VLM backends (vlm-*, hybrid-* with VLM routing) ignore the lang field — model selection is not conditional on it. That said, the Pro VLM has been empirically verified to handle Cyrillic (Russian) correctly without lang being set, so for Cyrillic content either backend works. Coverage of other non-Latin scripts on the VLM is undocumented; if you see transliteration in practice, switching to pipeline with the right script-family lang code is the safer path.

Examples

PDF (English)

{
  "input": {
    "file_url": "https://example.com/report.pdf"
  }
}

Default backend: "vlm-auto-engine", lang: "en" — both ignored for non-Latin VLM input.

Cyrillic scan (Russian)

{
  "input": {
    "file_url": "https://example.com/russian-scan.pdf",
    "backend": "pipeline",
    "lang": "east_slavic"
  }
}

Either backend produces Russian output (Pro VLM handles Cyrillic correctly; pipeline uses PaddleOCR with explicit script-family OCR models). For scans without an embedded text layer or for Cyrillic dialects beyond Russian/Ukrainian/Belarusian, pipeline is the documented-safe choice.

Scanned image (PNG)

{
  "input": {
    "file_url": "https://example.com/page-scan.png",
    "backend": "pipeline",
    "lang": "latin"
  }
}

Image is converted to a single-page PDF internally, then routed through pipeline OCR.

DOCX with native text + embedded equations

{
  "input": {
    "file_url": "https://example.com/spec.docx",
    "formula_enable": true
  }
}

Office parser extracts native text and structure; embedded equation/image regions are sent to the chosen backend for parsing.

XLSX (spreadsheet)

{
  "input": {
    "file_url": "https://example.com/data.xlsx"
  }
}

Returns Markdown tables (one per sheet) in content_list.

Size limits

Transport	Max input size
`file_b64` (inline)	20 MB (RunPod gateway cap on `/runsync`; 10 MB on `/run`)
`file_url`	200 MB (worker download cap); fetched with a 120 s timeout
`volume_path`	No hard cap; only limited by the network volume’s free space

For files up to 200 MB, a signed S3 URL via file_url is cheapest; for larger books or image archives, pre-stage on a network volume and use volume_path — see Network volumes.

Page selection (PDF only)

The start_page / end_page fields apply to PDFs (including images converted to single-page PDFs). For DOCX/PPTX/XLSX, page-range semantics are interpreted by MinerU’s Office parsers — they generally process the whole document and start_page / end_page are best-effort.

Format detection edge cases

If _detect_format returns "unknown" (the bytes don’t match any known magic), the worker raises:

ValueError: input bytes do not match any supported format (PDF, PNG/JPEG/GIF/BMP/TIFF/WebP image, or DOCX/PPTX/XLSX). Check that file_b64 was base64-encoded correctly and that file_url returned the file body (not an error page).

Most common cause is file_url returning an HTML error page (e.g. a 403 from an S3 bucket with expired credentials). The first 16 bytes of the response will start with <!DOCT or <html — neither of which is in our magic table.