Skip to content

Input formats

The worker accepts five input formats through the same three transports (file_url, file_b64, volume_path). Format is auto-detected from the input bytes — there’s no format field to set, and the field name file_* is intentionally format-agnostic. Send whatever you have; the worker figures out what it is.

FormatMagic bytesPath through MinerU
PDF%PDFPasses straight to aio_do_parse
Image (PNG / JPEG / GIF / BMP / TIFF / WebP)\x89PNG, \xff\xd8\xff, GIF8, BM, II*\x00, MM\x00*, RIFFConverted to single-page PDF via images_bytes_to_pdf_bytes, then parsed
DOCX (Word)PK\x03\x04 (ZIP/OOXML)Parsed via MinerU’s office_docx_analyze (python-docx)
PPTX (PowerPoint)PK\x03\x04Parsed via office_pptx_analyze (python-pptx)
XLSX (Excel)PK\x03\x04Parsed via office_xlsx_analyze (openpyxl)

DOCX / PPTX / XLSX share the same ZIP magic — MinerU’s guess_suffix_by_bytes inspects the archive’s [Content_Types].xml to discriminate them downstream.

The output shape is the same regardless of input format: Markdown + content_list + middle.json + extracted images. The exact contents differ, but the schema doesn’t. A DOCX with a chart and a PDF with the same chart produce comparably structured output.

When format matters for the backend choice

Section titled “When format matters for the backend choice”

The five MinerU backends (see Picking a backend) handle the input pool differently:

  • vlm-auto-engine (default): MinerU’s end-to-end VLM. Best for PDF. MinerU’s README attaches the 109-language claim specifically to the pipeline backend’s OCR layer; the VLM’s model card carries only English and Chinese HuggingFace language tags. Verified empirically on Russian (2026-05-21, on the prior Pro-2604 weights): the Pro VLM does produce real Cyrillic output despite the limited model-card tags. The 2605 release kept the same language tag set; we have not re-run the Cyrillic check on 2605 but its lineage gives no reason to expect a regression. Coverage of other non-Latin scripts (Arabic, Devanagari, etc.) is undocumented and unverified by us; if you hit transliteration on those, pipeline with the matching script-family lang code is the documented-safe alternative.
  • pipeline: native multi-language OCR (109 languages). Best for images and scanned PDFs without an embedded text layer. Respects the lang parameter (script-family codes — see below).
  • hybrid-auto-engine: routes between pipeline and VLM based on page content. Best when you don’t know in advance what mix of layouts you have.

Office formats (DOCX/PPTX/XLSX) are parsed by MinerU’s dedicated analysers regardless of which backend you set — the choice only affects which engine handles the embedded graphics / equations / scanned regions inside the document.

The lang parameter (pipeline backend only)

Section titled “The lang parameter (pipeline backend only)”

For non-English/Chinese content, set backend: "pipeline" and use a script-family code (NOT an ISO language code):

Script familyUse for
arabicArabic, Persian, Urdu
ch, ch_lite, ch_server, enVariants of Chinese / English defaults
chinese_chtTraditional Chinese
cyrillicBulgarian, Macedonian, Mongolian, Serbian (non-Slavic Cyrillic)
devanagariHindi, Marathi, Nepali
east_slavicBelarusian, Russian, Ukrainian
elGreek
japanJapanese (kanji + kana)
ka, ta, teKannada, Tamil, Telugu
koreanKorean (hangul)
latinFrench, German, Indonesian, Polish, Spanish, Vietnamese, etc.
thThai

The VLM backends (vlm-*, hybrid-* with VLM routing) ignore the lang field — model selection is not conditional on it. That said, the Pro VLM has been empirically verified to handle Cyrillic (Russian) correctly without lang being set, so for Cyrillic content either backend works. Coverage of other non-Latin scripts on the VLM is undocumented; if you see transliteration in practice, switching to pipeline with the right script-family lang code is the safer path.

{
"input": {
"file_url": "https://example.com/report.pdf"
}
}

Default backend: "vlm-auto-engine", lang: "en" — both ignored for non-Latin VLM input.

{
"input": {
"file_url": "https://example.com/russian-scan.pdf",
"backend": "pipeline",
"lang": "east_slavic"
}
}

Either backend produces Russian output (Pro VLM handles Cyrillic correctly; pipeline uses PaddleOCR with explicit script-family OCR models). For scans without an embedded text layer or for Cyrillic dialects beyond Russian/Ukrainian/Belarusian, pipeline is the documented-safe choice.

{
"input": {
"file_url": "https://example.com/page-scan.png",
"backend": "pipeline",
"lang": "latin"
}
}

Image is converted to a single-page PDF internally, then routed through pipeline OCR.

DOCX with native text + embedded equations

Section titled “DOCX with native text + embedded equations”
{
"input": {
"file_url": "https://example.com/spec.docx",
"formula_enable": true
}
}

Office parser extracts native text and structure; embedded equation/image regions are sent to the chosen backend for parsing.

{
"input": {
"file_url": "https://example.com/data.xlsx"
}
}

Returns Markdown tables (one per sheet) in content_list.

TransportMax input size
file_b64 (inline)20 MB (RunPod gateway cap on /runsync; 10 MB on /run)
file_urlNo hard cap; the worker downloads with a 120 s timeout
volume_pathNo hard cap; only limited by the network volume’s free space

For multi-hundred-MB books or large image archives, prefer file_url (signed S3 URL is cheapest) or pre-stage on a network volume — see Network volumes.

The start_page / end_page fields apply to PDFs (including images converted to single-page PDFs). For DOCX/PPTX/XLSX, page-range semantics are interpreted by MinerU’s Office parsers — they generally process the whole document and start_page / end_page are best-effort.

If _detect_format returns "unknown" (the bytes don’t match any known magic), the worker raises:

ValueError: input bytes do not match any supported format (PDF, PNG/JPEG/GIF/BMP/TIFF/WebP image, or DOCX/PPTX/XLSX). Check that file_b64 was base64-encoded correctly and that file_url returned the file body (not an error page).

Most common cause is file_url returning an HTML error page (e.g. a 403 from an S3 bucket with expired credentials). The first 16 bytes of the response will start with <!DOCT or <html — neither of which is in our magic table.