Input formats
The worker accepts five input formats through the same three transports (file_url, file_b64, volume_path). Format is auto-detected from the input bytes — there’s no format field to set, and the field name file_* is intentionally format-agnostic. Send whatever you have; the worker figures out what it is.
Supported formats
Section titled “Supported formats”| Format | Magic bytes | Path through MinerU |
|---|---|---|
%PDF | Passes straight to aio_do_parse | |
| Image (PNG / JPEG / GIF / BMP / TIFF / WebP) | \x89PNG, \xff\xd8\xff, GIF8, BM, II*\x00, MM\x00*, RIFF | Converted to single-page PDF via images_bytes_to_pdf_bytes, then parsed |
| DOCX (Word) | PK\x03\x04 (ZIP/OOXML) | Parsed via MinerU’s office_docx_analyze (python-docx) |
| PPTX (PowerPoint) | PK\x03\x04 | Parsed via office_pptx_analyze (python-pptx) |
| XLSX (Excel) | PK\x03\x04 | Parsed via office_xlsx_analyze (openpyxl) |
DOCX / PPTX / XLSX share the same ZIP magic — MinerU’s guess_suffix_by_bytes inspects the archive’s [Content_Types].xml to discriminate them downstream.
What gets returned
Section titled “What gets returned”The output shape is the same regardless of input format: Markdown + content_list + middle.json + extracted images. The exact contents differ, but the schema doesn’t. A DOCX with a chart and a PDF with the same chart produce comparably structured output.
When format matters for the backend choice
Section titled “When format matters for the backend choice”The five MinerU backends (see Picking a backend) handle the input pool differently:
vlm-auto-engine(default): MinerU’s end-to-end VLM. Best for PDF. MinerU’s README attaches the 109-language claim specifically to the pipeline backend’s OCR layer; the VLM’s model card carries onlyEnglishandChineseHuggingFace language tags. Verified empirically on Russian (2026-05-21, on the priorPro-2604weights): the Pro VLM does produce real Cyrillic output despite the limited model-card tags. The 2605 release kept the same language tag set; we have not re-run the Cyrillic check on 2605 but its lineage gives no reason to expect a regression. Coverage of other non-Latin scripts (Arabic, Devanagari, etc.) is undocumented and unverified by us; if you hit transliteration on those,pipelinewith the matching script-familylangcode is the documented-safe alternative.pipeline: native multi-language OCR (109 languages). Best for images and scanned PDFs without an embedded text layer. Respects thelangparameter (script-family codes — see below).hybrid-auto-engine: routes between pipeline and VLM based on page content. Best when you don’t know in advance what mix of layouts you have.
Office formats (DOCX/PPTX/XLSX) are parsed by MinerU’s dedicated analysers regardless of which backend you set — the choice only affects which engine handles the embedded graphics / equations / scanned regions inside the document.
The lang parameter (pipeline backend only)
Section titled “The lang parameter (pipeline backend only)”For non-English/Chinese content, set backend: "pipeline" and use a script-family code (NOT an ISO language code):
| Script family | Use for |
|---|---|
arabic | Arabic, Persian, Urdu |
ch, ch_lite, ch_server, en | Variants of Chinese / English defaults |
chinese_cht | Traditional Chinese |
cyrillic | Bulgarian, Macedonian, Mongolian, Serbian (non-Slavic Cyrillic) |
devanagari | Hindi, Marathi, Nepali |
east_slavic | Belarusian, Russian, Ukrainian |
el | Greek |
japan | Japanese (kanji + kana) |
ka, ta, te | Kannada, Tamil, Telugu |
korean | Korean (hangul) |
latin | French, German, Indonesian, Polish, Spanish, Vietnamese, etc. |
th | Thai |
The VLM backends (vlm-*, hybrid-* with VLM routing) ignore the lang field — model selection is not conditional on it. That said, the Pro VLM has been empirically verified to handle Cyrillic (Russian) correctly without lang being set, so for Cyrillic content either backend works. Coverage of other non-Latin scripts on the VLM is undocumented; if you see transliteration in practice, switching to pipeline with the right script-family lang code is the safer path.
Examples
Section titled “Examples”PDF (English)
Section titled “PDF (English)”{ "input": { "file_url": "https://example.com/report.pdf" }}Default backend: "vlm-auto-engine", lang: "en" — both ignored for non-Latin VLM input.
Cyrillic scan (Russian)
Section titled “Cyrillic scan (Russian)”{ "input": { "file_url": "https://example.com/russian-scan.pdf", "backend": "pipeline", "lang": "east_slavic" }}Either backend produces Russian output (Pro VLM handles Cyrillic correctly; pipeline uses PaddleOCR with explicit script-family OCR models). For scans without an embedded text layer or for Cyrillic dialects beyond Russian/Ukrainian/Belarusian, pipeline is the documented-safe choice.
Scanned image (PNG)
Section titled “Scanned image (PNG)”{ "input": { "file_url": "https://example.com/page-scan.png", "backend": "pipeline", "lang": "latin" }}Image is converted to a single-page PDF internally, then routed through pipeline OCR.
DOCX with native text + embedded equations
Section titled “DOCX with native text + embedded equations”{ "input": { "file_url": "https://example.com/spec.docx", "formula_enable": true }}Office parser extracts native text and structure; embedded equation/image regions are sent to the chosen backend for parsing.
XLSX (spreadsheet)
Section titled “XLSX (spreadsheet)”{ "input": { "file_url": "https://example.com/data.xlsx" }}Returns Markdown tables (one per sheet) in content_list.
Size limits
Section titled “Size limits”| Transport | Max input size |
|---|---|
file_b64 (inline) | 20 MB (RunPod gateway cap on /runsync; 10 MB on /run) |
file_url | No hard cap; the worker downloads with a 120 s timeout |
volume_path | No hard cap; only limited by the network volume’s free space |
For multi-hundred-MB books or large image archives, prefer file_url (signed S3 URL is cheapest) or pre-stage on a network volume — see Network volumes.
Page selection (PDF only)
Section titled “Page selection (PDF only)”The start_page / end_page fields apply to PDFs (including images converted to single-page PDFs). For DOCX/PPTX/XLSX, page-range semantics are interpreted by MinerU’s Office parsers — they generally process the whole document and start_page / end_page are best-effort.
Format detection edge cases
Section titled “Format detection edge cases”If _detect_format returns "unknown" (the bytes don’t match any known magic), the worker raises:
ValueError: input bytes do not match any supported format (PDF, PNG/JPEG/GIF/BMP/TIFF/WebP image, or DOCX/PPTX/XLSX). Check that file_b64 was base64-encoded correctly and that file_url returned the file body (not an error page).Most common cause is file_url returning an HTML error page (e.g. a 403 from an S3 bucket with expired credentials). The first 16 bytes of the response will start with <!DOCT or <html — neither of which is in our magic table.