PDF

How to self-host the MinerU API on RunPod

Jun 8, 2026

Last Updated: 2026-06-08

If you’re calling the official MinerU API (mineru.net/api/v4) in production, you’ve probably hit one of three walls: the daily quota of 1,000 high-priority pages, the per-file 200 MB cap, or a compliance review asking why your documents leave your infrastructure for a third-party cloud. You can run the exact same MinerU engine yourself on RunPod Serverless, keep documents in your own bucket, drop the quota, and pay roughly $0.001 per page warm. The migration is close to drop-in: swap the client, keep your create-task / poll / download loop.

The cloud API is the right way to try MinerU. Once you’re parsing real volume, self-hosting changes the economics and the data path. Here’s why, what it costs, and how to move your code over.

Why self-host the MinerU API instead of using mineru.net?

Three reasons: cost at volume, data residency, and control. The cloud API meters a daily high-priority page quota and then deprioritizes you; a self-hosted endpoint has no quota. Self-hosted, your PDFs and their parsed output never leave your RunPod worker and your own S3 bucket. And you pin the model version, pick the GPU, and set your own concurrency.

The official MinerU API caps each file at 200 MB and gives each account a daily quota of 1,000 high-priority pages (see the published limits) before jobs drop to lower priority. That ceiling is fine for evaluation and light use. It becomes a planning problem the moment you’re ingesting thousands of pages a day on a deadline.

The data path matters too. On the SaaS, every document round-trips through mineru.net. Self-hosted, the worker pulls the file from a URL you control (or you send the bytes inline), parses on your GPU, and writes the result to your bucket. Nothing transits a vendor you have to put in a data-processing agreement.

The trade-off is honest: you run the infrastructure. There’s a cold-start cost, but FlashBoot blunts it: once a host has booted the worker, it restores from a process snapshot in ~7-8 s across scale-to-zero cycles, and only a brand-new host pays the full ~110 s (vLLM init plus loading the model into VRAM). The model is baked into the image, so nothing is downloaded at request time. If your traffic is a handful of pages a month, the cloud API’s free tier is simpler; cross into steady volume and the math flips.

What does self-hosting MinerU cost vs the cloud API?

Roughly $0.001 per page warm on a 24 GB RTX 4090, plus a ~$0.03 cold-start tax the first time a host boots the worker (≈110 s of GPU billing for vLLM init and loading the baked model into VRAM). FlashBoot then snapshots that state, so the same host restarts in ~7-8 s across scale-to-zero cycles. RunPod bills per second and scales to zero, so an idle endpoint costs nothing. The cloud API is free under its daily quota, then queues or meters you; self-hosting trades that ceiling for a steady, predictable per-page rate.

The real number depends on how well you amortize that first-boot tax. A thousand pages parsed across one warm window land near $0.001/page; a single short doc on a cold host lands closer to $0.005–$0.01 because you’re paying the one-time tax against very few pages. FlashBoot shrinks that in practice, since a host pays the full tax only once. I broke the full workload-shape math down in Serverless MinerU on RunPod: honest cost math.

One cost the marketing pages skip: the first-boot cold start (~110 s on a brand-new host) is billed GPU time. FlashBoot snapshots that boot, so the same host restarts in ~7-8 s afterward. The model is baked into the image, so there’s no model-download step at boot or request time. Budget for the first boot per host, not for every request.

How do you deploy your own MinerU endpoint on RunPod?

Deploy the open-source mineru-runpod template from the RunPod Hub, or fork it and point RunPod’s GitHub build at your fork. Create a Serverless Endpoint on a 24 GB GPU (ADA_24 / RTX 4090) with FlashBoot enabled. For full_zip_url parity with the cloud API, also set four BUCKET_* env vars pointing at an S3-compatible bucket.

The fastest path is the Hub listing: one click, fill in the deploy-time form, done. The full walkthrough (fork-and-build and bring-your-own-image included) is in the deploy guide.

The one piece worth getting right up front is object storage. The compat client returns results as a full_zip_url, which means the worker uploads the output archive to a bucket and hands back a presigned URL, exactly like the SaaS. That path needs four env vars on the endpoint:

BUCKET_ENDPOINT_URL=https://<account>.r2.cloudflarestorage.com
BUCKET_NAME=mineru-outputs
BUCKET_ACCESS_KEY_ID=<key>
BUCKET_SECRET_ACCESS_KEY=<secret>

Cloudflare R2 is a good pairing here because its egress is free, so downloading your own results costs nothing. Any S3-compatible store works (R2, Backblaze B2, MinIO, AWS S3). To get started you’ll need a RunPod account; sign up here (full disclosure: that’s a referral link). Add a few dollars of credit and it covers thousands of cold starts plus millions of warm pages.

How do you migrate your code off the MinerU API?

Install the mineru-client package and swap the official requests calls for MineruApiClient. It mirrors the cloud API’s create_task / get_task surface and returns the same response dicts, so your existing poll loop keeps working. Under the hood it requests archive_format="zip", so full_zip_url comes back as a real .zip like the SaaS does.

Install it (no version pin, so it tracks the repo):

pip install "mineru-client @ git+https://github.com/sergeyshmakov/mineru-runpod"

Here’s the before and after. The official API, polling by hand:

import requests, time

H = {"Authorization": f"Bearer {MINERU_TOKEN}"}
task_id = requests.post(
    "https://mineru.net/api/v4/extract/task",
    headers=H, json={"url": pdf_url, "model_version": "vlm"},
).json()["data"]["task_id"]

while True:
    data = requests.get(
        f"https://mineru.net/api/v4/extract/task/{task_id}", headers=H
    ).json()["data"]
    if data["state"] in ("done", "failed"):
        break
    time.sleep(2)
zip_url = data["full_zip_url"]   # then download + unzip yourself

Self-hosted, against your own endpoint:

from mineru_client import MineruApiClient

client = MineruApiClient(endpoint_id="<your-endpoint-id>", api_key="<runpod-key>")

task_id = client.create_task(pdf_url, model_version="vlm")["data"]["task_id"]
done = client.wait_for_task(task_id)        # polls to a terminal state
client.download_results(done, "./out")      # full_zip_url is a real .zip; unpacked for you

Same lifecycle, same {"code": 0, "data": {...}} response shape. The parameter names map across cleanly too: model_version to the worker’s backend, language to lang, enable_formula / enable_table straight through. The full field-by-field mapping is in Migrate from the MinerU API, and the Clients page covers the native client if you’d rather use the worker’s own (richer) request shape once you’ve moved.

One auth note: the self-hosted endpoint authenticates with your RunPod API key, also via Authorization: Bearer, so even that part of your code barely changes.

What doesn’t carry over from the cloud API?

Most of it carries; four things don’t. The compat client rejects callback (a RunPod webhook delivers a different, unsigned payload than MinerU’s signed {checksum, content} callback, so it raises instead of misleading you), and it doesn’t support extra_formats (docx/html/latex), the MinerU-HTML model, or multi-range page_ranges like "2,4-6". There’s no batch endpoint (RunPod’s queue already parallelizes individual jobs across workers, so one isn’t needed). And full_zip_url requires the BUCKET_* setup above.

The full list, so there are no surprises:

Cloud API feature	Self-hosted status
`create_task` / `get_task`, state machine, `full_zip_url`	Supported (with object storage configured)
`model_version: pipeline` / `vlm`	Supported (maps to the worker backends)
`model_version: MinerU-HTML`	Not supported, raises
`extra_formats` (docx / html / latex)	Not produced by this worker, raises
`page_ranges` multi-range (`"2,4-6"`)	One contiguous range per job, raises otherwise
`callback` + `seed`	Rejected (poll with `get_task` / `wait_for_task` instead)
Batch (`/extract/task/batch`)	Not offered; submit tasks individually and raise `workers_max` — RunPod’s queue parallelizes them across workers

Where it falls down: if your pipeline leans on webhook callbacks or exports to DOCX/LaTeX, the compat client isn’t a clean swap today. There’s no batch endpoint either, but you rarely need one: submit tasks individually and let RunPod fan them across workers_max workers (the queue parallelizes them). For everything else, the create / poll / download path behaves like the SaaS.

One more practical detail. The compat client is URL-only, mirroring the cloud API’s POST /extract/task, which doesn’t accept file uploads. For small local files you don’t need to host anything: use the native MineruClient with file_b64 instead, which sends the bytes inline (fine under RunPod’s ~20 MB request cap). I parsed a 522 KB scanned Russian invoice that way and got clean Cyrillic Markdown back, no bucket round-trip involved.

FAQ

Is the MinerU API free?

The official MinerU cloud API has a free daily quota of 1,000 high-priority pages, after which jobs run at lower priority. There’s no per-call charge inside the quota. Self-hosting removes the quota entirely and replaces it with per-second GPU billing (around $0.001 per page warm).

What are the MinerU API’s limits?

Each file is capped at 200 MB, with a daily quota of 1,000 high-priority pages plus rate limits of 50 files/minute and 5,000 files/day (see the published limits). Self-hosting on RunPod removes the daily quota; the per-file practical limit becomes your GPU’s memory and your endpoint’s job timeout rather than a fixed page count.

Can I self-host MinerU without RunPod?

Yes. MinerU is Apache-2.0 open source and runs anywhere with a CUDA GPU. RunPod Serverless is the path this template targets because it scales to zero, so a bursty workload doesn’t pay for an idle GPU. On a fixed VPS or your own hardware you’d run MinerU directly and skip the serverless wrapper.

Does the self-hosted output match the cloud API’s full_zip_url?

Yes, when the endpoint has object storage configured. The compat client requests archive_format="zip", so the worker uploads a .zip to your bucket and returns a presigned full_zip_url, the same container and field the SaaS returns. download_results fetches and unpacks it for you, and autodetects .tar.gz too if you change the format.

Is self-hosting actually cheaper than the MinerU API?

At steady volume, usually yes, because you stop being throttled by the daily quota and pay only for GPU seconds used. The crossover depends on cold-start amortization: dense traffic lands near $0.001/page, sparse one-doc-per-cold-start traffic closer to $0.005–$0.01. Below a few hundred pages a month, the cloud API’s free tier is the cheaper and simpler option.

Do my documents stay private when self-hosting?

Yes. The worker fetches each file from a URL you control or from bytes you send inline, parses on your own RunPod GPU, and writes output to your own S3 bucket. No document or result passes through a third-party parsing service, which is the usual blocker in a data-residency or compliance review.

Self-hosting MinerU isn’t about beating the cloud API on accuracy. The accuracy is identical: it’s the same model. It’s about removing the quota ceiling, keeping documents in your stack, and paying per-second instead of per-tier. If that’s the wall you’ve hit, fork the template, or deploy it from the RunPod Hub and grab a RunPod account to run the first parse.

RunPod 20 MB Response Cap: Fix NoneType with Cloudflare R2

May 20, 2026

Sergei Shmakov

Last Updated: 2026-05-26

If your RunPod serverless worker logs say done but your client raises unexpected handler return type: <class 'NoneType'>, you’ve hit RunPod’s bidirectional 20 MB payload cap on /runsync. The handler succeeded. The gateway dropped the response on the way back because the payload was too large.

The fix is two steps. Set return: "s3" on the job, and configure four env vars on the endpoint pointing at a Cloudflare R2 bucket. The worker uploads the result to R2 and returns a small presigned URL. Your client downloads from R2 directly. No gateway cap in the path.

I hit this on an 82-page Cyrillic fiscal report (30 MB input, ~25 MB output with embedded images) running my open-source mineru-runpod template. Two retries via return: "inline" and return: "tarball_b64" failed the same way. R2 mode worked first try. The rest of this post is the symptom, the env-var recipe, the cost comparison vs S3, and a few gotchas worth knowing.

Why does my RunPod worker return NoneType after a successful parse?

The worker handler completed and returned a valid dict. RunPod’s runtime then tried to POST that result back to RunPod’s API via /job-done, and the API returned HTTP 400 because the payload exceeded ~20 MB. The result was discarded. The SDK saw no output, returned None to the client, and the client wrapper raised the NoneType error.

The worker logs make the chain explicit:

[mineru-worker] done: elapsed=91.77s phase_ms={'fetch_input': 972, 'mineru_parse': 90789, 'package': 66}
{"requestId": "sync-fdcd03cd-...", "message": "Failed to return job results. | 400, message='Bad Request',
 url='https://api.runpod.ai/v2/<endpoint>/job-done/<worker>/sync-fdcd03cd-...?gpu=NVIDIA+RTX+A5000&isStream=false'"}

The first line shows the handler finished cleanly: 82 pages parsed in 91.8 s on the worker (this test ran on A5000; on the current 4090 default the warm parse is 2–3× faster). The second line shows the gateway rejecting the result. The handler already returned and never knows the rejection happened. The SDK sees the discarded result and returns None to your code.

If you see this NoneType error on a small doc, the diagnosis is different (worker OOM, crash, timeout). On a multi-page parse that the worker logs as done, the answer is almost always the 20 MB cap.

What is RunPod’s /runsync response payload limit?

RunPod’s /runsync gateway caps payloads at roughly 20 MB in both directions. The request cap affects file_b64 inline uploads. The response cap affects what the worker can return. Both are independent of execution time and memory budget. A fast, successful parse can hit the response cap simply by producing a large output.

Direction	Limit	What triggers it
Request → gateway → worker	~20 MB	`file_b64` inline transport for large PDFs
Worker → gateway → client	~20 MB	Multi-page parse outputs with embedded images

The request cap is in RunPod’s docs and widely discussed. The response cap is mentioned only in passing. I found three open issues on the runpod-workers repos where other users hit the same symptom and didn’t realise what it was, so this post is partly to make that searchable.

Practical threshold for mineru-runpod: pure-text PDFs are fine for longer. Image-heavy PDFs with embedded raster output hit the response cap around 50–80 pages on inline or tarball_b64 transport.

Does `return: "tarball_b64"` get around the 20 MB cap?

No. return: "tarball_b64" gzips the output into a single .tar.gz before base64-encoding it. Gzip compresses the JSON and Markdown text well, but the page images inside the tarball are already raster bytes (PNG, JPEG) and barely compress further. Multi-page parses with embedded images keep the tarball over 20 MB.

I confirmed this on the same 82-page PDF. Same 400 from /job-done. Same NoneType in the client. Both inline and tarball_b64 route through the gateway response, so both inherit the cap. Only return: "s3" avoids it because the worker uploads out-of-band.

How do I configure Cloudflare R2 to bypass the RunPod response cap?

Set return: "s3" in the job input, then add four env vars on the RunPod endpoint pointing at a Cloudflare R2 bucket. The worker uploads the gzipped tarball directly to R2 and returns a small presigned URL (~1 h TTL). Your client downloads from R2.

The job input changes one field:

{
  "input": {
    "file_url": "https://example.com/big.pdf",
    "return": "s3"
  }
}

The four env vars go on the endpoint (not the template — they’re secrets):

Env var	Cloudflare R2 value
`BUCKET_ENDPOINT_URL`	`https://<account-id>.r2.cloudflarestorage.com`
`BUCKET_NAME`	your bucket name
`BUCKET_ACCESS_KEY_ID`	R2 API token access key
`BUCKET_SECRET_ACCESS_KEY`	R2 API token secret
`BUCKET_REGION` (optional)	`auto`

You generate the access key pair in the Cloudflare dashboard: R2 → Manage R2 API Tokens → Create API Token → Object Read & Write scoped to the bucket. The worker auto-restarts when you save endpoint env vars in RunPod. Test with one small doc before sending production traffic.

Why pick Cloudflare R2 over AWS S3 for RunPod output storage?

R2 has zero egress fees, a 10 GB free storage tier, 1M Class A ops and 10M Class B ops per month free, and is fully S3-compatible. AWS S3 charges egress at roughly $0.085/GB plus storage at $0.023/GB/month. For a RunPod pipeline doing dozens of GB of I/O per month, R2’s bill stays near zero while S3 lands in the $5–$15 range.

A back-of-envelope month for the workload I tested:

1,000 multi-page parses, average output 8 MB → 8 GB stored then deleted
1,000 worker→bucket uploads + 1,000 client→bucket downloads = 2,000 ops
Storage: free (under 10 GB). Egress: free (R2 doesn’t bill egress). Ops: free (well under 1M Class A).

Same workload on S3: ~$0.18 storage + ~$0.68 egress + per-request fees, maybe $1–$3 total. Cheap but R2’s $0 is cheaper.

S3 still makes sense if you’re already deep in AWS, if you need IAM-controlled access patterns, or if RunPod workers and your AWS region are colocated tightly enough that egress doesn’t apply. For everyone else and especially for solo / indie deploys, R2 is the right default. See R2 pricing for current rates.

What does the parse flow look like end-to-end with `return: "s3"`?

The worker fetches the input PDF, runs MinerU, gzips the outputs into a tarball, uploads to R2 via the configured BUCKET_* env vars, and returns a small JSON response with tarball_url, tarball_url_expires_in (3600 s), and bucket_key. Your client follows the URL and extracts the tarball locally. No payload ever crosses RunPod’s 20 MB-capped response path.

Concrete numbers from the 82-page test (on A5000; current default is 4090):

result = client.parse_document(
    file_url="https://pub-....r2.dev/report.pdf",
    backend="vlm-auto-engine",
    return_format="s3",
)
# result["tarball_url"]            -> presigned R2 URL, valid ~1 h
# result["tarball_url_expires_in"] -> 3600
# result["bucket_key"]             -> "report-<hash>.tar.gz"

client.save_s3_tarball(result, "./out/")
# downloads + extracts -> out/report.md, out/report_content_list_v2.json, out/images/, ...

End-to-end wall-clock: 211 s for an 82-page doc on a cold worker. Breakdown: ~112 s before MinerU started parsing (worker boot + warmup), ~92 s warm parsing (1.1 s/page on A5000), ~11 s gzip and upload to R2 (the package phase). The extracted output: 313 KB Markdown plus structured JSON plus per-page images. Roughly 3.5 minutes for a document that previously couldn’t return its output at all.

The cold-start portion is a separate concern from the response cap. The FlashBoot mechanism investigation covers why the ~112 s exists, how the boot-time warmup interacts with RunPod’s snapshot system, and when subsequent cold starts are much faster.

What should I watch out for with the R2 bridge?

Four things the docs don’t say loudly. The presigned URL TTL is 60 minutes. R2 doesn’t auto-clean uploaded objects. One bucket can serve input and output. The 20 MB cap applies to /run (async) too, not just /runsync.

Presigned URL TTL is 60 minutes. If your client is slow to download (e.g. a job-queue worker that picks up results minutes later), bump _S3_PRESIGN_TTL_SECONDS in the handler. Don’t rely on the default in long-tail flows.
R2 doesn’t auto-clean uploaded objects. Add an R2 lifecycle rule (e.g. delete after 7 days) so your output bucket doesn’t grow forever.
One R2 bucket can serve input and output. Upload your PDFs to R2 ahead of time, pass file_url pointing at the R2 public dev URL, and the worker writes outputs to the same bucket at the root. Add BUCKET_PREFIX env var if you want outputs in a subdirectory.
The 20 MB cap applies to /run (async) too. Same gateway, same limit. Switching to async polling doesn’t help.

FAQ

How do I get the R2 access key for `BUCKET_ACCESS_KEY_ID` and `BUCKET_SECRET_ACCESS_KEY`?

In the Cloudflare dashboard: R2 → Manage R2 API Tokens → Create API Token. Set permissions to “Object Read & Write” scoped to the specific bucket. Cloudflare shows the access key ID and secret access key once; copy both into your RunPod endpoint env vars immediately. The secret isn’t retrievable later.

Does the presigned URL expire?

Yes. The default TTL is 3600 seconds (one hour). If your downstream client picks up the response asynchronously (job queue, cron, etc.), download promptly or bump _S3_PRESIGN_TTL_SECONDS in the worker handler before redeploying.

Can I reuse the same R2 bucket for input and output?

Yes. The worker doesn’t care about the bucket layout. Upload your input PDFs to bucket/inputs/ and the worker writes outputs to bucket/<basename>-<hash>.tar.gz at the root. Add BUCKET_PREFIX env var if you want outputs pushed into a subdirectory.

What if I can’t set up R2? Is there a fallback?

Page chunking. Split the parse with start_page and end_page into segments small enough that each output tarball stays under 20 MB, then concatenate the .md files client-side. Slower (you may pay multiple cold starts if the worker scales to zero between calls) and you handle joining yourself, but no infra changes needed.

Is the 20 MB cap on `/run` too, or only `/runsync`?

Both. RunPod’s /run (async) and /runsync (synchronous) share the same gateway and the same payload limits. Switching to async doesn’t help the response-size problem. The cap is at the gateway layer, not the polling protocol.

Does using `return: "s3"` add to cold-start time?

No. The S3 upload happens at the end of the parse, not the beginning. The handler’s package phase grew from ~95 ms (in-memory tarball) to ~11 s (gzip + upload to R2) on an 82-page job, but cold-start is unchanged. The S3 mode adds a small constant to warm-job latency, not a multiplier.

How big can the R2-uploaded tarball be?

Effectively unlimited for mineru-runpod workloads. R2 supports multipart uploads up to 5 TB per object. You’ll hit the worker’s executionTimeoutMs long before you hit R2’s per-object limit.

Does R2 work for input PDFs too, or only output?

Both. The worker accepts file_url pointing at an R2 public dev URL (or a presigned R2 GET URL for private buckets) and fetches the input from R2. This avoids the inbound 20 MB cap on file_b64 for large PDFs. You can run an R2-in / R2-out setup with one bucket and avoid every payload-size limit RunPod has.

Where to next

If you’ve shipped a multi-page PDF pipeline on RunPod and you’re not using return: "s3", you’ll hit the gateway cap eventually. Set it up before you need it. The cost is ten minutes of env-var configuration and possibly zero dollars per month at indie volumes.

If you’re new to the template, the getting-started guide walks through the full deploy in about ten minutes. For the cold-start side of the picture (separate from the response cap covered here), see the FlashBoot mechanism investigation. For GPU sizing, Choosing a GPU covers when the default ADA_24 (RTX 4090) is enough and when to opt up.

If this saved you time, the easiest way to say thanks is signing up for RunPod through this link. Star the repo on GitHub for updates.

Disclosure: RunPod links in this post use a referral code that credits me at no cost to you. The post would read the same without it.

Serverless MinerU on RunPod: honest cost math (2026)

May 19, 2026

Sergei Shmakov

Last Updated: 2026-05-26

If you’re building a RAG pipeline, a document indexer, or any product that ingests PDFs at scale, you’ve probably hit the same wall I did. Hosted OCR APIs charge pennies per page that compound into thousands per million. CPU parsers are too slow for production volume. A permanent GPU pod is wasteful when traffic comes in bursts.

MinerU 2.5 is genuinely state-of-the-art for PDF → Markdown / structured JSON. Apache 2.0 license. The MinerU2.5-Pro-2604-1.2B model fits comfortably on a 24 GB GPU. RunPod Serverless scales to zero when nothing is calling. Wiring those two together is the obvious move.

Real numbers from my open-source mineru-runpod template, measured on a 24 GB RTX 4090 in May 2026: ~$0.001 per page for warm parses, plus a ~$0.03 fixed tax per cold start. The all-in per-page cost depends on how much work you do before the worker scales back to zero. Here’s the deploy, the response shape, and the workload patterns this template is the right fit for.

What does it actually cost to run MinerU on RunPod Serverless?

About $0.001 per page on an RTX 4090 once the worker is warm. Each scale-from-zero adds a ~$0.03 fixed tax: roughly 110 seconds of GPU billing for vLLM engine init plus model load. Per-page math depends entirely on amortization. Sparse traffic with one short request per cold start lands closer to $0.005–$0.01 per page.

Real workload-shape math using ADA_24 (RTX 4090, ~$1.10/hr Flex):

Workload shape	Per-page cost
1,000 pages amortized across one cold start	~$0.001
100 pages amortized across one cold start	~$0.0013
10 pages then idle out	~$0.004
One short doc per scale-from-zero (worst case)	~$0.007

Compared to alternatives:

Tool / setup	Per-page cost	Notes
Hosted OCR APIs (typical)	$0.001 – $0.01	vendor lock-in, rate limits, documents leave your stack
Permanent GPU pod (24 h on A5000)	$0.001 – $0.003	24 h of bills whether you use it or not
mineru-runpod, amortized	~$0.001 – $0.004	scales to zero; cold-start tax is real
Marker / Nougat on CPU	$0 cash, $$$ time	~30 s/page sequential (Marker docs)

The trick is RunPod’s per-second billing. No worker running, no bill. The catch is every scale-from-zero pays a real fixed cost.

How do I deploy MinerU to RunPod Serverless in ten minutes?

Fork the repo, point RunPod’s GitHub auto-build at your fork, create a Serverless Endpoint with ADA_24 (RTX 4090) and FlashBoot enabled, send a request via the included Python client. Total wall-clock from RunPod sign-up to first parse: roughly ten minutes, dominated by the image build (~5–10 min) plus the first cold start (~110 s).

1. Get a RunPod account

Sign up here. Add $5 of credit. That covers several thousand cold starts plus a few million warm pages.

2. Fork the repo

gh repo fork sergeyshmakov/mineru-runpod --clone
cd mineru-runpod

The repo stays small. Dockerfile, handler.py, a worker/ package, a Python client (mineru_client), three GitHub Actions workflows, Hub metadata under .runpod/. MIT licensed, ~30 files.

3. Wire RunPod’s GitHub auto-build

In the RunPod dashboard:

Serverless → Templates → New → Import Git Repository
Point at your fork. Branch main, Dockerfile path Dockerfile.
RunPod clones, builds the image, stores it in its own registry, and gives you a template_id. The build runs ~5–10 minutes. Watch the log if you want.

4. Create the endpoint

Dashboard path:

Serverless → Endpoints → New
Template: the one you just created
GPU pool: ADA_24 (RTX 4090, 24 GB)
Workers min: 0, max: 3
Idle timeout: 10 seconds
FlashBoot: on
Save, grab the endpoint id

Or as code (reproducible across redeploys):

pip install -e .[deploy]
python deploy.py --template-id <tid>

deploy.py exposes every endpoint setting as a CLI flag.

5. Parse your first PDF

from mineru_client import MineruClient

client = MineruClient(
    endpoint_id="<your-endpoint-id>",
    api_key="<your-runpod-api-key>",
)
result = client.parse_document(
    file_url="https://example.com/report.pdf",
    end_page=4,  # smoke test on first 5 pages
)
client.save_tarball(result, "./out/doc")
# → ./out/doc/<basename>.md
# → ./out/doc/<basename>_content_list_v2.json
# → ./out/doc/<basename>_middle.json
# → ./out/doc/images/*.png

First parse pays a cold start. Subsequent parses on the same warm worker run at ~1–6 s/page on the 4090, content density dependent. After 10 s of idle, the worker scales to zero.

What does the MinerU response actually contain?

Three structured outputs plus extracted images. <basename>.md is Markdown with LaTeX equations, HTML tables, and image references. <basename>_content_list_v2.json is a flat list of typed entries (text, equation, table, image, code) each tagged with page_idx. <basename>_middle.json carries the full layout with bounding boxes and reading order. Pick the transport via return: tarball_b64, inline, or s3.

For a document indexer or RAG pipeline, content_list_v2.json is the file you’ll spend the most time with. Group entries by level: "title" boundaries for section-based chunking. Embed each chunk and store page_idx for citation back to the source.

The Markdown is for human-readable display. middle.json has bounding boxes per span when you need page coordinates for hover-to-source UI.

Transport options on the request: tarball_b64 (default) for outputs under ~20 MB, inline if you want the markdown directly in the JSON response, s3 for anything that would exceed RunPod’s response cap. See the R2 bridge post for the s3 setup.

When does mineru-runpod fit your workload, and when doesn’t it?

Good fit: batch ingest jobs, bursty traffic (50 docs in a minute, then quiet), background pipelines, OCR-API replacement. Poor fit: interactive single-document apps (cold starts make users think it’s broken), sparse traffic (one job per cold start dominates the bill), strict latency SLOs without provisioning workers_min ≥ 1.

I run this template in production for a document indexer. Six months of operation, here’s the honest fit picture:

Good fit:

Batch ingest. Drop 500 PDFs into a queue. One cold start amortizes across the whole batch at ~$0.001 per page.
Bursty traffic. A user uploads 50 documents in a minute. One cold start, 49 warm parses.
Background pipelines. Nightly cron processes yesterday’s intake. Cold start cost is rounding error against a multi-hour batch.
OCR-API replacement. Comparable per-page cost without shipping documents to a third party.

Poor fit:

Interactive single-document parsing. Your user uploads one PDF and waits two minutes for the cold start. They’ll think it’s broken.
Sparse traffic (one job every 20–60 min). Almost every request is a cold start. The ~$0.03 cold-start tax dominates. Rent a permanent low-tier GPU pod and skip serverless instead.
Strict latency SLOs. Cold-start latency is partly outside your control. Provisioning workers_min ≥ 1 eliminates cold starts but you pay for the warm worker around the clock.

The repo’s defaults (workers_min=0, idle_timeout=10s) are tuned for batch-with-bursts. The dashboard’s scaling settings are where you tune for other patterns.

What’s the real cold-start cost on RunPod Serverless?

Roughly 110 seconds before MinerU starts parsing your first request after a scale-from-zero. The composition: ~3 s fitness checks, ~20 s vLLM engine config, ~20 s model load, ~25 s torch.compile, ~5 s CUDA graph capture, ~5 s of actual parse. Billed at ~$1.10/hr on the 4090 default, that’s roughly $0.03 per cold start.

The per-phase breakdown is documented in the troubleshooting guide if you want to see where the time goes. The boot-time warmup in this template loads MinerU’s model and JIT-compiles vLLM kernels before the worker accepts requests. When RunPod’s FlashBoot snapshot is available on a subsequent scale-from-zero, the wall-clock drops to ~7–8 seconds because the snapshot captured a warm process. When the snapshot isn’t available (new host, image rebuild), warmup re-runs and you pay the full ~110 s again.

The FlashBoot mechanism investigation covers when the fast path applies, with measured numbers across multiple consecutive cold starts.

What should I watch out for before going to production?

Three production gotchas the marketing won’t mention. The 20 MB response cap silently drops large outputs (symptom: NoneType after a successful parse — covered by the R2 bridge). execution_timeout defaults to 900 s and won’t cover full books. file_b64 inline payloads cap around 10 MB on the way in. None of these crash the worker; they manifest as confusing client-side errors.

20 MB response cap. RunPod’s /runsync gateway drops responses over ~20 MB. Multi-page parses with embedded images hit this around 50–80 pages. Worker logs done; client gets NoneType. Fix: return: "s3" + Cloudflare R2, walked through in the R2 bridge post.
Long-job timeout. Repo defaults execution_timeout=900s (good for ~150–300 pages on 4090). A 5,000-page book is 80–500 minutes depending on content density. Bump execution_timeout for long jobs; the endpoint upper limit is 24 hours.
Inline payload cap on the way in. file_b64 requests cap around 10 MB. For bigger files, pass file_url and let the worker fetch from your storage. R2 public dev URLs work well.
Cold-start economics. “Pennies per page” depends on amortization. Track average pages per cold start in your logs. If it’s under 30, bump idle_timeout or run workers_min=1.

Where to next

The repo ships with:

Typed Python client (MineruClient)
deploy.py / destroy.py for endpoint lifecycle automation
Reference adapter pattern for wrapping MinerU output into domain models
96 unit tests, CI on every PR
Commitlint + semantic-release for automated CHANGELOG / GitHub Releases

For the deeper context that didn’t fit:

How RunPod FlashBoot actually works — four-request investigation into the cold-start mechanism and the per-host snapshot caveat.
The R2 bridge for the 20 MB response cap — fix for NoneType on multi-page outputs.
Choosing a GPU — when 24 GB is enough, when to opt up to 48 GB.

If this saved you time, the easiest way to say thanks is signing up for RunPod through this link. Star the repo on GitHub for updates.

FAQ

How does mineru-runpod compare to hosted PDF APIs?

Per-page cost is in the same ballpark ($0.001–$0.004) when amortizing cold starts across reasonable batches. The differences are control and lock-in. You deploy your own RunPod endpoint, pick your GPU and concurrency, run whichever MinerU version you want, and never send documents to a third party. The trade-off is operating a serverless template instead of consuming a managed API.

Can MinerU 2.5 handle non-English PDFs?

Yes. The vlm-auto-engine default backend handles English and Chinese well per the model card. For other scripts (Cyrillic, Arabic, Devanagari, Japanese, Korean), the pipeline backend uses PaddleOCR with script-family models, covering 109 languages. Empirically the Pro VLM also handles Cyrillic correctly even though lang is ignored on the VLM path. Switch backends per-request via the backend field.

What’s the difference between `vlm-auto-engine`, `pipeline`, and `hybrid-auto-engine`?

vlm-auto-engine uses MinerU’s 1.2B VLM via vLLM. Fastest on English / Chinese, ~1–6 s/page warm. pipeline uses PaddleOCR plus dedicated layout / formula / table models. Slower (~3–5 s/page) but more memory-predictable (4 GB minimum VRAM) and covers 109 languages. hybrid-auto-engine routes each page through either backend based on content. Highest quality on mixed-content docs; needs 48 GB on dense layouts.

Does the per-page cost include the cold-start tax?

No. The ~$0.001 per page is warm-worker math. Each scale-from-zero adds a roughly $0.03 fixed cost on the 4090 default. Your effective per-page cost is (0.001 × pages) + (0.03 × cold_starts) / pages. For 100 pages across one cold start, that’s $0.0013 per page. For 10 pages, it’s $0.004.

Can I use mineru-runpod with my own MinerU model?

Yes. Fork the repo and update the Dockerfile’s huggingface_hub.snapshot_download call to point at your model. Rebuild and redeploy. The handler is model-agnostic; MinerU’s aio_do_parse resolves whatever model is in HF_HOME at runtime.

What GPU does the template default to?

ADA_24 (RTX 4090, 24 GB). Switched from AMPERE_24 (A5000) on 2026-05-26 after measuring per-page cost. The 4090 is 2–4× faster per page than the A5000 and cheaper per page despite the higher hourly rate. See Choosing a GPU for the full math and when to opt up to 48 GB.

How do I keep my RunPod endpoint warm to avoid cold starts?

Set workers_min=1 on the endpoint. You pay for the always-on worker around the clock (~$0.000306/s on the 4090 default, so ~$26/day or ~$800/month). Worth it if your traffic is steady enough that the warm worker stays busy, or if your latency SLO can’t tolerate the cold-start window. For bursty traffic, workers_min=0 with FlashBoot enabled is usually cheaper.

Disclosure: RunPod links in this post use a referral code that credits me at no cost to you. The post would read the same without it.

Why self-host the MinerU API instead of using mineru.net?

What does self-hosting MinerU cost vs the cloud API?

How do you deploy your own MinerU endpoint on RunPod?

How do you migrate your code off the MinerU API?

What doesn’t carry over from the cloud API?

FAQ

Is the MinerU API free?

What are the MinerU API’s limits?

Can I self-host MinerU without RunPod?

Does the self-hosted output match the cloud API’s full_zip_url?

Is self-hosting actually cheaper than the MinerU API?

Do my documents stay private when self-hosting?

Why does my RunPod worker return NoneType after a successful parse?

What is RunPod’s /runsync response payload limit?

Does return: "tarball_b64" get around the 20 MB cap?

How do I configure Cloudflare R2 to bypass the RunPod response cap?

Why pick Cloudflare R2 over AWS S3 for RunPod output storage?

What does the parse flow look like end-to-end with return: "s3"?

What should I watch out for with the R2 bridge?

FAQ

How do I get the R2 access key for BUCKET_ACCESS_KEY_ID and BUCKET_SECRET_ACCESS_KEY?

Does the presigned URL expire?

Can I reuse the same R2 bucket for input and output?

What if I can’t set up R2? Is there a fallback?

Is the 20 MB cap on /run too, or only /runsync?

Does using return: "s3" add to cold-start time?

How big can the R2-uploaded tarball be?

Does R2 work for input PDFs too, or only output?

Where to next

What does it actually cost to run MinerU on RunPod Serverless?

How do I deploy MinerU to RunPod Serverless in ten minutes?

1. Get a RunPod account

2. Fork the repo

3. Wire RunPod’s GitHub auto-build

4. Create the endpoint

5. Parse your first PDF

What does the MinerU response actually contain?

When does mineru-runpod fit your workload, and when doesn’t it?

What’s the real cold-start cost on RunPod Serverless?

What should I watch out for before going to production?

Where to next

FAQ

How does mineru-runpod compare to hosted PDF APIs?

Can MinerU 2.5 handle non-English PDFs?

What’s the difference between vlm-auto-engine, pipeline, and hybrid-auto-engine?

Does the per-page cost include the cold-start tax?

Can I use mineru-runpod with my own MinerU model?

What GPU does the template default to?

How do I keep my RunPod endpoint warm to avoid cold starts?

Does `return: "tarball_b64"` get around the 20 MB cap?

What does the parse flow look like end-to-end with `return: "s3"`?

How do I get the R2 access key for `BUCKET_ACCESS_KEY_ID` and `BUCKET_SECRET_ACCESS_KEY`?

Is the 20 MB cap on `/run` too, or only `/runsync`?

Does using `return: "s3"` add to cold-start time?

What’s the difference between `vlm-auto-engine`, `pipeline`, and `hybrid-auto-engine`?