Choosing a GPU

The repo’s default GPU pool is ADA_24 (24 GB Ada — RTX 4090). The MinerU2.5-Pro-2605-1.2B VLM (running on the MinerU 3.2.x runtime) was specifically sized to fit comfortably in 24 GB with KV cache headroom; among the 24 GB pools, the 4090 is the right default because it’s both faster per page and cheaper per page than the cheaper-hourly A5000/3090 (AMPERE_24). This page is for the cases when 24 GB isn’t enough or when a different trade-off makes sense.

TL;DR

You are…	Use GPU pool	Why
Parsing small-to-medium docs (≤100 pages) one at a time	`ADA_24` (default) or `AMPERE_24`	4090 is ~4× faster per warm page than A5000, and supply is better. A5000 has lower $/hr if you can tolerate slower parses.
Parsing large books or huge documents (100–10,000+ pages)	`ADA_24` (default), batched	Document length doesn’t grow VRAM — a single sequential parse stays ~13 GiB at any page count. Slice the doc into page-range jobs client-side (`start_page`/`end_page`); each batch fits 24 GB. We ran 5,039 pages this way on a 24 GB A5000. See Large documents.
Doing concurrent batching (multiple docs) inside one worker	`AMPERE_48`	vLLM async engine can grow KV cache beyond 24 GB under batch pressure. This — not document size — is the real reason to size up.

If you’re unsure, start with the default ADA_24 and watch for OOM in the worker logs. Bumping to a 48 GB pool is a one-line change to deploy.py --gpu-ids or the gpuIds field in your endpoint config.

How much VRAM MinerU actually uses

Originally measured on an RTX A5000 (24 GB) parsing a single-page test PDF with the default vlm-auto-engine backend. The 4090 numbers match within a few hundred MiB — same vLLM config, same KV-cache budget; the difference between the two cards is compute speed, not memory footprint.

Component	VRAM
Model weights	2.16 GiB
Peak activations	0.75 GiB
Non-torch memory	0.02 GiB
CUDA graph cache	0.23 GiB
vLLM KV cache (default `gpu_memory_utilization=0.5`)	9.50 GiB
Total active VRAM	~13 GiB / 24 GiB (≈54%)

The model itself is small (~2 GB). The dominant consumer is vLLM’s KV cache, which gets allocated upfront based on gpu_memory_utilization (default 0.5 → ~11.78 GiB target on a 24 GB card). The actual KV cache usage scales with how many tokens are in flight; on a single-page parse it sits around 9.5 GiB.

Implications:

24 GB GPUs are the right minimum. A 16 GB card cannot fit the default config — vLLM would OOM during engine init.
Sequential parses don’t grow above ~13 GiB — at any page count. KV cache is bounded by gpu_memory_utilization; document length doesn’t push it higher. We confirmed this end-to-end: a 5,039-page PDF parsed on a 24 GB A5000 with the default vlm-auto-engine backend, sliced into page-range batches client-side (walkthrough). So don’t size up the GPU for a big document — batch it (see Large documents).
Concurrent batching grows KV cache pressure, not other components. To bound max VRAM under heavy concurrency, either move to a 48 GB pool or tune gpu_memory_utilization downward.

Rough peak VRAM by workload shape (estimates)

Workload	Peak VRAM	Fits in
Single-doc sequential (default config, any page count)	~13 GiB	24 GB ✓
Single-worker, light async batching (≤4 concurrent pages)	~14–18 GiB	24 GB ✓
Single-worker, heavy batching (8+ concurrent, large docs)	20+ GiB	48 GB recommended
Pipeline backend (`backend: "pipeline"`) — no vLLM KV cache	4 GB min (MinerU upstream)	24 GB easily ✓

Why the MinerU paper used 48 GB

The published benchmarks ran on dual RTX 4090 (effectively 48 GB combined) to maximize throughput for the leaderboard numbers — not because correctness requires it. If you’re running a single document at a time in a serverless worker, you do not need 48 GB.

Workload-to-GPU map

Concurrency comes from two independent levers — workers_max (more workers) and MINERU_MAX_CONCURRENCY (more jobs per worker); see Concurrency for the full model. Horizontal workers_max on 24 GB workers is the default throughput dial; the only shape that needs a 48 GB pool is in-worker concurrency.

Single worker, one doc at a time

This is the default RunPod serverless shape: each incoming job spins a worker, the worker parses the document, the worker scales to zero after idle_timeout. No concurrency inside the worker.

GPU pool: ADA_24 (default) or AMPERE_24 (cheaper hourly, slower)
VRAM peak: ~13 GiB
Backend: vlm-auto-engine (the default; resolves to vLLM async on CUDA)

Single worker, batching multiple docs

If you set MINERU_MAX_CONCURRENCY > 1 (RunPod calls this the concurrencyModifier) so one worker handles several requests at once, the vLLM async engine batches their pages through a shared KV cache, which grows roughly linearly with concurrency. Worth it only when single requests don’t saturate the GPU (small or I/O-heavy docs) — otherwise add a worker instead. See Concurrency → in-worker.

GPU pool: ADA_24 for light concurrency (≤4), AMPERE_48 for more
VRAM peak: 11–16 GB depending on concurrency
Trade-off: higher utilization per worker; harder to bound max VRAM. Watch nvidia-smi in worker logs after deploy.

Multiple workers, one doc each (horizontal scaling)

This is what RunPod does when you set workers_max > 1 and several requests arrive at once. Each worker is an isolated container with its own model copy — VRAM is per-worker, not shared.

GPU pool: ADA_24 per worker
VRAM peak: ~13 GiB per worker
Scaling note: you pay per worker-second, so cost scales linearly. The default workers_max: 3 (set in deploy.py / the dashboard, not hub.json) is a sensible cap for spiky workloads.

Multiple workers, in-worker concurrency on each

The maximum-throughput shape: workers_max > 1 AND MINERU_MAX_CONCURRENCY > 1. Each worker runs several jobs through its shared engine while RunPod scales horizontally.

GPU pool: AMPERE_48
VRAM peak: 16+ GB per worker
When to use: high-volume ingest of small or I/O-heavy docs where you’ve already validated that in-worker concurrency beats simply adding 24 GB workers.

Document-size adjustments

Doc size	Pages	Recommended pool	Notes
Tiny	1–3	`ADA_24`	Default 24 GB pool; ~13 GiB peak. 16 GB pools won’t fit the default vLLM config.
Small	3–30	`ADA_24`	The “average document” case the defaults are tuned for.
Medium	30–100	`ADA_24`	Same default; KV cache stays well under 16 GB even on dense pages.
Large	100–1,000	`ADA_24`, batched	Page count doesn’t raise the ~13 GiB peak, so 24 GB fits. Batch into page-range jobs (below) for resumability and to stay under the 20 MB response cap — not because 24 GB runs out.
Huge	1,000–10,000	`ADA_24`, batched	Same: don’t size up. We parsed a 5,039-page PDF on a 24 GB A5000 with `vlm-auto-engine`, batched client-side.

Large documents: batch, don’t size up

A bigger document is not a reason for a bigger GPU. A single sequential parse peaks ~13 GiB no matter how many pages you feed it, so a 1,000- or 10,000-page PDF fits the default 24 GB pool exactly like a 10-page one does. What breaks on big documents isn’t VRAM — it’s everything around the parse:

All-or-nothing failure — one job that dies at page 4,800 (OOM, transient eviction, timeout) costs the whole run.
The 20 MB response cap — a few hundred pages of output already blows past it.
No resumability — one job has no checkpoint; a failure means starting over.

The fix is client-side page-range batching, not a 48 GB card. The worker slices server-side: pass start_page/end_page (API reference) and each job parses one contiguous range out of the same uploaded file — you never pre-split the PDF. Keep MINERU_MAX_CONCURRENCY=1 and scale throughput with workers_max (more GPU workers, not a bigger GPU). See Concurrency for how workers_max and MINERU_MAX_CONCURRENCY interact.

Picking a backend

MinerU 3.2.x exposes five backends; the template accepts all of them as a per-job string. You don’t need to change the endpoint or the GPU pool to switch — the same worker can handle different backends on consecutive jobs.

Backend	What it runs	When to pick it	Per-page VRAM
`vlm-auto-engine` (default)	The `MinerU2.5-Pro-2605-1.2B` VLM via vLLM	English / Chinese (per model card tags), short-to-medium docs, fast wall time	~13 GiB
`pipeline`	PaddleOCR + dedicated layout/formula/table models	Non-Latin scripts (use a script-family `lang` code), constant-memory workloads, sub-24 GB GPUs	4 GB min (per MinerU upstream)
`hybrid-auto-engine`	Pipeline + VLM, auto-routed based on content	Mixed-content docs where the pipeline+VLM combination outperforms either alone	Highest
`vlm-http-client`	Same VLM via an external vLLM OpenAI-compatible server	Split-tier deploys — keep a 24 GB worker cheap, run vLLM on a separate always-on box	Low (no model on worker)
`hybrid-http-client`	Hybrid with external VLM server	Same split-tier idea, on hybrid workloads	Low

Pipeline: the constant-memory escape hatch

Pipeline trades raw speed for memory predictability: page-by-page streaming with no vLLM KV-cache accumulation, 4 GB minimum per MinerU’s hardware compatibility table. Per-page time is ~3–5 s across GPUs for pipeline. The VLM backend is much more GPU-sensitive AND content-sensitive: MinerU upstream claims ~0.5 s/page (2.12 fps) on an A100; we measured ~1 s/page (uniform reports) to ~6 s/page (dense forms) on RTX 4090, and ~1–10 s/page on A5000 24 GB depending on content density. So VLM-on-A100 ≈ 7× faster than pipeline; VLM-on-4090 is meaningfully faster than pipeline; VLM-on-A5000 is roughly even with pipeline on average. The pipeline backend uses PaddleOCR with explicit script-family models (109 languages per MinerU’s README) and is the documented-safe choice for non-Latin scripts; empirically the Pro VLM also handles Cyrillic (Russian) correctly even though lang is ignored, but coverage of other non-Latin scripts on the VLM is undocumented.

{
  "input": {
    "file_url": "https://example.com/russian-report.pdf",
    "backend": "pipeline",
    "lang": "east_slavic"
  }
}

Script-family lang codes (not ISO): east_slavic (Russian/Ukrainian/Belarusian), cyrillic (Serbian/Bulgarian/Mongolian), latin, arabic, devanagari, japan, korean, chinese_cht, el (Greek), th, etc. See MinerU’s language support docs for the full list of 109.

Hybrid: best quality, biggest GPU

hybrid-auto-engine routes each page through either pipeline or VLM depending on content. Best overall quality but highest VRAM footprint — needs 48 GB on dense mixed-content docs.

Empirical note (Pro-2604 + Cyrillic): we A/B-tested vlm-auto-engine against hybrid-auto-engine on Cyrillic fiscal PDFs (forms with embedded tables and codes) and got near-identical output — same Cyrillic transcription, same cell-collision pattern on dense form headers. The Pro-2604 VLM already handles Cyrillic correctly end-to-end; hybrid’s layout layer didn’t unlock additional accuracy for that content. The 2605 model card lists the same English/Chinese tags as 2604; we have not re-run the Cyrillic A/B on 2605, but its lineage and the unchanged tag set leave no reason to expect a regression. Hybrid is most useful when layout disambiguation (multi-column reading order, complex form structure, mixed text + math + tables) is the bottleneck — not OCR per se. Don’t reach for hybrid just because your docs are non-English.

HTTP-client backends: split worker from model

vlm-http-client and hybrid-http-client don’t load the VLM into the worker. Instead they POST to an external vLLM OpenAI-compatible server (pass server_url in the job input). Useful when you want to:

Keep a fleet of small (24 GB or less) workers cheap and stateless
Run one always-on vLLM server amortising the model load across all of them
Run vLLM on hardware the serverless tier doesn’t offer (multi-GPU, NVLink, etc.)

{
  "input": {
    "file_url": "...",
    "backend": "vlm-http-client",
    "server_url": "https://your-vllm-host.example.com/v1"
  }
}

Official hardware compatibility (MinerU upstream)

Mirrored from MinerU’s official PyPI page so you don’t have to context-switch. These are MinerU’s project-wide minimums; our worker’s defaults are usually well above them.

	`pipeline`	`vlm-auto-engine`	`hybrid-auto-engine`	`*-http-client`
Backend features	Good compatibility	High HW requirements	High HW requirements	For OpenAI-compatible servers²
Accuracy¹	85+	95+	95+	(delegated to remote server)
OS support	Linux³ / Windows⁴ / macOS⁵	Linux³ / Windows⁴ / macOS⁵	Linux³ / Windows⁴ / macOS⁵	Linux³ / Windows⁴ / macOS⁵
Pure CPU support	✅	❌	❌	✅
GPU acceleration	Volta or newer GPUs / Apple Silicon	Volta or newer	Volta or newer	Not required
Min VRAM	4 GB	8 GB	8 GB	2 GB
RAM	16 GB min, 32 GB recommended	16 GB min, 32 GB recommended	16 GB min, 32 GB recommended	16 GB min
Disk	20 GB min, SSD recommended	20 GB min, SSD recommended	20 GB min, SSD recommended	2 GB min
Python	3.10 – 3.13	3.10 – 3.13	3.10 – 3.13	3.10 – 3.13

_{¹ End-to-End Evaluation Overall score on OmniDocBench v1.6.}
_{² OpenAI-compatible servers: vLLM, SGLang, LMDeploy.}
_{³ Linux distributions from 2019 or later.}
_{⁴ Windows is limited to Python 3.10–3.12 because the ray dependency does not support Python 3.13 on Windows.}
_{⁵ macOS requires 14.0 or later.}

Our default RunPod config uses more VRAM than these floors because vLLM’s gpu_memory_utilization=0.5 allocates ~9.5 GiB of KV cache upfront on 24 GB cards. Tuning it down would let vlm-auto-engine run on smaller GPUs, at the cost of concurrent throughput. Our container image runs Linux + Python 3.11; OS rows above are MinerU’s claim, not what we ship.

How to configure GPU pool

Three places, in order of how you’d reach for them:

From `deploy.py`

python deploy.py --template-id $env:MINERU_TEMPLATE_ID --gpu-ids AMPERE_48

Accepts a comma-separated list — RunPod picks the first available pool from the list at scale-up time. The repo’s default already prefers 4090 then falls back to A5000 then to A6000:

python deploy.py --template-id ... --gpu-ids "ADA_24,AMPERE_24,AMPERE_48"

From the RunPod dashboard

Endpoint detail page → GPU configuration → tick the pools you want. Same comma-separated semantics: RunPod tries the cheapest available pool that’s allow-listed.

From `.runpod/hub.json`

For published Hub templates, the gpuIds field declares the default pool list users will see when they deploy your template:

{
  "config": {
    "gpuIds": "ADA_24,AMPERE_24,AMPERE_48",
    ...
  }
}

This repo’s default exposes 24 GB and 48 GB tiers so users can opt up without re-publishing. You can change the pool list per-endpoint from the RunPod dashboard at any time — the hub.json value is just the suggested starting set for new deploys.

Supported GPU pools — full reference

The Hub default is intentionally conservative. The full list of RunPod pools that work (or are known not to) with this template:

Pool ID	What it contains	Compute	Status with MinerU 2.5 Pro
`ADA_24` (default)	RTX 4090, 24 GB	8.9	✅ Tested by us; ~1–6 s/page VLM warm depending on content density. Best speed-and-cost trade-off for sequential parsing.
`AMPERE_24` (default)	RTX 3090 / RTX A5000, 24 GB	8.6	✅ Tested by us; ~1–10 s/page VLM warm. Lower $/hr than ADA_24 but slower per page; usually worse $/page in practice.
`AMPERE_48` (default)	RTX A6000, 48 GB	8.6	✅ In default list; same generation as A5000, double the VRAM. Reach for it when 24 GB OOMs under your workload.
`AMPERE_80`	NVIDIA A100 80 GB	8.0	✅ The GPU MinerU benchmarks against (2.12 fps / ~0.5 s/page upstream). Compute 8.0 fully supported by vllm 0.11.2. Untested by us.
`ADA_80_PRO`	RTX 6000 Ada variant	8.9	✅ Should work; same kernels as `AMPERE_48`. Untested by us.
`ADA_48_PRO`	RTX 6000 Ada or RTX PRO 6000 Blackwell — RunPod groups them together	8.9 or 12.0	⚠️ Avoid: if RunPod schedules you on a Blackwell SKU, VLM backend will crash. Pipeline backend still runs on Blackwell. See Blackwell note above.

Other pools (HOPPER_*, lower-VRAM Ampere SKUs, etc.) may exist in your RunPod account but I haven’t verified the exact pool-ID strings. RunPod’s pool IDs have been historically inconsistent — ADA_48 vs ADA_48_PRO was a footgun we hit on this project. Before adding a non-default pool to your endpoint, check RunPod’s GPU types reference to confirm the exact ID string.

To opt into a non-default pool, edit your endpoint’s GPU configuration in the RunPod dashboard and add the pool ID to the comma-separated list. RunPod picks the first available pool from the list at scale-up time, so listing cheaper options first keeps cost lowest while making expensive options available if cheaper ones run out.

Why we don’t put A100/H100 in the Hub default: the hub.json gpuIds list is what new deployers see on the deploy form. Including AMPERE_80 by default would make it look like A100 is the expected GPU for this template — it’s not. Per-page cost on A100 is often higher than on ADA_24 because the hourly rate jump outpaces the speed-up except for sustained throughput. ADA_24 is the right default; opt into bigger only when you’ve measured a workload that needs it.

A note on `ADA_48_PRO` and Blackwell

The ADA_48_PRO pool was originally the RTX 6000 Ada Generation (compute capability 8.9). RunPod has been adding the newer RTX PRO 6000 Blackwell cards (compute capability 12.0) to the same pool, and the names are nearly identical — NVIDIA labels both as “RTX 6000” family cards. The bundled xformers / flash-attn in the current worker image has dedicated kernels for Ampere (8.0-8.6), Ada (8.9), and Hopper (9.0), but no Blackwell (12.0) code path yet. When run on a Blackwell card, xformers misroutes to the Hopper kernel and crashes during VLM model init:

CUDA error (...flash-attention/hopper/flash_fwd_launch_template.h:188): invalid argument

That’s why the template’s default gpuIds list excludes ADA_48_PRO. If you need a 48 GB Ada-or-newer card and are happy to use only the pipeline backend (which doesn’t touch xformers / flash-attn), you can add it back to your endpoint’s pool list manually — but the VLM backend will fail on the Blackwell variant until the worker image bumps xformers / vllm to versions that ship a Blackwell kernel.

Cost implications

GPU pricing on RunPod varies by pool, availability, and whether you’re on spot vs on-demand. Indicative serverless Flex rates we’ve seen (subject to change — confirm at RunPod’s pricing page):

Pool	$/hr	Per-page wall time (warm VLM, dense content)	Per-page cost
`ADA_24` (RTX 4090, default)	~$1.10	~1–6 s	lowest
`AMPERE_24` (A5000 / 3090)	~$0.69	~1–10 s	~1.5–2× higher than ADA_24 despite cheaper hourly — speed gap dominates
`AMPERE_48` (A6000)	~$0.79	similar to A5000	only worth it when 24 GB OOMs
`AMPERE_80` (A100 80 GB)	~$1.89	~0.5 s upstream	sustained throughput only; otherwise per-page cost is higher than ADA_24

Why ADA_24 wins on $/page despite the higher hourly: the 4090’s compute speed-up (~2–4× faster per page than A5000) outpaces its ~1.6× hourly premium. The math only flips when a workload sits on a single warm worker for very long stretches and the per-page advantage saturates.

When to revisit this

Watch your worker logs after deploy. The signals that mean you should bump up:

CUDA out of memory errors → bump to next VRAM tier
vLLM warning about KV cache eviction → larger pool or reduce concurrency
Per-page wall time spiking → check nvidia-smi; if at 100% memory but low utilisation, you’re memory-bound

Conversely, if your worker logs show consistent low VRAM usage (~6–8 GB peak) on a 48 GB pool, you’re paying for headroom you don’t need. Drop to 24 GB.