Skip to content

Choosing a GPU

The repo’s default GPU pool is ADA_24 (24 GB Ada — RTX 4090). The MinerU2.5-Pro-2605-1.2B VLM (running on the MinerU 3.2.x runtime) was specifically sized to fit comfortably in 24 GB with KV cache headroom; among the 24 GB pools, the 4090 is the right default because it’s both faster per page and cheaper per page than the cheaper-hourly A5000/3090 (AMPERE_24). This page is for the cases when 24 GB isn’t enough or when a different trade-off makes sense.

You are…Use GPU poolWhy
Parsing small-to-medium docs (≤100 pages) one at a timeADA_24 (default) or AMPERE_244090 is ~4× faster per warm page than A5000, and supply is better. A5000 has lower $/hr if you can tolerate slower parses.
Doing concurrent batching inside a single workerAMPERE_48vLLM async engine can grow KV cache beyond 24 GB under batch pressure.
Parsing large books (100–1,000 pages)AMPERE_48Headroom for dense layout / long reading-order context.
Parsing huge documents (1,000+ pages)AMPERE_48 + --backend pipelineSwitch off vLLM KV-cache mode; pipeline backend processes page-by-page with a 4 GB minimum (per MinerU upstream).

If you’re unsure, start with the default ADA_24 and watch for OOM in the worker logs. Bumping to a 48 GB pool is a one-line change to deploy.py --gpu-ids or the gpuIds field in your endpoint config.

Originally measured on an RTX A5000 (24 GB) parsing a single-page test PDF with the default vlm-auto-engine backend. The 4090 numbers match within a few hundred MiB — same vLLM config, same KV-cache budget; the difference between the two cards is compute speed, not memory footprint.

ComponentVRAM
Model weights2.16 GiB
Peak activations0.75 GiB
Non-torch memory0.02 GiB
CUDA graph cache0.23 GiB
vLLM KV cache (default gpu_memory_utilization=0.5)9.50 GiB
Total active VRAM~13 GiB / 24 GiB (≈54%)

The model itself is small (~2 GB). The dominant consumer is vLLM’s KV cache, which gets allocated upfront based on gpu_memory_utilization (default 0.5 → ~11.78 GiB target on a 24 GB card). The actual KV cache usage scales with how many tokens are in flight; on a single-page parse it sits around 9.5 GiB.

Implications:

  • 24 GB GPUs are the right minimum. A 16 GB card cannot fit the default config — vLLM would OOM during engine init.
  • Sequential parses don’t grow above ~13 GiB. KV cache is bounded by gpu_memory_utilization; long documents don’t push it higher.
  • Concurrent batching grows KV cache pressure, not other components. To bound max VRAM under heavy concurrency, either move to a 48 GB pool or tune gpu_memory_utilization downward.

Rough peak VRAM by workload shape (estimates)

Section titled “Rough peak VRAM by workload shape (estimates)”
WorkloadPeak VRAMFits in
Single-doc sequential (default config)~13 GiB24 GB ✓
Single-worker, light async batching (≤4 concurrent pages)~14–18 GiB24 GB ✓
Single-worker, heavy batching (8+ concurrent, large docs)20+ GiB48 GB recommended
Pipeline backend (backend: "pipeline") — no vLLM KV cache4 GB min (MinerU upstream)24 GB easily ✓

The published benchmarks ran on dual RTX 4090 (effectively 48 GB combined) to maximize throughput for the leaderboard numbers — not because correctness requires it. If you’re running a single document at a time in a serverless worker, you do not need 48 GB.

This is the default RunPod serverless shape: each incoming job spins a worker, the worker parses the document, the worker scales to zero after idle_timeout. No concurrency inside the worker.

  • GPU pool: ADA_24 (default) or AMPERE_24 (cheaper hourly, slower)
  • VRAM peak: ~13 GiB
  • Backend: vlm-auto-engine (the default; resolves to vLLM async on CUDA)

If you set concurrencyModifier > 1 on your endpoint so one worker handles several requests in parallel, the vLLM async engine batches their pages. KV cache grows linearly with batch size.

  • GPU pool: ADA_24 for small batches (≤4 concurrent), AMPERE_48 for larger
  • VRAM peak: 11–16 GB depending on batch size
  • Trade-off: higher throughput per worker; harder to bound max VRAM. Watch nvidia-smi in worker logs after deploy.

Multiple workers, one doc each (horizontal scaling)

Section titled “Multiple workers, one doc each (horizontal scaling)”

This is what RunPod does when you set workers_max > 1 and several requests arrive at once. Each worker is an isolated container with its own model copy — VRAM is per-worker, not shared.

  • GPU pool: ADA_24 per worker
  • VRAM peak: ~13 GiB per worker
  • Scaling note: you pay per worker-second, so cost scales linearly. The default workers_max: 3 in hub.json is a sensible cap for spiky workloads.

The high-throughput shape: workers_max > 1 AND concurrencyModifier > 1. Each worker batches several pages internally while RunPod scales horizontally.

  • GPU pool: AMPERE_48
  • VRAM peak: 16+ GB per worker
  • When to use: high-volume production ingest where you’ve already validated the workload mix on 24 GB.
Doc sizePagesRecommended poolNotes
Tiny1–3ADA_24Default 24 GB pool; ~13 GiB peak. 16 GB pools won’t fit the default vLLM config.
Small3–30ADA_24The “average document” case the defaults are tuned for.
Medium30–100ADA_24Same default; KV cache stays well under 16 GB even on dense pages.
Large100–1,000AMPERE_48More headroom for dense layouts. Sequential mode still uses ~10 GB, so 48 GB is safety margin.
Huge1,000–10,000AMPERE_48 + --backend pipelineSee next section.

MinerU 3.2.x exposes five backends; the template accepts all of them as a per-job string. You don’t need to change the endpoint or the GPU pool to switch — the same worker can handle different backends on consecutive jobs.

BackendWhat it runsWhen to pick itPer-page VRAM
vlm-auto-engine (default)The MinerU2.5-Pro-2605-1.2B VLM via vLLMEnglish / Chinese (per model card tags), short-to-medium docs, fast wall time~13 GiB
pipelinePaddleOCR + dedicated layout/formula/table modelsNon-Latin scripts (use a script-family lang code), very long docs, constant-memory workloads4 GB min (per MinerU upstream)
hybrid-auto-enginePipeline + VLM, auto-routed based on contentMixed-content docs where the pipeline+VLM combination outperforms either aloneHighest
vlm-http-clientSame VLM via an external vLLM OpenAI-compatible serverSplit-tier deploys — keep a 24 GB worker cheap, run vLLM on a separate always-on boxLow (no model on worker)
hybrid-http-clientHybrid with external VLM serverSame split-tier idea, on hybrid workloadsLow

Pipeline: the constant-memory escape hatch

Section titled “Pipeline: the constant-memory escape hatch”

Pipeline trades raw speed for memory predictability: page-by-page streaming with no vLLM KV-cache accumulation, 4 GB minimum per MinerU’s hardware compatibility table. Per-page time is ~3–5 s across GPUs for pipeline. The VLM backend is much more GPU-sensitive AND content-sensitive: MinerU upstream claims ~0.5 s/page (2.12 fps) on an A100; we measured ~1 s/page (uniform reports) to ~6 s/page (dense forms) on RTX 4090, and ~1–10 s/page on A5000 24 GB depending on content density. So VLM-on-A100 ≈ 7× faster than pipeline; VLM-on-4090 is meaningfully faster than pipeline; VLM-on-A5000 is roughly even with pipeline on average. The pipeline backend uses PaddleOCR with explicit script-family models (109 languages per MinerU’s README) and is the documented-safe choice for non-Latin scripts; empirically the Pro VLM also handles Cyrillic (Russian) correctly even though lang is ignored, but coverage of other non-Latin scripts on the VLM is undocumented.

{
"input": {
"file_url": "https://example.com/russian-report.pdf",
"backend": "pipeline",
"lang": "east_slavic"
}
}

Script-family lang codes (not ISO): east_slavic (Russian/Ukrainian/Belarusian), cyrillic (Serbian/Bulgarian/Mongolian), latin, arabic, devanagari, japan, korean, chinese_cht, el (Greek), th, etc. See MinerU’s language support docs for the full list of 109.

hybrid-auto-engine routes each page through either pipeline or VLM depending on content. Best overall quality but highest VRAM footprint — needs 48 GB on dense mixed-content docs.

Empirical note (Pro-2604 + Cyrillic): we A/B-tested vlm-auto-engine against hybrid-auto-engine on Cyrillic fiscal PDFs (forms with embedded tables and codes) and got near-identical output — same Cyrillic transcription, same cell-collision pattern on dense form headers. The Pro-2604 VLM already handles Cyrillic correctly end-to-end; hybrid’s layout layer didn’t unlock additional accuracy for that content. The 2605 model card lists the same English/Chinese tags as 2604; we have not re-run the Cyrillic A/B on 2605, but its lineage and the unchanged tag set leave no reason to expect a regression. Hybrid is most useful when layout disambiguation (multi-column reading order, complex form structure, mixed text + math + tables) is the bottleneck — not OCR per se. Don’t reach for hybrid just because your docs are non-English.

HTTP-client backends: split worker from model

Section titled “HTTP-client backends: split worker from model”

vlm-http-client and hybrid-http-client don’t load the VLM into the worker. Instead they POST to an external vLLM OpenAI-compatible server (pass server_url in the job input). Useful when you want to:

  • Keep a fleet of small (24 GB or less) workers cheap and stateless
  • Run one always-on vLLM server amortising the model load across all of them
  • Run vLLM on hardware the serverless tier doesn’t offer (multi-GPU, NVLink, etc.)
{
"input": {
"file_url": "...",
"backend": "vlm-http-client",
"server_url": "https://your-vllm-host.example.com/v1"
}
}

Official hardware compatibility (MinerU upstream)

Section titled “Official hardware compatibility (MinerU upstream)”

Mirrored from MinerU’s official PyPI page so you don’t have to context-switch. These are MinerU’s project-wide minimums; our worker’s defaults are usually well above them.

pipelinevlm-auto-enginehybrid-auto-engine*-http-client
Backend featuresGood compatibilityHigh HW requirementsHigh HW requirementsFor OpenAI-compatible servers²
Accuracy¹85+95+95+(delegated to remote server)
OS supportLinux³ / Windows⁴ / macOS⁵Linux³ / Windows⁴ / macOS⁵Linux³ / Windows⁴ / macOS⁵Linux³ / Windows⁴ / macOS⁵
Pure CPU support
GPU accelerationVolta or newer GPUs / Apple SiliconVolta or newerVolta or newerNot required
Min VRAM4 GB8 GB8 GB2 GB
RAM16 GB min, 32 GB recommended16 GB min, 32 GB recommended16 GB min, 32 GB recommended16 GB min
Disk20 GB min, SSD recommended20 GB min, SSD recommended20 GB min, SSD recommended2 GB min
Python3.10 – 3.133.10 – 3.133.10 – 3.133.10 – 3.13
¹ End-to-End Evaluation Overall score on OmniDocBench v1.6.
² OpenAI-compatible servers: vLLM, SGLang, LMDeploy.
³ Linux distributions from 2019 or later.
⁴ Windows is limited to Python 3.10–3.12 because the ray dependency does not support Python 3.13 on Windows.
⁵ macOS requires 14.0 or later.

Our default RunPod config uses more VRAM than these floors because vLLM’s gpu_memory_utilization=0.5 allocates ~9.5 GiB of KV cache upfront on 24 GB cards. Tuning it down would let vlm-auto-engine run on smaller GPUs, at the cost of concurrent throughput. Our container image runs Linux + Python 3.11; OS rows above are MinerU’s claim, not what we ship.

Three places, in order of how you’d reach for them:

Terminal window
python deploy.py --template-id $env:MINERU_TEMPLATE_ID --gpu-ids AMPERE_48

Accepts a comma-separated list — RunPod picks the first available pool from the list at scale-up time. The repo’s default already prefers 4090 then falls back to A5000 then to A6000:

Terminal window
python deploy.py --template-id ... --gpu-ids "ADA_24,AMPERE_24,AMPERE_48"

Endpoint detail page → GPU configuration → tick the pools you want. Same comma-separated semantics: RunPod tries the cheapest available pool that’s allow-listed.

For published Hub templates, the gpuIds field declares the default pool list users will see when they deploy your template:

{
"config": {
"gpuIds": "ADA_24,AMPERE_24,AMPERE_48",
...
}
}

This repo’s default exposes 24 GB and 48 GB tiers so users can opt up without re-publishing. You can change the pool list per-endpoint from the RunPod dashboard at any time — the hub.json value is just the suggested starting set for new deploys.

The Hub default is intentionally conservative. The full list of RunPod pools that work (or are known not to) with this template:

Pool IDWhat it containsComputeStatus with MinerU 2.5 Pro
ADA_24 (default)RTX 4090, 24 GB8.9✅ Tested by us; ~1–6 s/page VLM warm depending on content density. Best speed-and-cost trade-off for sequential parsing.
AMPERE_24 (default)RTX 3090 / RTX A5000, 24 GB8.6✅ Tested by us; ~1–10 s/page VLM warm. Lower $/hr than ADA_24 but slower per page; usually worse $/page in practice.
AMPERE_48 (default)RTX A6000, 48 GB8.6✅ In default list; same generation as A5000, double the VRAM. Reach for it when 24 GB OOMs under your workload.
AMPERE_80NVIDIA A100 80 GB8.0✅ The GPU MinerU benchmarks against (2.12 fps / ~0.5 s/page upstream). Compute 8.0 fully supported by vllm 0.11.2. Untested by us.
ADA_80_PRORTX 6000 Ada variant8.9✅ Should work; same kernels as AMPERE_48. Untested by us.
ADA_48_PRORTX 6000 Ada or RTX PRO 6000 Blackwell — RunPod groups them together8.9 or 12.0⚠️ Avoid: if RunPod schedules you on a Blackwell SKU, VLM backend will crash. Pipeline backend still runs on Blackwell. See Blackwell note above.

Other pools (HOPPER_*, lower-VRAM Ampere SKUs, etc.) may exist in your RunPod account but I haven’t verified the exact pool-ID strings. RunPod’s pool IDs have been historically inconsistent — ADA_48 vs ADA_48_PRO was a footgun we hit on this project. Before adding a non-default pool to your endpoint, check RunPod’s GPU types reference to confirm the exact ID string.

To opt into a non-default pool, edit your endpoint’s GPU configuration in the RunPod dashboard and add the pool ID to the comma-separated list. RunPod picks the first available pool from the list at scale-up time, so listing cheaper options first keeps cost lowest while making expensive options available if cheaper ones run out.

Why we don’t put A100/H100 in the Hub default: the hub.json gpuIds list is what new deployers see on the deploy form. Including AMPERE_80 by default would make it look like A100 is the expected GPU for this template — it’s not. Per-page cost on A100 is often higher than on ADA_24 because the hourly rate jump outpaces the speed-up except for sustained throughput. ADA_24 is the right default; opt into bigger only when you’ve measured a workload that needs it.

The ADA_48_PRO pool was originally the RTX 6000 Ada Generation (compute capability 8.9). RunPod has been adding the newer RTX PRO 6000 Blackwell cards (compute capability 12.0) to the same pool, and the names are nearly identical — NVIDIA labels both as “RTX 6000” family cards. The bundled xformers / flash-attn in the current worker image has dedicated kernels for Ampere (8.0-8.6), Ada (8.9), and Hopper (9.0), but no Blackwell (12.0) code path yet. When run on a Blackwell card, xformers misroutes to the Hopper kernel and crashes during VLM model init:

CUDA error (...flash-attention/hopper/flash_fwd_launch_template.h:188): invalid argument

That’s why the template’s default gpuIds list excludes ADA_48_PRO. If you need a 48 GB Ada-or-newer card and are happy to use only the pipeline backend (which doesn’t touch xformers / flash-attn), you can add it back to your endpoint’s pool list manually — but the VLM backend will fail on the Blackwell variant until the worker image bumps xformers / vllm to versions that ship a Blackwell kernel.

GPU pricing on RunPod varies by pool, availability, and whether you’re on spot vs on-demand. Indicative serverless Flex rates we’ve seen (subject to change — confirm at RunPod’s pricing page):

Pool$/hrPer-page wall time (warm VLM, dense content)Per-page cost
ADA_24 (RTX 4090, default)~$1.10~1–6 slowest
AMPERE_24 (A5000 / 3090)~$0.69~1–10 s~1.5–2× higher than ADA_24 despite cheaper hourly — speed gap dominates
AMPERE_48 (A6000)~$0.79similar to A5000only worth it when 24 GB OOMs
AMPERE_80 (A100 80 GB)~$1.89~0.5 s upstreamsustained throughput only; otherwise per-page cost is higher than ADA_24

Why ADA_24 wins on $/page despite the higher hourly: the 4090’s compute speed-up (~2–4× faster per page than A5000) outpaces its ~1.6× hourly premium. The math only flips when a workload sits on a single warm worker for very long stretches and the per-page advantage saturates.

Watch your worker logs after deploy. The signals that mean you should bump up:

  • CUDA out of memory errors → bump to next VRAM tier
  • vLLM warning about KV cache eviction → larger pool or reduce concurrency
  • Per-page wall time spiking → check nvidia-smi; if at 100% memory but low utilisation, you’re memory-bound

Conversely, if your worker logs show consistent low VRAM usage (~6–8 GB peak) on a 48 GB pool, you’re paying for headroom you don’t need. Drop to 24 GB.