Choosing a GPU
The repo’s default GPU pool is ADA_24 (24 GB Ada — RTX 4090). The MinerU2.5-Pro-2605-1.2B VLM (running on the MinerU 3.2.x runtime) was specifically sized to fit comfortably in 24 GB with KV cache headroom; among the 24 GB pools, the 4090 is the right default because it’s both faster per page and cheaper per page than the cheaper-hourly A5000/3090 (AMPERE_24). This page is for the cases when 24 GB isn’t enough or when a different trade-off makes sense.
| You are… | Use GPU pool | Why |
|---|---|---|
| Parsing small-to-medium docs (≤100 pages) one at a time | ADA_24 (default) or AMPERE_24 | 4090 is ~4× faster per warm page than A5000, and supply is better. A5000 has lower $/hr if you can tolerate slower parses. |
| Doing concurrent batching inside a single worker | AMPERE_48 | vLLM async engine can grow KV cache beyond 24 GB under batch pressure. |
| Parsing large books (100–1,000 pages) | AMPERE_48 | Headroom for dense layout / long reading-order context. |
| Parsing huge documents (1,000+ pages) | AMPERE_48 + --backend pipeline | Switch off vLLM KV-cache mode; pipeline backend processes page-by-page with a 4 GB minimum (per MinerU upstream). |
If you’re unsure, start with the default ADA_24 and watch for OOM in the worker logs. Bumping to a 48 GB pool is a one-line change to deploy.py --gpu-ids or the gpuIds field in your endpoint config.
How much VRAM MinerU actually uses
Section titled “How much VRAM MinerU actually uses”Originally measured on an RTX A5000 (24 GB) parsing a single-page test PDF with the default vlm-auto-engine backend. The 4090 numbers match within a few hundred MiB — same vLLM config, same KV-cache budget; the difference between the two cards is compute speed, not memory footprint.
| Component | VRAM |
|---|---|
| Model weights | 2.16 GiB |
| Peak activations | 0.75 GiB |
| Non-torch memory | 0.02 GiB |
| CUDA graph cache | 0.23 GiB |
vLLM KV cache (default gpu_memory_utilization=0.5) | 9.50 GiB |
| Total active VRAM | ~13 GiB / 24 GiB (≈54%) |
The model itself is small (~2 GB). The dominant consumer is vLLM’s KV cache, which gets allocated upfront based on gpu_memory_utilization (default 0.5 → ~11.78 GiB target on a 24 GB card). The actual KV cache usage scales with how many tokens are in flight; on a single-page parse it sits around 9.5 GiB.
Implications:
- 24 GB GPUs are the right minimum. A 16 GB card cannot fit the default config — vLLM would OOM during engine init.
- Sequential parses don’t grow above ~13 GiB. KV cache is bounded by
gpu_memory_utilization; long documents don’t push it higher. - Concurrent batching grows KV cache pressure, not other components. To bound max VRAM under heavy concurrency, either move to a 48 GB pool or tune
gpu_memory_utilizationdownward.
Rough peak VRAM by workload shape (estimates)
Section titled “Rough peak VRAM by workload shape (estimates)”| Workload | Peak VRAM | Fits in |
|---|---|---|
| Single-doc sequential (default config) | ~13 GiB | 24 GB ✓ |
| Single-worker, light async batching (≤4 concurrent pages) | ~14–18 GiB | 24 GB ✓ |
| Single-worker, heavy batching (8+ concurrent, large docs) | 20+ GiB | 48 GB recommended |
Pipeline backend (backend: "pipeline") — no vLLM KV cache | 4 GB min (MinerU upstream) | 24 GB easily ✓ |
Why the MinerU paper used 48 GB
Section titled “Why the MinerU paper used 48 GB”The published benchmarks ran on dual RTX 4090 (effectively 48 GB combined) to maximize throughput for the leaderboard numbers — not because correctness requires it. If you’re running a single document at a time in a serverless worker, you do not need 48 GB.
Workload-to-GPU map
Section titled “Workload-to-GPU map”Single worker, one doc at a time
Section titled “Single worker, one doc at a time”This is the default RunPod serverless shape: each incoming job spins a worker, the worker parses the document, the worker scales to zero after idle_timeout. No concurrency inside the worker.
- GPU pool:
ADA_24(default) orAMPERE_24(cheaper hourly, slower) - VRAM peak: ~13 GiB
- Backend:
vlm-auto-engine(the default; resolves to vLLM async on CUDA)
Single worker, batching multiple docs
Section titled “Single worker, batching multiple docs”If you set concurrencyModifier > 1 on your endpoint so one worker handles several requests in parallel, the vLLM async engine batches their pages. KV cache grows linearly with batch size.
- GPU pool:
ADA_24for small batches (≤4 concurrent),AMPERE_48for larger - VRAM peak: 11–16 GB depending on batch size
- Trade-off: higher throughput per worker; harder to bound max VRAM. Watch
nvidia-smiin worker logs after deploy.
Multiple workers, one doc each (horizontal scaling)
Section titled “Multiple workers, one doc each (horizontal scaling)”This is what RunPod does when you set workers_max > 1 and several requests arrive at once. Each worker is an isolated container with its own model copy — VRAM is per-worker, not shared.
- GPU pool:
ADA_24per worker - VRAM peak: ~13 GiB per worker
- Scaling note: you pay per worker-second, so cost scales linearly. The default
workers_max: 3inhub.jsonis a sensible cap for spiky workloads.
Multiple workers, batching inside each
Section titled “Multiple workers, batching inside each”The high-throughput shape: workers_max > 1 AND concurrencyModifier > 1. Each worker batches several pages internally while RunPod scales horizontally.
- GPU pool:
AMPERE_48 - VRAM peak: 16+ GB per worker
- When to use: high-volume production ingest where you’ve already validated the workload mix on 24 GB.
Document-size adjustments
Section titled “Document-size adjustments”| Doc size | Pages | Recommended pool | Notes |
|---|---|---|---|
| Tiny | 1–3 | ADA_24 | Default 24 GB pool; ~13 GiB peak. 16 GB pools won’t fit the default vLLM config. |
| Small | 3–30 | ADA_24 | The “average document” case the defaults are tuned for. |
| Medium | 30–100 | ADA_24 | Same default; KV cache stays well under 16 GB even on dense pages. |
| Large | 100–1,000 | AMPERE_48 | More headroom for dense layouts. Sequential mode still uses ~10 GB, so 48 GB is safety margin. |
| Huge | 1,000–10,000 | AMPERE_48 + --backend pipeline | See next section. |
Picking a backend
Section titled “Picking a backend”MinerU 3.2.x exposes five backends; the template accepts all of them as a per-job string. You don’t need to change the endpoint or the GPU pool to switch — the same worker can handle different backends on consecutive jobs.
| Backend | What it runs | When to pick it | Per-page VRAM |
|---|---|---|---|
vlm-auto-engine (default) | The MinerU2.5-Pro-2605-1.2B VLM via vLLM | English / Chinese (per model card tags), short-to-medium docs, fast wall time | ~13 GiB |
pipeline | PaddleOCR + dedicated layout/formula/table models | Non-Latin scripts (use a script-family lang code), very long docs, constant-memory workloads | 4 GB min (per MinerU upstream) |
hybrid-auto-engine | Pipeline + VLM, auto-routed based on content | Mixed-content docs where the pipeline+VLM combination outperforms either alone | Highest |
vlm-http-client | Same VLM via an external vLLM OpenAI-compatible server | Split-tier deploys — keep a 24 GB worker cheap, run vLLM on a separate always-on box | Low (no model on worker) |
hybrid-http-client | Hybrid with external VLM server | Same split-tier idea, on hybrid workloads | Low |
Pipeline: the constant-memory escape hatch
Section titled “Pipeline: the constant-memory escape hatch”Pipeline trades raw speed for memory predictability: page-by-page streaming with no vLLM KV-cache accumulation, 4 GB minimum per MinerU’s hardware compatibility table. Per-page time is ~3–5 s across GPUs for pipeline. The VLM backend is much more GPU-sensitive AND content-sensitive: MinerU upstream claims ~0.5 s/page (2.12 fps) on an A100; we measured ~1 s/page (uniform reports) to ~6 s/page (dense forms) on RTX 4090, and ~1–10 s/page on A5000 24 GB depending on content density. So VLM-on-A100 ≈ 7× faster than pipeline; VLM-on-4090 is meaningfully faster than pipeline; VLM-on-A5000 is roughly even with pipeline on average. The pipeline backend uses PaddleOCR with explicit script-family models (109 languages per MinerU’s README) and is the documented-safe choice for non-Latin scripts; empirically the Pro VLM also handles Cyrillic (Russian) correctly even though lang is ignored, but coverage of other non-Latin scripts on the VLM is undocumented.
{ "input": { "file_url": "https://example.com/russian-report.pdf", "backend": "pipeline", "lang": "east_slavic" }}Script-family lang codes (not ISO): east_slavic (Russian/Ukrainian/Belarusian), cyrillic (Serbian/Bulgarian/Mongolian), latin, arabic, devanagari, japan, korean, chinese_cht, el (Greek), th, etc. See MinerU’s language support docs for the full list of 109.
Hybrid: best quality, biggest GPU
Section titled “Hybrid: best quality, biggest GPU”hybrid-auto-engine routes each page through either pipeline or VLM depending on content. Best overall quality but highest VRAM footprint — needs 48 GB on dense mixed-content docs.
Empirical note (Pro-2604 + Cyrillic): we A/B-tested vlm-auto-engine against hybrid-auto-engine on Cyrillic fiscal PDFs (forms with embedded tables and codes) and got near-identical output — same Cyrillic transcription, same cell-collision pattern on dense form headers. The Pro-2604 VLM already handles Cyrillic correctly end-to-end; hybrid’s layout layer didn’t unlock additional accuracy for that content. The 2605 model card lists the same English/Chinese tags as 2604; we have not re-run the Cyrillic A/B on 2605, but its lineage and the unchanged tag set leave no reason to expect a regression. Hybrid is most useful when layout disambiguation (multi-column reading order, complex form structure, mixed text + math + tables) is the bottleneck — not OCR per se. Don’t reach for hybrid just because your docs are non-English.
HTTP-client backends: split worker from model
Section titled “HTTP-client backends: split worker from model”vlm-http-client and hybrid-http-client don’t load the VLM into the worker. Instead they POST to an external vLLM OpenAI-compatible server (pass server_url in the job input). Useful when you want to:
- Keep a fleet of small (24 GB or less) workers cheap and stateless
- Run one always-on vLLM server amortising the model load across all of them
- Run vLLM on hardware the serverless tier doesn’t offer (multi-GPU, NVLink, etc.)
{ "input": { "file_url": "...", "backend": "vlm-http-client", "server_url": "https://your-vllm-host.example.com/v1" }}Official hardware compatibility (MinerU upstream)
Section titled “Official hardware compatibility (MinerU upstream)”Mirrored from MinerU’s official PyPI page so you don’t have to context-switch. These are MinerU’s project-wide minimums; our worker’s defaults are usually well above them.
pipeline | vlm-auto-engine | hybrid-auto-engine | *-http-client | |
|---|---|---|---|---|
| Backend features | Good compatibility | High HW requirements | High HW requirements | For OpenAI-compatible servers² |
| Accuracy¹ | 85+ | 95+ | 95+ | (delegated to remote server) |
| OS support | Linux³ / Windows⁴ / macOS⁵ | Linux³ / Windows⁴ / macOS⁵ | Linux³ / Windows⁴ / macOS⁵ | Linux³ / Windows⁴ / macOS⁵ |
| Pure CPU support | ✅ | ❌ | ❌ | ✅ |
| GPU acceleration | Volta or newer GPUs / Apple Silicon | Volta or newer | Volta or newer | Not required |
| Min VRAM | 4 GB | 8 GB | 8 GB | 2 GB |
| RAM | 16 GB min, 32 GB recommended | 16 GB min, 32 GB recommended | 16 GB min, 32 GB recommended | 16 GB min |
| Disk | 20 GB min, SSD recommended | 20 GB min, SSD recommended | 20 GB min, SSD recommended | 2 GB min |
| Python | 3.10 – 3.13 | 3.10 – 3.13 | 3.10 – 3.13 | 3.10 – 3.13 |
² OpenAI-compatible servers: vLLM, SGLang, LMDeploy.
³ Linux distributions from 2019 or later.
⁴ Windows is limited to Python 3.10–3.12 because the
ray dependency does not support Python 3.13 on Windows.
⁵ macOS requires 14.0 or later.
Our default RunPod config uses more VRAM than these floors because vLLM’s gpu_memory_utilization=0.5 allocates ~9.5 GiB of KV cache upfront on 24 GB cards. Tuning it down would let vlm-auto-engine run on smaller GPUs, at the cost of concurrent throughput. Our container image runs Linux + Python 3.11; OS rows above are MinerU’s claim, not what we ship.
How to configure GPU pool
Section titled “How to configure GPU pool”Three places, in order of how you’d reach for them:
From deploy.py
Section titled “From deploy.py”python deploy.py --template-id $env:MINERU_TEMPLATE_ID --gpu-ids AMPERE_48Accepts a comma-separated list — RunPod picks the first available pool from the list at scale-up time. The repo’s default already prefers 4090 then falls back to A5000 then to A6000:
python deploy.py --template-id ... --gpu-ids "ADA_24,AMPERE_24,AMPERE_48"From the RunPod dashboard
Section titled “From the RunPod dashboard”Endpoint detail page → GPU configuration → tick the pools you want. Same comma-separated semantics: RunPod tries the cheapest available pool that’s allow-listed.
From .runpod/hub.json
Section titled “From .runpod/hub.json”For published Hub templates, the gpuIds field declares the default pool list users will see when they deploy your template:
{ "config": { "gpuIds": "ADA_24,AMPERE_24,AMPERE_48", ... }}This repo’s default exposes 24 GB and 48 GB tiers so users can opt up without re-publishing. You can change the pool list per-endpoint from the RunPod dashboard at any time — the hub.json value is just the suggested starting set for new deploys.
Supported GPU pools — full reference
Section titled “Supported GPU pools — full reference”The Hub default is intentionally conservative. The full list of RunPod pools that work (or are known not to) with this template:
| Pool ID | What it contains | Compute | Status with MinerU 2.5 Pro |
|---|---|---|---|
ADA_24 (default) | RTX 4090, 24 GB | 8.9 | ✅ Tested by us; ~1–6 s/page VLM warm depending on content density. Best speed-and-cost trade-off for sequential parsing. |
AMPERE_24 (default) | RTX 3090 / RTX A5000, 24 GB | 8.6 | ✅ Tested by us; ~1–10 s/page VLM warm. Lower $/hr than ADA_24 but slower per page; usually worse $/page in practice. |
AMPERE_48 (default) | RTX A6000, 48 GB | 8.6 | ✅ In default list; same generation as A5000, double the VRAM. Reach for it when 24 GB OOMs under your workload. |
AMPERE_80 | NVIDIA A100 80 GB | 8.0 | ✅ The GPU MinerU benchmarks against (2.12 fps / ~0.5 s/page upstream). Compute 8.0 fully supported by vllm 0.11.2. Untested by us. |
ADA_80_PRO | RTX 6000 Ada variant | 8.9 | ✅ Should work; same kernels as AMPERE_48. Untested by us. |
ADA_48_PRO | RTX 6000 Ada or RTX PRO 6000 Blackwell — RunPod groups them together | 8.9 or 12.0 | ⚠️ Avoid: if RunPod schedules you on a Blackwell SKU, VLM backend will crash. Pipeline backend still runs on Blackwell. See Blackwell note above. |
Other pools (HOPPER_*, lower-VRAM Ampere SKUs, etc.) may exist in your RunPod account but I haven’t verified the exact pool-ID strings. RunPod’s pool IDs have been historically inconsistent — ADA_48 vs ADA_48_PRO was a footgun we hit on this project. Before adding a non-default pool to your endpoint, check RunPod’s GPU types reference to confirm the exact ID string.
To opt into a non-default pool, edit your endpoint’s GPU configuration in the RunPod dashboard and add the pool ID to the comma-separated list. RunPod picks the first available pool from the list at scale-up time, so listing cheaper options first keeps cost lowest while making expensive options available if cheaper ones run out.
Why we don’t put A100/H100 in the Hub default: the hub.json gpuIds list is what new deployers see on the deploy form. Including AMPERE_80 by default would make it look like A100 is the expected GPU for this template — it’s not. Per-page cost on A100 is often higher than on ADA_24 because the hourly rate jump outpaces the speed-up except for sustained throughput. ADA_24 is the right default; opt into bigger only when you’ve measured a workload that needs it.
A note on ADA_48_PRO and Blackwell
Section titled “A note on ADA_48_PRO and Blackwell”The ADA_48_PRO pool was originally the RTX 6000 Ada Generation (compute capability 8.9). RunPod has been adding the newer RTX PRO 6000 Blackwell cards (compute capability 12.0) to the same pool, and the names are nearly identical — NVIDIA labels both as “RTX 6000” family cards. The bundled xformers / flash-attn in the current worker image has dedicated kernels for Ampere (8.0-8.6), Ada (8.9), and Hopper (9.0), but no Blackwell (12.0) code path yet. When run on a Blackwell card, xformers misroutes to the Hopper kernel and crashes during VLM model init:
CUDA error (...flash-attention/hopper/flash_fwd_launch_template.h:188): invalid argumentThat’s why the template’s default gpuIds list excludes ADA_48_PRO. If you need a 48 GB Ada-or-newer card and are happy to use only the pipeline backend (which doesn’t touch xformers / flash-attn), you can add it back to your endpoint’s pool list manually — but the VLM backend will fail on the Blackwell variant until the worker image bumps xformers / vllm to versions that ship a Blackwell kernel.
Cost implications
Section titled “Cost implications”GPU pricing on RunPod varies by pool, availability, and whether you’re on spot vs on-demand. Indicative serverless Flex rates we’ve seen (subject to change — confirm at RunPod’s pricing page):
| Pool | $/hr | Per-page wall time (warm VLM, dense content) | Per-page cost |
|---|---|---|---|
ADA_24 (RTX 4090, default) | ~$1.10 | ~1–6 s | lowest |
AMPERE_24 (A5000 / 3090) | ~$0.69 | ~1–10 s | ~1.5–2× higher than ADA_24 despite cheaper hourly — speed gap dominates |
AMPERE_48 (A6000) | ~$0.79 | similar to A5000 | only worth it when 24 GB OOMs |
AMPERE_80 (A100 80 GB) | ~$1.89 | ~0.5 s upstream | sustained throughput only; otherwise per-page cost is higher than ADA_24 |
Why ADA_24 wins on $/page despite the higher hourly: the 4090’s compute speed-up (~2–4× faster per page than A5000) outpaces its ~1.6× hourly premium. The math only flips when a workload sits on a single warm worker for very long stretches and the per-page advantage saturates.
When to revisit this
Section titled “When to revisit this”Watch your worker logs after deploy. The signals that mean you should bump up:
CUDA out of memoryerrors → bump to next VRAM tier- vLLM warning about KV cache eviction → larger pool or reduce concurrency
- Per-page wall time spiking → check
nvidia-smi; if at 100% memory but low utilisation, you’re memory-bound
Conversely, if your worker logs show consistent low VRAM usage (~6–8 GB peak) on a 48 GB pool, you’re paying for headroom you don’t need. Drop to 24 GB.