Skip to content

Scaling and tuning

This page collects the worker-level tuning knobs — the things you set once on the endpoint, not per request. For per-request options (backend selection, page range, output format), see API reference.

All values listed here are set as environment variables on the RunPod endpoint, not in the job payload. You’ll find them under the endpoint’s Environment Variables section in the RunPod dashboard, or as env entries in deploy.py if you use that.

MINERU_MAX_CONCURRENCY controls how many jobs a single worker handles in parallel. Default 1.

ValueWhen to use
1 (default)Safe on every supported GPU type. Each worker handles one job at a time; concurrency comes from running more workers.
2-3Only on ≥24 GB GPUs with the vlm-auto-engine backend, or on any GPU with the pipeline backend. Watch VRAM via nvidia-smi or the gpu.memory fields in worker logs before raising.

MinerU’s own mineru-api server defaults this to 3, but that’s a guidance value tuned for their full-fat hosted setup — for a generic serverless template across all of RunPod’s GPU pool, 1 is the safe baseline.

Why not just always run higher concurrency? MinerU’s VLM backend wraps vLLM, which pre-allocates a large KV cache at engine init. Two concurrent jobs share that cache; on a 16 GB GPU you OOM around 2 concurrent VLM parses. The pipeline backend is more frugal but still bound by per-page VRAM. See Choosing a GPU for the memory math.

Long-lived workers accumulate VRAM fragmentation and Python heap growth from MinerU + vLLM. After hundreds of parses, the same job may run noticeably slower than on a fresh worker. Two thresholds opt into automatic recycling:

Env varWhatWhen to set
REFRESH_WORKER_AFTER_JOBSRecycle after this many successful jobsHigh-throughput workloads with many small documents. Try 100 if you notice gradual slowdown.
REFRESH_WORKER_AFTER_PAGESRecycle after this many cumulative pages parsedMixed workloads where job count is misleading (some 1-page, some 100-page). Try 5000.

Both default to 0 (disabled). When a threshold crosses, the worker attaches refresh_worker: true to the response — RunPod’s runtime then kills the worker after the response is delivered. The next request lands on a fresh worker (with a cold-start cost; pair with FlashBoot to keep that cheap).

  • Both thresholds active: whichever trips first wins.
  • Unbounded parses (end_page=-1, the default): contribute 1 to the jobs counter and 0 to the pages counter. Use the jobs threshold if you mostly do full-document parses.
  • Errors and probes: don’t increment either counter. Recycling is for memory hygiene, not error recovery.

RunPod’s FlashBoot is a CRIU-style process-snapshot mechanism that captures the worker’s full state (Python interpreter, GPU VRAM, subprocess tree) when the worker scales to zero and restores it on the next scale-from-zero. Enable it in the endpoint config (default on for Hub-deployed templates).

The boot-time warmup loads the MinerU model into VRAM before serverless.start() runs, so FlashBoot captures a post-warmup snapshot. A snapshot-restored cold start is ~7-8 s wall-clock instead of the ~110 s a fresh boot would take.

Snapshots are per (worker host, image SHA), not per endpoint. First time the worker lands on a new host, it pays the full ~110 s warmup once and that host then has a snapshot. Subsequent scale-from- zeroes that RunPod schedules onto the same host get the fast restore. A scale-from-zero onto a different host re-runs warmup (~110 s, once). See Troubleshooting → FlashBoot mechanism for the four-request investigation that pinned down this behavior.

Practical operator math: on endpoints with high enough traffic that the same hosts keep getting re-selected, you’ll see almost all ~7-8 s cold starts. On quiet endpoints with long idle gaps, expect a mix — RunPod’s scheduler may bounce between hosts and you’ll see some ~110 s starts before each host warms up.

If FlashBoot is off, every cold start runs the ~110 s warmup, with no per-host amortization at all. Keep it on unless you have a specific reason to disable it.

To bypass the warmup entirely (e.g., for debugging cold-start ordering), set MINERU_SKIP_WARMUP=1 on the endpoint. The worker will fall back to lazy load, paying ~110-130 s on every cold start regardless of host history.

These pass straight through to MinerU’s own configuration. Leave empty to use MinerU’s defaults.

Env varMinerU defaultDescription
MINERU_PROCESSING_WINDOW_SIZE64Pages per pipeline batch window. Lower (32) reduces peak VRAM on long documents; higher (128) trades VRAM for throughput. Only affects the pipeline backend.
MINERU_PDF_RENDER_TIMEOUT300Seconds before MinerU kills a hung PDF page render. Raise for very complex PDFs (600).
MINERU_PDF_RENDER_THREADS4CPU threads for PDF page rasterization. Scale up on workers with more vCPUs (8 on 16-vCPU pods) for image-heavy PDFs.

For the hybrid-http-client backend specifically, MinerU exposes one more knob: MINERU_HYBRID_BATCH_RATIO (defaults to ~4) for VRAM reduction. Only relevant if you’re pointing the worker at an external vLLM server with limited VRAM.

LOG_FORMAT selects the worker’s log output:

  • json (default) — one JSON object per line, with a job_id field on every emission for cross-job correlation. Easier to filter in RunPod’s log viewer, CloudWatch, Loki, or Axiom.
  • text — human-readable, key=value pairs after the message. Useful for local development with runpod.serverless.start running on your laptop.

See Troubleshooting → reading worker logs for the field reference.

Two caveats for production deployments:

  • Throttling: RunPod drops logs from workers producing too much output. The worker’s defaults (~4 lines per job) are well under any practical ceiling, but verbose forks may lose data.
  • No external shipping: RunPod does not forward logs to external systems. The dashboard retains endpoint logs for 90 days; worker-level logs disappear when the worker terminates. For durable / queryable logs in Axiom, Honeycomb, etc., the worker must push them — this is what the planned OTel integration does (item 8 in the runtime improvements plan).
KnobDefaultMost common use
MINERU_MAX_CONCURRENCY1Bump to 2-3 only on ≥24 GB GPUs
REFRESH_WORKER_AFTER_JOBS0 (off)Set to 100+ if workers slow down over time
REFRESH_WORKER_AFTER_PAGES0 (off)Alternative to jobs threshold for mixed workloads
LOG_FORMATjsonSwitch to text for local dev
MINERU_PROCESSING_WINDOW_SIZEMinerU default (64)Lower if OOM on long docs (pipeline backend)
MINERU_PDF_RENDER_TIMEOUTMinerU default (300 s)Raise for very complex PDFs
MINERU_PDF_RENDER_THREADSMinerU default (4)Raise on high-vCPU pods for image-heavy PDFs