Scaling and tuning

This page collects the worker-level tuning knobs — the things you set once on the endpoint, not per request. For per-request options (backend selection, page range, output format), see API reference.

All values listed here are set as environment variables on the RunPod endpoint, not in the job payload. You’ll find them under the endpoint’s Environment Variables section in the RunPod dashboard, or as env entries in deploy.py if you use that.

Concurrency

MINERU_MAX_CONCURRENCY is the in-worker concurrency knob — one of two concurrency levers. For how it interacts with workers_max and the request queue (and which to reach for), see Concurrency. It controls how many jobs a single worker handles in parallel. Default 1.

Value	When to use
`1` (default)	Safe on every supported GPU type. Each worker handles one job at a time; concurrency comes from running more workers.
`2-3`	Only on ≥24 GB GPUs with the `vlm-auto-engine` backend, or on any GPU with the `pipeline` backend. Watch VRAM via `nvidia-smi` or the `gpu.memory` fields in worker logs before raising.

MinerU’s own mineru-api server defaults this to 3, but that’s a guidance value tuned for their full-fat hosted setup — for a generic serverless template across all of RunPod’s GPU pool, 1 is the safe baseline.

Why not just always run higher concurrency? MinerU’s VLM backend wraps vLLM, which pre-allocates a large KV cache at engine init. Two concurrent jobs share that cache; on a 16 GB GPU you OOM around 2 concurrent VLM parses. The pipeline backend is more frugal but still bound by per-page VRAM. See Choosing a GPU for the memory math.

Worker recycling

Long-lived workers accumulate VRAM fragmentation and Python heap growth from MinerU + vLLM. After hundreds of parses, the same job may run noticeably slower than on a fresh worker. Two thresholds opt into automatic recycling:

Env var	What	When to set
`REFRESH_WORKER_AFTER_JOBS`	Recycle after this many successful jobs	High-throughput workloads with many small documents. Try `100` if you notice gradual slowdown.
`REFRESH_WORKER_AFTER_PAGES`	Recycle after this many cumulative pages parsed	Mixed workloads where job count is misleading (some 1-page, some 100-page). Try `5000`.

Both default to 0 (disabled). When a threshold crosses, the worker attaches refresh_worker: true to the response — RunPod’s runtime then kills the worker after the response is delivered. The next request lands on a fresh worker (with a cold-start cost; pair with FlashBoot to keep that cheap).

Counter rules

Both thresholds active: whichever trips first wins.
Unbounded parses (end_page=-1, the default): contribute 1 to the jobs counter and 0 to the pages counter. Use the jobs threshold if you mostly do full-document parses.
Errors and probes: don’t increment either counter. Recycling is for memory hygiene, not error recovery.

FlashBoot

RunPod’s FlashBoot is a CRIU-style process-snapshot mechanism that captures the worker’s full state (Python interpreter, GPU VRAM, subprocess tree) when the worker scales to zero and restores it on the next scale-from-zero. Enable it in the endpoint config (default on for Hub-deployed templates).

The boot-time warmup loads the MinerU model into VRAM before serverless.start() runs, so FlashBoot captures a post-warmup snapshot. A snapshot-restored cold start is ~7-8 s wall-clock instead of the ~110 s a fresh boot would take.

Snapshots are per (worker host, image SHA), not per endpoint. First time the worker lands on a new host, it pays the full ~110 s warmup once and that host then has a snapshot. Subsequent scale-from- zeroes that RunPod schedules onto the same host get the fast restore. A scale-from-zero onto a different host re-runs warmup (~110 s, once). See Troubleshooting → FlashBoot mechanism for the four-request investigation that pinned down this behavior.

Practical operator math: on endpoints with high enough traffic that the same hosts keep getting re-selected, you’ll see almost all ~7-8 s cold starts. On quiet endpoints with long idle gaps, expect a mix — RunPod’s scheduler may bounce between hosts and you’ll see some ~110 s starts before each host warms up.

If FlashBoot is off, every cold start runs the ~110 s warmup, with no per-host amortization at all. Keep it on unless you have a specific reason to disable it.

To bypass the warmup entirely (e.g., for debugging cold-start ordering), set MINERU_SKIP_WARMUP=1 on the endpoint. The worker will fall back to lazy load, paying ~110-130 s on every cold start regardless of host history.

MinerU performance knobs

These pass straight through to MinerU’s own configuration. Leave empty to use MinerU’s defaults.

Env var	MinerU default	Description
`MINERU_PROCESSING_WINDOW_SIZE`	`64`	Pages per pipeline batch window. Lower (`32`) reduces peak VRAM on long documents; higher (`128`) trades VRAM for throughput. Only affects the `pipeline` backend.
`MINERU_PDF_RENDER_TIMEOUT`	`300`	Seconds before MinerU kills a hung PDF page render. Raise for very complex PDFs (`600`).
`MINERU_PDF_RENDER_THREADS`	`4`	CPU threads for PDF page rasterization. Scale up on workers with more vCPUs (`8` on 16-vCPU pods) for image-heavy PDFs.

For the hybrid-http-client backend specifically, MinerU exposes one more knob: MINERU_HYBRID_BATCH_RATIO (defaults to ~4) for VRAM reduction. Only relevant if you’re pointing the worker at an external vLLM server with limited VRAM.

Worker logs

LOG_FORMAT selects the worker’s log output:

json (default) — one JSON object per line, with a job_id field on every emission for cross-job correlation. Easier to filter in RunPod’s log viewer, CloudWatch, Loki, or Axiom.
text — human-readable, key=value pairs after the message. Useful for local development with runpod.serverless.start running on your laptop.

See Troubleshooting → reading worker logs for the field reference.

Two caveats for production deployments:

Throttling: RunPod drops logs from workers producing too much output. The worker’s defaults (~4 lines per job) are well under any practical ceiling, but verbose forks may lose data.
No external shipping: RunPod does not forward logs to external systems. The dashboard retains endpoint logs for 90 days; worker-level logs disappear when the worker terminates. For durable / queryable logs in Axiom, Honeycomb, etc., the worker must push them — this is what the planned OTel integration does (item 8 in the runtime improvements plan).

Summary

Knob	Default	Most common use
`MINERU_MAX_CONCURRENCY`	`1`	Bump to `2-3` only on ≥24 GB GPUs
`REFRESH_WORKER_AFTER_JOBS`	`0` (off)	Set to `100`+ if workers slow down over time
`REFRESH_WORKER_AFTER_PAGES`	`0` (off)	Alternative to jobs threshold for mixed workloads
`LOG_FORMAT`	`json`	Switch to `text` for local dev
`MINERU_PROCESSING_WINDOW_SIZE`	MinerU default (`64`)	Lower if OOM on long docs (pipeline backend)
`MINERU_PDF_RENDER_TIMEOUT`	MinerU default (`300` s)	Raise for very complex PDFs
`MINERU_PDF_RENDER_THREADS`	MinerU default (`4`)	Raise on high-vCPU pods for image-heavy PDFs