GPU

2 posts with the tag “GPU”

Fix RunPod's 'no resources to deploy your pod' error

Jun 3, 2026

Last Updated: 2026-06-03

If RunPod fails a deploy with this:

This machine does not have the resources to deploy your pod. Please try a different machine.

your Docker image is fine. This is a capacity error: RunPod’s scheduler tried to place your pod on a host that didn’t have a free GPU of the type you asked for, and bailed. It’s transient. The fix is to make RunPod try again on a different host, and how you trigger that retry depends on which RunPod product you’re using. On a Serverless endpoint wired to GitHub, push any commit to the watched branch. On a RunPod Hub template, cut a new GitHub Release. Those two triggers are not interchangeable, which is the part that trips people up.

I hit this constantly while maintaining the mineru-runpod template. The rest of this post is what the error actually means, why it’s not your fault, and the exact retry mechanic for each workflow.

What does “this machine does not have the resources to deploy your pod” mean on RunPod?

It means RunPod’s scheduler picked a physical host to run your pod, found that host couldn’t satisfy the requested GPU, RAM, or disk, and refused the placement. It’s a per-machine capacity miss, not a global outage and not a problem with your image. “Try a different machine” is literal advice: another host of the same GPU type may have room.

The message fires during the scheduling phase, before your container ever starts. That timing is the tell. A broken image fails differently: you’d see an image-pull error, a non-zero container exit, or a failed health check. This message means the scheduler never got that far. It looked at the GPU type you requested, compared it against free capacity on the candidate host, and found the fit impossible.

For most people the requested GPU is the binding constraint. Popular pools like the RTX 4090 get contended, especially in a busy region. When every host of that type in that data center is full, a fresh placement attempt fails until one frees up.

Is it a RunPod bug, or did I break something?

Neither. It’s a legitimate, expected response from RunPod’s scheduler reporting that the GPU pool you targeted had no free host at that moment. Your Dockerfile, your handler, and your config are all irrelevant to this error. GPU capacity on RunPod fluctuates minute to minute, so the same deploy that fails now often succeeds 30 seconds later on a different host.

The reason it feels like a bug is that it’s non-deterministic. The exact same config fails one minute and works the next, purely because the cluster’s free capacity moved. That’s also why retrying works: you’re not changing anything about your build, you’re just asking the scheduler to roll the dice again against a pool whose occupancy has shifted.

Where you actually meet this string matters. RunPod throws it whenever it spins up a real pod, which in this template’s world is two places: the Hub validator test pod that runs after a release, and a GPU Pod you launch directly. A Serverless worker that can’t find capacity usually surfaces it as workers stuck in a throttled or initializing state rather than this exact sentence, but the underlying cause and the recovery are the same: force a rebuild so RunPod reschedules.

How do I fix it on a RunPod Serverless endpoint?

Push any commit to the branch your endpoint watches. Per RunPod’s docs, “every git push to your specified branch results in an updated Endpoint”, so a no-op commit triggers a fresh build and redeploy. The new workers get scheduled again, almost always landing on a host with free capacity. You can also hit Rebuild in the RunPod console to do the same thing without a commit.

This is the path most people need. If you deployed your worker by connecting a GitHub repo, your endpoint redeploys on every push, so a trivial commit is the lowest-friction way to re-roll the scheduler:

git commit --allow-empty -m "chore: re-trigger RunPod build"
git push

The --allow-empty flag is the point: you don’t need a real change to force a rebuild, you just need a new commit on the watched branch. RunPod’s layer caching means the rebuild is fast after the first one, since only the layers that changed get rebuilt (and for an empty commit, none did).

If you’d rather not pollute history, the console’s manual Rebuild button is the cleaner equivalent. Either way you’re doing the same thing: asking RunPod to provision workers again, on hosts whose occupancy has moved since the last attempt.

How do I fix it when publishing a RunPod Hub template?

Cut a new GitHub Release. The Hub does not watch commits. Per RunPod’s publishing guide, “repository integration connects with GitHub repos using releases (not commits) for versioning”, so pushing to your branch does nothing on the Hub side. Only a new Release re-runs the build and the validator test pod, which is where this error shows up for template authors.

Here’s the trick that saves you: a Release is just a tag, and a tag can point at a commit you already have. You don’t need to change a single line of code to re-trigger the Hub. Tag the same HEAD you already shipped and publish a Release for it:

git tag v1.6.4          # same commit, new tag
git push origin v1.6.4
# then publish a GitHub Release for v1.6.4 in the UI or via gh

RunPod treats each tag as a distinct template version and re-runs the full pipeline: build the image, spin up the validator test pod from .runpod/tests.json, and index the listing (usually within an hour). If the previous Release failed only because the validator pod couldn’t get a GPU, the new Release gives it a fresh roll of the scheduler.

The catch worth stating: each retry adds a version to your Hub listing, even if two versions are byte-identical. That’s the cost of the release-driven model. It’s cosmetic, but if you retry five times you’ll have five versions, so don’t spin on it if the failure is actually persistent (see below).

Why is the retry different for Serverless vs the Hub?

Because the two products use different GitHub triggers. A Serverless endpoint rebuilds on every push to its watched branch, so a commit is your retry. The Hub builds only on a new Release tag, so a release is your retry. Pushing commits at a Hub listing does nothing; pushing releases at a Serverless endpoint isn’t how it watches for changes.

	Serverless endpoint	RunPod Hub template
Build trigger	Every push to the watched branch	New GitHub Release (tag)
Retry the deploy by	Empty commit, or Rebuild in console	New Release on the same commit
Who needs this	Anyone running a GitHub-connected worker	Template authors publishing to the Hub
Where the error appears	Workers stuck initializing / throttled	The validator test pod after a Release

Most readers are in the left column. You deployed a worker from a repo (or one-click from the Hub) and you’re iterating on it; a commit re-rolls it. The right column is for the smaller group of people authoring a Hub listing, where the validator test pod is the thing hitting the capacity wall. If you’re not publishing your own Hub template, you can ignore the release workflow entirely. For the full deploy walkthrough, the getting-started guide covers both paths.

What if re-triggering doesn’t fix it?

If the same GPU pool fails every time, the capacity miss is persistent, not transient, and rolling the scheduler won’t help. Switch to a higher-availability GPU pool, change region, or lower the resources you’re requesting. For the mineru-runpod Hub validator, that means editing the gpuTypeId in .runpod/tests.json to a pool that’s actually free, then cutting a new Release.

The template defaults its validator to "NVIDIA GeForce RTX 4090" because it has the best pool availability across RunPod’s regions. "NVIDIA RTX A5000" works too but tends to be scarcer. I’ve bounced the test pod’s GPU between the A40, the A5000, and the 4090 across releases, chasing whichever pool had capacity on a given day, and the 4090 wins most often.

Three levers when retries aren’t enough:

Change the GPU type to a less-contended pool. A 24 GB workload fits several pools; pick the one with capacity rather than the one you assumed.
Change the region if your endpoint or template pins one. Capacity is per data center, so a pool that’s full in one region can be wide open in another.
Reduce the request. Oversized container disk or volume sizes shrink the set of hosts that can fit your pod. Trim them if they’re padded.

For template authors, there’s also an escape hatch documented in the troubleshooting guide: if a release is urgent and the validator is the only blocker, rename .runpod/tests.json to .runpod/tests_.json so the Hub skips the test pod entirely. You lose all CI signal, so it’s a temporary unblock, not a default. For the GPU-pool math behind these choices, see Choosing a GPU.

FAQ

Is “this machine does not have the resources to deploy your pod” a RunPod outage?

No. It’s a per-host capacity miss, not a global outage. The scheduler tried one machine, found it full, and stopped. Other hosts of the same GPU type may have room, which is why a retry often succeeds within seconds even while RunPod is otherwise healthy.

Does pushing a commit fix the error on a RunPod Hub template?

No. The Hub builds only on new GitHub Releases, not commits. Pushing to your branch leaves the Hub listing untouched. You have to publish a new Release (a new tag, which can point at the same commit) to re-run the Hub build and its validator test pod.

How do I re-trigger a RunPod Serverless build without changing code?

Push an empty commit with git commit --allow-empty to the watched branch, or click Rebuild in the RunPod console. Both force a fresh build and redeploy, so workers get scheduled again on hosts whose free capacity has shifted since the last attempt.

Can I create a GitHub Release without a new commit?

Yes. A Release is a tag, and a tag can point at any existing commit. Tag your current HEAD and publish a Release for it. RunPod treats every tag as a new version and re-runs the build, so this re-triggers the Hub without any code change.

Why does the same deploy fail once and work the next time?

GPU capacity on RunPod fluctuates minute to minute. The same config hits a full host on one attempt and a free host on the next, with nothing about your image changing. That non-determinism is exactly why retrying is the first thing to try.

How do I stop hitting the capacity error repeatedly?

Stop targeting a contended pool. Switch gpuTypeId to a higher-availability GPU (RTX 4090 pools are usually the most available), change region, or reduce requested disk and volume sizes so more hosts can fit your pod.

Where to next

The error is annoying but harmless once you know it’s capacity and not your build. For a Serverless worker, a commit re-rolls it; for a Hub template, a Release does. If it persists, it’s a pool-availability problem, and the fix lives in your GPU choice, not your Dockerfile. The full set of Hub build failures (this one, the CUDA floor mismatch, and the 30-minute build timeout) is catalogued in the troubleshooting guide.

If this saved you a debugging session, star the repo on GitHub for updates, or open an issue if you hit a build failure that isn’t covered here.

How RunPod FlashBoot Actually Works (4-Request Test)

May 26, 2026

Sergei Shmakov

Last Updated: 2026-05-26

If you’re shipping vLLM or any heavy ML model on RunPod Serverless, you’ve probably looked at FlashBoot, ticked the checkbox, and then watched your cold starts still take 60-120 seconds. RunPod’s marketing says “1-second cold starts.” Their docs describe FlashBoot as “pre-loading container images.” Neither of those matches what most ML workloads actually see.

I ran four cold-start tests on a deployed RunPod endpoint serving a vLLM-backed PDF parser. The wall-clock numbers ranged from 7 seconds to 7 minutes. The point of this post is to explain why — what FlashBoot actually does at the systems level, when it kicks in, and how to set up your worker so it kicks in more often.

What does FlashBoot actually do?

FlashBoot is a CRIU-style process snapshot mechanism. When a worker scales to zero, RunPod captures the full process state (Python interpreter, CUDA VRAM, subprocess tree) into a snapshot on the host’s local storage. When the worker scales back up on the same host, RunPod restores from that snapshot. The restored process resumes mid-stride: model still in VRAM, vLLM engine subprocess still alive, IPC pipes still connected.

The key qualifier that RunPod’s docs don’t mention: snapshots are per (host, image SHA), not per endpoint. If the next scale-from-zero lands on a different host, there’s no snapshot to restore from. The worker boots fresh and pays the full warmup cost. Once.

The TL;DR for an ML workload: set up an eager warmup at worker boot, then let FlashBoot do its thing. Each new host pays the warmup tax once. Subsequent scale-from-zeroes on that same host get the snapshot restore and finish a typical request in single-digit seconds.

Why do “cold” starts sometimes take 7 seconds and sometimes 110?

Because they’re hitting different parts of the per-host model. Four consecutive requests against the same endpoint, single-page parse on each, with a deliberate scale-to-zero between every one:

Request	Wall-clock	Host	Snapshot?	What the worker did
1	456 s	A (post-rebuild)	none	Image pull + fitness checks + warmup (101 s) + parse (5.6 s)
2	7.6 s	A (same as R1)	yes	Snapshot restore + parse (4.7 s)
3	122 s	B (different host)	none	Fitness checks + warmup (101.5 s) + parse (5.6 s)
4	7.4 s	B (same as R3)	yes	Snapshot restore + parse (4.6 s)

First hit on a fresh host pays ~110 s for the warmup. Every subsequent restore on that same host is ~7-8 s. A new host, when RunPod’s scheduler picks one, starts the cycle over.

The 456 s on Request 1 included a one-time image pull (the worker image is ~27 GB; this was the first time that physical host had ever seen it). Strip that off and you get ~110 s of actual boot work, which matches Request 3 exactly.

How can you tell if a request hit a snapshot restore?

By what’s missing from the worker logs. A FlashBoot-restored worker skips its boot sequence entirely — no fitness checks, no Python import logs, no vLLM engine initialization, no model load. The first log line is Jobs in queue: 1, immediately followed by your handler’s “starting job” entry.

Compare a fresh boot to a snapshot restore for the same request shape:

Fresh boot (Request 3):

04:45:45  Running 7 fitness check(s)...
04:45:46  All fitness checks passed. (1285.99ms)
04:45:46  [mineru-warmup] starting (backend=vlm-auto-engine ...)
04:45:51  Using vllm-async-engine as the inference engine for VLM.
04:46:23  Initializing a V1 LLM engine (v0.11.2) ...
04:46:47  Model loading took 2.1601 GiB memory and 18.41 seconds
04:47:14  torch.compile takes 22.81 s in total
04:47:17  init engine (profile, create kv cache, warmup model) took 30.66 seconds
04:47:18  get vllm-async-engine predictor cost: 87.26s
04:47:28  [mineru-warmup] done in 101.5s
04:47:28  Jobs in queue: 1
04:47:28  Started.
04:47:28  "starting job" {...}
04:47:34  "done" {...elapsed_seconds: 5.58...}

Snapshot restore (Request 4):

04:51:25  Jobs in queue: 1
04:51:25  Started.
04:51:25  "starting job" {...}
04:51:26  Using vllm-async-engine ...   (instant — engine handle restored from snapshot)
04:51:30  "done" {...elapsed_seconds: 4.58...}

No boot sequence. Three timestamps. The vLLM engine subprocess PID from the previous boot is reused — same EngineCore_DP0 pid=NNN from the snapshot. If you grep your own worker logs for the gap between Jobs in queue: 1 and the previous activity, you’ll see whether RunPod did a fresh boot or a snapshot restore.

What does the FlashBoot snapshot preserve?

Everything that lived in the worker process at snapshot time, mediated by CRIU semantics:

Python interpreter state. Module imports stay loaded. Globals (job counters, contextvars, signal handlers) keep their values. The MinerU engine registry returns the same handles it returned before the snapshot.
GPU VRAM. Model weights (~2.16 GiB for our VLM), vLLM’s KV cache (~8.17 GiB on a 24 GB card), and captured CUDA graphs (~0.3 GiB) all survive. The first request after restore parses with the same allocations it had before.
The subprocess tree. vLLM runs its engine in a child process for memory isolation. That subprocess gets captured along with the parent and restored with its IPC pipes intact. The engine PID persists.
torch.compile cache. The JIT-compiled Dynamo / Inductor output stays valid across restore. No 22-second recompile.

What doesn’t survive: snapshot lifetime is limited. RunPod doesn’t publish the eviction policy, but obvious triggers include image rebuild (new SHA invalidates the snapshot), and presumably long enough idle on a busy host that the snapshot storage gets pushed out.

What broke before this worked? The asyncio gotcha

The “eager warmup at boot” idea is obvious in principle: run one throwaway parse during worker startup so the model is loaded and warm before any user request arrives. The implementation has one trap.

vLLM’s AsyncLLMEngine binds its IPC primitives (transports, queues) to the asyncio event loop that initialized it. If you call asyncio.run(warmup()) followed by runpod.serverless.start(), your warmup creates loop A, runs the parse, then tears loop A down when asyncio.run returns. Then runpod.serverless.start() creates loop B for serving. When the first user request tries to talk to the vLLM engine through loop B, the engine handle is bound to the now-dead loop A. Result:

"error_type": "EngineDeadError",
"error_message": "EngineCore encountered an issue. See stack trace (above) for the root cause."

The engine subprocess itself is still alive. It’s only the parent process’s IPC reference that’s broken.

The fix is to keep the warmup and the serve loop on the same asyncio event loop. RunPod’s runpod.serverless.start() internally calls asyncio.run(JobScaler.run()), but JobScaler (in runpod.serverless.modules.rp_scale) is constructible directly and its run() is an awaitable coroutine. So you can compose:

import asyncio
from runpod.serverless.modules import rp_ping, rp_scale
from runpod.serverless.modules.rp_fitness import run_fitness_checks

config = {"handler": handler, "concurrency_modifier": _concurrency_modifier, "rp_args": {}}

async def _bootstrap():
    await run_fitness_checks()
    await warmup_async()          # <- engine binds to THIS loop
    rp_ping.Heartbeat().start_ping()
    await rp_scale.JobScaler(config).run()   # <- and stays here

asyncio.run(_bootstrap())

Now both phases share one event loop. The engine handle stays valid across the warmup → serve transition. FlashBoot captures a snapshot of a process where the loop, the engine, and the IPC are all alive together. On restore, they come back together too.

This does reach into runpod-python’s internals (the runpod.serverless.modules.* submodules aren’t part of the documented public API). Cheap to guard against drift: a unit test that asserts JobScaler exists with the expected constructor and an awaitable run() method. If RunPod refactors, CI catches it before production does.

When does the warmup pay off and when doesn’t it?

Per host, not per endpoint. The math depends on your traffic pattern.

Scenario	Likely outcome
`workers_min ≥ 1` (always-on worker)	Worker stays on its host. Every request is on a fully warm worker (~5 s parse). No cold starts at all.
High-frequency endpoint, workers scale up and down fast	Same hosts get re-selected. Most cold starts are happy-path restores (~7 s).
Quiet endpoint, infrequent requests, long idle gaps	RunPod’s scheduler may pick a different host. Some cold starts will be on new hosts (~110 s).
First request after a rebuild	Always cold path. Every endpoint’s first request after a fresh image pays ~5-7 min (image pull) + ~110 s (warmup). One-time per worker host.
`MINERU_SKIP_WARMUP=1` (warmup off)	Every cold start is ~110-130 s. No per-host amortization. Don’t do this in production.

The case that stings is “quiet endpoint with sporadic traffic” — a few requests an hour, 10-minute idle gaps, RunPod bouncing between hosts. Without warmup, every cold start would be ~110-130 s. With warmup, you get a mix: some 7-second restores, some 110-second fresh boots. The mix tilts toward fast as the endpoint warms up across more hosts and RunPod’s scheduler starts re-selecting them.

If your traffic is sustained enough that you can pin a worker (workers_min=1), you skip the entire question. You’re paying for the GPU 24/7 but never paying a cold start. For workloads with even modest cost sensitivity, the warmup + FlashBoot path is the better trade.

What this means if you’re shipping vLLM on RunPod

Three takeaways from the live measurements:

Always set up an eager warmup at worker boot. Loading the model on first request is silently worse than it sounds — you don’t just pay 110 s once per cold start, you pay it every time a host doesn’t have a snapshot, AND you forfeit the per-host amortization that makes the second-hit-on-a-host cheap. Without warmup, FlashBoot has nothing to snapshot.
Compose warmup and the serving loop under one asyncio.run(). If you asyncio.run() the warmup separately, the engine dies at the loop boundary. The fix is straightforward but the failure mode is opaque (EngineDeadError 75 ms into the first request) — easy to misdiagnose as a vLLM bug.
Don’t market your cold start as “X seconds” without acknowledging the per-host mix. A snapshot-restore cold start is genuinely 7-8 seconds. A new-host cold start is ~110 s. Both are big improvements over the no-warmup baseline (~110-130 s per request, every request). But your users will see the mix, and a too-clean claim makes the bad days look broken.

The whole investigation was on a 24 GB A5000 / RTX 4090 class GPU running MinerU’s 1.2B VLM via vLLM 0.11.2. The numbers will shift on larger models (more VRAM to snapshot, longer model load on cold path) but the mechanism applies the same way. If your cold start dominates wall-clock latency on a serverless GPU workload, set up boot-time warmup, watch the worker logs for the snapshot pattern, and tune your workers_min accordingly.

FAQ

Does FlashBoot snapshot the vLLM engine subprocess?

Yes. The vLLM engine runs as a child process for memory isolation, and FlashBoot’s CRIU-style mechanism captures the full process tree including subprocesses. The engine’s PID persists across snapshot/restore, and its IPC pipes back to the parent stay connected.

Why does my cold start take 60-120 seconds even with FlashBoot enabled?

Most likely your model is being loaded lazily on first request rather than at worker boot. FlashBoot only snapshots state that already exists in the worker process when it scales to zero. If your model loads on first request, the snapshot captures a worker without the model, and every cold start has to load the model again. Move the model load to worker boot (before runpod.serverless.start()) and FlashBoot will start carrying the warm state forward.

What’s the difference between FlashBoot and a network volume?

A network volume is shared file storage attached to your worker (e.g., for model weights you don’t want to bake into the Docker image). FlashBoot is process-state preservation — it captures the running Python process, including data already loaded from disk into VRAM. They solve different problems and can be used together: a network volume avoids re-downloading model files on image pull; FlashBoot avoids re-loading them into VRAM on cold start.

Does FlashBoot work for non-GPU workloads?

The mechanism (process snapshot via CRIU or equivalent) doesn’t depend on GPU memory specifically. CPU-bound workloads with significant cold-start cost (heavy library imports, large in-memory indices, JIT compilation) should benefit similarly. The framing in this post happens to use a GPU workload because that’s where the cold-start tax is most painful.

How do I know if my worker is hitting a snapshot restore vs a fresh boot?

Check the worker logs in the RunPod dashboard. A fresh boot shows fitness checks, framework init logs, and any warmup output. A snapshot restore is silent until the first Jobs in queue: 1 line, then jumps straight to your handler’s request-processing logs. The presence or absence of the boot sequence is the cleanest signal.

Is FlashBoot the same as RunPod’s “Active Workers” tier?

No. Active Workers are a billing tier where you pre-commit to a number of workers that are always on, billed at a discount in exchange for the 24/7 commitment. FlashBoot is a free runtime optimization that applies to flex (scale-to-zero) workers. The two can be combined: an Active Worker on the same endpoint can also benefit from FlashBoot when it cycles, though for a worker that never goes idle there’s nothing to snapshot.

Will FlashBoot survive a Docker image rebuild?

No. Each image gets its own SHA, and FlashBoot snapshots are scoped to (host, image SHA). When you push a new image, all existing snapshots are invalid. The first request after a rebuild on any host pays the full cold-start cost (image pull + warmup). Once each host has served the new image once, subsequent restores work normally.

What’s next

The runpod-mineru repo wraps all of this into one Docker image: MinerU 3.2.x + the MinerU2.5-Pro-2605-1.2B VLM, the JobScaler-bypass composition for warmup, structured logging, and the rest. Open source (GitHub), MIT-licensed, deploys from the RunPod Hub in two clicks.

If you want the deeper breakdown of which phases of a cold start cost what, the troubleshooting guide has the per-phase timing table from the same test runs. The scaling guide covers when to pair FlashBoot with workers_min ≥ 1 for fully predictable latency.

Disclosure: RunPod links in this post use a referral code that credits me at no cost to you. The post would read the same without it.