Observability
The worker ships with optional OpenTelemetry export of logs, traces,
and metrics. It is off by default: until you set
OTEL_EXPORTER_OTLP_ENDPOINT on your endpoint, the OTel SDK is never
loaded and the worker behaves exactly as it does today (direct-print
JSON logs to RunPod’s dashboard, no spans, no metric flush).
When you turn it on, the OpenTelemetry SDK is configured at worker boot to ship to any OTLP/HTTP-compatible backend. The template itself is vendor-neutral; the only Axiom-specific recipe lives in the Axiom blog post.
What gets emitted
Section titled “What gets emitted”Traces. One span per job (mineru.job) and one span per phase
inside it: mineru.fetch_input, mineru.parse, mineru.package. The
boot-time warmup also gets its own span (mineru.warmup). Each span
carries the relevant attributes — backend, input_format, page range,
input/output sizes — for ad-hoc analysis.
Logs. Every line emitted by the worker’s structured logger is
mirrored to OTLP, additive to the existing stdout JSON. RunPod’s
dashboard remains the source of truth; the mirror lets you query the
same records from your downstream sink. Each record carries job_id
as an attribute and (when a span is active) the OTel trace and span
IDs so logs link back to traces in the UI.
Metrics. The full catalog below, exported via the OTLP/HTTP metric exporter on a 10-second batch interval. All histogram metrics use base-2 exponential bucket aggregation, not the SDK default of explicit (linear) buckets. Latency metrics span ms → minutes and byte-size metrics span KB → hundreds of MB; exponential buckets give uniform resolution across those ranges without per-metric bucket tuning. Any modern OTLP backend (Axiom, Honeycomb, Grafana Mimir, Datadog) accepts exponential histograms.
| Metric | Type | Labels | Question it answers |
|---|---|---|---|
mineru.jobs.total | counter | status, backend, input_format | How many jobs, what outcome? |
mineru.pages.total | counter | backend | How many pages — the $/page denominator. |
mineru.bytes_in.total | counter | source (url/b64/volume) | Ingest volume by transport. |
mineru.bytes_out.total | counter | transport (tarball_b64/inline/s3) | Egress volume by transport. Inline payloads are approximated as markdown text + image bytes (ignoring JSON overhead). |
mineru.errors.total | counter | type, phase | What’s failing, where? |
mineru.job.duration | histogram | backend, input_format | End-to-end wall-clock per job. |
mineru.phase.duration | histogram | phase | Where time goes inside a job. |
mineru.pages_per_second | histogram | backend | Throughput regression detector. |
mineru.input.size_bytes | histogram | — | Input-size distribution. |
mineru.output.size_bytes | histogram | transport | Output-size distribution. |
mineru.worker.cold_starts.total | counter | — | Cold-start rate per endpoint. |
mineru.worker.warmup.duration | histogram | backend, status (ok/error) | Boot-time warmup duration. |
mineru.worker.refresh.total | counter | reason (jobs_threshold/pages_threshold/sigterm) | Why workers recycle. |
mineru.worker.jobs_since_boot | gauge | — | Counts toward REFRESH_WORKER_AFTER_JOBS. |
mineru.worker.pages_since_boot | gauge | — | Counts toward REFRESH_WORKER_AFTER_PAGES. |
mineru.gpu.memory_used_bytes | gauge | device | VRAM usage — critical for tuning concurrency. |
mineru.gpu.memory_total_bytes | gauge | device | Constant per pod; enables % math. |
mineru.gpu.utilization_percent | gauge | device | SM utilization. GPU-bound vs CPU-bound. |
Resource attributes are attached to every signal (logs, spans, metrics):
service.name—mineru-runpodby default; override withOTEL_SERVICE_NAMEmineru.version— the MinerU library version baked into the imagerunpod.endpoint_id,runpod.pod_id,runpod.gpu_type,runpod.gpu_count— read from theRUNPOD_*env vars RunPod sets on every worker
How to enable it
Section titled “How to enable it”Set OTEL_EXPORTER_OTLP_ENDPOINT on your RunPod endpoint to the base
URL of an OTLP/HTTP collector. That single env var is the trigger —
without it, the SDK is never initialized.
| Env var | Purpose |
|---|---|
OTEL_EXPORTER_OTLP_ENDPOINT | Base URL of your OTLP/HTTP backend. The SDK appends /v1/traces, /v1/logs, /v1/metrics per signal — do not append them yourself. Some vendors split management API and ingest onto different hostnames (Axiom is the obvious example: their ingest URL is the regional edge deployment, not api.axiom.co) — check your vendor’s OpenTelemetry doc, not their generic API doc, for the correct ingest hostname. |
OTEL_EXPORTER_OTLP_HEADERS | CSV of key=value headers sent with every export (auth tokens, dataset IDs, etc.). |
OTEL_EXPORTER_OTLP_PROTOCOL | http/protobuf (default) or http/json. Some backends require protobuf — see the per-vendor notes. |
OTEL_SERVICE_NAME | Override the default mineru-runpod service name (useful when you run multiple endpoints into the same backend). |
OTEL_RESOURCE_ATTRIBUTES | Comma-separated extra resource attributes (deployment.environment=prod,team=ml). Merged with the built-in attrs. |
These are the standard OpenTelemetry environment variables — every OTLP-compatible backend documents them.
Per-signal overrides
Section titled “Per-signal overrides”OTel lets you route each signal type — traces, logs, metrics — through a different endpoint, protocol, or set of headers. Override on a per-signal basis with:
OTEL_EXPORTER_OTLP_TRACES_ENDPOINT/..._HEADERS/..._PROTOCOLOTEL_EXPORTER_OTLP_LOGS_ENDPOINT/..._HEADERS/..._PROTOCOLOTEL_EXPORTER_OTLP_METRICS_ENDPOINT/..._HEADERS/..._PROTOCOL
The common case where this matters: a backend stores metrics in a
different dataset (or bucket, or stream) than traces and logs, and the
auth header has to identify which. The worker exposes one specific
override in its hub.json — OTEL_EXPORTER_OTLP_METRICS_HEADERS — but
all of the standard OTel per-signal env vars are read by the SDK
directly, so you can set them in your endpoint config and they take
effect immediately.
The base OTEL_EXPORTER_OTLP_ENDPOINT you set will have
/v1/traces, /v1/logs, /v1/metrics appended per signal
automatically. Set the base URL, not the full per-signal path —
otherwise the SDK will tack the suffix on a second time and the
backend will 404.
Compatible backends
Section titled “Compatible backends”Any OTLP/HTTP collector. Tested or known-compatible:
- Axiom — OTLP-native, no agent. See the dedicated walkthrough for endpoint URLs, header layout, and the metrics-dataset gotcha.
- Honeycomb — set
OTEL_EXPORTER_OTLP_ENDPOINT=https://api.honeycomb.ioandOTEL_EXPORTER_OTLP_HEADERS=x-honeycomb-team=<key>. - Grafana Tempo / Loki / Mimir — point at your Grafana Cloud OTLP gateway or self-hosted collector.
- Datadog — via their OTLP intake (requires the dedicated endpoint URL).
- Jaeger — via Jaeger’s OTLP receiver.
- Any OTel Collector — point at its OTLP/HTTP receiver and let the collector fan out to whatever downstreams you have configured.
If you write up a recipe for a backend not listed here, send a PR adding it as a blog post (one post per vendor, please — this guide stays vendor-neutral).
Performance notes
Section titled “Performance notes”Enabling OTel adds ~200-500 ms to cold start (one-time SDK init plus the first export of resource attributes). Subsequent FlashBoot restores on the same host inherit the warm state from the process snapshot, so the cold-start cost only shows up on first-ever boots and image rebuilds.
The batch span / log / metric processors flush every 500 ms (spans, logs) or every 10 s (metrics) with a small in-memory queue, so per-request latency overhead is negligible — the export happens off the request path. If your collector is unreachable, the SDK retries with exponential backoff and drops batches that age out; the worker continues serving requests either way.
Failure modes
Section titled “Failure modes”Bad endpoint URL or auth. The SDK logs Transient error HTTPConnectionPool(...) to stderr (visible in RunPod’s logs) and
retries. The worker stays healthy and continues serving — the
stdout-JSON log channel is unaffected.
Collector goes down mid-job. Same behavior as above. In-flight batches retry until they age out of the queue; new batches keep flowing. Logs in RunPod’s dashboard are unaffected (additive mirror, not replacement).
OTel SDK init crashes. The worker logs
[mineru-telemetry] init failed, continuing without OTel: ... to
stdout and proceeds to serve traffic without telemetry. This is
deliberate — a misconfigured exporter must never block worker boot.
Missing dependencies. The container always ships with the OTel packages installed; forks that strip them will fall into the “init crashes” path above (telemetry simply stays disabled). The worker still serves requests.
Reading worker logs
Section titled “Reading worker logs”Independent of OpenTelemetry, the worker emits structured JSON to stdout on every log line — see Troubleshooting → Reading worker logs for the schema and field reference. The OTel logs export above mirrors those records to your OTLP backend; the stdout JSON is always the primary channel and remains queryable through RunPod’s log viewer regardless of whether OTel is enabled.
Set LOG_FORMAT=text on the endpoint for a human-readable single-line
format instead of JSON — useful for local development, less useful
once you’re shipping to an indexed sink.