Observability
Stages is the contract a microservice codes against; Ingest is how a result is stored. This page is for whoever runs the platform: it documents what the workflows engine (hub-workflows) writes to its logs and traces as it drives a run, so you can answer the questions that come up in practice — did this recording open a run? did it reach a stage? why not? did a stage time out? — and so the logs you forward to us are precise enough to act on.
The engine is deliberately chatty about its routing decisions: every run leaves a trace from the moment it opens to the moment it finalizes, even when it dispatches no stage at all. That is the case that used to be invisible (“a run was created but never reached a stage”) and is now explicit.
Where the logs come from
All of the lines below are emitted by the standalone hub-workflows engine — the service that consumes the classify hand-off, dispatches stages and tracks runs to completion. It is separate from your stage microservices, which log on their own (services.<name>.logLevel).
- Format — structured JSON, one object per line, written to stdout (collect it with your platform’s normal pod-log pipeline;
kubectl logs deploy/hub-workflows). - Verbosity — the
LOG_LEVELenvironment variable. Recognised values areerroranddebug; anything else (including unset) meansinfo.infoshows the full run lifecycle and routing outcomes plus all warnings and errors;debugadds the per-stage condition explanations;errorsuppresses everything but failures. - Levels you’ll see — logrus writes the level as
"level":"info","warning"or"error". Faults are emitted atwarningorerror, healthy lifecycle and routing atinfo, so you can alert on severity alone without parsing message text.
Anatomy of a log line
Every lifecycle and fault line carries a consistent set of correlation fields so you can pivot from one line to the whole run. A started line looks like:
{
"level": "info",
"msg": "workflow run started",
"time": "2026-06-24T08:30:01Z",
"traceId": "8f3a1c2b4d5e6f70",
"organisationId": "64f0a1b2c3d4e5f600112233",
"fileName": "front-gate/2026/06/12/08-30-00.mp4",
"mediaKey": "front-gate/2026/06/12/08-30-00.mp4"
}| Field | When present | What it is |
|---|---|---|
msg | always | The stable, greppable line identity — match on this exact string. |
level | always | info (healthy) / warning / error (faults). |
traceId | always | The distributed-trace id — the same id your stage receives on the WorkflowRun and the one to give us. Ties every line of one recording together across services. |
mediaKey | always | The recording reference (media key) the run is about. The primary key to grep a single recording. fileName mirrors it. |
organisationId | always | The owning organisation — scope when triaging a single tenant. |
runId | once the run is loaded | The run’s database id (workflow_runs._id) — on dispatch, routing, completion and ingest lines. |
operation | stage / routing / ingest lines | The stage’s operation id (e.g. speed). |
reason | routing & ingest-failure lines | Why a decision went the way it did (e.g. permanent persistence rejection, dropping, empty media key). |
error | error lines | The underlying error message. |
Every line in this document — the lifecycle, the routing decisions and the ingest/persistence failures — carries these correlation fields as structured keys (
traceId,mediaKey,organisationId, andrunIdonce the run is loaded). You can filter the entire life of one recording onmediaKeyortraceIdalone and group faults byoperation/reason; you never have to parse values out of the message text.
The run lifecycle (info)
A healthy run emits these, in order — they are your “happy path” trace. Seeing started then completed with no stage dispatched in between is the signature of a run created but never reached a stage.
msg | Meaning | Extra fields |
|---|---|---|
workflow run started | A new run document was seeded for this recording. | — |
workflow stage dispatched | A stage was published to its queue — the run reached a stage. One per dispatched stage. | operation, queue, runId |
workflow upstream operation resolved | A stage’s result came back and was recorded on the run. | operation |
workflow run completed | The run was finalized cleanly — every dispatched stage resolved. | runId, dispatchedStages, resolvedStages, timedOut (false) |
dispatchedStages / resolvedStages on the completion line let you confirm at a glance how much the run did: dispatchedStages: 0 is the explicit never reached a stage outcome.
Routing decisions (info & debug)
When a result lands (or a run opens), the engine re-evaluates the conditional stages and says what it decided. Every line carries mediaKey, runId and the rest of the correlation fields:
| Level | msg | Meaning | Extra fields |
|---|---|---|---|
info | workflow conditional stages matched | At least one conditional stage’s rule held and is being dispatched. | matchedOperations, runId |
info | workflow no conditional stages matched | Nothing matched — the result advanced the run without opening a new stage. The availableOperations field is the readiness gate: it lists exactly which upstream results were present when the decision was made, so you can see why a needs gate didn’t open. | availableOperations, runId |
debug | workflow conditional stage routing decision | Per-stage explanation — matched (bool) plus a reason naming which need passed or failed and the actual value seen. Turn on LOG_LEVEL=debug when a conditional stage isn’t firing and you need to know which predicate is the blocker. | operation, matched, reason, runId |
debug | workflow stage dispatch skipped | The stage was already queued for this run; the engine never dispatches a stage twice (reason: already dispatched (fire-once)). | operation, reason, runId |
Troubleshooting a stage that never fires: set
LOG_LEVEL=debugand read itsworkflow conditional stage routing decisionline —matched: falsewith thereasonnaming the failingneedand the value it saw, which is almost always a path/value mismatch in the condition or an upstream that wasn’t inavailableOperationsyet.
Faults — warnings & errors
These are the lines to alert on and the ones most useful to send us. All carry the correlation fields above; the error field holds the underlying cause.
| Level | msg | What happened / what to check |
|---|---|---|
warning | workflow run timed out with stages outstanding | The watchdog finalized a run because a dispatched stage never returned (a stuck or crashed microservice). dispatchedStages > resolvedStages; unresolvedStages is the count abandoned. Check that the named stage’s microservice is up and acking. |
warning | workflow event dropped | An inbound message was discarded before it could open or advance a run. reason says why (e.g. empty media key); operation is the message’s op. A steady stream means a malformed upstream producer. |
error | workflow stage dispatch failed | The engine could not enqueue a stage — the run never reached it. operation names the stage; error is the cause (queue publish error, or the dispatch payload failed to marshal). Check broker connectivity and the stage’s queue. |
error | workflow stage dispatched but recording it failed | The stage was published, but the engine couldn’t record it on the run’s dispatched tier. The worker will still run; the run’s bookkeeping may be slightly off. Usually a transient database blip. |
error | HandleNewRun: failed to load workflow run / … failed to insert workflow run | The engine couldn’t read or seed the run document — a database problem. The run did not open. |
error | finalizeIfComplete: failed to stamp run end | The engine couldn’t mark a run finished. The run stays open; the watchdog will retry on a later message. |
Result-persistence (ingest) failures
When a stage hands back a block envelope, the engine persists it through the shared ingest core. These lines carry the full correlation fields plus operation; the error cases use a reason field to distinguish the failure site, so you alert on msg+level and group by reason:
| Level | msg | reason / fields | What happened |
|---|---|---|---|
error | workflow stage result ingest failed | reason: transient persistence failure, retrying (+ operation, error) | A sink write failed transiently — the message is re-queued (idempotent, so safe). Repeated lines for one recording mean a sink is down. |
error | workflow stage result ingest failed | reason: permanent persistence rejection, dropping | A deterministic rejection (schema/validator or duplicate key) — the block is dropped, not retried. This is real data loss for that block; fix the block’s shape. |
error | workflow stage result ingest failed | reason: malformed block envelope | The stage’s result body didn’t parse as a block envelope — a bug in the stage’s output. Recorded without ingest. |
error | workflow stage result ingest failed | reason: could not load run, retrying | The engine couldn’t reload the run for its retention clock — a transient database blip; re-queued. |
warning | workflow stage result dropped | operation, error | An unknown/forbidden block type or a validation failure — recorded without ingest, not retried. |
warning | workflow stage result ingest warning | operation, detail | The ingest core stored the envelope but flagged a non-fatal warning. |
Reading a run end-to-end
Pick the recording’s mediaKey (or its traceId) and filter the engine logs on it; the lifecycle reads as a short story. A few signatures:
Healthy run with a conditional stage
workflow run started
workflow conditional stages matched matchedOperations=[speed] runId=…
workflow stage dispatched operation=speed queue=hub-workflows-speed runId=…
workflow upstream operation resolved operation=speed
workflow run completed dispatchedStages=1 resolvedStages=1 timedOut=falseCreated but never reached a stage (the previously-invisible case)
workflow run started
workflow no conditional stages matched availableOperations=[classify] runId=…
workflow run completed dispatchedStages=0 resolvedStages=0 timedOut=false→ Working as designed if no stage’s rule matched this recording; if you expected one to, switch to LOG_LEVEL=debug and read the workflow conditional stage routing decision line (matched: false) and its reason.
Stuck / crashed stage
workflow run started
workflow stage dispatched operation=speed …
workflow run timed out with stages outstanding dispatchedStages=1 resolvedStages=0 unresolvedStages=1→ The speed microservice received the run but never routed it back. Check that pod.
Distributed tracing
The engine participates in OpenTelemetry tracing under the service name hub-workflows. For every message it continues the trace carried on the run’s traceId and opens a span tagged with the operation and fileName, so a single recording’s spans line up across the analysis service, the engine and your stages in your tracing backend.
The same traceId is on the WorkflowRun your stage receives. Propagate it on your stage’s own logs and spans (see the traceId field) and the whole run — pipeline → engine → your microservice — stays correlatable under one id. If the tracing backend is unreachable the engine logs failed to connect to tracing backend (at startup) or tracing failed for event (per message) and keeps processing — tracing is best-effort and never blocks a run.
Metrics
The engine exposes Prometheus metrics on :8080/metrics — the hook your monitoring scrapes. Alongside the engine’s queue processing time (<queue>_processing_time), it exports lifecycle and fault counters so you can alert on rates and ratios across the fleet without parsing logs:
| Metric | Type | What it counts |
|---|---|---|
workflow_runs_started_total | counter | Runs opened (a run document was seeded). |
workflow_runs_completed_total | counter | Runs finalized cleanly (every dispatched stage resolved). |
workflow_runs_timed_out_total | counter | Runs the watchdog forced shut with a stage still outstanding (a stage never returned). |
workflow_runs_no_stage_total | counter | Runs that finalized without ever dispatching a stage (created but never reached a stage). A subset of …completed_total. |
workflow_stages_dispatched_total | counter | Stages successfully published to their queue. |
workflow_stage_dispatch_failures_total | counter | Stage dispatch attempts that failed (the stage never reached its worker). |
workflow_stage_ingest_failures_total | counter (reason) | Stage-result ingest failures, labelled reason = transient_persistence | permanent_rejection | malformed_envelope | run_load. |
workflow_stage_ingest_dropped_total | counter | Stage-result blocks dropped for a non-retryable reason (unknown/forbidden type or validation failure). |
workflow_events_dropped_total | counter | Inbound messages discarded before they could open or advance a run. |
Useful signals to alert on: a rising workflow_runs_timed_out_total (stuck stages), any workflow_stage_ingest_failures_total{reason="permanent_rejection"} (real data loss), and a high workflow_runs_no_stage_total / workflow_runs_completed_total ratio (recordings opening runs but matching no stage). Each counter has a one-to-one structured log line above, so a metric spike points straight at the lines to read.
Retries & dead-lettering
A run whose result can’t be persisted is re-queued rather than lost, but the loop is bounded so a permanently-failing stage can’t grow the queue forever:
WORKFLOWS_MAX_RETRIES(default 5) caps how many times one run is re-queued; past the cap it is sent to the dead-letter queue.- Retries are delayed by a fixed backoff (5s) and a malformed message body (not a
WorkflowRun) is dead-lettered immediately.
If runs are disappearing, check the dead-letter queue and the workflow stage result ingest failed lines (reason: transient persistence failure, retrying) — a recording that retries up to the cap and then stops is being dead-lettered.
Sending us accurate logs
When you open an issue, the single most useful thing is the engine’s JSON log lines for the affected recording, unredacted of the correlation fields:
- Find the recording’s
mediaKey(ortraceId). - Grep the
hub-workflowslogs for it and include the whole run —startedthroughcompleted/timed out. - Include any
warning/errorlines (they carry theerrorfield with the root cause). - If a
conditionalstage isn’t firing, re-run withLOG_LEVEL=debugand include theworkflow conditional stage routing decisionlines (matched: true/falsewith theirreason) for that stage. - Note the
traceIdso we can line it up with the analysis service and your stage.
That set lets us reconstruct exactly what the engine decided and why, without guessing.