Check our lag before you trust our verdict

A threat feed's verdict is only as current as its data. Most feeds won't tell you when they went stale. IntrusionLabs publishes every stage of its pipeline — ingestion, aggregation, enrichment, campaign detection, per-collector heartbeats — at a public endpoint. Point your monitoring at it; gate on it.

TL;DR

GET /api/v1/health/ returns JSON covering ten subsystems, each with a status (ok / degraded / down) and the timing details behind that status. Right now 3/3 edge collectors are reporting healthy. Every severity threshold is published below — nothing is hidden behind a "trust us, we're fresh" badge.

// Ten subsystems, one endpoint

Key	What it covers	Probe
database	PostgreSQL operational DB	Connection latency probe
timeseries_db	TimescaleDB honeypot event store	Connection latency probe
cache	Django cache layer	Set/get round-trip probe
valkey	Valkey pub/sub bus	Ping + publish/subscribe round-trip
collectors	Edge node heartbeats	Per-node last_seen_at freshness
ingestion	Event ingestion rate	Events received in last 1h / 10m
pipeline	Aggregation lag	Raw-event to actor-update gap, per-node lag
enrichment	Hostname classification	Backlog + classification age
campaign_freshness	Campaign detection	Time since last active campaign update
threats	Active threat volume	Active campaigns + 24h actor count (informational)

// Severity thresholds (the ones you can alert on)

Every threshold below is defined in apps/api/health.py. If you want your monitoring to page when our data goes stale, these are the numbers to gate on.

Metric	OK	Degraded	Down	Meaning
ingestion_lag_seconds	≤120s	120–600s	>600s	Time since newest raw event arrived
aggregation_lag_seconds	agg_age ≤600s	agg_age ≤900s	agg_age >900s	Gap between newest raw event and actor aggregation
collector heartbeat	≤5 min	5–15 min	>15 min	Per-node last_seen_at age
enrichment age	≤2h	2–6h	>6h	Time since most recent hostname classification
campaign age	≤2h	2–6h	>6h	Time since newest active campaign update

// Why publish this at all

Every threat feed has outages. Sensors go down, aggregation jobs hang, feeds miss a sync. The question is whether you, the consumer, find out. When a feed's pipeline goes stale and the feed keeps serving yesterday's verdicts as though they were current, you make block/allow decisions on data that's lying to you about its freshness.

IntrusionLabs' approach: publish the lag, let you gate on it. If aggregation_lag_seconds exceeds 900, skip the current query and check back when it recovers. If a specific collector stopped reporting, the collectors.nodes block names it. If the enrichment queue is backed up, the enrichment block says so.

This is what "threat intelligence with receipts" means in practice — the receipts include the timestamps.

// Per-collector detail

The collectors.nodes block of the response names every active edge collector, its location code, the timestamp of its last ingest, minutes since that ingest, and the collector's git SHA. 3/3 are healthy right now. The pipeline.per_node block cross-references that against actual event arrival lag, so a collector that's heartbeating but not forwarding events is visible as "heartbeat ok, event lag elevated."

New collector geographies come online as we add them. The version field exposed at the top level of the health response reflects the exact git SHA running in production — handy for correlating a behavior change with a deploy.

// How to use it

Pre-flight every batch query. If the top-level status is not ok, defer the query or degrade gracefully.
Alert on your own infrastructure. Scrape the endpoint into Prometheus; alert when aggregation_lag_seconds or ingestion_lag_seconds exceed the thresholds above.
Correlate with deploy SHA. The version field at the root of the response is the exact commit running in production. Useful for "did something change at this time?" investigations.
Check before publishing conclusions. If you're about to write up a finding sourced from IL data, sanity-check freshness at the time the query ran.

// Honest about limits

The health endpoint reports what our own code can see. If the Django process is up but a downstream dependency returns bad data silently, we won't catch that here — use it as a necessary condition, not sufficient.

Thresholds are current as of this writing; they'll change as sensor footprint grows. If we tighten them, the page updates. If you want stability for alerting, pin your own copy of the threshold table or scrape the raw numbers rather than the status string.

// See also

GET /api/v1/health/ → — the live endpoint
Confidence scoring → — the published formula for the verdicts themselves
Provenance → — every verdict traces to its raw event