Features / Confidence scoring
// Deep dive — for CTI practitioners

The formula is on the page

Confidence scores come from six weighted signals with publicly published weights. Every score is reproducible from the actor's data. If you don't like how we weight cross-sensor visibility against external feeds, the numbers are right here and you can rescore in your own pipeline.

TL;DR

Six signals, published weights, summing to 1.0. No ML in the loop — the weights are hand-tuned and literally the values you see below are the whole classifier. The mean confidence on our 14,356 non-benign actors is 0.28. That low-looking median is correct, not a bug: most actors are single-sensor scanner-shaped activity with no external corroboration, and the formula is calibrated to score that accurately.

// Six signals, six weights

Each signal maps the actor's data to a 0.0 – 1.0 value. The final confidence is the weighted sum, clamped to 0.0 – 1.0.

Signal Weight
Cross-sensor visibility 0.30
Interaction depth 0.25
Recency 0.13
External corroboration 0.12
Event volume 0.12
Protocol breadth 0.08
Total 1.0

Code: apps/threats/aggregation.py:calculate_confidence — the function is about 40 lines. Each signal is computed independently; no cross-signal interaction beyond the weighted sum. That's a design choice: every score is auditable as the sum of six independently-verifiable terms.

// What each signal rewards

Cross-sensor visibility (w=0.30)

1.0 if seen by ≥3 of our sensors, 0.5 if 2, 0.2 if 1. Cross-sensor visibility is the strongest single signal we have — an actor our Seattle sensor sees that our Singapore sensor also sees is hitting the internet broadly enough to cross both hemispheres.

Interaction depth (w=0.25)

0.0 (pure port probe) to 1.0 (malware dropper). Uses the depth score from the session classifier — scanner 0.15, credential_harvester 0.4, reconnaissance 0.6, interactive_operator 0.9, malware_dropper 1.0.

Recency (w=0.13)

Linear decay from 1.0 at last_seen=now to 0.0 at last_seen=7d ago. Past 7 days, no recency credit — the actor's score becomes purely about the other five signals. That's the default window; analysts can query historical periods separately.

External corroboration (w=0.12)

1.0 for ≥3 feeds, 0.7 for 2, 0.4 for 1, 0.0 for none. See /features/fusion/ for the list of feeds and their licensing.

Event volume (w=0.12)

Log scale: ln(1 + events) / ln(1 + 1000). Saturates near 1000 events so a massive scanner doesn't crowd out other signals. A persistent low-volume actor and a big-volume scanner score similarly on this axis once both cross ~100 events.

Protocol breadth (w=0.08)

1.0 if the actor touches ≥2 protocols (SSH + HTTP, say), 0.0 otherwise. Multi-protocol scanning is a modest positive signal — single-protocol is neither rewarded nor penalized, and most scanners are multi-protocol anyway.

// The live distribution

The current confidence distribution across our 14,356 non-benign actors. Benign actors are capped at 0.1 by the benign-override rule, so they're excluded here to avoid flattening the histogram.

0.0 – 0.2
5,394 (37.6%)
0.2 – 0.4
6,024 (42.0%)
0.4 – 0.6
2,714 (18.9%)
0.6 – 0.8
224 (1.6%)
0.8 – 1.0
0 (0.0%)

At first glance the distribution looks discouraging — most actors score below 0.4 and very few score above 0.6. A casual reader sees this and thinks "their scoring is broken" or "their data is thin."

The reality is the opposite: the formula is calibrated to score these actors this way, and the shape is correct. Most of the internet's attack traffic is single-sensor, low-depth, low-volume scanner-shaped activity — a VPS in Vultr running masscan for a day, then rotating. That's what 0.0 – 0.4 is supposed to represent. The high-confidence tail above 0.6 is where the formula is saying "this one is different": multiple sensors, deep interaction, independent external corroboration, recent activity. You want that tail to be small — it's the thing you want to treat as real.

A CTI team evaluating our data should set their alert threshold at the shape of the tail, not at some arbitrary "50% confident = real threat" assumption. For us, 0.6 is the floor where every signal is reinforcing every other. That's a different scale than services that compress every verdict into a 0 – 100 number — a 60 on AbuseIPDB is not the same as a 0.6 here.

// The high-confidence tail right now

Actors currently at confidence 0.6 or higher — the ones where multiple signals agree. Click through to see which signals are firing for each.

Actor Confidence
193.46.255.86 0.70
202.188.47.41 0.70
209.141.41.212 0.70
20.203.42.204 0.69
101.36.117.234 0.69
182.18.161.165 0.69

// What the formula won't do

  • Learn from new data. There's no ML in the loop. The weights are hand-tuned from watching how well each signal separated obvious threats from obvious noise during initial calibration. A machine-learned classifier might find weights that score better on some metric, but then the answer to "why did you score this actor 0.7?" becomes a gradient explanation, not a sum. We'd rather stay at a heuristic that an analyst can reproduce by hand.
  • Weight the signals differently per actor. Every actor is scored by the same six-term sum. A scanner and a malware dropper get the same formula applied to their respective inputs; the inputs differ because the behaviors differ, not because we apply different math to different actor types. That's simpler to explain and defensible.
  • Inflate for marketing. The median confidence being 0.28 is not a problem we want to solve by adjusting the weights. Anyone who makes the median confidence in their reputation product look like 0.7 is either scoring a curated subset or adjusting the weights to make the number look better. Our median tells you honestly what most internet attack traffic is: scanner-shaped, single-sensor, low-volume — and the formula is scoring it that way.

// How to use it

Set your alert floor at the tail
GET /api/v1/threats/ips?min_confidence=0.6

0.6 is where every signal is reinforcing every other — a defensible default floor for an automated blocklist consumer.

Rescore in your own pipeline

All six signal inputs are on every actor via the API. If you want to weight external corroboration higher, pull the data and recompute — the score you produce is as defensible as ours.

Read the actual code

The formula lives in apps/threats/aggregation.py:calculate_confidence. It's ~40 lines, all six signals, all weights. No hidden state.

Combine with intent

Confidence answers "how sure?" Intent answers "of what?" Use them together: /features/scanner-classification/ covers intent, this page covers confidence.

// See also