Insights Transaction Monitoring

AML's False Positive Problem: Why 95% Is the New Baseline

Stack of printed compliance alerts representing the false positive problem in AML monitoring

When we talk about the false positive rate in AML transaction monitoring, 95% is not hyperbole. It is the figure most compliance officers at mid-size digital banks will recognize from their own alert queue data. For every 100 alerts the system generates, 95 of them — on average, in a pure rules-based environment — will be reviewed and closed as non-suspicious. The analyst hours spent on those 95 alerts are not recoverable, and the one true positive buried somewhere in the batch may not get the attention it warrants.

Why the Number Is So High

Rules-based transaction monitoring works by flagging any transaction that meets a defined threshold. A velocity rule might fire when a customer sends more than 10 transfers in a 24-hour period. A dollar-amount rule fires when an ACH pull exceeds $5,000. These rules are calibrated conservatively — by design — because the cost of missing a suspicious transaction is treated as higher than the cost of reviewing a non-suspicious one.

That logic is defensible in a legacy bank context where the transaction volume is measured in tens of thousands of events per day and the customer population has stable behavioral baselines. It breaks down in a neobank context, where transaction volumes are higher, customer demographics are younger, and behavioral patterns are different. A gig-economy worker receiving multiple same-day ACH deposits from different payers will look like a structuring pattern to a rule calibrated for traditional bank customers.

The rules haven't changed. The customer base has. That mismatch is the core driver of elevated false positive rates in digital banking environments.

The Operational Cost Nobody Models Fully

The standard argument for tolerating high false positive rates is that they represent an excess of caution. We'd push back on that framing. A false positive rate of 95% is not cautious — it is operationally unsustainable. Here's what it actually costs:

  • An analyst reviewing 100 alerts per week at 20 minutes per alert spends 33 hours on cases that will not result in SARs. That's essentially a full-time compliance position dedicated to clearing false positives.
  • The true positives buried in the queue receive triage attention, not investigation attention. A complex mule network pattern that requires 2 hours to properly document may get 20 minutes because the analyst has 80 other alerts to clear that week.
  • Alert fatigue is a documented phenomenon in financial compliance. When analysts know from experience that 95 out of 100 alerts are noise, the psychological tendency is to look for reasons to close an alert, not reasons to investigate it.

This isn't a criticism of compliance teams. It is a structural problem created by systems that generate more signal than any human review capacity can meaningfully absorb.

What ML-Augmented Scoring Actually Changes

We're not saying ML replaces rules — that would be both operationally and regulatorily incorrect. FinCEN's examination guidance expects institutions to maintain explainable, documented rationale for SAR decisions. A pure black-box ML score does not produce that. What ML-augmented scoring does is provide a prioritization layer.

A behavioral baseline model observes a customer's normal transaction patterns over a rolling 60-to-90-day lookback window. When a velocity rule fires, the ML scoring component asks: is this transaction meaningfully anomalous relative to this customer's established baseline, or is it consistent with their historical behavior? A gig-economy worker who regularly receives 8 to 12 ACH deposits per week does not score high on anomaly even if the velocity rule fires. An account that normally sends two transactions per week and suddenly sends 15 in a single day scores very differently.

This prioritization doesn't eliminate the rule-triggered alert — the rule still fires, the case is still opened. What it does is sort the queue so analysts review the highest-anomaly cases first. In practice, well-instrumented teams using this approach have reduced effective false positive rates to the 60-75% range from the 90-95% baseline of pure-rules environments. That's still a majority of alerts being cleared, but the reduction translates to meaningful capacity recovery.

The Examiner's View of False Positive Rates

Here's a nuance that sometimes gets lost in vendor conversations about ML and false positive reduction: regulators do not set a target false positive rate, and they do not evaluate programs primarily by that metric. What examiners evaluate is whether the program is appropriately calibrated for the institution's risk profile and whether the methodology for calibrating rules is documented and defensible.

An institution with a 90% false positive rate that has documented why its rules are calibrated at current thresholds, why those thresholds are appropriate for its customer risk profile, and how it reviews and updates calibration on a defined schedule is in a much better position than an institution with a 70% false positive rate that cannot articulate why it uses the parameters it does.

Reducing false positives is not the goal. Having a program that is appropriately calibrated and documented is the goal. False positive rate reduction is a byproduct of getting the calibration right.

Calibration Is Not a One-Time Event

The final gap we see consistently is treating rules calibration as a project rather than an ongoing process. A digital bank goes live with a set of velocity and amount thresholds calibrated at initial deployment. Transaction volumes grow. The customer mix shifts. The product adds new rails — RTP, same-day ACH, card push. The thresholds don't change. Two years later the false positive rate has drifted from 88% to 96% and no one has connected the drift to the product evolution that caused it.

A calibration review cadence — quarterly at minimum for high-growth digital banks — and a documented rationale for each threshold is not optional infrastructure. It is what separates a BSA program that passes examination from one that generates an MRA around monitoring adequacy.

The 95% false positive rate isn't a product failure to be embarrassed about — it's a calibration baseline that most programs can improve with a disciplined methodology. The starting point is acknowledging the number honestly.

Velocity Rules vs. ML Scoring: When to Use Each

CTR Thresholds and Structuring Detection in High-Volume Digital Banks

Building a BSA Program as a Neobank