Zenoo
10 min read

Ongoing customer monitoring: AI vs. rules-based. What regulators actually want to see

Regulatory intelligence

Ongoing customer monitoring: AI vs. rules-based. What regulators actually want to see
Zenoo's Editorial Team
Share

In 2025, FinCEN fined a US regional bank $85 million. Not for failing to monitor its customers. For monitoring them badly. The bank had a transaction monitoring system. It had rules. It generated alerts. But the rules had not been recalibrated in four years. The alert-to-SAR conversion rate was 0.3%. And when examiners asked the compliance team to demonstrate how their monitoring adapted to emerging typologies, no one could answer. The system was running. It just was not working.

This is the gap that regulators are now targeting. AMLA, the EU's new Anti-Money Laundering Authority, became operational in 2025. Its mandate is not to introduce new obligations from scratch. It is to ensure that existing obligations, particularly ongoing customer monitoring, are being met with genuine effectiveness, not just procedural compliance. FATF's updated guidance on risk-based supervision makes the same point: having a monitoring programme is necessary but not sufficient. The programme must demonstrably detect, assess, and respond to changes in customer risk.

We see this every week. Firms that passed regulatory inspections three years ago are now receiving findings on the same monitoring programmes. The rules have not changed. The expectations have.

What AMLA and FATF actually expect from ongoing monitoring

AMLA's supervisory mandate covers direct supervision of high-risk obliged entities and coordination of national supervisors across the EU. For ongoing customer monitoring specifically, three expectations are becoming non-negotiable.

First, monitoring must be risk-proportionate. FATF Recommendation 10 and its interpretive note require that the nature and frequency of monitoring reflect the customer's risk profile. A high-risk customer with complex cross-border transactions should not be monitored with the same rule set and review cycle as a low-risk domestic retail customer. This sounds obvious. In practice, most institutions apply the same rules across their entire customer base, with risk differentiation limited to the frequency of periodic reviews.

Second, monitoring must be responsive to change. The EU's Sixth Anti-Money Laundering Directive and AMLA's technical standards require that monitoring systems react to material changes: new sanctions designations, changes in corporate ownership, shifts in transaction behaviour, adverse media. Batch processing that runs overnight is no longer considered adequate for sanctions-related triggers. The expectation is near real-time for high-risk events.

Third, monitoring must be demonstrably effective. This is where most enforcement actions originate. Regulators are no longer satisfied with evidence that a monitoring system exists. They want evidence that it works: what is the alert-to-SAR conversion rate, how are thresholds calibrated, when were rules last updated, what is the false positive rate, and how does the institution measure whether its monitoring is catching what it should catch.

Static rules catch what you already know about. That is the problem.

Rules-based monitoring works on a simple principle: define a condition, generate an alert when the condition is met. If a customer makes a cash deposit over a defined threshold, flag it. If a customer transacts with a counterparty in a high-risk jurisdiction, flag it. If transaction volume exceeds a multiple of the customer's historical average, flag it.

This approach has three strengths. It is transparent: you can explain exactly why an alert was generated. It is auditable: regulators can review the rule set and understand the logic. And it is predictable: the same inputs always produce the same outputs.

It also has three fundamental weaknesses that regulators are increasingly unwilling to overlook.

Weakness Definition Regulatory impact
Rigidity Rules detect only what they are designed to detect. Sophisticated actors study the rules and adapt their behaviour to stay outside detection parameters. Systems cannot identify novel typologies or adaptive evasion tactics.
Alert volume Industry average false positive rate is approximately 95%. At £2,600 per manual case review, operational cost is staggering. Genuine risks get buried in noise. Analyst attention to genuine anomalies degrades under false positive burden. Human factors problem created by imprecise technology.
Static calibration Rules are set at a point in time based on existing typologies. Risk environment changes. New typologies emerge. Rules become materially miscalibrated. Institutions lack resources or processes to recalibrate at the pace the environment demands. Rules from 2023 may be unsuitable by 2026.

The second weakness is alert volume. The industry average false positive rate for transaction monitoring alerts sits at approximately 95%. For every 100 alerts a rules-based system generates, roughly 95 require analyst time and turn out to be benign. At £2,600 per manual case review, the operational cost is staggering. More importantly, genuine risks get buried in noise. When analysts are processing thousands of false positives per month, their attention to genuine anomalies degrades. This is not a technology problem. It is a human factors problem created by technology that is not precise enough.

The third weakness is static calibration. Rules are set at a point in time, based on the typologies and risk environment that existed when they were written. The risk environment changes. New typologies emerge. Customer behaviour shifts. Jurisdictional risk profiles evolve. A rules-based system that was well calibrated in 2023 may be materially miscalibrated by 2026, and many institutions lack the resources or processes to recalibrate at the pace the environment demands.

"We had 47 transaction monitoring rules when I joined. I asked the team when they were last reviewed. No one knew. We traced some back to 2019. The rules were generating 2,200 alerts per month. Our SAR filing rate from those alerts was 1.8%. We were spending the equivalent of three full-time analysts processing noise."
Head of Financial Crime, UK payment institution

AI-driven anomaly detection: what it adds, and what regulators demand from it

Machine learning and AI-driven monitoring take a fundamentally different approach. Instead of defining rules in advance, these systems learn patterns from historical data and flag deviations from those patterns. A customer whose transaction behaviour shifts materially from their established profile will generate an alert, even if the specific pattern does not match any predefined rule.

This approach addresses the three weaknesses of rules-based systems. It can detect novel patterns that rules were not designed to catch. It can be more precise, reducing false positive rates by learning what "normal" looks like for each customer individually rather than applying generic thresholds. And it adapts as behaviour changes, because the models retrain on new data.

The numbers bear this out. Institutions that have deployed ML-based monitoring alongside their rules engines consistently report false positive reductions of 40 to 60% on the alerts where both systems overlap. Some report higher. The EY 2024 global survey found that 43% of financial institutions now use machine learning in their detection mechanisms, up from roughly 25% two years earlier. The direction of travel is clear.

But regulators have legitimate concerns about AI in monitoring, and any institution deploying it needs to address three specific demands.

Explainability. When an AI model flags a transaction or a customer, the institution must be able to explain why. "The model scored this 0.87" is not an explanation. "The model identified a 340% increase in cross-border transfers to a jurisdiction where the customer has no declared business activity, combined with a change in counterparty concentration from 12 regular counterparties to 3 new ones over a 60-day period" is an explanation. Regulators expect human-readable rationale for every alert, and the institution must demonstrate that analysts understand and can interrogate the model's reasoning.

Validation and governance. ML models require ongoing validation. This means regular back-testing against known outcomes (did the model flag cases that subsequently became SARs?), sensitivity analysis (how does alert volume change when model parameters shift?), and bias testing (is the model disproportionately flagging customers from certain demographics or jurisdictions for reasons unrelated to risk?). AMLA's technical standards are expected to address model governance explicitly. Institutions should not wait for those standards to build their validation frameworks.

Audit trail integrity. Every model decision must be logged with the data that informed it, the model version that produced it, and the outcome. If a model is retrained and its behaviour changes, the institution must be able to explain what changed and why. This is more demanding than the audit trail for a rules-based system, because the logic is not static. The audit trail must capture not just what the model decided, but how the model was operating at the time of the decision.

This is exactly the monitoring gap that Zenoo was built to close. If your current system is generating thousands of alerts with a sub-2% conversion rate, book a demo and we will show you what risk-proportionate monitoring looks like with your own data.

OCM best practices that actually survive regulatory scrutiny

The institutions we work with that perform best under regulatory examination share four common practices. None of them rely exclusively on rules or exclusively on AI. They use both, deliberately.

Risk-scoring models that update continuously. Customer risk scores should not be static assignments that change only at periodic review. Every monitoring event, whether it originates from transaction monitoring, sanctions screening, adverse media, or corporate registry changes, should feed into a risk recalculation. When the recalculated score crosses a materiality threshold, the customer is queued for review. This means risk scores reflect the current reality, not the reality at the last scheduled review date. Institutions that do this well see a 20 to 30% increase in the proportion of reviews that result in a genuine risk decision, because reviews are triggered by actual changes rather than arbitrary schedules.

Transaction pattern analysis at the customer level. Generic rules apply the same thresholds to every customer. Effective monitoring establishes a behavioural baseline for each customer and flags deviations from that baseline. A cash deposit of £50,000 from a commercial property firm is routine. The same deposit from a sole-trader consultancy is anomalous. Customer-level baselines eliminate the largest source of false positives: alerts generated because a transaction exceeded a generic threshold that was never appropriate for that customer's profile.

Behavioural clustering for peer comparison. Even with individual baselines, some anomalies are only visible in context. Behavioural clustering groups customers with similar profiles and transaction patterns, then identifies customers whose behaviour is diverging from their peer group. A customer who was onboarded as a small import/export business but whose transaction patterns now resemble a money service business will stand out in a peer comparison, even if their absolute transaction volumes have not crossed any rule-based threshold.

Documented threshold governance. Every monitoring threshold, whether rule-based or model-derived, should have a documented rationale, an owner, a review date, and a record of when it was last calibrated. This is the evidence regulators ask for first. When an examiner asks why a particular threshold is set at a particular level, the answer cannot be "it has always been that way." The answer must reference a risk assessment, a typology analysis, or a calibration exercise.

Enforcement actions tell you exactly where OCM programmes fail

Regulatory enforcement actions are, in effect, published case studies of what not to do. The patterns in recent OCM-related actions are remarkably consistent.

FinCEN's 2024 and 2025 enforcement actions against mid-tier banks repeatedly cite the same failures: transaction monitoring rules that were not calibrated to the institution's specific risk profile, alert backlogs that grew without remediation, and a lack of documentation showing how monitoring thresholds were set and reviewed. In one case, an institution had a 14-month backlog of unreviewed alerts. In another, the monitoring system had not been updated to reflect new correspondent banking relationships that materially changed the institution's risk exposure.

OCC consent orders in the same period tell a similar story. A common finding is that institutions could not demonstrate the effectiveness of their monitoring. They could show that monitoring was running. They could produce alert volumes. But they could not show what the monitoring was catching, what it was missing, or how they knew the difference. The absence of effectiveness testing, where the institution proactively tests whether its monitoring would detect known typologies, is cited in the majority of recent OCC orders related to BSA/AML programmes.

In the EU, the pattern is consistent but with additional emphasis on risk-based differentiation. Supervisory findings from national competent authorities, now coordinated through AMLA, increasingly focus on whether monitoring intensity matches customer risk. Institutions that apply the same monitoring parameters across all risk tiers are receiving findings, even when their overall monitoring coverage is high. Coverage without risk proportionality is not compliance.

"After our last examination, the single biggest remediation item was not technology. It was documentation. We could not demonstrate why our thresholds were set where they were, when they were last reviewed, or how we measured whether they were effective. We had the technology. We did not have the governance around it."
MLRO, European digital bank

Building a hybrid programme: rules plus intelligence for defensibility

The most defensible OCM programmes are hybrid. They use rules for known, well-defined scenarios where transparency is paramount, and AI for pattern detection, anomaly identification, and dynamic risk scoring where adaptability matters. The key is knowing which approach applies where, and documenting the rationale for both.

Rules should cover regulatory bright lines: sanctions screening matches, threshold-based reporting obligations, PEP identification triggers. These are scenarios where the detection logic must be fully transparent, the regulatory expectation is binary (flag or do not flag), and the cost of a miss is existential. Rules are appropriate here because the patterns are known, the logic is auditable, and there is no ambiguity about what constitutes a match.

AI should cover everything that requires pattern recognition across complex, multi-dimensional data: behavioural anomaly detection, peer group deviation, transaction network analysis, and dynamic risk recalculation. These are scenarios where the patterns are not known in advance, where individual rules cannot capture the complexity, and where the volume of data exceeds what rule-based systems can meaningfully process.

The hybrid model works because each approach compensates for the other's weaknesses. Rules ensure that known, high-priority scenarios are always detected with full transparency. AI ensures that novel, complex, or evolving patterns are detected despite not fitting any predefined rule. Together, they create a monitoring programme that is both defensible (regulators can see the logic for rules-based detections) and effective (AI catches what rules miss).

At Zenoo, we built our monitoring capabilities around this hybrid principle. Rules and AI run in parallel, with a unified case management layer that presents analysts with the combined output, the detection source (rule, model, or both), and the supporting evidence. The analyst sees what triggered the alert, why, and what data informed the decision, regardless of whether the trigger was a rule or a model.

Benchmarking your OCM: the numbers that matter

If you cannot measure your monitoring programme's performance, you cannot defend it to a regulator. Here are the metrics that matter, with benchmarks from what we see across the industry.

Metric Industry baseline Best practice target What it measures
Alert-to-SAR conversion rate 1 to 3% 8 to 15% Quality of alert generation. Below 2% suggests over-alerting. Above 20% suggests thresholds too narrow.
False positive rate ~95% 50 to 70% Precision of detection. Achieved by hybrid rules plus ML-based anomaly detection.
Alert backlog age Growing backlogs typical Disposition within 5 business days (standard), 24 hours (high-priority) Operational health. Alerts over 30 days old constitute regulatory finding.
Monitoring coverage Varies 100% for sanctions and rules. Risk-tiered for behavioural monitoring with documented gaps. Detection scope. Must be articulated and risk-assessed.
Threshold review frequency Ad hoc or annual Semi-annual (high-risk), annual (standard), plus ad hoc for material risk changes Governance maturity. Every threshold must have documented rationale and review date.
Case remediation rate Varies 80 to 90% of standard cases within 15 business days; 95% of high-priority within 5 business days Case closure discipline. Demonstrates risk response speed.

The compliance team's role is changing, not disappearing

None of this means that compliance teams are being replaced by algorithms. The shift is from manual alert processing to model oversight, threshold governance, and complex case investigation. Analysts who currently spend 60 to 70% of their time on routine false positive disposition will spend that time on cases that genuinely require human judgement, on validating model performance, and on the governance framework that makes the whole programme defensible.

This is a better use of skilled professionals. It is also a more sustainable operating model. The industry survey data on compliance professional stress (68% report high stress) and attrition (42% are considering leaving the profession) reflects a workforce that is burning out on repetitive work that technology should be handling. Redirecting human effort towards the work that actually requires human expertise is not just an efficiency gain. It is a retention strategy.

AMLA is operational. FATF is tightening its effectiveness assessments. Regulators in every major jurisdiction are moving from checking that monitoring exists to evaluating whether monitoring works. Institutions that are still running the same rules-based monitoring programme they built five years ago are not just operationally inefficient. They are regulatorily exposed.

Key takeaways

  • Regulators now focus on monitoring effectiveness, not just existence. Alert-to-SAR conversion rates below 2% and false positive rates of 95% are red flags in enforcement actions.
  • Static rules-based systems cannot detect novel typologies or adapt to evolving risk. Hybrid programmes combining rules (for regulatory bright lines) and AI (for pattern recognition) are most defensible.
  • Explainability, validation governance, and audit trail integrity are non-negotiable for any AI-based monitoring. Regulators expect human-readable rationale for every alert.
  • Continuous risk-scoring and customer-level behavioural baselines reduce false positives by 40 to 60% compared to generic rule sets applied across all customers.
  • Documented threshold governance is the first thing regulators ask for. Every threshold must have a rationale, an owner, and evidence of calibration and review.

The question is no longer whether to move beyond pure rules-based monitoring. It is how quickly you can build a hybrid programme that is both effective and defensible. If your alert-to-SAR conversion rate is below 3%, your thresholds have not been reviewed in over a year, or your team is drowning in false positives, the regulatory risk is real and growing.

We built Zenoo to solve exactly this. If you want to see what risk-proportionate, hybrid monitoring looks like in practice, book a demo. 30 minutes. Your data. No slides.

Share
Z

Published by

Zenoo's Editorial Team

Practical, unbiased content on KYC, AML, and compliance operations. Written by the team building tools to make compliance work better.

The compliance intelligence you actually need

Weekly insights on KYC, AML, and compliance operations. No vendor spin. No gated whitepapers. Just honest, useful guidance.

More from Zenoo Insights

22 hours per alert is too long. Cut it to 12 minutes.

One platform. 10 AI agents. 240+ check types. Live in weeks, not months.

30 minutes. Your data. No slides.