
As AI becomes embedded in nearly every compliance workflow, from hotline triage and policy analysis to screenings and program effectiveness reviews, the central question is no longer whether to use AI; it’s how to use AI responsibly: selecting the right model, ensuring data integrity, and applying governance and risk controls to make sure the technology is fit for purpose.
A major new industry benchmark evaluating leading AI models across real-world compliance and ethics tasks reinforces a pivotal truth: different models excel at different things, and the differences can be substantial.1 Some models perform exceptionally well on structured, rules-based tasks like categorization, risk ranking, and decision routing. Others shine in open-ended narrative work, such as drafting briefs or generating training content. And still others lag significantly in areas requiring deeper analytical reasoning, pattern detection, or contextual interpretation.
Perhaps the most striking insight is the variability in performance when tasks become more ambiguous or involve unstructured information. In these scenarios, even top-performing models can miss critical red flags, overlook regulatory nuances, or produce inconsistent judgments. By contrast, straightforward tasks such as mapping controls, assigning risk categories, or extracting specific fields from questionnaires are handled with far greater confidence and accuracy.
For compliance leaders, these findings are not academic; they reshape how we evaluate AI-enabled tools and how we govern third-party risk.
A new due diligence obligation: Model transparency
More vendors are quietly embedding AI into their products: screening platforms, case management systems, investigative assistants, policy engines, and analytics dashboards. But the benchmark makes one point abundantly clear: the underlying model determines the reliability of the output.
A vendor relying on an outdated or lower-performing model isn’t just behind technologically; they may be exposing your organization to avoidable risk. This leads us to third-party risk management, which should now include AI-governance due diligence questions, such as:
- Which AI model powers this feature, and why was it chosen for this workflow?
- Has the model been benchmarked against compliance-relevant tasks?
- How does the vendor monitor quality, drift, reliability, and hallucination risk?
- Where is human review required, and how is oversight integrated?
- Does the tool update to newer, more capable models as they become available?
The compliance officer’s evolving role
The benchmark’s broader message is that compliance officers must become savvy evaluators of AI systems; not to replace technologists, but to ask the right questions, assess AI-driven controls, and ensure the “right model for the right task” principle becomes standard practice.
Compliance can lead to responsible AI usage and AI governance, and that means adopting, governing, and overseeing the area with the same rigor we apply to every other risk domain.