How to Introduce Generative AI Software Solutions and Keep Hallucinations in Check

There is so much excitement about Generative AI and how it will change business, finance, medicine, education, and more. It has been less than 18 months since the announcement of OpenAI ChatGPT, Anthropic and other new solutions. Those in the fraud prevention community – both practitioners and vendors – anticipate seeing the benefits of Gen AI at work. But with the technology still being so new for actual deployments, there is concern on how it should be properly implemented. And we have historical perspective to learn from going back over 25 years.

A History Lesson: The UK Postal Service Horizon Scandal

With Generative AI, we must think about the core of what is being built and how accurate and reliable it will be. To do this, we need to go back to the turn of the century in England to find a very troubling software deployment.

In 1999, the British government deployed Horizon, a new solution designed to manage its Postal Service Branch Managers (known as sub-postmasters, sub-contracted positions). Once deployed, it started to detect that sub-postmasters were mismanaging funds from the postal service. Over a twenty-year period, the software detected over 900 criminal incidents. The postmasters who were being reviewed claimed the software results were not correct. But the British government ignored them. The outcome resulted in:

• Over 3,500 sub-postmasters forced to make repayments based on the Horizon software data.
• Over 900 innocent Post Office sub-postmasters wrongly convicted of theft, false accounting, and fraud between 1999 and 2015. Since January 2024, only “93 convictions have been overturned.”
• Many of the affected employees lost their careers, while some ended up in prison, and several committed suicide.

The postmasters were incensed and fearful. Even though they knew the software results were inaccurate, they wrote checks to cover the shortfalls. Many pleaded guilty to financial mismanagement. Some went to jail. According to the BBC, many were financially destroyed, had their marriages breakdown and some also committed suicide over these false allegations.

In court, the prosecutor said that the Postal software showed there was theft in all of these cases, based on the Horizon software. Yes, in the UK the software is considered a point of truth. And so, these people must be criminals.

It wasn’t until 2017, almost 20 years later, that the truth came out. The software was not accurate at all. And it wasn’t until late 2023, when a TV documentary, Mr. Bates vs. the Post Office, appeared on UK television, that the scandal was brought to light. It shocked the countryside about the gross mismanagement of these bogus criminal cases. But too little too late. The damage was done. People were crushed, and many lives were ruined.

Generative AI Software Shows Promising Results

With this perspective, let’s get back to the present day. No one wants to be the Generative AI solution to suffer the UK Postal Service ‘hallucination.” But it will happen. And it did in the Netherlands. A Politico article revealed that in 2019, Dutch tax authorities had used a self-learning algorithm to create risk profiles in an effort to spot child care benefits fraud. The algorithms were wrong. According to Politico, “Tens of thousands of families — often with lower incomes or belonging to ethnic minorities — were pushed into poverty because of exorbitant debts to the tax agency.” This is yet another example of the devastation that can be caused when technology and automation are deployed without the right safeguards.

So, these software disasters are not black swan events. They occur because new software is not properly tested for the core features and red team tested for the potential outlier issues. And this is what this blog is about. Yes, new Generative AI solutions must have model governance documentation and meet state, federal and EU requirements. And those requirements will be tough.

We are starting to see impressive results with Generative AI solutions in cybersecurity. On February 16, 2024, the Financial Times reported the following results by Google for Generative AI cybersecurity software:

• 70 percent better at detecting malicious scripts
• Up to 300 per cent more effective at identifying files that exploit vulnerabilities
• Time savings of 51 percent among detection and response teams

Results from the use of Generative AI undoubtedly show promise. However, achieving these types of results requires sound testing of Generative AI solutions. I mention red team testing because the testers must have experience with AI or neural networks to know how to really test for the outliers including privacy issues, bias issues, and hallucination results.

The testing needs to assess how hallucinations (e.g. wrong decisions) are addressed and controlled. It may be as simple as the program saying, “There is no risk score for this transaction” because the program does not know, as opposed to giving it a bogus undeserved high or low risk score. The testing needs to also critically look for bias, a key government concern.

Gen AI solutions can also handle so much more data—that is a good thing. As an example, there will be more ISO 20022 data in financial transactions (faster payments, CHIPS, etc.). But more data also makes testing even more difficult. And if it contains PII data, then anonymized data procedures need to be deployed.

Summary

Fraud fighters can’t wait for Generative AI solutions—to sell them or to deploy them. And the return on investment should be quite significant. But the first step out the door will not be easy. Yes, we already have AI and machine learning solutions, and there are some lessons to be learned from those deployments. But Generative AI is different. The Wall Street Journal recently reported that Google and Anthropic acknowledge that their AI systems are capable of hallucinations, where they authoritatively spit out statements that are wrong. Eli Collins, VP or Product Management at Google Deep Mind, stated, “We’re not in a situation where you can just trust the model output.”

In an interview with American Banker, Eric Siegel, author of The AI Playbook, noted that most Generative AI projects will never even reach deployment. And as for the Generative AI projects that do. Organizations will need to be prepared with their model governance documents and aligned with the myriad of regulatory requirements they will need to meet. But even before getting there, testing Generative AI software in depth is necessary, including red team testing concepts. No business or government agency wants to become the next UK Postal Service disaster.

Epilogue

While writing this blog, I came across a story where a Canadian airline blamed its chatbot for misquoting fares. The airline argued “that despite the error, the chatbot was a ‘separate legal entity’ and thus was responsible for its actions.” The judge disagreed.

Solutions

BioCatch Connect

Frameworks

Resources

Company

BioCatch blog channel

How to Introduce Generative AI Software Solutions and Keep Hallucinations in Check

Recent Posts