Best Practices

Guide to fraud prediction models | Census

Terence Shin
Terence Shin August 06, 2021

Terence is a data enthusiast and professional working with top Canadian tech companies. He received his MBA from Quantic School of Business and Technology and is currently pursuing a master's in computational analytics at Georgia Tech. Markham, Ontario, Canada

Key learnings from this article:

  • Types of fraud prediction models (and the differences between them)
  • Key features to use in your fraud prediction models
  • How to evaluate a fraud prediction model

As the world becomes more digitized and people are better equipped with new technologies and tools, the level of fraudulent activity continues to reach record-highs. According to a report from PwC, fraud losses totaled US$42 billion in 2020, affecting 47% of all companies in the past 24 months.

Paradoxically speaking, the same technological advancements, like big data, the cloud, and modern prediction algorithms, allows companies to tackle fraud better than ever before. In this article, we’re focused on the last point, fraud prediction algorithms - specifically, we’ll look at types of fraud models, features to use in a fraud model, and how to evaluate a fraud model.

Types of Fraud Prediction Models

“Fraud” is a wide-reaching, comprehensive term. So it should come as no surprise that you can build several types of fraud models, each serving its own purposes. Below, we’ll take a look at four of the most common models and map how they relate to each other.

Profile-specific models vs transaction-specific models

The idea of “innocent until proven guilty” isn’t just a concept for courtroom justice. It’s a philosophy that should ring true for your users to, and in order to ensure innocent folks aren’t put in digital jail because of the fraudulent actions of others you need to distinguish between the following two model types:

  1. Profile-specific models, which focus on identifying fraudulent activity on a user level, meaning that these models determine whether a user is fraudulent or not.
  2. Transaction-specific models, which take a more granular approach and identify fraudulent transactions, rather than fraudulent users.

At a glance, it sounds like these models serve the same purpose, but it’s not always the case that a fraudulent transaction comes from a fraudulent user. A user shouldn’t be deemed as fraudulent if his/her credit card was stolen and a fraudulent transaction was made on that credit card. Similarly, it’s not always the case that a fraudulent user makes a fraudulent transaction 100% of the time - whether that user should be allowed to make any transactions at all is a topic for another time.

Rules-based models vs machine learning models

If you’ve ever traveled across the country from your home and had your local bank freeze your credit card when you try to buy a cup of coffee at your destination, you know how annoying it can be when a bank uses rules-based models vs machine learning models.

Rules-based models are models with hard-coded rules, think of “if-else” statements (or case-when statements if you’re a SQL rockstar). With rules-based models, you’re responsible for coming up with the rules by yourself. Rules-based models are useful if you know the exact signals that indicate fraudulent activity. However, these models can be difficult if you either A) can't predict all fraudulent activity signals or B) can't guarantee those signals only correlate to fraudulent activity.  

For example, credit card companies usually have a rules-based approach that checks the location of where you use your credit card. If the distance between where you spend your credit card and your home address location passes a certain threshold--if you’re too far away from home--the transaction may automatically be denied.

Machine learning fraud-detection models, on the other hand, have become increasingly popular with the emergence of data science over the past decade. Machine learning models shine when you don’t know the exact signals that indicate fraudulent activity. Instead, you provide a machine learning model with a handful of features (variables) and let the model identify the signals itself.

For example, banks feed dozens of engineered features into machine learning models to identify what transactions are likely to be fraudulent and are moved to a second stage for further investigation. This allows their ML model to learn which behaviors over others tend to signal fraudulent behavior over time.

Now that we’ve broken down the four popular fraud prediction models you should know, let’s take a look at the features they should have.

Key features to use in your model

When choosing the features for your fraud prediction model, you want to include as many signals indicating fraudulent activity as possible.

To help spark some ideas, here’s a non-exhaustive list of key features commonly used in machine learning models:

  • Time of registration or transaction: The time when a user registers or makes a transaction is a good signal because it gives you an idea of the normal operating hours of your users, which can help you identify fraudulent users who take suspicious action outside that window. This means that they may make a number of transactions at a time when people don’t normally make transactions.
  • Location of transaction: As I alluded to before, the location in which a transaction was made can sometimes indicate whether the transaction is fraudulent or not. If a transaction is made 2000 miles away from the home address within minutes of a closer-to-home transaction, that's abnormal behavior and can possibly be fraudulent.
  • Cost to average spend ratio: This looks at the amount of a given transaction compared to the average spend of the given user. The larger the ratio, the more irregular the transaction is (and more likely it's fraudulent).
  • Email information: You can actually check when an email account was created to flag it for potentially suspicious behavior (for example, if someone gets a potentially fraudulent email and the sender created the account earlier that day, it may suggest fraudulent behavior).

The more signals you can provide--and the stronger the signals are--the better your model can predict and identify fraudulent activity.

How to evaluate a fraud model

You have to evaluate a fraud model differently than normal machine learning models because the fraudulent activity is (hopefully) a very small slice of your overall sample data used to train models.  

Why you shouldn’t use accuracy

Fraud detection is classified as an imbalanced classification problem. Specifically, there’s a significant imbalance between the number of fraudulent profiles/transactions and the number of non-fraudulent profiles/transactions. Because of this, you’re not going to get very far using accuracy as an evaluation metric.

To give an example of why, consider a dataset with one fraudulent transaction and 99 non-fraudulent transactions (in the real world, the ratio is even smaller). If a machine learning model were to classify every single transaction as non-fraudulent, it would be 99% accurate! Unfortunately, we’re not worried about non-fraudulent transactions and this accuracy-based model fails to tackle the problem at hand.

Metrics to use instead of accuracy

Instead, here are two metrics you should consider when evaluating a fraud-prediction model:

1. Precision, also known as a positive predictive value, is the proportion of relevant instances among the retrieved instances. In other words, it answers the question “What proportion of positive identifications was actually correct?”

You should use precision when the cost of classifying a non-fraudulent transaction as fraudulent is too high and you’re okay with only catching a portion of fraudulent transactions.

2. Recall--also known as the sensitivity, hit rate, or the true positive rate (TPR)--is the proportion of the total number of relevant instances that were actually retrieved. It answers the question “What proportion of actual positives was identified correctly?”

I recommend that you use recall when it’s absolutely critical you identify every single fraudulent transaction and you feel okay with incorrectly classifying some non-fraudulent transactions as fraudulent.

Each has its own pros and cons. You should weigh the strengths and weaknesses of each based on the business problem at hand. If you're a more visual person, check out the image below, which breaks down the difference between precision and recall metrics.  

Take your knowledge further

Thanks for reading! You should now know about the types of fraud models, some key features that you can use in your model, and how to evaluate your fraud model.

If you want to learn more about fraud, eCommerce, and data science, check out this podcast with Elad Cohen, the VP of data science at Riskified. Additionally, if you want to get an idea of how to build an actual fraud prediction model in Python, you can check out this Kaggle repository of a credit card fraud predictive model.

Looking for even more ways to improve your data skills overall? Check out our other tutorials here. Or, if you have questions (or want help building and leveraging your fraud prediction models), drop us a line.

Related articles

Customer Stories
Built With Census Embedded: Labelbox Becomes Data Warehouse-Native
Built With Census Embedded: Labelbox Becomes Data Warehouse-Native

Every business’s best source of truth is in their cloud data warehouse. If you’re a SaaS provider, your customer’s best data is in their cloud data warehouse, too.

Best Practices
Keeping Data Private with the Composable CDP
Keeping Data Private with the Composable CDP

One of the benefits of composing your Customer Data Platform on your data warehouse is enforcing and maintaining strong controls over how, where, and to whom your data is exposed.

Product News
Sync data 100x faster on Snowflake with Census Live Syncs
Sync data 100x faster on Snowflake with Census Live Syncs

For years, working with high-quality data in real time was an elusive goal for data teams. Two hurdles blocked real-time data activation on Snowflake from becoming a reality: Lack of low-latency data flows and transformation pipelines The compute cost of running queries at high frequency in order to provide real-time insights Today, we’re solving both of those challenges by partnering with Snowflake to support our real-time Live Syncs, which can be 100 times faster and 100 times cheaper to operate than traditional Reverse ETL. You can create a Live Sync using any Snowflake table (including Dynamic Tables) as a source, and sync data to over 200 business tools within seconds. We’re proud to offer the fastest Reverse ETL platform on the planet, and the only one capable of real-time activation with Snowflake. 👉 Luke Ambrosetti discusses Live Sync architecture in-depth on Snowflake’s Medium blog here. Real-Time Composable CDP with Snowflake Developed alongside Snowflake’s product team, we’re excited to enable the fastest-ever data activation on Snowflake. Today marks a massive paradigm shift in how quickly companies can leverage their first-party data to stay ahead of their competition. In the past, businesses had to implement their real-time use cases outside their Data Cloud by building a separate fast path, through hosted custom infrastructure and event buses, or piles of if-this-then-that no-code hacks — all with painful limitations such as lack of scalability, data silos, and low adaptability. Census Live Syncs were born to tear down the latency barrier that previously prevented companies from centralizing these integrations with all of their others. Census Live Syncs and Snowflake now combine to offer real-time CDP capabilities without having to abandon the Data Cloud. This Composable CDP approach transforms the Data Cloud infrastructure that companies already have into an engine that drives business growth and revenue, delivering huge cost savings and data-driven decisions without complex engineering. Together we’re enabling marketing and business teams to interact with customers at the moment of intent, deliver the most personalized recommendations, and update AI models with the freshest insights. Doing the Math: 100x Faster and 100x Cheaper There are two primary ways to use Census Live Syncs — through Snowflake Dynamic Tables, or directly through Snowflake Streams. Near real time: Dynamic Tables have a target lag of minimum 1 minute (as of March 2024). Real time: Live Syncs can operate off a Snowflake Stream directly to achieve true real-time activation in single-digit seconds. Using a real-world example, one of our customers was looking for real-time activation to personalize in-app content immediately. They replaced their previous hourly process with Census Live Syncs, achieving an end-to-end latency of <1 minute. They observed that Live Syncs are 144 times cheaper and 150 times faster than their previous Reverse ETL process. It’s rare to offer customers multiple orders of magnitude of improvement as part of a product release, but we did the math. Continuous Syncs (traditional Reverse ETL) Census Live Syncs Improvement Cost 24 hours = 24 Snowflake credits. 24 * $2 * 30 = $1440/month ⅙ of a credit per day. ⅙ * $2 * 30 = $10/month 144x Speed Transformation hourly job + 15 minutes for ETL = 75 minutes on average 30 seconds on average 150x Cost The previous method of lowest latency Reverse ETL, called Continuous Syncs, required a Snowflake compute platform to be live 24/7 in order to continuously detect changes. This was expensive and also wasteful for datasets that don’t change often. Assuming that one Snowflake credit is on average $2, traditional Reverse ETL costs 24 credits * $2 * 30 days = $1440 per month. Using Snowflake’s Streams to detect changes offers a huge saving in credits to detect changes, just 1/6th of a single credit in equivalent cost, lowering the cost to $10 per month. Speed Real-time activation also requires ETL and transformation workflows to be low latency. In this example, our customer needed real-time activation of an event that occurs 10 times per day. First, we reduced their ETL processing time to 1 second with our HTTP Request source. On the activation side, Live Syncs activate data with subsecond latency. 1 second HTTP Live Sync + 1 minute Dynamic Table refresh + 1 second Census Snowflake Live Sync = 1 minute end-to-end latency. This process can be even faster when using Live Syncs with a Snowflake Stream. For this customer, using Census Live Syncs on Snowflake was 144x cheaper and 150x faster than their previous Reverse ETL process How Live Syncs work It’s easy to set up a real-time workflow with Snowflake as a source in three steps: