Matt Neill, Author at Datactics

How to test your data against Benford’s Law

Matt Neill — Tue, 09 May 2023 16:04:04 +0000

One of the most important aspects of data quality is being able to identify anomalies within your data. There are many ways to approach this, one of which is to test the data against Benford’s Law. This blog will take a look at what Benford’s Law is, how it can be used to detect fraud, and how the Datactics platform can be used to achieve this.

What is Benford’s Law?

Benford’s law is named after a physicist called Frank Benford and was first discovered in the 1880s by an astronomer named Simon Newcomb. Newcomb was looking through logarithm tables (used before pocket calculators were invented to find the value of the logarithms of numbers), when he spotted that the pages which started with earlier digits, like 1, were significantly more worn than other pages.

Given a large set of numerical data, Benford’s Law asserts that the first digit of these numbers is more likely to be small. If the data follows Benford’s Law, then approximately 30% of the time the first digit would be a 1, whilst 9 would only be the first digit around 5% of the time. If the distribution of the first digit was uniform, then they would all occur equally often (around 11% of the time). It also proposes a distribution of the second digit, third digit, combinations of digits, and so on. According to Benford’s Law, the probability that the first digit in a dataset is d is given by P(d) = log10(1 + 1/d).

Why is it useful?

There are plenty of data sets that have proven to have followed Benford’s Law, including stock prices, population numbers, and electricity bills. Due to the large availability of data known to follow Benford’s Law, checking a data set to see if it follows Benford’s Law can be a good indicator as to whether the data has been manipulated. While this is not definitive proof that the data is erroneous or fraudulent, it can provide a good indication of problematic trends in your data.

In the context of fraud, Benford’s law can be used to detect anomalies and irregularities in financial data. For example, within large datasets such as invoices, sales records, expense reports, and other financial statements. If the data has been fabricated, then the person tampering with it would probably have done so “randomly”. This means the first digits would be uniformly distributed and thus, not follow Benford’s Law.

Below are some real-world examples where Benford’s Law has been applied:

Detecting fraud in financial accounts – Benford’s Law can be useful in its application to many different types of fraud, including money laundering and large financial accounts. Many years after Greece joined the eurozone, the economic data they provided to the E.U. was shown to be probably fraudulent using this method.

Detecting election fraud – Benford’s Law was used as evidence of fraud in the 2009 Iranian elections and was also used for auditing data from the 2009 German federal elections. Benford’s Law has also been used in multiple US presidential elections.

Analysis of price digits – When the euro was introduced, all the different exchange rates meant that, while the “real” price of goods stayed the same, the “nominal” price (the monetary value) of goods was distorted. Research carried out across Europe showed that the first digits of nominal prices followed Benford’s Law. However, deviation from this occurred for the second and third digits. Here, trends more commonly associated with psychological pricing could be observed. Larger digits (especially 9) are more commonly found due to the fact that prices such as £1.99 have been shown to be more associated with spending £1 rather than £2.

How can Datactics’ tools be used to test for Benford’s Law?

Using the Datactics platform, we can very easily test any dataset against Benford’s Law. Take this dataset of financial transactions (shown below). We’re going to be testing the “pmt_amt” column to see if it follows Benford’s Law for first digits. It spans several orders of magnitudes ranging from a few dollars to 15 million, which means that Benford’s Law is more likely to accurately apply to it.

The first step of the test is to extract the first digit of the column for analysis. This can very easily be done using a small FlowDesigner project (shown below).

Here we import the dataset and then filter out any values that are less than 1, as these aren’t relevant to our analysis. Then, we extract the first digit. Once that’s been completed, we can profile these digits to find out how many times each occurs and then save the results.

The next step would be to perform a statistical test to see how confident we can be that Benford’s Law applies here. We can use our Data Quality Manager tool to architect the whole process.

Step one runs our FlowDesigner project, whilst the second executes a simple Python script to perform the test and the last two steps let us set up an automated email alert to let the user know if the data failed the test at a specified threshold. While I’m using an email alert here, any issues tracking platform, such as Jira, can be used. We can also show the results in a dashboard, like the one below.

The graph on the left, with the green line, represents the distribution we would expect the digits to follow if it obeyed Benford’s Law. The red line shows the actual distribution of the digits. The bottom right table shows the two distributions and then the top right table shows the result of the test. In this case, it shows that we can be 100% confident that the data follows Benford’s Law.

In conclusion…

Physicist Frank Benford discovered a useful methodology that is as beneficial today as ever. The applicability of Benford’s law is a powerful tool for detecting fraud and other irregularities in large datasets. By combining statistical analysis with expert knowledge and AI-enabled technologies, organizations can improve their ability to detect and prevent fraudulent activities, thus safeguarding their financial health and reputation.

Matt Neil is a Machine Learning Engineer at Datactics. For more insights from Datactics, find us on Linkedin, Twitter or Facebook.

The post How to test your data against Benford’s Law appeared first on Datactics.

Outlier Detection – What Is It And How Can It Help In The Improvements Of Data Quality?

Matt Neill — Fri, 27 May 2022 11:05:50 +0000

Identifying outliers and errors in data is an important but time-consuming task. Depending on the context and domain, errors can be impactful in a variety of ways, some very severe. One of the issues with detecting outliers and errors is that they come in many different forms. There are syntactic errors, where a value like a date or time is in the wrong format, and semantic errors, where a value is in the correct format but doesn’t make sense in the context of the data, like an age of 500. The biggest problem with creating a method for detecting outliers in dataset is how to identify a vast range of different errors with the one tool.

At Datactics, we’ve been working on a tool to solve some of these problems and enable errors and outliers to be quickly identified with minimal user input. With this project, our goal is to assign a number to each value in a dataset which represents the likelihood that the value is an outlier. To do this we use a number of different features of the data, which range from quite simple methods like looking at the frequency of a value or its length compared to others in its column, to more complex methods using n-grams and co-occurrence statistics. Once we have used these features to get a numerical representation of each value, we can then use some simple statistical tests to find the outliers.

When profiling a dataset, there are a few simple things you can do to find errors and outliers in the data. A good place to start could be to look at the least frequent values in a column or the shortest and longest values. These will highlight some of the most obvious errors but what then? If you are profiling numeric or time data, you could rank the data and look at both ends of the spectrum to see if there are any other obvious outliers. But what about text data or unique values that can’t be profiled using frequency analysis? If you want to identify semantic errors, this profiling would need to be done by a domain expert. Another factor to consider is the fact that this must all be done manually. It is evident that there are a number of aspects of the outlier detection process that limit both its convenience and practicality. These are some of the things we have tried to address with this project.

When designing this tool, our objective was to create a simple, effective, universal approach to outlier detection. There are a large number of statistical methods for outlier detection that, in some cases, have existed for hundreds of years. These are all based on identifying numerical outliers, which would be useful in some of the cases listed above but has obvious limitations. Our solution to this is to create a numerical representation of every value in the data set that can be used with a straightforward statistical method. We do this using features of the data. The features currently implemented and available for use are:

Character N-Grams
Co-Occurrence Statistics
Date Value
Length
Numeric Value
Symbolic N-Grams
Text Similarities
Time Value

We are also working on creating a feature of the data to enable us to identify outliers in time series data. Some of these features, such as date and numeric value are only applicable on certain types of data. Some incorporate the very simple steps discussed above, like occurrence and length analysis. Others are more complicated and could not be done manually, like co-occurrence statistics. Then there are some, like the natural language processing text similarities, which make use of machine learning algorithms. While there will be some overlap in the outliers identified by these features, on the most part, they will all single out different errors and outliers, acting as an antidote to the heterogenous nature of errors discussed above.

One of the benefits of this method of outlier detection is its simplicity which leads to very explainable results. Once features of our dataset have been generated, we have a number of options in terms of next steps. In theory, all of these features could be fed into a machine learning model which could then be used to label data as outlier and non-outlier. However, there are a number of disadvantages to this approach. Firstly, this would require a labelled dataset to train the model with, which would be time-consuming to create. Moreover, the features will differ from dataset to dataset so it would not be a case of “one model fits all”. Finally, if you are using a “black box” machine learning method when a value is labelled as an outlier, you have no way of explaining this decision or evidence as to why this value has been labelled as opposed to others in the dataset.

All three of these problems are avoidable using the Datactics approach. The outliers are generated using only the features of the original dataset and, because of the statistical methods being used, can be identified with nothing but the data itself and a confidence level (a numerical value representing the likelihood that a value is an outlier). There is no need for any labelling or parameter-tuning with this approach. The other big advantage is, that due to the fact we assign a number to every value, we have evidence to back-up every outlier identified and are able to demonstrate how they differ from other none-outliers in the data.

Another benefit of this approach is that it is modular and therefore completely expandable. The features the outliers are based on can be selected based on the data being profiled which increases accuracy. Using this architecture also give us the ability to seamlessly expand the number of features available to be used and if trends or common errors are encounter that aren’t identified using the current features, it is very straightforward to create another feature to rectify this.

And for more from Datactics, find us on Linkedin, Twitter, or Facebook.

The post Outlier Detection – What Is It And How Can It Help In The Improvements Of Data Quality? appeared first on Datactics.