ML Archives - Datactics

Outlier Detection – What Is It And How Can It Help In The Improvements Of Data Quality?

Matt Neill — Fri, 27 May 2022 11:05:50 +0000

Identifying outliers and errors in data is an important but time-consuming task. Depending on the context and domain, errors can be impactful in a variety of ways, some very severe. One of the issues with detecting outliers and errors is that they come in many different forms. There are syntactic errors, where a value like a date or time is in the wrong format, and semantic errors, where a value is in the correct format but doesn’t make sense in the context of the data, like an age of 500. The biggest problem with creating a method for detecting outliers in dataset is how to identify a vast range of different errors with the one tool.

At Datactics, we’ve been working on a tool to solve some of these problems and enable errors and outliers to be quickly identified with minimal user input. With this project, our goal is to assign a number to each value in a dataset which represents the likelihood that the value is an outlier. To do this we use a number of different features of the data, which range from quite simple methods like looking at the frequency of a value or its length compared to others in its column, to more complex methods using n-grams and co-occurrence statistics. Once we have used these features to get a numerical representation of each value, we can then use some simple statistical tests to find the outliers.

When profiling a dataset, there are a few simple things you can do to find errors and outliers in the data. A good place to start could be to look at the least frequent values in a column or the shortest and longest values. These will highlight some of the most obvious errors but what then? If you are profiling numeric or time data, you could rank the data and look at both ends of the spectrum to see if there are any other obvious outliers. But what about text data or unique values that can’t be profiled using frequency analysis? If you want to identify semantic errors, this profiling would need to be done by a domain expert. Another factor to consider is the fact that this must all be done manually. It is evident that there are a number of aspects of the outlier detection process that limit both its convenience and practicality. These are some of the things we have tried to address with this project.

When designing this tool, our objective was to create a simple, effective, universal approach to outlier detection. There are a large number of statistical methods for outlier detection that, in some cases, have existed for hundreds of years. These are all based on identifying numerical outliers, which would be useful in some of the cases listed above but has obvious limitations. Our solution to this is to create a numerical representation of every value in the data set that can be used with a straightforward statistical method. We do this using features of the data. The features currently implemented and available for use are:

Character N-Grams
Co-Occurrence Statistics
Date Value
Length
Numeric Value
Symbolic N-Grams
Text Similarities
Time Value

We are also working on creating a feature of the data to enable us to identify outliers in time series data. Some of these features, such as date and numeric value are only applicable on certain types of data. Some incorporate the very simple steps discussed above, like occurrence and length analysis. Others are more complicated and could not be done manually, like co-occurrence statistics. Then there are some, like the natural language processing text similarities, which make use of machine learning algorithms. While there will be some overlap in the outliers identified by these features, on the most part, they will all single out different errors and outliers, acting as an antidote to the heterogenous nature of errors discussed above.

One of the benefits of this method of outlier detection is its simplicity which leads to very explainable results. Once features of our dataset have been generated, we have a number of options in terms of next steps. In theory, all of these features could be fed into a machine learning model which could then be used to label data as outlier and non-outlier. However, there are a number of disadvantages to this approach. Firstly, this would require a labelled dataset to train the model with, which would be time-consuming to create. Moreover, the features will differ from dataset to dataset so it would not be a case of “one model fits all”. Finally, if you are using a “black box” machine learning method when a value is labelled as an outlier, you have no way of explaining this decision or evidence as to why this value has been labelled as opposed to others in the dataset.

All three of these problems are avoidable using the Datactics approach. The outliers are generated using only the features of the original dataset and, because of the statistical methods being used, can be identified with nothing but the data itself and a confidence level (a numerical value representing the likelihood that a value is an outlier). There is no need for any labelling or parameter-tuning with this approach. The other big advantage is, that due to the fact we assign a number to every value, we have evidence to back-up every outlier identified and are able to demonstrate how they differ from other none-outliers in the data.

Another benefit of this approach is that it is modular and therefore completely expandable. The features the outliers are based on can be selected based on the data being profiled which increases accuracy. Using this architecture also give us the ability to seamlessly expand the number of features available to be used and if trends or common errors are encounter that aren’t identified using the current features, it is very straightforward to create another feature to rectify this.

And for more from Datactics, find us on Linkedin, Twitter, or Facebook.

The post Outlier Detection – What Is It And How Can It Help In The Improvements Of Data Quality? appeared first on Datactics.

Rules Suggestion – What is it and how can it help in the pursuit of improving data quality?

Jamie Gordon — Wed, 15 Sep 2021 09:06:21 +0000

Written by Daniel Browne, Machine Learning Engineer

Defining data quality rules and collection of rules for data quality projects is often a manual time-consuming process. It often involves a subject matter expert reviewing data sources and designing quality rules to ensure the data complies with integrity, accuracy and / or regulatory standards. As data sources increase in volume and variety with potential functional dependencies, the task of defining data quality rules becomes more difficult. The application of machine learning can aid with this task by identifying dependencies between datasets through to the uncovering patterns related to data quality and suggesting previously applied rules to similar data.

At Datactics, we recently undertook a Rule Suggestion Project to automate the process of defining data quality rules for datasets through rule suggestions. We use natural language processing techniques to analyse the contents of a dataset and suggest rules in our rule library that best fit each column.

Problem Area and ML Solution

Generally, there are several data quality and data cleansing rules that you would typically want to apply to certain fields in a dataset. An example is a consistency check on a phone number column in a dataset such as checking that the number provided is valid and formatted correctly. Unfortunately, it is not usually as simple as searching for the phrase “phone number” in a column header and going from there. A phone number column could be labelled “mobile”, or “contact”, or “tel”, for example. Doing a string match in these cases may not uncover accurate rule suggestions. We need context embedded into this process and this is where machine learning comes in. We’ve been experimenting with building and training machine learning models to be able to categorise data, then return suggestions for useful data quality and data cleansing rules to consider applying to datasets.

Human in the Loop

The goal here is not to take away control from the user, the machine learning model isn’t going to run off with your dataset and do what it determines to be right on its own – the aim is to assist the user and to streamline the selection of rules to apply. A user will have full control to accept or reject some or all suggestions that come from the Rule Suggestion model. Users can add new rules not suggested by the model and this information is captured to improve the suggestions by the model. We hope that this will be a useful tool for users to make the process of setting up data quality and data cleansing rules quicker and easier.

Developers View

I’ve been involved in the development of this project from the early stages, and it’s been exciting to see it come together and take shape over the course of the project’s development. A lot of my involvement has been around building out the systems and infrastructure to help users interact with the model and to format the model’s outputs into easily understandable and useful pieces of information. This work surrounds allowing the software to take a dataset and process it such that the model can make its predictions on it, and then mapping from the model’s output to the individual rules that will then be presented to the user.

One of the major focuses we’ve had throughout the development of the project is control. We’ve been sure to build out the project with this in mind, with features such as giving users control over how cautious the model should be in making suggestions by being able to set confidence thresholds for suggestions, meaning the model will only return suggestions that meet or surpass the chosen threshold. We’ve also included the ability to add specific word-to-rule mappings that can help maintain a higher level of consistency and accuracy in results for very specific or rare categories that the model may have little or no prior knowledge of. For example, if there are proprietary fields that may have their own unique label, formatting, patterns or structures, and their own unique rules related to that, it’s possible to define a direct mapping from that to rules so that the Rule Suggestion system can produce accurate suggestions for any instances of that information in a dataset in the future.

Another focus of the project we hope to develop further upon is the idea of consistently improving results as the project matures. In the future we’re looking to develop a system where the model can continue to adapt based on how the suggested rules are used. Ideally, this will mean that if the model tends to incorrectly predict that a specific rule or rules will be useful for a given dataset column, it will begin to learn to avoid suggesting that rule for that column based on the fact that users tend to disagree with that suggestion. Similarly, if there are rules that the model tends to avoid suggesting for a certain column that users then manually select, the model will learn to suggest these rules in similar cases in the future.

In the same vein as this, one of the recent developments that I’ve found really interesting and exciting is a system that allows us to analyse the performance of various different machine learning models on a suite of sample data, which allows us to gain detailed insights into what makes an efficient and powerful rule prediction model, and how we can expect models to perform in real-world scenarios. It provides us with a sandbox to experiment with new ways of creating and updating machine learning models and being able to estimate baseline standards for performance, so we can be confident of the level of performance for our system. It’s been really rewarding to be able to analyse the results from this process so far and to be able to compare the different methods of processing the data and building machine learning models and see which areas one model may outperform another and so on.

Thanks to Daniel for talking to us about rules suggestion. If you would like to discuss further or find out more about rules suggestion at Datactics, reach out to Daniel Browne directly or you can reach out to our Head of AI, Fiona Browne.

Get in touch or find us on Linkedin, Twitter, or Facebook.

The post Rules Suggestion – What is it and how can it help in the pursuit of improving data quality? appeared first on Datactics.