Labs Archives - Datactics

The Importance of Data Quality in Machine Learning

Fiona Browne — Mon, 18 Dec 2023 12:40:03 +0000

We are currently in an exciting area and time, where Machine Learning (ML) is applied across sectors from self driving cars to personalised medicine. Although ML models have been around for a while – for example, the use of algorithmic trading models from the 80’s, Bayes since 1700s – we are still in the nascent stages of productionising ML.

From a technical viewpoint, this is ‘Machine Learning Ops’ or MLOPs. MLOPs involve figuring out how to build, deploy via continuous integration and deployment, tracking and monitoring models and data in production.

From a human, risk, and regulatory viewpoint we are grappling with big questions about ethical AI (Artificial Intelligence) systems and where and how they should be used. Areas including risk, privacy and security of data, accountability, fairness, adversarial AI, and what this means, all come into play in this topic. Additionally, the debate over supervised machine learning, semi-supervised learning, and unsupervised machine learning, brings further complexity to the mix.

Much of the focus is on the models themselves, such as OpenAI GPT-4. Everyone can get their hands on pre-trained models or licensed APIs; What differentiates a good deployment is the data quality.

However, the one common theme that underpins all this work, is the rigour required in developing production-level systems and especially the data necessary to ensure they are reliable, accurate, and trustworthy. This is especially important for ML systems; the role that data and processes play; and the impact of poor-quality data on ML algorithms and learning models in the real world.

Data as a common theme

If we shift our gaze from the model side to the data side, including:

Data management – what processes do I have to manage data end to end, especially generating accurate training data?
Data integrity – how am I ensuring I have high-quality data throughout?
Data cleansing and improvement – what am I doing to prevent bad data from reaching data scientists?
Dataset labeling – how am I avoiding the risk of unlabeled data?
Data preparation – what steps am I taking to ensure my data is data science-ready?

A far greater understanding of performance and model impact (consequences) could be achieved. However, this is often viewed as less glamorous or exciting work and, as such, is often unvalued. For example, what is the impetus for companies or individuals to invest at this level (such as regulatory – e.g. BCBS, financial, reputational, law)?

Yet, as well defined in research by Google,

“Data largely determines performance, fairness, robustness, safety, and scalability of AI systems…[yet] In practice, most organizations fail to create or meet any data quality standards, from under-valuing data work vis-a-vis model development.”

This has a direct impact on people’s lives and society, where “…data quality carries an elevated significance in high-stakes AI due to its heightened downstream impact, impacting predictions like cancer detection, wildlife poaching, and loan allocations”.

What this looks like in practice

We have seen this in the past, with the exam predictions in the UK during Covid. In this case, teachers predicted the grades of their students, then an algorithm was applied to these predictions to downgrade any potential grade inflation by the Office of Qualifications and Examinations Regulation, using an algorithm. This algorithm was quite complex and non-transparent in the first instance. When the results were released, 39% of grades were downgraded. The algorithm captured the distribution of grades from previous years, the predicted distribution of grades for past students, and then the current year.

In practice, this meant that if you were a candidate who had performed well at GCSE, but attended a historically poor performing school, then it was challenging to achieve a top grade. Teachers had to rank their students in the class, resulting in a relative ranking system that could not equate to absolute performance. It meant that even if you were predicted a B, were ranked at fifteenth out of 30 in your class, and the pupil ranked at fifteenth the last three years received a C, you would likely get a C.

The application of this algorithm caused an uproar. Not least because schools with small class sizes – usually private, or fee-paying schools – were exempt from the algorithm resulting in the use of the teaching predicted grades. Additionally, it baked in past socioeconomic biases, benefitting underperforming students in affluent (and previously high-scoring) areas while suppressing the capabilities of high-performing students in lower-income regions.

A major lesson to learn from this, therefore, was transparency in the process and the data that was used.

An example from healthcare

Within the world of healthcare, it had an impact on ML cancer prediction with IBM’s ‘Watson for Oncology’, partnering with The University of Texas MD Anderson Cancer Center in 2013 to “uncover valuable insights from the cancer center’s rich patient and research databases”. The system was trained on a small number of hypothetical cancer patients, rather than real patient data. This resulted in erroneous and dangerous cancer treatment advice.

Significant questions that must be asked include:

Where did it go wrong here – certainly the data but in general a wider AI system?
Where was the risk assessment?
What testing was performed?
Where did responsibility and accountability reside?

Machine Learning practitioners know well the statistic that 80% of ML work is data preparation. Why then don’t we focus on this 80% effort and deploy a more systematic approach to ensure data quality is embedded in our systems, and considered important work to be performed by an ML team?

This is a view recently articulated by Andrew Ng who urges the ML community to be more data-centric and less model-centric. In fact, Andrew was able to demonstrate this using a steel sheets defect detection prediction use case whereby a deep learning computer vision model achieved a baseline performance of 76.2% accuracy. By addressing inconsistencies in the training dataset and correcting noisy or conflicting dataset labels, the classification performance reached 93.1%. Interestingly and compellingly from the perspective of this blog post, minimal performance gains were achieved addressing the model side alone.

Our view is, if data quality is a key limiting factor in ML performance –then let’s focus our efforts here on improving data quality, and can ML be deployed to address this? This is the central theme of the work the ML team at Datactics undertakes. Our focus is automating the manual, repetitive (often referred to as boring!) business processes of DQ and matching tasks, while embedding subject matter expertise into the process. To do this, most of our solutions employ a human-in-the-loop approach where we capture human decisions and expertise and use this to inform and re-train our models. Having this human expertise is essential in guiding the process and providing context improving the data and the data quality process. We are keen to free up clients from manual mundane tasks and instead use their expertise on tricky cases with simpler agree/disagree options.

To learn more about an AI-driven approach to Data Quality, read our press release about our Augmented Data Quality platform here.

The post The Importance of Data Quality in Machine Learning appeared first on Datactics.

How Data Quality Tools Deliver Clean Data for AI and ML

Fiona Browne — Mon, 21 Feb 2022 13:26:50 +0000

In her previous blog Dr Fiona Browne, Head of AI and Software Development, assessed the need for the AI and Machine Learning world to prioritise the data that is being fed into models and algorithms (and you can read it here. ) This blog goes into some of the critical capabilities for data quality tools to support specific AI and ML use cases with clean data.

A Broad Range of Data Quality Tool Features On Offer

The data quality tools market is full of vendors with a wide range of capabilities, as referenced in the recent Gartner Magic Quadrant. Regardless of the firm’s data volumes, or whether they are a small, midsize or large enterprise, they will be reliant on high quality data for every conceivable business use case, from the smallest product data problem to enterprise master data management. Consequently, data leaders should explore the competitive landscape fully to find the best fit to their data governance culture and the growth opportunities that the right vendor-client fit can offer.

Labelling Datasets

A supervised Machine Learning (ML) model learns from a training dataset consisting of features and labels.

We do not often hear about the efforts required to produce a consistent, well-labelled dataset, yet this will have a direct impact on the quality of a model and the predictive performance, regardless of organization size. A recent Google research report estimates that within an ML project, data labelling can cost between 25%-60% of the total budget.

Labelling is often a manual process requiring a reviewer to assign a tag to a piece of data e.g. to identify a car in an image, state if a case is fraudulent, or assign sentiment to a piece of text.

Succinct, well defined labelling instructions should be provided to reduce labelling inconsistencies. Where data quality solutions can be applied in this context includes the use of metrics to measure the label consistency within a dataset, and based on this, review and improve consistency scores.

As labelling is a laborious process, and access to resource to provide the labels can be limited, we reduce the volume of manual labelling using an active learning approach.

Here, ML is used to identify the trickiest, edge cases within a data set to label. These prioritised cases are passed to a reviewer to manually annotate without the need to label a complete data set. This approach also captures the rationale from a human expert as to why a label was provided, which provides transparency in predictions further downstream.

Entity resolution

For data matching and entity resolution, Datactics has used ML as a ‘decision aid’ for low confidence matches to reduce again the burden of manual review. The approach implemented by Datactics provides information on the confidence of the predictions through to the rationale as to why a prediction was provided. Additionally, the solution has built in the capability to accept or reject the predictions, so the client can continually update and improve the predictions required, using that fully-explainable, human in the loop approach. You can see more information on this in our White Paper here.

Detecting outliers and predicting rules

This is a critical step in a fully AI-augmented data quality journey, occurring in the key data profiling stage, before data cleansing. It empowers business users, who are perhaps not familiar with big data techniques, coding or programming, to rapidly get to grips with the data they are exploring. Using ML in this way helps them to uncover relationships, dependencies and patterns which can influence which data quality rules they wish to use to improve data quality or deliver better business outcomes, for example regulatory reporting or digital transformation.

This automated approach to identifying potentially erroneous data within your dataset and highlighting these within the context of data profiling reduces manual effort spent in trying to find these connections across different data sources or within an individual data set. It can remove a lot of the heavy lifting associated with data profiling especially when complex data integration or connectivity to data lakes or data stores is required.

The rule prediction element complements the outlier detection. It involves reviewing a data set, and suggesting data quality rules that can be run against this set to ensure compliance to both regulations and to standard dimensions of data quality, e.g. consistency, accuracy, timeliness etc., and for business dimensions or policies such as credit ratings or risk appetite.

Fixing data quality breaks

Again, ML helps in this area where the focus is placed on manual tasks for remediating erroneous or broken data. Can we detect trends in this data, for example on the first day of the month, we ingest a finance dataset and which causes a spike in data quality issues? Is there an optimal path to remediation that we can predict, or are there remediation values that we can suggest?

For fixing breaks, we have seen the use of rewards to the best performing teams which builds that value of the work. This gamification approach can support business goals through optimal resolution of key issues that matter to the business, rather than simply trying to fix everything that is wrong, all at once.

Data Quality for Explainability & Bias

We hear a lot about the deployment of ML models and the societal issues in terms of bias and fairness of a model. Applications of models can have a direct, potentially negative impact on people, and it stands to reason that everyone involved in the creation, development, deployment and evaluation of these models should take an active role in preventing such negative impacts from arising.

Having diverse representative teams building these systems is important. For example, a diverse team could have ensured that Google’s speech recognition software was trained on a diverse section of voices. In 2016, Rachael Tatman, a research fellow in linguistics at the University of Washington, found that Google’s speech-recognition software was 70% more likely to accurately recognise male speech.

Focusing on the data quality of the data that feeds our models can help identify areas of potential bias and unfairness. Interestingly, bias isn’t necessarily a bad thing. Models need bias in the data in order to discriminate between outcomes, e.g. having a history of a disease results in a higher risk of having that disease again.

The bias we want to be able to detect is unintended bias and, accordingly, unintended outcomes (and of course, intentional bias created by bad actors). For example, using techniques to identify potential proxy features, e.g. post or ZIP code even when discriminatory variables are removed such as race. IBM AI Fairness 360 suggest metrics to run against datasets to highlight potential bias e.g. using class labels such as race, gender and running metrics against the decisions made by the classifier. From this identification there are different approaches that can be taken to address these issues such as balancing a dataset, within an algorithm to penalise a bias through to the post processing in favouring a particular outcome.

Explainable AI (XAI)’s Role In Detecting Bias

XAI is a nascent field where ML is used to explain the predictions made by a classifier. For instance LIME (Local Interpretable Model-agnostic Explanations) provides a measure of ‘feature importance’. So if we find that postcode which correlates with race is a key driver in a prediction, this could highlight discriminatory behaviour within the model.

These approaches explain the local behaviour of a model and fit an interpretable model, such as a tree or linear regression. Again, the type of explanation will differ depending on an audience. For example, different processes may be needed to provide an explanation at an internal or data scientist level compared to an external client or customer level. Examples could be extended by providing reason and action codes as to why credit was refused.

Transparency can be provided in terms of model cards structured framework for reporting on ML model provenance, usage, and ethics-informed evaluation and give a detailed overview of a model’s suggested uses and limitations. This can be extended to the data side, and contain meta-data such data provenance, consent sought, and so on and so forth.

That being said, there is no single ‘silver-bullet’ approach to address these issues. Instead we need to use a combination of approaches and to test often.

Where to next – Machine Learning Ops (MLOps)

These days, the ‘-ops’ suffix is often appended to business practices right across the enterprise, from DevOps to PeopleOps, reflecting a systematic approach to how a function behaves and is designed to perform.

In Machine Learning, that same systematic approach, providing transparency and auditability, helps to move the business from brittle data pipelines to a proactive data approach that embeds human expertise.

Such an approach would identify issues within a process and not rely on an engineer identifying an issue by chance or individual expertise, which of course does not scale and is not robust. This system-wide approach embeds governance, security, risk and ownership at all levels. It does require a need for integration of expertise, for example the model developers gain an understanding into what risk is from knowledge transferred from risk officers and subject matter experts.

We need a maturing of the MLOps approach to support these processes. This is essential for high-quality and consistent flow of data throughout all stages of a project and to ensure that the process is repeatable and systematic.

It also neccessitates monitoring the performance of the model once in production, to take into account potential data drift or concept drift, and address this as and when identified. It should be said that testing for bias, robustness and adversarial attacks is still in nascent stages, but all this serves to do is highlight the importance of an MLOps approach right now rather than wait until these capabilities are fully developed.:

In practical terms, groups such as the Bank of England’s AI Public-Private Forum have significant potential to help the public and private sectors better understand the key issues, clarify the priorities and determine what actions are needed to support the safe adoption of AI in financial services.

The post How Data Quality Tools Deliver Clean Data for AI and ML appeared first on Datactics.

Dataset Labelling For Entity Resolution & Beyond with Dr Fiona Browne

Fiona Browne — Fri, 05 Jun 2020 10:40:43 +0000

In late 2019 our Head of AI, Dr Fiona Browne, delivered a series of talks to the Enterprise Data Management Council on AI-Enabled Data Quality in the context of AML operations, specifically for resolving differences in dataset labelling for legal entity data.

In this blog post, Fiona goes under the hood to explain some of the techniques that underpin Datactics’ extensible AI Framework.

Across the financial sector, Artificial Intelligence (AI) and Machine Learning (ML) have been applied to a number of areas, including the profiling of behaviour for fraud detection and Anti-Money Laundering (AML), through to the use of natural language processing to enrich data in Know-Your-Customer processes (KYC).

An important part of the KYC/AML process is entity resolution, which is the process of identifying and resolving entities from multiple data sources. This is traditionally the space that high-performance matching engines have been deployed, with associated fuzzy-match capabilities used to account for trivial or significant differences (indeed, this is part of Datactics’ existing self-service platform).

In this arena, Machine Learning (ML) techniques have been applied to address the task of entity resolution using different approaches from graphs and network analysis to probabilistic matching.

Although ML is a sophisticated approach for democratizing entity resolution, a limitation of applying this approach is the requirement of large volumes of labelled data for the ML model to learn from when supervised ML is used.

What is Supervised ML?

For supervised ML, a classifier is trained using a labelled dataset. This is a dataset that contains example inputs paired with their correct output label. In the case of entity resolution, this includes examples of input matches and non-matches which are correctly labelled. The Machine Learning algorithms learns from these examples and identifies patterns that link to specific outcomes. The trained classifier then uses this learning to make a prediction on new unseen cases based on their input values.

Dataset Labelling

As we see from above, for supervised ML we need high quality labelled examples for the classifier to learn from. Unlabelled data or poorly labelled data will only make it harder data labelling tools to work. The process of labelling raw data from scratch can be time-consuming and labour intensive especially if experts are required to provide labels for, in this example, entity resolution outputs. The data labelling process is repetitive in nature, and there is a need for consistency in the labelling process to ensure high quality and correct labels are applied. It is also costly in monetary terms, as those involved in processing the entity data require a high level of understanding of the nature of entities and ultimate beneficial owners, and in the context of failure where regulatory sanctions and fines can result.

Approaches for Dataset Labelling

As AI/ML progresses across all sectors, we have seen the rise in industrial level dataset labelling where companies/individuals are able to outsource their labelling tasks to annotation tools and labelling services. For example, the Amazon Mechanical Turk service, which enables the crowdsourcing of labelling of data. This can reduce data labelling work from months to hours. Machine Learning models can also be harnessed for data annotation tasks using approaches such as weak and semi-supervised learning along with Human-In-The-Loop Learning (HITL). HITL enables the improvement on ML models through the incorporation of human feedback through stages such as training, testing and evaluation.

ML approaches for Budgeted Learning

We can think of budgeted learning as a balancing act between the expense (in terms of cost, effort and time) of acquiring training data against the predictive performance of the model that you are building. For example, can we label a few hundred types of data instead of hundreds of thousands? There are a number of ML approaches that can help with this question and reduce the burden of manually labelling large volumes of training data. These include transfer learning, where you reuse previously gained knowledge. For instance, leveraging existing labelled data from a related sector or similar task. The recent open-source system Snorkel uses a form of weak supervision to label datasets via programmable labelling functions.

Active learning is a semi-supervised ML approach which can be used to reduce the burden of manually labelling datasets. The ‘active learner’ proactively selects the training dataset it needs to learn from. This is based on the concept that an ML model can achieve good predictive performance with fewer training sample instances by prioritising the examples to learn from. During the training process, an active learner poses queries which can be a selection of unlabelled instances from a dataset. These ML selected instances are then presented to an expert to manually label.

As it is seen above, there are wide and varied approaches to tackling the task of dataset labelling. What approach to select depends on a number of factors from the prediction task through to expense and budgeted learning. The connecting tenet is ensuring high quality labelled datasets for classifiers to learn from.

Click here for more from Datactics, or find us on Linkedin, Twitter or Facebook for the latest news.

The post Dataset Labelling For Entity Resolution & Beyond with Dr Fiona Browne appeared first on Datactics.