Fiona Browne

The benefits of an Augmented Data Quality Solution

Fiona Browne — Mon, 22 Jan 2024 15:55:40 +0000

In the digital era, data is essential for every organisation, meaning good data management is needed to empower businesses to make well-informed decisions and operate efficiently. However, this can be a challenging landscape, encompassing catalogs, lineage, observability, master data management, and data quality.

We’re at a point now where institutions’ data estates are rapidly expanding. Stretching from legacy systems to cloud migrations and data warehouses, and spanning relational databases to unstructured documents, the importance of data quality has never been greater. This, coupled with the decentralisation of organisational data, has made it difficult for organisations to maintain good data quality.

From traditional to transformative Data Quality Solutions

Addressing data quality issues within a business has typically involved very labour-heavy, manual processes. The nature of the modern data landscape, with its complex and ever-growing data sets, is demanding much more in the way of transformative solutions. Consequently, data quality systems must now adapt to automate processes like data profiling, rule suggestion, and time-series analysis of data issues. This is where the revolutionary concept of ‘augmented data quality’ comes into play.

Augmented Data Quality- What is it?

In short, augmented data quality is an approach that uses machine learning (ML) and artificial intelligence (AI) to automate and enhance data quality management. The aim is to automatically improve data quality by analyzing data, identifying and fixing issues, and providing clear, transparent metrics on data quality and improvement actions across your entire data estate. As a result, our users have found that an augmented data quality approach makes their data assets more valuable, allowing them to maximise the value of their data at a low cost with minimal manual effort.

Augmented data quality promotes self-service data quality management, making it easier for business users to carry out tasks without the need for deep technical expertise and knowledge of data science techniques. Moreover, it offers many benefits, from improved data accuracy to increased efficiency, and reduced costs. Rather than needing to carry out many specific tasks when assessing the quality of a set of data, augmented data quality automates this process, making it a valuable resource for enterprises dealing with big data.

Whilst AI and machine learning models can speed up routine DQ tasks, they cannot fully automate the whole process. In other words, augmented data quality does not eliminate the need for human oversight, decision-making, and intervention; instead, it complements it by leveraging human-in-the-loop technology, using advanced algorithms to perform large amounts of checks and fixes while making use of human expertise to review and tackle only the most difficult of issues, ensuring the highest levels of accuracy.

Datactics Augmented Data Quality Platform

Responding to these challenges, Datactics has developed the Augmented Data Quality platform (ADQ), which streamlines the data quality journey through a user-friendly interface. Our technology team has pioneered the use of AI/ML capabilities to make it easier for businesses to improve data quality. This includes:

Automated Data Profiling: Enabling you to efficiently onboard new sources of data or analyse existing ones, this feature allows the user to quickly understand their data, identify trends and outliers, and, when errors are found, automatically suggest and apply data quality rules.
DQ Insights Hub: Making use of a wide range of our machine learning capabilities, this feature provides a summarised view of data quality across many sources, allowing you to create interactive and fully customizable dashboards. These dashboards highlight and track many DQ metrics, from the number of issues found with each data element to the average time it takes for these issues to be remediated and then re-occur again.
Predictive Features:  We’ve developed a bespoke machine learning algorithm that learns from your data quality issues, allowing you to gain a deeper understanding of the root causes of the problems and empowering you to take preventative measures to ensure they don’t reoccur. By training this exclusively on your data, you get the most accurate predictions whilst also ensuring your data is fully secure.

Benefits of the Datactics ADQ platform

These represent tangible benefits for our users. At the heart of ADQ’s success is the new user layer that simplifies all the key components of a good data quality solution, such as connectivity, integrations, rule authoring, remediation, and insights. Essentially providing a pragmatic and practical real-world understanding of data quality

The Datactics platform is designed with all levels of users in mind. ADQ’s interface is intuitive and user-friendly, ensuring that users, regardless of their technical proficiency, can easily navigate and utilise the platform to its full potential. With support for a spectrum of different technologies, ADQ is the perfect platform for any user, from a non-technical business user to expert data scientists. This approach democratises data quality management, making it accessible and manageable for a wider range of professionals within an organisation.

The practical benefits of ADQ are evident in our client testimonials, with users reporting significant reductions in cost and time associated with building data quality projects. Specifically, the rule suggestion feature has been a game-changer for many, identifying a substantial portion of business rules which results in considerable time savings. Essentially, it provides a pragmatic and practical real-world understanding of data quality.

Empowering Organisations with Data

In the future, we plan to enhance ADQ with more automated features, better insights, and additional integrations. Some of the new features upcoming this year include incorporating generative AI into the platform, allowing non-technical users to create data quality checks using natural language prompts. Suggestions for remediations, generated using historical fixes and our bespoke machine learning algorithm, will vastly boost the number of issues that can be automatically resolved, decreasing the likelihood of human error and leaving your data stewards free to tackle the most critical and problematic cases. Additionally, by enhancing our predictive capabilities, we will allow you to pre-emptively act before data quality issues occur, ensuring your organisation is always working with high quality data.

The release of ADQ marks a significant milestone at Datactics, in terms of innovation and supporting our customers. It embodies our commitment to providing state-of-the-art data management solutions, enabling organisations to fully leverage their data assets. We are proud of our team’s vision and dedication to delivering a platform that not only addresses current data quality challenges but also paves the way for future innovations.

For more information about the Datactics ADQ solution, take a look at this piece by A-Team Insight or reach out to us at www.datactics.com.

The post The benefits of an Augmented Data Quality Solution appeared first on Datactics.

The Importance of Data Quality in Machine Learning

Fiona Browne — Mon, 18 Dec 2023 12:40:03 +0000

We are currently in an exciting area and time, where Machine Learning (ML) is applied across sectors from self driving cars to personalised medicine. Although ML models have been around for a while – for example, the use of algorithmic trading models from the 80’s, Bayes since 1700s – we are still in the nascent stages of productionising ML.

From a technical viewpoint, this is ‘Machine Learning Ops’ or MLOPs. MLOPs involve figuring out how to build, deploy via continuous integration and deployment, tracking and monitoring models and data in production.

From a human, risk, and regulatory viewpoint we are grappling with big questions about ethical AI (Artificial Intelligence) systems and where and how they should be used. Areas including risk, privacy and security of data, accountability, fairness, adversarial AI, and what this means, all come into play in this topic. Additionally, the debate over supervised machine learning, semi-supervised learning, and unsupervised machine learning, brings further complexity to the mix.

Much of the focus is on the models themselves, such as OpenAI GPT-4. Everyone can get their hands on pre-trained models or licensed APIs; What differentiates a good deployment is the data quality.

However, the one common theme that underpins all this work, is the rigour required in developing production-level systems and especially the data necessary to ensure they are reliable, accurate, and trustworthy. This is especially important for ML systems; the role that data and processes play; and the impact of poor-quality data on ML algorithms and learning models in the real world.

Data as a common theme

If we shift our gaze from the model side to the data side, including:

Data management – what processes do I have to manage data end to end, especially generating accurate training data?
Data integrity – how am I ensuring I have high-quality data throughout?
Data cleansing and improvement – what am I doing to prevent bad data from reaching data scientists?
Dataset labeling – how am I avoiding the risk of unlabeled data?
Data preparation – what steps am I taking to ensure my data is data science-ready?

A far greater understanding of performance and model impact (consequences) could be achieved. However, this is often viewed as less glamorous or exciting work and, as such, is often unvalued. For example, what is the impetus for companies or individuals to invest at this level (such as regulatory – e.g. BCBS, financial, reputational, law)?

Yet, as well defined in research by Google,

“Data largely determines performance, fairness, robustness, safety, and scalability of AI systems…[yet] In practice, most organizations fail to create or meet any data quality standards, from under-valuing data work vis-a-vis model development.”

This has a direct impact on people’s lives and society, where “…data quality carries an elevated significance in high-stakes AI due to its heightened downstream impact, impacting predictions like cancer detection, wildlife poaching, and loan allocations”.

What this looks like in practice

We have seen this in the past, with the exam predictions in the UK during Covid. In this case, teachers predicted the grades of their students, then an algorithm was applied to these predictions to downgrade any potential grade inflation by the Office of Qualifications and Examinations Regulation, using an algorithm. This algorithm was quite complex and non-transparent in the first instance. When the results were released, 39% of grades were downgraded. The algorithm captured the distribution of grades from previous years, the predicted distribution of grades for past students, and then the current year.

In practice, this meant that if you were a candidate who had performed well at GCSE, but attended a historically poor performing school, then it was challenging to achieve a top grade. Teachers had to rank their students in the class, resulting in a relative ranking system that could not equate to absolute performance. It meant that even if you were predicted a B, were ranked at fifteenth out of 30 in your class, and the pupil ranked at fifteenth the last three years received a C, you would likely get a C.

The application of this algorithm caused an uproar. Not least because schools with small class sizes – usually private, or fee-paying schools – were exempt from the algorithm resulting in the use of the teaching predicted grades. Additionally, it baked in past socioeconomic biases, benefitting underperforming students in affluent (and previously high-scoring) areas while suppressing the capabilities of high-performing students in lower-income regions.

A major lesson to learn from this, therefore, was transparency in the process and the data that was used.

An example from healthcare

Within the world of healthcare, it had an impact on ML cancer prediction with IBM’s ‘Watson for Oncology’, partnering with The University of Texas MD Anderson Cancer Center in 2013 to “uncover valuable insights from the cancer center’s rich patient and research databases”. The system was trained on a small number of hypothetical cancer patients, rather than real patient data. This resulted in erroneous and dangerous cancer treatment advice.

Significant questions that must be asked include:

Where did it go wrong here – certainly the data but in general a wider AI system?
Where was the risk assessment?
What testing was performed?
Where did responsibility and accountability reside?

Machine Learning practitioners know well the statistic that 80% of ML work is data preparation. Why then don’t we focus on this 80% effort and deploy a more systematic approach to ensure data quality is embedded in our systems, and considered important work to be performed by an ML team?

This is a view recently articulated by Andrew Ng who urges the ML community to be more data-centric and less model-centric. In fact, Andrew was able to demonstrate this using a steel sheets defect detection prediction use case whereby a deep learning computer vision model achieved a baseline performance of 76.2% accuracy. By addressing inconsistencies in the training dataset and correcting noisy or conflicting dataset labels, the classification performance reached 93.1%. Interestingly and compellingly from the perspective of this blog post, minimal performance gains were achieved addressing the model side alone.

Our view is, if data quality is a key limiting factor in ML performance –then let’s focus our efforts here on improving data quality, and can ML be deployed to address this? This is the central theme of the work the ML team at Datactics undertakes. Our focus is automating the manual, repetitive (often referred to as boring!) business processes of DQ and matching tasks, while embedding subject matter expertise into the process. To do this, most of our solutions employ a human-in-the-loop approach where we capture human decisions and expertise and use this to inform and re-train our models. Having this human expertise is essential in guiding the process and providing context improving the data and the data quality process. We are keen to free up clients from manual mundane tasks and instead use their expertise on tricky cases with simpler agree/disagree options.

To learn more about an AI-driven approach to Data Quality, read our press release about our Augmented Data Quality platform here.

The post The Importance of Data Quality in Machine Learning appeared first on Datactics.

How Data Quality Tools Deliver Clean Data for AI and ML

Fiona Browne — Mon, 21 Feb 2022 13:26:50 +0000

In her previous blog Dr Fiona Browne, Head of AI and Software Development, assessed the need for the AI and Machine Learning world to prioritise the data that is being fed into models and algorithms (and you can read it here. ) This blog goes into some of the critical capabilities for data quality tools to support specific AI and ML use cases with clean data.

A Broad Range of Data Quality Tool Features On Offer

The data quality tools market is full of vendors with a wide range of capabilities, as referenced in the recent Gartner Magic Quadrant. Regardless of the firm’s data volumes, or whether they are a small, midsize or large enterprise, they will be reliant on high quality data for every conceivable business use case, from the smallest product data problem to enterprise master data management. Consequently, data leaders should explore the competitive landscape fully to find the best fit to their data governance culture and the growth opportunities that the right vendor-client fit can offer.

Labelling Datasets

A supervised Machine Learning (ML) model learns from a training dataset consisting of features and labels.

We do not often hear about the efforts required to produce a consistent, well-labelled dataset, yet this will have a direct impact on the quality of a model and the predictive performance, regardless of organization size. A recent Google research report estimates that within an ML project, data labelling can cost between 25%-60% of the total budget.

Labelling is often a manual process requiring a reviewer to assign a tag to a piece of data e.g. to identify a car in an image, state if a case is fraudulent, or assign sentiment to a piece of text.

Succinct, well defined labelling instructions should be provided to reduce labelling inconsistencies. Where data quality solutions can be applied in this context includes the use of metrics to measure the label consistency within a dataset, and based on this, review and improve consistency scores.

As labelling is a laborious process, and access to resource to provide the labels can be limited, we reduce the volume of manual labelling using an active learning approach.

Here, ML is used to identify the trickiest, edge cases within a data set to label. These prioritised cases are passed to a reviewer to manually annotate without the need to label a complete data set. This approach also captures the rationale from a human expert as to why a label was provided, which provides transparency in predictions further downstream.

Entity resolution

For data matching and entity resolution, Datactics has used ML as a ‘decision aid’ for low confidence matches to reduce again the burden of manual review. The approach implemented by Datactics provides information on the confidence of the predictions through to the rationale as to why a prediction was provided. Additionally, the solution has built in the capability to accept or reject the predictions, so the client can continually update and improve the predictions required, using that fully-explainable, human in the loop approach. You can see more information on this in our White Paper here.

Detecting outliers and predicting rules

This is a critical step in a fully AI-augmented data quality journey, occurring in the key data profiling stage, before data cleansing. It empowers business users, who are perhaps not familiar with big data techniques, coding or programming, to rapidly get to grips with the data they are exploring. Using ML in this way helps them to uncover relationships, dependencies and patterns which can influence which data quality rules they wish to use to improve data quality or deliver better business outcomes, for example regulatory reporting or digital transformation.

This automated approach to identifying potentially erroneous data within your dataset and highlighting these within the context of data profiling reduces manual effort spent in trying to find these connections across different data sources or within an individual data set. It can remove a lot of the heavy lifting associated with data profiling especially when complex data integration or connectivity to data lakes or data stores is required.

The rule prediction element complements the outlier detection. It involves reviewing a data set, and suggesting data quality rules that can be run against this set to ensure compliance to both regulations and to standard dimensions of data quality, e.g. consistency, accuracy, timeliness etc., and for business dimensions or policies such as credit ratings or risk appetite.

Fixing data quality breaks

Again, ML helps in this area where the focus is placed on manual tasks for remediating erroneous or broken data. Can we detect trends in this data, for example on the first day of the month, we ingest a finance dataset and which causes a spike in data quality issues? Is there an optimal path to remediation that we can predict, or are there remediation values that we can suggest?

For fixing breaks, we have seen the use of rewards to the best performing teams which builds that value of the work. This gamification approach can support business goals through optimal resolution of key issues that matter to the business, rather than simply trying to fix everything that is wrong, all at once.

Data Quality for Explainability & Bias

We hear a lot about the deployment of ML models and the societal issues in terms of bias and fairness of a model. Applications of models can have a direct, potentially negative impact on people, and it stands to reason that everyone involved in the creation, development, deployment and evaluation of these models should take an active role in preventing such negative impacts from arising.

Having diverse representative teams building these systems is important. For example, a diverse team could have ensured that Google’s speech recognition software was trained on a diverse section of voices. In 2016, Rachael Tatman, a research fellow in linguistics at the University of Washington, found that Google’s speech-recognition software was 70% more likely to accurately recognise male speech.

Focusing on the data quality of the data that feeds our models can help identify areas of potential bias and unfairness. Interestingly, bias isn’t necessarily a bad thing. Models need bias in the data in order to discriminate between outcomes, e.g. having a history of a disease results in a higher risk of having that disease again.

The bias we want to be able to detect is unintended bias and, accordingly, unintended outcomes (and of course, intentional bias created by bad actors). For example, using techniques to identify potential proxy features, e.g. post or ZIP code even when discriminatory variables are removed such as race. IBM AI Fairness 360 suggest metrics to run against datasets to highlight potential bias e.g. using class labels such as race, gender and running metrics against the decisions made by the classifier. From this identification there are different approaches that can be taken to address these issues such as balancing a dataset, within an algorithm to penalise a bias through to the post processing in favouring a particular outcome.

Explainable AI (XAI)’s Role In Detecting Bias

XAI is a nascent field where ML is used to explain the predictions made by a classifier. For instance LIME (Local Interpretable Model-agnostic Explanations) provides a measure of ‘feature importance’. So if we find that postcode which correlates with race is a key driver in a prediction, this could highlight discriminatory behaviour within the model.

These approaches explain the local behaviour of a model and fit an interpretable model, such as a tree or linear regression. Again, the type of explanation will differ depending on an audience. For example, different processes may be needed to provide an explanation at an internal or data scientist level compared to an external client or customer level. Examples could be extended by providing reason and action codes as to why credit was refused.

Transparency can be provided in terms of model cards structured framework for reporting on ML model provenance, usage, and ethics-informed evaluation and give a detailed overview of a model’s suggested uses and limitations. This can be extended to the data side, and contain meta-data such data provenance, consent sought, and so on and so forth.

That being said, there is no single ‘silver-bullet’ approach to address these issues. Instead we need to use a combination of approaches and to test often.

Where to next – Machine Learning Ops (MLOps)

These days, the ‘-ops’ suffix is often appended to business practices right across the enterprise, from DevOps to PeopleOps, reflecting a systematic approach to how a function behaves and is designed to perform.

In Machine Learning, that same systematic approach, providing transparency and auditability, helps to move the business from brittle data pipelines to a proactive data approach that embeds human expertise.

Such an approach would identify issues within a process and not rely on an engineer identifying an issue by chance or individual expertise, which of course does not scale and is not robust. This system-wide approach embeds governance, security, risk and ownership at all levels. It does require a need for integration of expertise, for example the model developers gain an understanding into what risk is from knowledge transferred from risk officers and subject matter experts.

We need a maturing of the MLOps approach to support these processes. This is essential for high-quality and consistent flow of data throughout all stages of a project and to ensure that the process is repeatable and systematic.

It also neccessitates monitoring the performance of the model once in production, to take into account potential data drift or concept drift, and address this as and when identified. It should be said that testing for bias, robustness and adversarial attacks is still in nascent stages, but all this serves to do is highlight the importance of an MLOps approach right now rather than wait until these capabilities are fully developed.:

In practical terms, groups such as the Bank of England’s AI Public-Private Forum have significant potential to help the public and private sectors better understand the key issues, clarify the priorities and determine what actions are needed to support the safe adoption of AI in financial services.

The post How Data Quality Tools Deliver Clean Data for AI and ML appeared first on Datactics.

Datactics demonstrates rapid matching capabilities on open datasets

Fiona Browne — Fri, 17 Dec 2021 11:31:33 +0000

This blog from Fiona Browne, Head of Software Development & AI at Datactics, covers the subject of matching data across open datasets, a project for which the firm secured Innovate UK funding.

The Rapid Match project is a vehicle to address the complexity of integrating data and matching data at scale providing a platform for reproducible data pipelines for post and current COVID analysis.

The project provides a generalised framework for data quality, preparation, and matching which is easy to use and reproducible for the integration and merging of diverse datasets at scale.

We highlighted this capability through a Use Case on the identification of financial risk across regions in the UK. Using the Datactics platform, data quality, preparation and matching tasks were undertaken to integrate diverse UK Office of National Statistics (ONS) and UK Companies House (CH) datasets to provide a view on regional funding and sectors and the impact of COVID.

The project is a vehicle to address the complexity of integrating data and matching data at scale providing a platform for reproducible data pipelines for post and current COVID analysis.

COVID-19 related datasets are being generated at speed and volume including governmental sources from ONS, local authorities, open data through to third party datasets. Value is obtained from integrating these data together to provide a view on a particular problem area. For example, fraud detection. It is estimated that British banks have lent about £68 billion through a trio of loan programs, with repayments backstopped by the Government. Concerns have been raised about the risk of fraud, and one estimate found defaults and fraud in the Bounce Back program for small businesses could reach 80% in the worst case.

Why?

Institutions and governments need rapid access to high quality data to inform decision making processes. It is essential for the data to be of high quality, accurate and up to date. In order to do this, data needs to be complete, high quality and obtained in timely fashion. These data need to be generated at speed and volume with value achieved from integration. This is often both a tricky and time-consuming process. Furthermore, processes to perform this are often fragmented, ad-hoc, non-systematic, brittle and difficult to reproduce and maintain.

What?

The Rapid Match project addressed the challenges around data quality and matching at scale through a systematic process which joins large amounts of messy, incomplete data in varying formats, from multiple sources. We provide a reliable ‘match engine’ allowing government and organisations to accurately and securely integrate diverse sources of data.

A key outcome of the project has been the data quality applied to the UK Companies House datasets. Companies House datasets are applied to a wide range of applications from providing a register of incorporated UK companies through use in KYC on-boarding and AML checks performed by institutions. It is estimated that “millions of professionals use Companies House data daily”. For example, in due diligence to verify ultimate beneficiary ownership through to matching against financial crime and terrorism lists.

What to do next

If you are considering how to approach your data matching strategies and would like to view the work we carried out, please get in touch with Fiona Browne on LinkedIn.

And for more from Datactics, find us on Linkedin, Twitter or Facebook.

The post Datactics demonstrates rapid matching capabilities on open datasets appeared first on Datactics.

Artificial Intelligence can help businesses thrive

Fiona Browne — Thu, 02 Dec 2021 17:10:52 +0000

The coronavirus pandemic produced challenges not one of us could have expected. While some sense of normality is returning, many businesses still face an uphill battle to recover. Artificial Intelligence Technology, however, presents a solution for firms hoping to thrive once again.

Artificial Intelligence (AI) is being used for predictive tasks from fraud detection through to medical analytics. A key component of AI is the underlying data. Data impacts predictions, scalability and fairness of AI systems. As we move towards data-centric AI, having good quality, fair, representative, reliable and complete data will provide firms with a strong foundation to undertake tasks such as decision making and knowledge to strengthen their competitive position. In fact, AI solutions can be used to improve data quality when applied to tasks such as data labelling, accuracy, consistency, and completeness of data.

AI can help businesses not only improve and integrate data, but it will help their business grow through cost reduction and profit enhancement by reducing annual tasks. It has been predicted by Gartner that the business value created by AI will reach $3.9 trillion in 2022.

Businesses thrive with AI. It can automate financial forecasting, giving them greater visibility of their future finances and in turn empowering business owners to make better decisions and take actions to achieve their ultimate goals.

A key challenge for organisations is understanding the business objectives of deploying AI solutions. Therefore, moving away from using AI for technology sake towards awareness of what is feasible and how AI can be harnessed to address these objectives. This is a significant stumbling block for businesses to understand the benefits it can bring to their organisation.

The perceived lack of access to technology and need for copious amounts of data to train machine learning models are other stumbling blocks. We must bust the myth that AI is hard to access, for instance open source projects such as TensorFlow through to Microsoft Azure ML and Amazon Sage Maker are simplifying the process of building, deploying and monitoring machine learning models in production. Most companies don’t know this or how to take advantage of AI cost effective nature.

Even though accessing the technology is easy, using it is less so. Vendors are investing heavily in making the technology more accessible to non-expert users and have overall made great strides in making AI accessible.

That is why the upcoming AI Con Conference on 3 December at Titanic Belfast is so important. It gives us the perfect opportunity to discuss the benefits of AI for local firms.

Bringing together business leaders with world-leading technology professionals, AI Con will examine how artificial intelligence is changing our world and the opportunities and challenges it presents.

The themes for this year’s conference, which hosted 450 attendees in its first year and 800 in a virtual format last year, include Applied AI, AI Next and the Business of AI. These are designed for a general audience, tech audience and business audience respectively, and encompass everything from how AI can add value to organisations to what start-ups in the space should know.

The importance of AI cannot be disputed. AI Con will provide us with an opportunity to showcase the very best of AI. With Belfast now being a recognised tech hub, AI Con provides the perfect opportunity to foster debate and discussion around the benefits AI provides for business. Engagement with key business leaders and organisations is an essential part of that.

To find out more information about this year’s AI Con visit here.

And for more from Datactics, find us on Linkedin, Twitter or Facebook.

The post Artificial Intelligence can help businesses thrive appeared first on Datactics.

Read how AI is transforming Data Quality in this exclusive white paper

Fiona Browne — Wed, 10 Jun 2020 20:00:43 +0000

In this AI whitepaper, authored by our Head of AI Fiona Browne, we provide an overview of Artificial Intelligence (AI) and Machine Learning (ML) and their application to Data Quality.

We highlight how tools in the Datactics platform can be used for key data preparation tasks including cleansing, feature engineering and dataset labelling for input into ML models.

A real-world application of how ML can be used as an aid to improve consistency around manual processes is presented through an Entity Resolution Use Case.

In this case study we show how using ML reduced manual intervention tasks by 45% and improved data consistency within the process.

Having good quality, reliable and complete data provides businesses with a strong foundation to undertake tasks such as decision making and knowledge to strengthen their competitive position. It is estimated that poor data quality can cost an institution on average $15 million annually.

As we continue to move into the era of real-time analytics and Artificial Intelligence (AI) and Machine Learning (ML) the role of quality data will continue to grow. For companies to remain competitive, they must have in place flexible data management practices underpinned by quality data.

AI/ML are being used for predictive tasks from fraud detection through to medical analytics. These techniques can also be used to improve data quality when applied to tasks such as data accuracy, consistency, and completeness of data along with the data management process itself.

In this whitepaper we will provide an overview of the AI/ML process and how Datactics tools can be applied in cleansing, deduplication, feature engineering and dataset labelling for input into ML models. We highlight a practical application of ML through an Entity Resolution Use Case which addresses inconstancies around manual tasks in this process.

The post Read how AI is transforming Data Quality in this exclusive white paper appeared first on Datactics.

Dataset Labelling For Entity Resolution & Beyond with Dr Fiona Browne

Fiona Browne — Fri, 05 Jun 2020 10:40:43 +0000

In late 2019 our Head of AI, Dr Fiona Browne, delivered a series of talks to the Enterprise Data Management Council on AI-Enabled Data Quality in the context of AML operations, specifically for resolving differences in dataset labelling for legal entity data.

In this blog post, Fiona goes under the hood to explain some of the techniques that underpin Datactics’ extensible AI Framework.

Across the financial sector, Artificial Intelligence (AI) and Machine Learning (ML) have been applied to a number of areas, including the profiling of behaviour for fraud detection and Anti-Money Laundering (AML), through to the use of natural language processing to enrich data in Know-Your-Customer processes (KYC).

An important part of the KYC/AML process is entity resolution, which is the process of identifying and resolving entities from multiple data sources. This is traditionally the space that high-performance matching engines have been deployed, with associated fuzzy-match capabilities used to account for trivial or significant differences (indeed, this is part of Datactics’ existing self-service platform).

In this arena, Machine Learning (ML) techniques have been applied to address the task of entity resolution using different approaches from graphs and network analysis to probabilistic matching.

Although ML is a sophisticated approach for democratizing entity resolution, a limitation of applying this approach is the requirement of large volumes of labelled data for the ML model to learn from when supervised ML is used.

What is Supervised ML?

For supervised ML, a classifier is trained using a labelled dataset. This is a dataset that contains example inputs paired with their correct output label. In the case of entity resolution, this includes examples of input matches and non-matches which are correctly labelled. The Machine Learning algorithms learns from these examples and identifies patterns that link to specific outcomes. The trained classifier then uses this learning to make a prediction on new unseen cases based on their input values.

Dataset Labelling

As we see from above, for supervised ML we need high quality labelled examples for the classifier to learn from. Unlabelled data or poorly labelled data will only make it harder data labelling tools to work. The process of labelling raw data from scratch can be time-consuming and labour intensive especially if experts are required to provide labels for, in this example, entity resolution outputs. The data labelling process is repetitive in nature, and there is a need for consistency in the labelling process to ensure high quality and correct labels are applied. It is also costly in monetary terms, as those involved in processing the entity data require a high level of understanding of the nature of entities and ultimate beneficial owners, and in the context of failure where regulatory sanctions and fines can result.

Approaches for Dataset Labelling

As AI/ML progresses across all sectors, we have seen the rise in industrial level dataset labelling where companies/individuals are able to outsource their labelling tasks to annotation tools and labelling services. For example, the Amazon Mechanical Turk service, which enables the crowdsourcing of labelling of data. This can reduce data labelling work from months to hours. Machine Learning models can also be harnessed for data annotation tasks using approaches such as weak and semi-supervised learning along with Human-In-The-Loop Learning (HITL). HITL enables the improvement on ML models through the incorporation of human feedback through stages such as training, testing and evaluation.

ML approaches for Budgeted Learning

We can think of budgeted learning as a balancing act between the expense (in terms of cost, effort and time) of acquiring training data against the predictive performance of the model that you are building. For example, can we label a few hundred types of data instead of hundreds of thousands? There are a number of ML approaches that can help with this question and reduce the burden of manually labelling large volumes of training data. These include transfer learning, where you reuse previously gained knowledge. For instance, leveraging existing labelled data from a related sector or similar task. The recent open-source system Snorkel uses a form of weak supervision to label datasets via programmable labelling functions.

Active learning is a semi-supervised ML approach which can be used to reduce the burden of manually labelling datasets. The ‘active learner’ proactively selects the training dataset it needs to learn from. This is based on the concept that an ML model can achieve good predictive performance with fewer training sample instances by prioritising the examples to learn from. During the training process, an active learner poses queries which can be a selection of unlabelled instances from a dataset. These ML selected instances are then presented to an expert to manually label.

As it is seen above, there are wide and varied approaches to tackling the task of dataset labelling. What approach to select depends on a number of factors from the prediction task through to expense and budgeted learning. The connecting tenet is ensuring high quality labelled datasets for classifiers to learn from.

Click here for more from Datactics, or find us on Linkedin, Twitter or Facebook for the latest news.

The post Dataset Labelling For Entity Resolution & Beyond with Dr Fiona Browne appeared first on Datactics.

Explainable AI with Dr. Fiona Browne

Fiona Browne — Tue, 26 May 2020 18:19:57 +0000

Dr Fiona Browne, Datactics, discusses Explainable AI

The AI team at Datactics is building explainability from the ground up and demonstrating the “why and how” behind predictive models for client projects.

Matt Flenley prepared to open his brains to a rapid education session from Dr Fiona Browne and Kaixi Yang.

One of the most hotly debated tech topics of 2020 concerns model interpretability, that is to say, the rationale of how an ML algorithm has made a decision or prediction. Nobody doubts that AI can deliver astonishing advances in capability and corresponding efficiencies in an effort, but as HSBC’s Chief Data Officer Lorraine Waters shared at a recent A-Team event, “is it creepy to do this?” Numerous agendas at conferences are filled with differing rationales for interpretability and explainability of models, whether business-driven, consumer-driven, or regulatory frameworks to enforce good behaviour, but these are typically ethical conversations first rather than technological ones. It’s clear we need to ensure technology is “in the room” on all of these drivers.

We need to be informed and guided by technology to see what tools are already available to help with understanding AI decision-making, how tech can help shed light on ‘black boxes’ just as much as we’re dreaming up possibilities for the use of those black boxes.

As Head of Datactics’ AI team, Dr Fiona Browne has a strong desire for what she calls ‘baked-in explainability’. Her colleague Kaixi Yang explains more about explainable models,

Some algorithms, such as neural networks (deep learning), are complex. Functions are calculated through approximation, from the network’s structure it is unclear how this approximation is determined. We need to understand the rationale behind the model’s prediction so that we can decide when or even whether to trust the model’s prediction, turning black boxes into glass boxes within data science.

The team puts their ‘explain first‘ approach to a specific client project to build explainable Artificial Intelligence (XAI) from the ground up, using explainability metrics including LIME – a local, interpretable, model-agnostic way of explaining individual predictions.

“Model-agnostic explanations are important because they can be applied to a wide range of ML classifiers, such as neural networks, random forests, or support vector machines” continued Ms Yang, who has recently joined Datactics after completing an MSc in Data Analytics with Queen’s University in Belfast. “They help to explain the predictions of any machine learning classifier and evaluate its usefulness in various tasks related to trust”.

For the work the team has been conducting, these range of explainability measures provides them with the ability to choose the most appropriate Machine Learning model and AI systems, not just the one that makes the most accurate predictions based on evaluation scores. This has had a significant impact on their work on Entity Resolution for Know Your Customer (KYC) processes, a classic problem of large, messy datasets that are hard to match, with painful penalties if it goes wrong for human users. The project, which is detailed in a recent webinar hosted with the Enterprise Data Management Council, matched entities from the Refinitiv PermID and Global LEI Foundation’s datasets and relied on human validation of rule-based matches to train a machine learning algorithm.

Dr Browne again: “We applied different explainability metrics to three different classifiers that could predict whether a legal entity would match or not. We trained, validated and tested the models using an entity resolution dataset. For this analysis we selected two ‘black-box”’classifiers, and one interpretable classifier to illustrate how the explainability metrics were entirely agnostic and applicable regardless of the classifier that was chosen.”

The results are shown here:

“In a regular ML conversation, these results indicate two reliably accurate models that could be deployed in production,” continued Dr Browne, “but in an XAI world we want to shed light on how appropriate those models are.”

By applying, for example, LIME to a random instance in the dataset, the team can uncover the rationale behind the predictions made. Datactics’ FlowDesigner rules studio automatically labelled this record as “not a match” through its configurable fuzzy matching engines.

Dr Browne continued, “explainability methods build an interpretable classifier based on similar instances to the selected instance from the different classifiers and summarises the features which are driving this prediction. It selects those instances that are quite close to the predicted instance, depending on the model that’s been built, and uses those predictions from the black-box model to build a glass-box model, where you can then describe what’s happening.

In this case, for the Random Forest model (fig.), the label has been correctly predicted as 0 (not a match) and LIME exposes the features driving this decision. The prediction is supported by two key features but not a feature based on entity name which we know is important”

Using LIME on the multilayer perceptron model (fig.), which had the same accuracy as Random Forest, it correctly predicted the “0” label of “not a match” but with a lower support score. It has been supported by slightly different features compared to the random forest model.

The Naïve Bayesian model was different altogether. “It fully predicted the correct label of zero with a prediction confidence of one, the highest confidence possible,” said Dr Browne, “however it’s made this prediction supported by only one feature, a match on the entity country, disregarding all other features. This would lead you to doubt whether it’s reliable as a prediction model.”

This has significant implications in something as riddled with differences in data fields as KYC data. People and businesses move, directors and beneficial owners resign, and new ones are appointed, and that’s without considering ‘bad actors’ who are trying to hoodwink Anti-Money Laundering (AML) systems.

The process of ‘phoenixing’, where a new entity rises from the ashes of a failed one, intentionally dodging the liabilities of the previous incarnation, frequently relies on truncations or mis-spellings of director’s names to avoid linking the new entity with the previous one.

Any ML model being used on such a dataset would need to have this explainability baked-in to understand the reliability of predictions that the data is informing.

Using one explainability metric only is not good practice. Dr Browne explains Datactics’ approach: “Just as in classifiers, there’s no real best evaluation approach or explainer to pick; the best way is to choose a number of different models and metrics to try to describe what’s happening .There are always pros and cons, ranging from the scope of the explainer to stability of the code to complexity of the model and how and when it’s configured.”

These technological disciplines, to test, evaluate and try to understand a problem are a crucial part of the entire conversation that businesses are having at an ethical or “risk appetite” level.

Click here for more from Datactics, or find us on Linkedin, Twitter or Facebook for the latest news.

The post Explainable AI with Dr. Fiona Browne appeared first on Datactics.

Fundamentals of AI ethics with Dr. Fiona Browne

Fiona Browne — Thu, 14 May 2020 10:40:01 +0000

AI Ethics

Dr Fiona Browne heads Datactics’ four-strong AI team building explainable AI solutions.

Here, Fiona delves into the thorny topic of ethics, the centrepiece of any AI expedition. More specifically, the post goes into how to ensure that the potentially negative ethical impacts of AI do not outweigh the positives it can deliver across industry and academic sectors.

Since 2016, the domain of AI/ML has been gathering momentum with breakthroughs in NLP and computer vision. Andrew Ng, one of the founders of Google Brain, has referred to Artificial Intelligence (AI) as “automation on steroids” and “the new electricity”. We really have come a long way since the 1950s when Alan Turing first posed the question – “Can machines think?” outlined in his seminal paper Computing Machinery and Intelligence.

AI is here and is already being applied, from email spam filters to personal assistants such as Siri or Alexa, through to social media and customer service chatbots.

One of the most interesting aspects of this technology is that it is general-purpose, and we can apply this across many diverse sectors, from agriculture to manufacturing to healthcare to finance. The potential applications are vast and can provide us with faster services, whether automating administrative tasks or developing ‘decision-aid’ tools for clinicians in analysing our medical data. This is clearly an exciting time and we can see how AI will continue to be embedded into our everyday lives from obtaining bank loans to driving our cars.

This has, quite rightly, raised ethical questions around the safety, confirmation bias and transparency of AI. Perhaps an even bigger question is: “what is an ethical AI system and how can I validate it?”

It is encouraging that such questions are being asked. We know that machine learning algorithms learn from common types of data, but if an algorithm learns from data containing bias, these data biases will persist through to predictions made. A wide range of types of bias exist, from gender bias, to selection bias. Such biases can be inherent in the data or extrinsic to it; that is, bias by the unintentional omission of data based on how the data was collected. Two excellent examples from Harvard Business Review delve deeper into this subject and are well worth taking the time to read. An interesting area is the emerging discipline of AI ethics dedicated to addressing these questions involving experts across diverse domains including philosophy, computing science, academia and government.

We are seeing the movement of machine learning models and AI solutions into our everyday life such as facial recognition and real-time video analysis, replacing humans in the decision making process.

These capabilities could be used for citizen protection, especially with the current contact-tracing demands of the Coronavirus situation. The key is striking the balance between what the technology can potentially do, and being responsible with this technology, so our democracy and privacy are not undermined or impacted by ethical issues and different types of data bias,

The question then is developing models for the society we wish to inhabit, not merely replicating the society we have.

Having technologies that are built and informed by a diverse workforce with different people, different points of view is one factor that will aid in this. Initiatives such as the Organisation for Economic Co-operation and Development (OECD) have developed principles to promote the use of AI as innovative and trustworthy. The Alan Turing institute also has initiatives around fairness, transparency and ethics, with similar ethics being considered in MIT/Harvard. However, as these technologies have begun to touch our everyday lives in increasingly unseen ways, it will be important that we are all given an equal voice in this conversation. Democratising the debate on ethics in AI needs to involve greater community understanding, political guidance and policies of inclusion to prevent – and hopefully, even undo – the biases already hard-coded into human society.

Click here for more from Datactics, or find us on Linkedin, Twitter or Facebook for the latest news.

The post Fundamentals of AI ethics with Dr. Fiona Browne appeared first on Datactics.