Data Quality Archives - Datactics https://www.datactics.com/tag/data-quality/ Unlock your data's true potential Fri, 25 Oct 2024 14:34:41 +0000 en-GB hourly 1 https://wordpress.org/?v=6.7.2 https://www.datactics.com/wp-content/uploads/2023/01/DatacticsFavIconBluePink-150x150.png Data Quality Archives - Datactics https://www.datactics.com/tag/data-quality/ 32 32 Nightmare on LLM Street: How To Prevent Poor Data Haunting AI https://www.datactics.com/blog/nightmare-on-llm-street/ Fri, 25 Oct 2024 14:34:36 +0000 https://www.datactics.com/?p=27295 Why risk taking a chance on poor data for training AI? If it's keeping you awake at night, read on for a strategy to overcome the nightmare scenarios!

The post Nightmare on LLM Street: How To Prevent Poor Data Haunting AI appeared first on Datactics.

]]>
How to prevent poor data haunting AI

It’s October, the Northern Hemisphere nights are drawing in, and for many it’s time when things take a scarier turn. But for public sector leaders exploring AI, that fright need not apply to your data. It definitely shouldn’t be something that haunts your digital transformation dreams.

With a reported £800m budget unveiled by the previous government to address ‘digital and AI’, UK public sector departments are keen to be the first to explore the sizeable benefits that AI and automation offer. The change of government in July 2024 has done nothing to indicate that this drive has lessened in any way; in fact, the Labour manifesto included the commitment to a “single unique identifier” to “better support children and families”[1].

While we await the first Budget of this Labour government, it’s beyond doubt that there is an urgent need to tackle this task amid a cost-of-living crisis, with economies still trying to recover from the economic shock of COVID and deal with energy price hikes amid several sizeable international conflicts.

However, like Hollywood’s best Halloween villains, old systems, disconnected data, and a lack of standardisation are looming large in the background.

Acting First and Thinking Later

It’s completely understandable that the pressures would lead us to this point. Societal expectations from the emergence of ChatGPT, among others, have only fanned the flames, swelling the sense that technology should just ‘work’ and leading to an overinflated belief in what is possible.

Recently, LinkedIn attracted some consternation[i][2] by automatically including members’ data in its AI models without seeking express consent first. For whatever reason, the idea that people would just accept this change was overlooked. It took the UK’s Information Commissioner’s Office, the ICO, to intervene for the change to be withdrawn – in the UK, at least.

A dose of reality is the order of the day. Government systems are lacking integrated data, and clear consent frameworks of the type that LinkedIn actually possesses seldom exist in one consistent way. Already lacking funds, the public sector needs to act carefully, and mindfully, to prevent their AI experiments (which is, after all, what they are) from leading to inaccuracies and wider distrust from the general public.

One solution is for Government departments to form one, holistic set of consents concerning use of data for AI, especially Large Language Models and Generative AI – similar to communication consents under the General Data Protection Regulation, GDPR.

The adoption of a flexible consent management policy, one which can be updated and maintained for future developments and tied to an interoperable, standardised single view of citizen (SCV), will serve to support the clear, safe development of AI models into the future. The risks of building models now, on shakier foundations, will only serve to erode public faith. The evidence of the COVID-era exam grades fiasco[3] demonstrates the risk that these models present to real human lives.

Of course, it’s not easy to do. Many legacy systems contain names, addresses and other citizen data in a variety of formats. This makes it difficult to be sure that when more than one dataset includes a particular name, that name actually refers to the same individual. Traditional solutions to this problem use anything from direct matching technology to the truly awful exercise of humans manually reviewing tens of thousands of records in spreadsheets. This is one recurring nightmare that society really does need to stop having.

Taking Refuge in Safer Models

Intelligent data matching uses a variety of matching algorithms and well-established machine learning techniques to reconcile data held in old systems, new ones, documents, even voice notes. Such approaches could help the public sector to streamline their SCV processes, managing consents more effectively. The ability to understand who has opted in, marrying opt-ins and opt-outs to demographic data is critical. This approach will help model creators to interpret the inherent bias in the models built on those consenting to take part, to understand how reflective of society the predictive models are likely to be – including whether or not it is actually safe to use the model at all.

It’s probable that this transparency in process could also lead to greater trust in the general public to take part in data sharing in this way. In the LinkedIn example, the news that data was being used without explicit consent, raced around like wildfire on the platform itself. This sort of outcome cannot be what LinkedIn anticipated, which in and of itself is a concern about the mindset of the model creators.

It Doesn’t Have to Be a Nightmare

It’s a spooky enough season without adding more fear to the bonfire; certainly, this article isn’t intended as a reprimand. The desire to save time and money to deliver better services to a country’s citizens is a major part of many a civil servant’s professional drive. And AI and automation offer so many opportunities for much better outcomes! For just one example, NHS England’s AI tool already uses image recognition to detect heart disease up to 30 times faster than a human[4] . Mid and South Essex (MSE) NHS Foundation used a predictive analytical machine learning model called Deep Medical to reduce the rate at which patients either didn’t attend appointments or cancelled with short notice (referred to as Did Not Attend, or DNA). Its pilot project identified which patients were more likely to fall into the DNA category, developed personalised reminder schedules, and through identifying frail patients who were less likely to attend an appointment, highlighted them to relevant clinical teams.[5]

The time for taking action is now. Public sector organisations, government departments and agencies should focus on the need to develop systems that will preserve and maintain trust in the AI-led future. This blog has shown that better is possible, through a dedicated desire to align citizen data and their consents to contact. In a society where people have trust and transparency in the ways that their data will be used to train AI, the risk of nightmare scenarios can be averted and we’ll all sleep better at night.


[1] https://www.ropesgray.com/en/insights/viewpoints/102jc9k/labour-victory-the-implications-for-data-protection-ai-and-digital-regulation-i

[2] https://etedge-insights.com/in-focus/trending/linkedin-faces-backlash-for-using-user-data-in-ai-training-without-consent/

[3] https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7894241/#:~:text=COVID%2D19%20prompted%20the%20UK,teacher%20assessed%20grades%20and%20standardisation.

[4] https://www.healthcareitnews.com/news/emea/nhs-rolls-out-ai-tool-which-detects-heart-disease-20-seconds

[5] https://www.nhsconfed.org/publications/ai-healthcare


[i]

The post Nightmare on LLM Street: How To Prevent Poor Data Haunting AI appeared first on Datactics.

]]>
Datactics Awards 2024: Celebrating Customer Innovation https://www.datactics.com/blog/datactics-awards-2024-celebrating-customer-innovation/ Tue, 24 Sep 2024 15:28:14 +0000 https://www.datactics.com/?p=27124 In 2024, our customers have been busy delivering data-driven return on investment for their respective organisations. We wanted to recognise and praise their efforts in our first-ever Datactics Customer Awards! The oak-panelled setting of historic Toynbee Hall provided the venue for the 2024 Datactics Summit, which this year carried a theme of ‘Data-Driven Return on […]

The post Datactics Awards 2024: Celebrating Customer Innovation appeared first on Datactics.

]]>
In 2024, our customers have been busy delivering data-driven return on investment for their respective organisations. We wanted to recognise and praise their efforts in our first-ever Datactics Customer Awards!

The winners of the Datactics awards gather for a photograph. Caption describes who is in the picture.
Datactics Customer Awards winners 2024 gather for a group photo.
(From L to R: Erikas Rimkus, RBC Brewin Dolphin; Rachel Irving, Daryoush Mohammadi-Zaniani, Nick Jones and Tony Cole, NHS BSA; Lyndsay Shields, Danske Bank UK; Bobby McClung, Renfrewshire Health and Social Care Partnership). Not pictured: Solidatus.

The oak-panelled setting of historic Toynbee Hall provided the venue for the 2024 Datactics Summit, which this year carried a theme of ‘Data-Driven Return on Investment.’

Attendees gathered for guest speaker slots covering:

  • Danske Bank UK’s Lyndsay Shields presenting a ‘Data Management Playbook’ covering the experiences of beginning with a regulatory-driven change for FSCS compliance, through to broader internal evangelisation on the benefits of better data;
  • Datactics’ own data engineer, Eugene Coakley, in a lively discussion on the data driving sport, drawing from his past career as a professional athlete and Olympic rower with Team Ireland;
  • and Renfrewshire HSCP’s Bobby McClung explaining how automation and the saving of person-hours or even days in data remediation was having a material impact on the level of care the organsation is now able to deliver to citizens making use of its critical services.

The Datactics Customer Awards in full

In recent months, the team at Datactics has worked to identify notable achievements in data in the past year. Matt Flenley, Head of Marketing at Datactics, presented each with a specific citation, quoted below.

Data Culture Champion of the Year – Lyndsay Shields, Danske Bank UK
Data Culture Champion Award graphic

“We’re delighted to be presenting Lyndsay with this award. As one of our longest-standing customers, Lyndsay has worked tirelessly to embed a positive data culture at Danske Bank UK. Her work in driving the data team has helped inform and guide data policy at group level, bringing up the standard of data management across Danske Bank.

“Today’s launch of the Playbook serves to showcase the work Lyndsay and her team have put into driving the culture at Danske Bank UK, and the wider culture across Danske Bank.”

Data-Driven Social Impact Award – Renfrewshire Health and Social Care Partnership
Data Driven Social Impact Award graphic

“Through targeted use of automation, Renfrewshire Health and Social Care Partnership has been able to make a material difference to the operational costs of local government care provision.

“Joe Deary’s early vision and enthusiasm for the programme, and the drive of the team under and alongside Bobby, has effectively connected data automation to societally-beneficial outcomes.”

Data Strategy Leader of the Year – RBC Brewin Dolphin
Data Strategy Leader of the Year Award graphic

“RBC Brewin Dolphin undertook a holistic data review towards the end of 2023, culminating in a set of proposals to create a rationalised data quality estate. The firm twinned data this strategy with technology innovation including being early adopters of ADQ from Datactics. They overcame some sizeable hurdles, notably supporting Datactics in our early stages of deployment. Their commitment to being an ambitious, creative partner makes them stand out.

“At Datactics we’re delighted to be giving the team this award and would also like to thank them for being exemplars of patience in the way they have worked with us this year in particular.”

Datactics Award for Partner of the Year – Solidatus
Partner of the Year award graphic

“Solidatus and Datactics have been partnered for the last two years but it’s really in 2023-2024 that this partnership took off.

“Ever since we jointly supported Maybank, in Malaysia, in their data quality and data lineage programme, we have worked together on joint bids and supported one another in helping customers choose the ‘best of breed’ option in procuring data management technology. We look forward to our next engagements!”

Datactics Data Champion of the Year – NHSBSA
Data Champion of the Year Award graphic

“For all the efforts Tony, Nick and team have made to spread the word about doing more with data, we’d like to recognize NHS Business Services Authority with our Datactics Data Champion of the Year award.

“As well as their advocacy for our platform, applying it to identify opportunities for cost savings and efficiencies across the NHS, the team has regularly presented their work to other Government departments and acted as a reference client on multiple occasions. Their continued commitment to the centrality of data as a business resource is why they’re our final champions this year, the Datactics Data Champion 2024.”

:yndsay Shields of Danske Bank celebrates winning her award.
Lyndsay from Danske Bank UK
Bobby McClung from Renfrewshire HSCP celebrates winning their award.
Bobby from Renfrewshire HSCP
Clive Mawdesley and Erikas Rimkus from RBC Brewin Dolphin celebrate winning their award
Erikas and Clive from RBC Brewin Dolphin
Winners from NHS BSA celebrate winning their award.
Tony, Rachel, Nick and Daryoush from NHS BSA

Toasting success at Potter & Reid

The event closed with its traditional visit to Shoreditch hot spot Potter & Reid. Over hand-picked canapés and sparkling drinks, attendees networked and mingled to share in the award winners’ achievements in demonstrating what data-driven culture and return on investment looks like in practice. Keep an eye out for a taster video from this year’s event!

The post Datactics Awards 2024: Celebrating Customer Innovation appeared first on Datactics.

]]>
ISO 27001:2022 Certification Success https://www.datactics.com/blog/datactics-achieves-certification-iso-27001/ Fri, 30 Aug 2024 10:06:01 +0000 https://www.datactics.com/?p=27015 Datactics, a leader in data quality software has achieved ISO 27001:2022 Certification for Information Security Management System. The ISO 27001 certification is recognised globally as a benchmark for managing information security. The rigorous certification process, conducted by NQA and Vertical Structure, involved an extensive evaluation of Datactics’ security policies, procedures, people, and controls. Achieving this […]

The post ISO 27001:2022 Certification Success appeared first on Datactics.

]]>

Datactics, a leader in data quality software has achieved ISO 27001:2022 Certification for Information Security Management System.

The ISO 27001 certification is recognised globally as a benchmark for managing information security. The rigorous certification process, conducted by NQA and Vertical Structure, involved an extensive evaluation of Datactics’ security policies, procedures, people, and controls. Achieving this certification demonstrates Datactics’ dedication to safeguarding client data and maintaining information assets’ integrity, confidentiality, and availability.

Victoria Wallace, Senior DevOps & Security Specialist, stated: “Security is at the heart of everything that Datactics does and achieving ISO 27001:2022 certification is a testament to the team’s unwavering commitment in this technical field. Showcasing the extensive work that went into this prestigious achievement proves that dedication and determination can lead to significant success, both within Datactics and across our client ecosystem. Achieving and maintaining this certification is a key part of Datactics’ progress in enhancing our secure, process-driven, and powerful data quality platform.”

Tom Shields, Cyber & Information Security Consultant at Vertical Structure, said “It was a pleasure working with the team at Datactics. Their enthusiastic approach to ISO 27001 Information Security and the associated business risk mitigation was evident in every interaction. Involvement from top to bottom was prioritised from day one, allowing us to integrate into their team from the very outset. The opportunity to guide such organisations in certifying to ISO 27001 is a privilege for us, and we look forward to continuing to work alongside their team in the future.

About ISO 27001:2022 Certification

Datactics’ accreditation has been issued by NQA, a leading global independently accredited certification body. NQA has provided assessments (audits) of organisations to various management system standards since 1988.

Founded in 2006, Vertical Structure is an independent cyber security consultancy with a ‘people-first’ approach. Vertical Structure specialises in providing people-focused security and penetration testing services for web applications, cloud infrastructure and mobile applications.

Vertical Structure also conducts technical security training, helping companies to achieve certification to international standards such as ISO 27001, Cyber Essentials and CAIQ and are proud to be an Amazon Web Services® Select Consulting Partner.

The post ISO 27001:2022 Certification Success appeared first on Datactics.

]]>
What is Data Quality and why does it matter? https://www.datactics.com/glossary/what-is-data-quality/ Mon, 05 Aug 2024 17:27:17 +0000 https://www.datactics.com/?p=15641 Data quality refers to how fit your data is for serving its intended purpose. Good quality data should be reliable, accurate and accessible.

The post What is Data Quality and why does it matter? appeared first on Datactics.

]]>

What is Data Quality and why does it matter?

 

Data Quality refers to how fit your data is for serving its intended purpose. Good quality data should be reliable, accurate and accessible

What is Data Quality

Good quality data allows organisations to make informed decisions and ensure regulatory compliance. Bad data should be viewed at least as costly as any other type of debt. For highly regulated industries such as government and financial services, achieving and maintaining good data quality is key to avoiding data breaches and regulatory fines.

As data is arguably the most valuable asset to any organisation, there are ways to improve data quality through a combination of people, processes and technology. Data quality issues can include data duplication, incomplete fields or manual input (human) error. Identifying these errors relies on human eyes and can take a significant amount of time. Utilising technologies can benefit an organisation to automate data quality monitoring, improving operational efficiencies and reducing risk.

These dimensions apply regardless of the location of the data (where it physically resides) and whether it is conducted on a batch or real time basis (also known as scheduling or streaming). These dimensions help provide a consistent view of data quality across data lineage platforms and into data governance tools.

How to measure Data Quality:

According to Gartner, data quality is typically measured against six main data quality dimensions, including – Accuracy, Completeness, Uniqueness, Timeliness, Validity (also known as Integrity) and Consistency.  

Accuracy

Data accuracy is the extent to which data succinctly represents the real-world scenario and confirms with a source that is independently verified. For example, an email address incorrectly recorded in an email list can lead to a customer not receiving information. An inaccurate birth detail can deprive an employee of certain benefits. The accuracy of data is linked to how the data is preserved through its journey. Data accuracy can be supported through successful data governance and is essential for highly regulated industries such as finance and banking.

Completeness

For products or services completeness is required. Completeness measures if the data can sufficiently guide and inform future business decisions. It measures the number of required values that are reported – this dimension not only affects mandatory fields but also optional values in some circumstances.

Uniqueness

Uniqueness links to showcasing that a given entity exists just once. Duplication is a huge issue and is frequently common when integrating various data sets. The way to combat this is to ensure that the correct rules are applied to unifying the candidate records. A high uniqueness score infers minimal duplicates will be present which subsequently builds trust in data and analysis. Data uniqueness has the power to improve data governance and subsequently speed up compliance.

Timeliness

Data is updated with timely frequency to meet business requirements. It is important to understand how often data changes and how subsequently how often it will need updated. Timeliness should be understood in terms of volatility.

Validity

Any invalid data will affect the completeness of the data. It is key to define rules that ignore or resolve the invalid data for ensuring completeness. Overall validity refers to data type, range, format, or precision. It is also referred to as data integrity.

Consistency

Inconsistent data is one of the biggest challenges facing organisations, because inconsistent data is difficult to assess and requires planned testing across numerous data sets. Data consistency is often linked with another dimension, data accuracy. Any data set scoring high in both will be a high-quality data set.

How does Datactics help with measuring Data Quality?

Datactics is a core component of any data quality strategy. The Self-Service Data Quality platform is fully interoperable with off-the-shelf business intelligence tools such as PowerBI, MicroStrategy, Qlik and Tableau. This means that data stewards, Heads of Data and Chief Data Officers can rapidly integrate the platform to provide fine-detail dashboards on the health of data, measured to consistent data standards.

The platform enables data leaders to conduct a data quality assessment, understanding the health of data against business rules and highlighting areas of poor data quality against consistent data quality metrics.

These business rules can relate to how the data is to be viewed and used as it flows through an organisation, or at a policy level. For example, a customer’s credit rating or a company’s legal entity identifier (LEI).

Once a baseline has been established the Datactics platform can perform data cleansing, with results over time displayed in data quality dashboards. These help data and business leaders to build the business case and secure buy-in for their overarching data management strategy.

What part does Machine Learning play?

Datactics uses Machine Learning (ML) techniques to propose fixes to broken data, and uncover patterns and rules within the data itself. The approach Datactics employs is of “fully-explainable” AI, ensuring humans in the loop can always understand why or how an AI or ML model has reached a specific decision.

Measuring data quality in an ML context therefore also refers to how well an ML model is monitored. This means that in practice, data quality measurement strays into an emerging trend of Data Observability: the knowledge at any point in time or location that the data – and its associated algorithms – is fit for purpose.

Data Observability, as a theme, has been explored further by Gartner and others. This article from Forbes provides deeper insights into the overlap between these two subjects.

What Self-Service Data Quality from Datactics provides

The Datactics Self-Service Data Quality tool measures the six dimensions of of data quality and more, some of which include: Completeness, Referential Integrity, Correctness, Consistency, Currency and Timeliness.

Completeness – The DQ tool profiles data on ingestion and gives the user a report on percentage populated along with a data and character profiles of each column to quickly spot any missing attributes. Profiling operations to identify non-conforming code fields can be easily configured by the user in the GUI. 

Referential Integrity – The DQ tool can identify links/relationships across sources with sophisticated exact/fuzzy/phonetic/numeric matching against any number of criteria and check the integrity of fields as required. 

Correctness – The DQ tool has a full suite of pre-built validation rules to measure against reference libraries or defined format/checksum combinations. New validations rules can easily be built and re-used. 

Consistency – The DQ tool can measure data inconsistencies via many different built-in operations such as validation, matching, filtering/searching. The rule outcome metadata can be analysed inside the tool to display the consistency of the data measured over time. 

Currency – Measuring the difference in dates and finding inconsistencies is fully supported in the DQ tool. Dates is any format can be matched against each other or converted to posix time and compared against historical dates. 

Timeliness – The DQ tool can measure timeliness by utilizing the highly customisable reference library to insert SLA reference points and comparing any action recorded against these SLAs with the powerful matching options available. 

Our Self-Service Data Quality solution empowers business users to self-serve for high-quality data, saving time, reducing costs, and increasing profitability. Our Data Quality solution can help ensure accurate, consistent, compliant and complete data which will help businesses to make better informed decisions. 

And for more from Datactics, find us on LinkedinTwitter or Facebook.

The post What is Data Quality and why does it matter? appeared first on Datactics.

]]>
Insights from techUK’s Security and Public Safety SME Forum https://www.datactics.com/blog/panel-discussion-techuk-security-and-public-safety-sme-forum/ Fri, 24 May 2024 10:58:48 +0000 https://www.datactics.com/?p=25972 Chloe O’Kane, Project Manager at Datactics, recently spoke at techUK’s Security and Public Safety SME Forum, which included a panel discussion featuring speakers from member companies of techUK’s National Security and JES programs. The forum provided an excellent opportunity to initiate conversations and planning for the future among its members.   Read Chloe’s Q&A from […]

The post Insights from techUK’s Security and Public Safety SME Forum appeared first on Datactics.

]]>
Chloe O’Kane, Project Manager at Datactics, recently spoke at techUK’s Security and Public Safety SME Forum, which included a panel discussion featuring speakers from member companies of techUK’s National Security and JES programs. The forum provided an excellent opportunity to initiate conversations and planning for the future among its members.

Chloe O'Kane, Project Manager at Datactics

 

Read Chloe’s Q&A from the panel session, ‘Challenges and opportunities facing SMEs in the security and public safety sectors’, below:

What made you want to join the forum?

For starters, techUK is always a pleasure to work with – my colleagues and I at Datactics have several contacts at techUK that we speak with regularly and it’s clear that they care about the work they’re doing. It never feels like a courtesy call – you always come away with valuable actions to follow up on. Having had such positive experiences with techUK before, I felt encouraged to join the Security and Public Safety SME forum. Being a part of the Security and Public Safety SME Forum is exciting- you’re in a room full of like-minded people who want to make a difference. 

What are your main hopes and expectations from the forum?

I’ve previously participated in techUK events where senior stakeholders from government departments have led open and honest conversations about gaps in their knowledge. It’s refreshing to see them hold their hands up and say ‘We need help and we want to hear from SMEs’.

I think it would be great to see more of this in the Security and Public Safety SME forum, with people not being afraid to ask for help and demonstrating a desire to make a change.

What are, in your opinion, the main challenges faced by the SME community in the security and public safety sectors?

One of the challenges we face as SMEs is that we have to be deliberate about the work we do. We might see an opportunity that we know we’re a good fit for, but before we can commit, we need to think about it more than just ‘do we fit the technical criteria?’ We need to think about how it’s going to affect wider aspects of the company – Do we have sufficient staffing? Do they need security clearance? What is the delivery timeline?

If we aren’t being intentional, we risk disrupting our current way of working. We have a loyal and happy customer base and an excellent team of engineers, developers, and PMs to manage and support them, but even if a brilliant data quality deal lands on our desk, if it would take an army to deliver it, we may not be able to commit the same resources that a big consultancy firm can and, ultimately, we may have to pass on it.  

Moreover, our expertise lies specifically in data quality. As a leading DQ vendor, we excel in this area. However, if a project requires both data quality and additional data management services, we may not be the most suitable candidate, despite being the best at delivering the data quality component.

What are your top 3 areas of focus that the forum should address?

Ultimately, I think the goal of this forum should be steered by asking the question ‘How do we make people feel safe’?

A big challenge is always going to be striking the balance between tackling the issues that affect people’s safety, whilst navigating those bigger ‘headline’ stories that can have a lasting effect on the public. For instance, if you google ‘Is the UK a safe place to live?’, largely speaking the answers will say that ‘yes, the UK is a very safe place to live’. However, people’s perceptions don’t always align with that. I remember reading an article last year about how public trust in police has fallen to the lowest levels ever, so I think that would be a good place to start.  

From a member’s perspective though, more selfishly, I’d like to get the following out of the forum – 

  • Access to more SME opportunities 
  • Greater partnership opportunities 
  • More insights into procurement and access to the market 
In your opinion, why is networking and collaboration so important? Have you any success stories to share?


Our biggest success in networking and collaboration is having so many customers willing to endorse us and share our joint achievements.

We focus on understanding our customers, learning how they use our product, and listening to their likes and dislikes. This feedback shapes our roadmap and shows customers how much we value their input. This approach not only creates satisfied customers, but also turns them into advocates for our product. They mention us at conferences, in speeches, and in reference requests, and even help other customers with their data management strategies.

For us, networking is about more than just making new contacts; it’s about helping our customers connect and build relationships. Our customers’ advocacy is incredibly valuable because prospective customers like to hear success stories from them, perhaps more than salespeople.

About Datactics

Datactics specialises in data quality solutions for security and public safety. Using advanced data matching, cleansing, and validation, we help law enforcement and public safety agencies manage and analyse large datasets. This ensures critical information is accurate and accessible, improving response times, reducing errors, and protecting communities from threats.

For more information on how we support security and public safety services, visit our GovTech and Policing page, or reach out to us via our contact us page.

The post Insights from techUK’s Security and Public Safety SME Forum appeared first on Datactics.

]]>
Got three minutes? Get all you need to know on ADQ! https://www.datactics.com/blog/adq-in-three-minutes/ Wed, 17 Apr 2024 11:17:55 +0000 https://www.datactics.com/?p=25382 To save you scrolling through our website for the essential all you need to know info on ADQ, we’ve created this handy infographic. Our quick ADQ in three minutes guide can be downloaded from the button below the graphic. Happy reading! As always, don’t hesitate to get in touch if you’re looking for an answer […]

The post Got three minutes? Get all you need to know on ADQ! appeared first on Datactics.

]]>
To save you scrolling through our website for the essential all you need to know info on ADQ, we’ve created this handy infographic.

Our quick ADQ in three minutes guide can be downloaded from the button below the graphic. Happy reading! As always, don’t hesitate to get in touch if you’re looking for an answer that you can’t find here. Simply hit ‘Contact us’ with your query and let us do the rest.

adq in three minutes part one: augmented data quality process from datactics - connect to data, profile data, leverage AI rule suggestion, configure controls.
adq in three minutes part 2:
measure data health; get alerts and remediations; generate AI powered insights, and work towards a return on investment.

Wherever you are on your data journey, we have the expertise, the tooling and the guidance to help accelerate your data quality initiatives. From connecting to data sources, through rule building, measuring and into improving the quality of data your business relies on, let ADQ be your trusted partner.

If you would like to read some customer stories of how we’ve already achieved this, head on over to our Resources page where you’ll find a wide range of customer case studies, white papers, blogs and testimonials.

To get hold of this infographic, simply hit Download this! below.

The post Got three minutes? Get all you need to know on ADQ! appeared first on Datactics.

]]>
Datactics placed in the 2024 Gartner® Magic Quadrant™ for Augmented Data Quality Solutions  https://www.datactics.com/blog/datactics-placed-in-the-2024-gartner-magic-quadrant-for-augmented-data-quality-solutions/ Fri, 05 Apr 2024 13:34:56 +0000 https://www.datactics.com/?p=25091 Belfast, Northern Ireland – 5th April, 2024 – Datactics, a leading provider of data quality and matching software, has been recognised in the 2024 Gartner Magic Quadrant for Augmented Data Quality Solutions for a third year running.   Gartner included only 13 data quality vendors in the report, where Datactics is named a Niche Player. Datactics’ […]

The post Datactics placed in the 2024 Gartner® Magic Quadrant™ for Augmented Data Quality Solutions  appeared first on Datactics.

]]>

Belfast, Northern Ireland – 5th April, 2024 – Datactics, a leading provider of data quality and matching software, has been recognised in the 2024 Gartner Magic Quadrant for Augmented Data Quality Solutions for a third year running.  

Gartner included only 13 data quality vendors in the report, where Datactics is named a Niche Player. Datactics’ Augmented Data Quality platform (ADQ) offers a unified and user-friendly experience, optimising data quality management and improving operational efficiencies. By augmenting data quality processes with advanced AI and machine learning techniques, such as outlier detection, bulk remediation, and rule suggestion, Datactics serves customers across highly regulated industries, including financial services and government.

In an era where messy, unreliable and inaccurate data poses a substantial threat to organisations, the demand for data quality solutions has never been greater. Datactics stands out for its user-friendly, scalable, and highly efficient data quality solutions, designed to empower business users to manage and improve data quality seamlessly. Its solutions leverage AI and machine learning to automate complex data management tasks, thereby significantly enhancing operational efficiency and data-driven decision-making across various industries. 

“We are thrilled to be included in the 2024 Gartner Magic Quadrant for Augmented Data Quality Solutions,” said Stuart Harvey, CEO of Datactics. “Our team’s dedication and innovative approach is solving the complex challenges of practical data quality for customers across industries.

We believe the report significantly highlights our distinction from traditional observability solutions, showcasing Datactics’ focus on identifying, measuring and remediating broken data. We are committed to assisting our clients to create clean, ready-to-use data via the latest techniques in AI and have invested heavily in automation to reduce the manual effort required in rule building and management while retaining human-in-the-loop supervision. It is gratifying to note that Gartner recognises Datactics for its ability to execute and completeness of vision.”

Datactics’ solutions are designed to empower data leaders to trust their data for critical decision-making and regulatory compliance. For organisations looking to enhance their data quality and leverage the power of augmented data management, Datactics offers a proven platform that stands out for its ease of use, flexibility, and comprehensive support. 

Magic Quadrant reports are a culmination of rigorous, fact-based research in specific markets, providing a wide-angle view of the relative positions of the providers in markets where growth is high and provider differentiation is distinct. Providers are positioned into four quadrants: Leaders, Challengers, Visionaries and Niche Players. The research enables you to get the most from market analysis in alignment with your unique business and technology needs.

Gartner does not endorse any vendor, product or service depicted in its research publications and does not advise technology users to select only those vendors with the highest ratings or other designation. Gartner research publications consist of the opinions of Gartner’s research organization and should not be construed as statements of fact. Gartner disclaims all warranties, expressed or implied, with respect to this research, including any warranties of merchantability or fitness for a particular purpose.

GARTNER is a registered trademark and service mark of Gartner and Magic Quadrant is a registered trademark of Gartner, Inc. and/or its affiliates in the U.S. and internationally and are used herein with permission. All rights reserved.


The post Datactics placed in the 2024 Gartner® Magic Quadrant™ for Augmented Data Quality Solutions  appeared first on Datactics.

]]>
Shaping the Future of Insurance: Insights from Tia Cheang https://www.datactics.com/blog/shaping-the-future-of-insurance-with-tia-cheang/ Tue, 02 Apr 2024 13:55:13 +0000 https://www.datactics.com/?p=25115 Tia Cheang, Director of IT Data and Information Services at Gallagher, recently delivered an interview with Tech-Exec magazine drawing from her knowledge and experience in shaping the future of the insurance industry at one of the world’s largest insurance brokers. You can read the article here. Tia is also one of DataIQ’s Most Influential People […]

The post Shaping the Future of Insurance: Insights from Tia Cheang appeared first on Datactics.

]]>

Tia Cheang, Director of IT Data and Information Services at Gallagher, recently delivered an interview with Tech-Exec magazine drawing from her knowledge and experience in shaping the future of the insurance industry at one of the world’s largest insurance brokers. You can read the article here.

Tia is also one of DataIQ’s Most Influential People In Data for 2024 (congratulations, Tia!). We took the opportunity to ask Tia a few questions of our own, building on some of the themes from the Tech-Exec interview.

In the article with Tech-Exec, you touched on your background, your drive and ambition, and what led you to your current role at Gallagher. What are you most passionate about in this new role?

In 2023, I started working at Gallagher after having an extensive career in data in both public and private sectors. This job was a logical next step for me, as it resonates with my longstanding interest in utilising data in creative ways to bring about beneficial outcomes. I was eager to manage a comprehensive data transformation at Gallagher to prepare for the future, aligning with my interests and expertise.

I am responsible for leading our data strategy and developing a strong data culture. We wish to capitalise on data as a route to innovation and strategic decision-making. Our organisation is therefore creating an environment where data plays a crucial role in our business operations, to allow us to acquire new clients and accomplish significant results rapidly. The role offers an exciting opportunity to combine my skills and lead positive changes in our thinking towards data and its role in the future of insurance.

The transition to making data an integral part of business operations is often challenging. How have you found the experience? 

At Gallagher, our current data infrastructure faces the typical challenges that arise when a firm is expanding. Our data warehouses collect data from many sources, which mirrors the diverse aspects of our brokerage activities. These encompass internal systems, such as customer relationship management (CRM), brokerage systems, and other business applications. We handle multiple data types in our data estate, ranging from structured numerical data to unstructured text. The vast majority of our estate is currently hosted on-premise using Microsoft SQL Server technology, however, we also manage various other departmental data platforms such as QlikView. 

“…we want data capabilities that provide flexibility and agility, to enable us to quickly react to new market opportunities.”

A key challenge we face is quickly incorporating new data sources obtained through our mergers and acquisitions activity. These problems affect our data management efforts in terms of migration, seamless data integration, maintaining data quality, and providing data accessibility. 
To overcome this, we want data capabilities that provide flexibility and agility, to enable us to quickly react to new market opportunities. Consequently, we are implementing a worldwide data transformation to update our data technology, processes, and skills to provide support for this initiative. This transformation will move Gallagher data to the cloud, using Snowflake to leverage the scalability and elasticity of the platform for advanced analytics. Having this flexibility gives us a major advantage, offering computational resources where and when they are required.

How does this technology strategy align with your data strategy, and how do you plan to ensure data governance and compliance while implementing these solutions, especially in a highly-regulated industry like insurance?

Gallagher’s data strategy aims to position us as the leader in the insurance sector. By integrating our chosen solutions within the Snowflake platform, we strive to establish a higher standard in data-driven decision-making. 

This strategy involves incorporating data management tools such as Collibra, CluedIn, and Datactics into our re-platforming efforts, with a focus on ensuring the compatibility and interoperability of each component. We are aligning each tool’s capabilities with Snowflake’s powerful data lake functionality with the support of our consulting partners to ensure that our set of tools function seamlessly within Snowflake’s environment.

“…we are contemplating upcoming AI and automation regulations and considering how to futureproof our products and approaches…”

We are meticulously navigating the waters of data governance and compliance. We carefully plan each stage to ensure that all components of our data governance comply with the industry regulations and legislation of the specific region. For example, we are contemplating upcoming AI and automation regulations and considering how to futureproof our products and approaches to comply with them.

The success of our programme requires cooperation across our different global regions, stakeholders, and partners. We are rethinking our data governance using a bottom-up approach tailored to the specific features of our global insurance industry. We review our documentation and test the methods we use to ensure they comply with regulations and maintain proper checks and balances. We seek to understand the operational aspects of a process in real-world scenarios and evaluate its feasibility and scalability.

Could you expand on your choice of multiple solutions for data management technology? What made you go this route over a one-stop shop for all technologies?

We have selected “best of breed” solutions for data quality, data lineage, and Master Data Management (MDM), based on a requirement for specialised, high-performance tools. We concentrated on high-quality enterprise solutions for easy integration with our current technologies. Our main priorities were security, scalability, usability, and compatibility with our infrastructure. 

By adopting this approach, we achieve enhanced specialisation and capabilities in each area, providing high-level performance. This strategy offers the necessary flexibility within the organisation to establish a unified data management ecosystem. This aligns with our strategic objectives, ensuring that our data management capability is scaleable, secure, and adaptable.

Regarding the technologies we have selected, Collibra increases data transparency through efficient cataloguing and clear lineage; CluedIn ensures consistent and reliable data across systems; and Datactics is critical for maintaining high-quality data. 

“As we venture into advanced analytics, the importance of our data quality increases.”

In Datactics’ case, it provides data cleansing tools that ensure the reliability and accuracy of our data, underpinning effective decision-making and strategic planning. The benefits of this are immense, enhancing operating efficiency, reducing errors, and enabling well-informed decisions. As we venture into advanced analytics, the importance of our data quality increases. Therefore, Datactics was one of the first technologies we started using.

We anticipate gaining substantial competitive advantages from our strategic investment, such as improved decision-making capabilities, operational efficiency, and greater customer insights for personalisation. Our ability to swiftly adapt to market changes is also boosted. Gallagher’s adoption of automation and AI technologies will also strengthen our position, ensuring we remain at the forefront of technological progress.

On Master Data Management (MDM), you referred to the importance of having dedicated technology for this purpose. How do you see MDM making a difference at Gallagher, and what approach are you taking?

Gallagher is deploying Master Data Management to provide a single customer view. We expect substantial improvements in operational efficiency and customer service when it is completed. This will improve processing efficiency by removing duplicate data and offering more comprehensive, actionable customer insights. These improvements will benefit the insurance brokerage business and will enable improved data monetisation and stronger compliance, eventually enhancing client experience and increasing operational efficiency.

Implementing MDM at Gallagher is foundational to our ability to enable global analytics and automation. To facilitate it, we need to create a unified, accurate, and accessible data environment. We plan to integrate MDM seamlessly with our existing data systems, leveraging tools like CluedIn to manage reference data efficiently. This approach ensures that our MDM solution supports our broader data strategy, enhancing our overall data architecture.

“By including data quality activities in our approach, we anticipate significant benefits from the MDM initiative.”

Data quality is crucial in Gallagher’s journey to achieve this, particularly in establishing a unified consumer view via MDM. Accurate and consistent data is essential for consolidating several client data sources into a master profile; we see it as essential, as without good data quality the benefits of our transformation will be reduced. By including data quality activities in our approach, we anticipate significant benefits from the MDM initiative. We foresee a marked improvement in data accuracy and consistency throughout all business units. We want to empower users across the organisation to make more informed, data-driven decisions to facilitate growth. Furthermore, a single source of truth enables us to streamline our operations, leading to greater efficiencies by removing manual processes. Essentially, this strategic MDM implementation transforms data into a valuable asset that drives innovation and growth for Gallagher.

Looking to the future of insurance, what challenges do you foresee in technology, data and the insurance market?

Keeping up with the fast speed of technology changes can be challenging. We are conducting horizon scanning on new technologies to detect emerging trends. We wish to include new tools and processes that will complement and improve our current systems as they become ready.

“We prioritise the security of our data assets and our clients’ privacy because it is essential for our reputation and confidence in the market.”

Next is ensuring robust data security and compliance, particularly when considering legislation changes about AI and data protection. Our approach is to continuously strengthen our data policies as we grow and proactively manage our data. We prioritise the security of our data assets and our clients’ privacy because it is essential for our reputation and confidence in the market.

Finally, we work closely with our technology partners to leverage their expertise. This collaborative approach ensures that we take advantage of new technologies to their maximum capacity while preserving the integrity and effectiveness of our current systems. 

Are there any other technologies or methodologies you are considering for improving data management in the future beyond what you have mentioned?

Beyond the technologies and strategies already mentioned, at Gallagher, we plan to align our data management practices with the principles outlined in DAMA/DMBOK (Data Management Body of Knowledge). This framework will ensure that our data management capabilities are not just technologically advanced but also adhere to the best practices and standards in the industry.

In addition to this, we are always on the lookout for emerging technologies and methodologies that could further enhance our data management. Whether it’s advancements in AI, machine learning, or new data governance frameworks, we are committed to exploring and adopting methodologies that can add value to our data management practices.

For more from Tia, you can find her on LinkedIn.



The post Shaping the Future of Insurance: Insights from Tia Cheang appeared first on Datactics.

]]>
The Importance of Data Quality in Machine Learning https://www.datactics.com/blog/the-importance-of-data-quality-in-machine-learning/ Mon, 18 Dec 2023 12:40:03 +0000 https://www.datactics.com/?p=18042 We are currently in an exciting area and time, where Machine Learning (ML) is applied across sectors from self driving cars to personalised medicine. Although ML models have been around for a while – for example, the use of algorithmic trading models from the 80’s, Bayes since 1700s – we are still in the nascent […]

The post The Importance of Data Quality in Machine Learning appeared first on Datactics.

]]>
the importance of data quality in machine learning

We are currently in an exciting area and time, where Machine Learning (ML) is applied across sectors from self driving cars to personalised medicine. Although ML models have been around for a while – for example, the use of algorithmic trading models from the 80’s, Bayes since 1700s – we are still in the nascent stages of productionising ML.

From a technical viewpoint, this is ‘Machine Learning Ops’ or MLOPs. MLOPs involve figuring out how to build, deploy via continuous integration and deployment, tracking and monitoring models and data in production. 

From a human, risk, and regulatory viewpoint we are grappling with big questions about ethical AI (Artificial Intelligence) systems and where and how they should be used. Areas including risk, privacy and security of data, accountability, fairness, adversarial AI, and what this means, all come into play in this topic. Additionally, the debate over supervised machine learning, semi-supervised learning, and unsupervised machine learning, brings further complexity to the mix.

Much of the focus is on the models themselves, such as OpenAI GPT-4.  Everyone can get their hands on pre-trained models or licensed APIs; What differentiates a good deployment is the data quality.

However, the one common theme that underpins all this work, is the rigour required in developing production-level systems and especially the data necessary to ensure they are reliable, accurate, and trustworthy. This is especially important for ML systems; the role that data and processes play; and the impact of poor-quality data on ML algorithms and learning models in the real world.

Data as a common theme 

If we shift our gaze from the model side to the data side, including:

  • Data management – what processes do I have to manage data end to end, especially generating accurate training data?
  • Data integrity – how am I ensuring I have high-quality data throughout?
  • Data cleansing and improvement – what am I doing to prevent bad data from reaching data scientists?
  • Dataset labeling – how am I avoiding the risk of unlabeled data?
  • Data preparation – what steps am I taking to ensure my data is data science-ready?

A far greater understanding of performance and model impact (consequences) could be achieved. However, this is often viewed as less glamorous or exciting work and, as such, is often unvalued. For example, what is the impetus for companies or individuals to invest at this level (such as regulatory – e.g. BCBS, financial, reputational, law)?

Yet, as well defined in research by Google,

“Data largely determines performance, fairness, robustness, safety, and scalability of AI systems…[yet] In practice, most organizations fail to create or meet any data quality standards, from under-valuing data work vis-a-vis model development.” 

This has a direct impact on people’s lives and society, where “…data quality carries an elevated significance in high-stakes AI due to its heightened downstream impact, impacting predictions like cancer detection, wildlife poaching, and loan allocations”.

What this looks like in practice

We have seen this in the past, with the exam predictions in the UK during Covid. In this case, teachers predicted the grades of their students, then an algorithm was applied to these predictions to downgrade any potential grade inflation by the Office of Qualifications and Examinations Regulation, using an algorithm. This algorithm was quite complex and non-transparent in the first instance. When the results were released, 39% of grades were downgraded. The algorithm captured the distribution of grades from previous years, the predicted distribution of grades for past students, and then the current year.

In practice, this meant that if you were a candidate who had performed well at GCSE, but attended a historically poor performing school, then it was challenging to achieve a top grade. Teachers had to rank their students in the class, resulting in a relative ranking system that could not equate to absolute performance. It meant that even if you were predicted a B, were ranked at fifteenth out of 30 in your class, and the pupil ranked at fifteenth the last three years received a C, you would likely get a C.

The application of this algorithm caused an uproar. Not least because schools with small class sizes – usually private, or fee-paying schools – were exempt from the algorithm resulting in the use of the teaching predicted grades. Additionally, it baked in past socioeconomic biases, benefitting underperforming students in affluent (and previously high-scoring) areas while suppressing the capabilities of high-performing students in lower-income regions.

A major lesson to learn from this, therefore, was transparency in the process and the data that was used.

An example from healthcare

Within the world of healthcare, it had an impact on ML cancer prediction with IBM’s ‘Watson for Oncology’, partnering with The University of Texas MD Anderson Cancer Center in 2013 to “uncover valuable insights from the cancer center’s rich patient and research databases”. The system was trained on a small number of hypothetical cancer patients, rather than real patient data. This resulted in erroneous and dangerous cancer treatment advice.

Significant questions that must be asked include:

  • Where did it go wrong here – certainly the data but in general a wider AI system?
  • Where was the risk assessment?
  • What testing was performed?
  • Where did responsibility and accountability reside?

Machine Learning practitioners know well the statistic that 80% of ML work is data preparation. Why then don’t we focus on this 80% effort and deploy a more systematic approach to ensure data quality is embedded in our systems, and considered important work to be performed by an ML team?

This is a view recently articulated by Andrew Ng who urges the ML community to be more data-centric and less model-centric. In fact, Andrew was able to demonstrate this using a steel sheets defect detection prediction use case whereby a deep learning computer vision model achieved a baseline performance of 76.2% accuracy. By addressing inconsistencies in the training dataset and correcting noisy or conflicting dataset labels, the classification performance reached 93.1%. Interestingly and compellingly from the perspective of this blog post, minimal performance gains were achieved addressing the model side alone.

Our view is, if data quality is a key limiting factor in ML performance –then let’s focus our efforts here on improving data quality, and can ML be deployed to address this? This is the central theme of the work the ML team at Datactics undertakes. Our focus is automating the manual, repetitive (often referred to as boring!) business processes of DQ and matching tasks, while embedding subject matter expertise into the process. To do this, most of our solutions employ a human-in-the-loop approach where we capture human decisions and expertise and use this to inform and re-train our models. Having this human expertise is essential in guiding the process and providing context improving the data and the data quality process. We are keen to free up clients from manual mundane tasks and instead use their expertise on tricky cases with simpler agree/disagree options.

To learn more about an AI-driven approach to Data Quality, read our press release about our Augmented Data Quality platform here. 

The post The Importance of Data Quality in Machine Learning appeared first on Datactics.

]]>
How to test your data against Benford’s Law  https://www.datactics.com/blog/how-to-test-your-data-against-benfords-law/ Tue, 09 May 2023 16:04:04 +0000 https://www.datactics.com/?p=22375 One of the most important aspects of data quality is being able to identify anomalies within your data. There are many ways to approach this, one of which is to test the data against Benford’s Law. This blog will take a look at what Benford’s Law is, how it can be used to detect fraud, […]

The post How to test your data against Benford’s Law  appeared first on Datactics.

]]>
How to test your data against Benford's Law

One of the most important aspects of data quality is being able to identify anomalies within your data. There are many ways to approach this, one of which is to test the data against Benford’s Law. This blog will take a look at what Benford’s Law is, how it can be used to detect fraud, and how the Datactics platform can be used to achieve this.

What is Benford’s Law? 

Benford’s law is named after a physicist called Frank Benford and was first discovered in the 1880s by an astronomer named Simon Newcomb. Newcomb was looking through logarithm tables (used before pocket calculators were invented to find the value of the logarithms of numbers), when he spotted that the pages which started with earlier digits, like 1, were significantly more worn than other pages. 

Given a large set of numerical data, Benford’s Law asserts that the first digit of these numbers is more likely to be small. If the data follows Benford’s Law, then approximately 30% of the time the first digit would be a 1, whilst 9 would only be the first digit around 5% of the time. If the distribution of the first digit was uniform, then they would all occur equally often (around 11% of the time). It also proposes a distribution of the second digit, third digit, combinations of digits, and so on.  According to Benford’s Law, the probability that the first digit in a dataset is d is given by P(d) = log10(1 + 1/d).

Why is it useful? 

There are plenty of data sets that have proven to have followed Benford’s Law, including stock prices, population numbers, and electricity bills. Due to the large availability of data known to follow Benford’s Law, checking a data set to see if it follows Benford’s Law can be a good indicator as to whether the data has been manipulated. While this is not definitive proof that the data is erroneous or fraudulent, it can provide a good indication of problematic trends in your data. 

In the context of fraud, Benford’s law can be used to detect anomalies and irregularities in financial data. For example, within large datasets such as invoices, sales records, expense reports, and other financial statements. If the data has been fabricated, then the person tampering with it would probably have done so “randomly”. This means the first digits would be uniformly distributed and thus, not follow Benford’s Law.

Below are some real-world examples where Benford’s Law has been applied:

Detecting fraud in financial accounts – Benford’s Law can be useful in its application to many different types of fraud, including money laundering and large financial accounts. Many years after Greece joined the eurozone, the economic data they provided to the E.U. was shown to be probably fraudulent using this method.

Detecting election fraud – Benford’s Law was used as evidence of fraud in the 2009 Iranian elections and was also used for auditing data from the 2009 German federal elections. Benford’s Law has also been used in multiple US presidential elections.

Analysis of price digits – When the euro was introduced, all the different exchange rates meant that, while the “real” price of goods stayed the same, the “nominal” price (the monetary value) of goods was distorted. Research carried out across Europe showed that the first digits of nominal prices followed Benford’s Law. However, deviation from this occurred for the second and third digits. Here, trends more commonly associated with psychological pricing could be observed. Larger digits (especially 9) are more commonly found due to the fact that prices such as £1.99 have been shown to be more associated with spending £1 rather than £2. 

How can Datactics’ tools be used to test for Benford’s Law? 

Using the Datactics platform, we can very easily test any dataset against Benford’s Law. Take this dataset of financial transactions (shown below). We’re going to be testing the “pmt_amt” column to see if it follows Benford’s Law for first digits. It spans several orders of magnitudes ranging from a few dollars to 15 million, which means that Benford’s Law is more likely to accurately apply to it.

Table of data

The first step of the test is to extract the first digit of the column for analysis. This can very easily be done using a small FlowDesigner project (shown below).

Datactics Flowdesigner product

 

Here we import the dataset and then filter out any values that are less than 1, as these aren’t relevant to our analysis. Then, we extract the first digit. Once that’s been completed, we can profile these digits to find out how many times each occurs and then save the results.

The next step would be to perform a statistical test to see how confident we can be that Benford’s Law applies here. We can use our Data Quality Manager tool to architect the whole process.

Datactics Data Quality Manager product

Step one runs our FlowDesigner project, whilst the second executes a simple Python script to perform the test and the last two steps let us set up an automated email alert to let the user know if the data failed the test at a specified threshold. While I’m using an email alert here, any issues tracking platform, such as Jira, can be used. We can also show the results in a dashboard, like the one below.

Datactics product shows Benford's Law

The graph on the left, with the green line, represents the distribution we would expect the digits to follow if it obeyed Benford’s Law. The red line shows the actual distribution of the digits. The bottom right table shows the two distributions and then the top right table shows the result of the test. In this case, it shows that we can be 100% confident that the data follows Benford’s Law.

In conclusion…

Physicist Frank Benford discovered a useful methodology that is as beneficial today as ever. The applicability of Benford’s law is a powerful tool for detecting fraud and other irregularities in large datasets. By combining statistical analysis with expert knowledge and AI-enabled technologies, organizations can improve their ability to detect and prevent fraudulent activities, thus safeguarding their financial health and reputation.

Matt Neil is a Machine Learning Engineer at Datactics. For more insights from Datactics, find us on LinkedinTwitter or Facebook.

The post How to test your data against Benford’s Law  appeared first on Datactics.

]]>
How You’ll Know You Still Have a Data Quality Problem https://www.datactics.com/blog/marketing-insights/how-youll-know-you-still-have-a-data-quality-problem/ Mon, 17 Apr 2023 12:30:00 +0000 https://www.datactics.com/?p=13333 Despite a seemingly healthy green glow in your dashboards and exemplary regulatory reports, you can’t help but sense that something is amiss with the data. If this feeling rings true for you, don’t worry – it may be an indication of bigger issues lurking beneath the surface. You’re not alone. In this blog we’ve taken […]

The post How You’ll Know You Still Have a Data Quality Problem appeared first on Datactics.

]]>

Despite a seemingly healthy green glow in your dashboards and exemplary regulatory reports, you can’t help but sense that something is amiss with the data. If this feeling rings true for you, don’t worry – it may be an indication of bigger issues lurking beneath the surface.

Three Ways You'll Know You Have a Data Quality Problem

 

You’re not alone. In this blog we’ve taken a look at some of the most influential factors that indicate you’ve got a data quality problem. Why not use these handy pointers as a starting point to dig deeper?

1. You’re getting negative feedback from your internal business partners.

Data is the backbone of any business, so it’s no surprise that a lack of satisfaction from internal partners can often be traced back to data issues. From ensuring quality datasets are delivered at scale, through to solutions aimed towards empowering your colleagues with access to necessary information and context – there are many proactive steps you can take when aiming for better performance in this area. Taking action now will ensure everyone has what they need; fuelling success and transforming negative feedback into positive progress.

2. People keep sending you data in Microsoft Excel.

Now, we all love Excel. It’s brilliant. It’s made data handling a far more widespread expectation at every level of an organisation. But it does not give any way of source or version controlling your datasets, and is massively prone to its inherent limitations in scale and size. In fact, its ubiquity and almost unilateral adoption means that all your fabulous data lake investments are being totally undermined when things like remediation files, or reports, get downloaded into an Excel sheet. If you’re seeing Excel being used for these kinds of activities, you can bet you’ve a data quality problem (or multiple problems) that are having a real effect on your business.

3. Your IT team has more tickets than an abandoned car.

If your business teams aren’t getting the data they need, they’re going to keep logging tickets for it. It’s likely these tickets will include:your IT team has more tickets than an abandoned car

  • Change requests, to get the specific things they need;
  • Service requests, for a dataset or sets;
  • Issue logs because the data is wrong.

More than an identifier that the data’s not great, this actually shows that the responsibility for accessing and using the data remains in the wrong place. It’s like they’re going to a library with an idea of the plot of the story, and the genre, but they can’t actually search by those terms so they’re stuck in a cycle of guessing, of trial and error.

4. Conclusion

What these indicators have shown is that identifying data quality issues isn’t just for data teams or data observability tools to own. The ablity to recognise that something isn’t right is something that sits just as importantly within business lines, users and teams. 

What to do next is always the key question. Ultimately, data quality can be improved if the right processes and tools are put in place to collect, cleanse, and enrich data. There are several challenges that need to be overcome when dealing with bad data. These challenges include:

  • Identifying data quality issues,
  • Deploying adequate resources and time to resolve them, and
  • Investing in advanced analytical tools.

To do this effectively, enterprise-wide data governance is essential as it provides an actionable framework for businesses to continuously manage their data quality over time. Although implementing changes across an organisation may seem daunting at first, there are a few simple steps which organisations can take today that will help them quickly improve their grip on data quality.

A very important first step is the establishment of a data quality control framework, and helpfully we’ve written about this in the following blog. Happy reading!

 

 

The post How You’ll Know You Still Have a Data Quality Problem appeared first on Datactics.

]]>
How to build an effective data quality control framework https://www.datactics.com/blog/how-to-build-an-effective-data-quality-control-framework/ Tue, 07 Mar 2023 11:23:08 +0000 https://www.datactics.com/?p=20878 Alexandra Collins, Product Owner at Datactics, recently spoke on an A-Team webinar on the subject of how to build a data quality control framework. In case you missed it, here is everything you need to know about building a data quality control framework using the latest tools and technologies. What are the drivers of change […]

The post How to build an effective data quality control framework appeared first on Datactics.

]]>
How to build an effective data quality control framework

Alexandra Collins, Product Owner at Datactics, recently spoke on an A-Team webinar on the subject of how to build a data quality control framework. In case you missed it, here is everything you need to know about building a data quality control framework using the latest tools and technologies.

What are the drivers of change for organisations needing to create a data quality framework?

The volume of data being captured is a key driver of change. The volume of existing data that organisations need to process is growing at an exponential rate, resulting in most of the manual effort that is involved in creating a data quality framework being unable to keep up (especially when dealing with things like alternative data and unstructured data).

As a result, organisations are suffering the consequences further downstream, since the quality of the data being analysed is still poor. In order to improve the quality of data this size, organisations need to research and invest in more automated data quality checks, which is where the introduction of AI and machine learning has played a key role.

In my opinion, it is almost infeasible for big data organisations to achieve an acceptable level of data quality by relying solely on manual procedures.

What types of approaches have been taken to improve data quality in the past, and why have they fallen short?

Again, this is down to a combination of factors, including but not limited to, large volumes of data, reliance on manual data quality processes, and the difficulties of code-heavy operations.

 The manual effort involved in generating good quality data solely via SQL-based checks, or using the more everyday spreadsheet tooling, isn’t feasible when the volume of data is as large as it is today. The time involved and incidents of human error can often result in too large a percentage of data not meeting an appropriate standard of data quality, which ultimately impacts the effectiveness of downstream business operations.

The same applies in instances where intensive code solutions, like Python, R, and Java, are being relied upon. Being able to find the required skill sets to implement and maintain automated processes can be a challenge, with some organisations struggling to find employees with the technical knowledge and coding capabilities to build the necessary, automated data quality checks. Finding employees with the skills to develop and maintain code-heavy data quality processes is an ongoing challenge for a lot of organisations. People who have these skills are always in demand by companies that are struggling to retain their current employees.

Both of these approaches create bottlenecks within the workstream. Good quality data isn’t being fed into the business areas which actually analyse and extract value out of the data. The number of requests for ready-to-use, good-quality data from business functions further down the pipeline is increasing. Whether that’s for insights and analytics, regulatory compliance (standardisation), or complex matching activities, good quality data will always be indispensable. These activities include things like data migration, mergers, and acquisitions.

How can a framework be built, integrated with existing systems, and sustained?

This process can be broken down into two steps.

1) Step 1- Recognising how beneficial a data quality framework is to your organisation. Once the business area implementing the framework understands its importance and associated benefits, the building, integrating, and maintenance of the framework will naturally follow. Accepting that there are problems with the data is the first step. Then, being willing to investigate where these problems lie plays a crucial part in constructing the DQ checks initially, and the automated DQ solution that will run checks against the data.

2) Step 2- Analysing, resolving, and reviewing the failing checks. The process needs to be something that the business users are happy to adopt or one which can be easily integrated into their current tools (you can devise this yourself or buy it off the shelf). The framework should generate data quality metrics, which can then be consumed by other existing systems, e.g. dashboards, ticketing tools, and AI/ML solutions. Having a process like this will promote wider adoption across the organisation and ultimately results in continuous data quality improvement.

In terms of creating a sustainable data quality framework, a tool that meets the needs of the business users is more likely to be sustained. Considering the fact that most business units won’t have their own coding team, it’s worth using a low-code or no-code tool. This gives users the ability to create software and applications without traditional coding, making it a great option for businesses that need an application quickly, or that don’t have the resources to hire experienced coders.

Low-code/no-code software has some major advantages over manual coding, as it is less time-consuming, more cost-effective, and highly customizable.

What technologies, tools, solutions, and services are helpful?

Depending on where the organisation sits within its data quality journey, the answer could be different in every case. A data quality control framework could be a relatively new investment for an organisation that is aware that they have data quality issues but doesn’t know where to begin with applying the DQ checks. In this case, automated profiling solutions can help business users explore the data that needs assessing and highlight outliers within the data. Moreover, an AI/ML solution that can suggest which data quality checks or rules to run against your data can be a helpful tool for organisations beginning their data quality framework journey.

Tools which can be adapted to both coders and non-coders alike are helpful for allowing data quality checks to be defined and built effectively. Similarly, a tool that allows business-specific checks to be integrated easily within the overall workflow will help in rolling out an end-to-end data quality control framework more quickly.

It’s also worth noting that an easy-to-use issue resolution application is also beneficial as part of your data quality framework. This allows non-technical users within the business, who are usually the SMEs and people who work with the data on a day-to-day basis, to locate and fix the break points within their data to avoid bottlenecks within their workstream or further down the business.

When choosing a tool for your data quality framework, it’s worth considering the following questions to determine your organisation’s needs…

  1. Does the tool meet all the data quality requirements of your organisation?
  2. Can it find the poor-quality areas of your data?
  3. Can it execute the required checks against the data?
  4. Is it easy to fix this bad data/resolve the failing checks?
  5. Does the tool dive deeper into the cause of these failing DQ checks?

Could you give us a step-by-step guide to setting up a data quality control framework, from early decision-making to understanding if it is successful?

In the early stages of creating a data quality framework, the first decision you’ll need to make is deciding which tool is fit for purpose. Can it be applied to your specific business needs? Can it be integrated easily within your current system? Does it match the data quality maturity of the business? After considering these questions, the setup process for building a data quality framework is as follows:

  1. Conduct some initial data discovery. Drill down into your data to find the initial failure points, such as invalid entries, incomplete data, etc. The aim is to get an idea of what controls need put in place.
  2. Define and build the data quality checks that need to be performed on your data. Ideally, these checks would then be scheduled to run in an automated fashion.
  3. Resolve the failing checks. Once resolved, ensure that the fixed data is what is being pushed further downstream, or back to the source.
  4. Record data quality metrics over time. This allows for an analysis of the breaks to be performed and as a result, business users can pinpoint exactly what is causing the poor-quality data.
  5. Maintenance. Once your data quality framework is complete, maintenance and ongoing improvements are necessary to ensure its success.

In terms of monitoring its success, recording data quality statistics over time allows the business to see if its data quality is improving or regressing at a high level. If it’s improving then great, you know the framework is operating effectively and that the failure points that are highlighted by the DQ process are being resolved within the organisation. Another good measurement of success is to dig deeper into the parts of the business that use this data to determine if they are functioning more fluidly and no longer suffering the consequences of poor data.

Even if the numbers are on a downward trajectory, the organisation is being presented with the evidence that they have a problem somewhere within their workstream and can therefore allocate the required resources to investigate the specific failing DQ checks.

Finally, what three pieces of advice would you give to practitioners working on, or planning to work on, a data quality control framework?

  1. First, you need buy-in. Ensure that people within the business understand the need for the framework and how it benefits them, as well as the organisation more widely. As a result, they’ll be more likely to be interested in learning how to use the tool effectively and maintaining its adoption (possibly even finding improvements to be made to the overall workflow). If there is a suitable budget, working with data quality software vendors can help for achieving quick wins.
  2. Next, consider which tool to invest in. It’s important that the tool is appropriate for the team that will be using it, so there are a few questions worth thinking about. What technological capabilities does the team have in order to build data quality checks suitable for the business needs? How mature is data quality within the team? It might transpire that you need to invest in a tool that provides initial data discovery, or perhaps one that schedules your pre-built DQ checks and pushes results for resolution. Or, perhaps, the priority will go to using AI/ML to analyse the failing DQ checks.  Dig a bit deeper here into what your requirements are in order to get the most out of your investment in a tool.
  3. Finally, understand the root cause of the poor data quality. Record data quality metrics over time and analyse where and when the failure points occur, perhaps visualising this within your data lineage tooling. This could then be tracked back to an automated process that is causing invalid or incomplete data to be generated.

A data quality control framework is essential for ensuring that the data your organisation relies on is accurate, reliable, and consistent. By following the best practices outlined in this blog post, you can be sure that your data quality control framework will be effective. To help minimise the manual effort involved with building a framework, there are a number of helpful technologies, tools, solutions, and services available to assist with creating and maintaining a data quality control framework.

If you would like more advice on creating or improving a data quality control framework for your organisation, our experienced data consultants are here to help.

building a data quality framework

Alexandra Collins is a Product Owner at Datactics. For more insights from Datactics, find us on LinkedinTwitter or Facebook.

The post How to build an effective data quality control framework appeared first on Datactics.

]]>
Top 5 Trends in Data and Information Quality for 2023 https://www.datactics.com/blog/marketing-insights/top-5-trends-in-data-and-information-quality-for-2023/ Thu, 12 Jan 2023 12:17:27 +0000 https://www.datactics.com/?p=20993 In this blog post, our Head of Marketing, Matt Flenley, takes a closer look at the latest trends in data and information quality for 2023. He analyses predictions made by Gartner and how they’ve developed in line with expectations to provide insight into the evolution of the market and its various key players. Automation and […]

The post Top 5 Trends in Data and Information Quality for 2023 appeared first on Datactics.

]]>
Discover the latest trends in data and information quality for 2023, featuring Data profiling, Data Mesh, Data Fabric, Data Governance, and more.

In this blog post, our Head of Marketing, Matt Flenley, takes a closer look at the latest trends in data and information quality for 2023. He analyses predictions made by Gartner and how they’ve developed in line with expectations to provide insight into the evolution of the market and its various key players. Automation and AI are expected to play a central role in data management, and their impact on the industry will be examined in detail. Additionally, the importance of collaboration and interoperability in a consolidating industry will be highlighted, as well as the potential impact of macroeconomic factors such as labour shortages and recession headwinds on the implementation of these trends. Explore the impact of Data profiling, Data mesh, Data Fabric, and Data Governance on the evolving data management landscape in this analysis.

A recent article by Gartner on predictions and outcomes in technology spend took a fair assessment of market predictions its analysts had made and the extent to which they had developed in line with expectations.  

Rather than simply a headline-grabbing list of the way blockchain or AI will finally rule the world, it’s a refreshing way to explore how a market has evolved against a backdrop of expected and unexpected developments in the overall data management ecosystem and beyond. 

For instance, while it was known pretty widely that the lessening day-to-day impact of the pandemic would see economies start to reopen, it was harder to predict with certainty that Russia would invade Ukraine to ignite a series of international crises, including cost-of-living, provision of food and energy and a new era of geopolitical turmoil long absent from Europe.  

Additionally, the impacts of the UK’s decision to leave its customs union and single market with its biggest trading partner were yet to be fully realised as the year commenced. The UK’s job market has become increasingly challenging for firms attempting to recruit into professional services and technology positions. Reduced spending power in the UK’s economy, combined with rising inflation and a move into economic recession will no doubt have an impact on organisations’ ability and willingness to make capital expenditures.  

In that light, this review and preview will explore a range of topics and themes likely to prove pivotal, as well as the possible impact of macroeconomic nuances on the speed and scale of their implementation.  

1. Automation is the key (but explain your AI!) 

Any time humans have to be involved in extracting, transforming or loading (ETL) of data, it costs a firm time and money, and increases risk. It’s the same throughout an entire data value chain, wherever there are human hands on it, manipulating it for a downstream use. Human intervention adds value in complex tasks where nuance is required. For tasks which are monotonous or require high throughput, errors can creep in. 

A backdrop of labour shortages, and probable recession headwinds, means that automation is going to be first among equals when it comes to 2023’s probable market trends. Firms are going to be doing more with less, and finding every opportunity to exploit greater automation offered by their own development teams and the best of what they can find off the shelf. The advantages of this are two-fold: freeing up experts to work on more value added tasks, and reducing the reliance on technical skills which are in high demand.  

Wherever there’s automation, AI and Machine Learning are not far behind. The deployment of natural language processing has made strides in the past year in automating the extraction of data, tagging and analysis of sentiment, seen in areas such as  speech tagging and entity resolution. The impact of InstructGPT and even more so ChatGPT, late in 2022, demonstrated to a far wider audience both the potency of machine learning and its risks.  

Discover the latest trends in data and information quality for 2023, featuring Data profiling, Data Mesh, Data Governance, and more.

Expect, therefore a massive increase in the world of Explainable AI – the ability to interpret and understand why an algorithm has reached a decision, and to track models to ensure they don’t drift away from their intended purpose. The EU AI act is currently working its way through the EU parliament and council, providing the proposed first regulation of AI systems enforcement using a risk-based approach. This will be helpful for firms both building and deploying AI models, providing guidance and application of their use. 

2. Collaborate, interoperate or risk isolation 

In the last few years, there has been significant consolidation across the technologies that collectively make up a fully automated, cloud-enabled, data management platform. Even within those consolidations, such as Precisely’s multiple acquisitions or Collibra purchasing OwlDQ, the need to expand beyond the specific horizons of these platforms has remained sizeable. Think integration with containerisation solutions like Kubernetes or Docker, or environments such as Databricks or dbt, where data is stored, accessed or processed. Consider how many firms leverage Microsoft products by default, so when they release something as significant as Purview for unified data governance, organisations which already offer some or most aspects of a unified data management platform will need to explore how to work alongside as-standard tooling. 

The global trend towards hybrid working has perhaps opened the eyes of many firms outside of large financial enterprises to cloud computing, remote access and the opportunities presented by a distributed workforce. At the same time, it’s brought their attention to the option to onboard data management tooling from a range of suppliers and based in a wide variety of locations. Such tooling will therefore need to demonstrate interoperability across locales and markets, alongside its immediate home market.  

3. Self-service in a data mesh and data fabric ecosystem 

Like Montagues and Capulets in a digital age, data mesh and data fabric have arisen as two rival methodologies for accessing, sharing and processing data within an enterprise. However, just as in Shakespeare’s Verona, there’s no real reason why they can’t coexist, and better still, nobody has to stage an elaborate, doomed, poison-related escape plan. 

Forrester’s Michele Goetz didn’t hold back in her assessment of the market confusion on this topic in an article well worth reading in full. Both setups are answering the question on everyone’s lips, which is “how can I make more use of all this data?” The operative word here is ‘how’, and whether your choice is fabric, mesh or some fun-loving third option stepping into the fight like a data-driven Mercutio, it’s going to be the decision to make in 2023.  

Handily, the most recent few years has seen a rise in data consultants and their consultancies, augmenting and differentiating from the Big Four-type firms by focusing purely on data strategy and implementation. Data leaders can benefit from working with such firms in scoping Requests for Information (RFIs), understanding optimal architectures for their organisation, and happily acknowledging the role of a sage – or learned Friar – in guiding their paths.  

Market trend-wise, those labour shortages referenced earlier have become acutely apparent in the global technology arena. Alongside the drive towards automation and production machine learning is a growing array of no-code, self-service platforms that business users can leverage without needing programming skill. It is wise therefore to expect further increases in this transition throughout 2023, both in marketing messaging and in user interface and user experience design. 

4. Everyone’s talking about data governance 

Speaking of data governance, a recent trend has been to acknowledge that firms are embracing that title in order to do anything with data management. Whether it’s to improve quality, understand lineage, implement master data management or undertake a cloud migration programme, much of this falls to or under the auspices of someone with a data governance plan.  

The rise of data governance as a function in sectors outside financial services has increase as firms become challenged to do more with their data. At the recent Data Governance and Information Quality event in Washington, DC, the vast majority of attendees visiting the Datactics stand held a data governance role or worked in that area.  

As a data quality platform provider it was interesting to hear their plans for 2023 and beyond, chiefly around the automation of data quality rules, ease of configuration and needing to interoperate with a wide variety of systems. Many were reluctant to source every aspect of their data management estate from just one vendor, preferring to explore a combination of technologies under an overarching data governance programme, and many were recruiting the specialist services of data governance consultants described previously.  

5. It’s all about the metadata 

The better your data is classified and quantified with the correct metadata, the more useful it is across an enterprise. This has long been the case, but as in this excellent article on Forbes, if anything its reality is only just becoming known. Transitioning from a passive metadata management approach – storing, classifying, sharing – to an active one, where the metadata evolves as the data changes, is a big priority for data-driven organisations in 2023. This is especially key in trends such as Data Observability, understanding the impact of data in all its uses and not just in where it came from or where it resides.  

Firms will thus seek technologies and architectures that enable them to actively manage their metadata as it applies to various use cases, such as risk management, business reporting, customer behaviour and so on.  

In the past, one issue affecting the volume of metadata firms could store, and consider being part of an active metadata strategy, was the high cost associated with physical servers and warehouses. However, access to cloud computing has meant that the thorny issue of storing data has, to a certain extent, become far less costly – lowering the bar for firms to consider pursuing an active metadata management strategy. 

If the cost of access to cloud services was to increase in the coming years, this could be decisive in how aggressively firms explore what their metadata strategy could deliver for them in terms of real-world business results. 

6. And a bonus: Profiling data has never been more important 

Wait, I thought this was a Top 5? Well, on the basis that everyone loves a bit of a January sale, here’s a bonus sixth!

Data profiling is usually the first step in any data management process, discovering exactly what’s in a dataset. Profiling has become even more pronounced with the advent of production machine learning, and the use of associated models and algorithms. Over the past few years, AI has had a few public run-ins with society, not least in the exam results debacle in the UK. For those who missed it, the UK decided to leverage algorithms built on past examination data to provide candidates with a fair predicted grade. However, in reality almost 40% of students received grades lower than anticipated. The data used to provide the algorithm with its inputs were as follows: 

  • Historical grade distribution of schools from the previous three years 
  • The comparative rank of each student in their school for a specific subject (based on teacher evaluation) 
  • The previous exam results for a student for a particular subject 

Thus a student deemed to be halfway in the list in their school would receive a grade equivalent to what the previous halfway pupils achieved in previous years.  

So why was this a profiling issue? Well, for one example, the model didn’t account for outliers in any given year, making it nigh-on impossible for a student to receive an A in a subject if nobody had achieved one in the previous three years. Profiling of the data in previous years could have identified these gaps and asked questions of the suitability of the algorithm for its intended use. 

Additionally, when the model started to spark outcry in the public domain, profiling the datasets involved would have revealed biases towards smaller school sizes. So while not exclusively a profiling problem, it was something that data profiling, and model drift profiling (discovering how far has the model deviated from its intent) would have helped to prevent. 

This is especially pertinent in the context of evolving data over time. Data doesn’t stand still, it’s full of values and terms which adapt and change. Names and addresses change, companies recruit different people, products diversify and adapt. Expect dynamic profiling of both data and data-associated elements, including algorithms, to be increasingly important throughout 2023 and beyond. 

And for more from Datactics, find us on LinkedinTwitter or Facebook.

The post Top 5 Trends in Data and Information Quality for 2023 appeared first on Datactics.

]]>
Outlier Detection – What Is It And How Can It Help In The Improvements Of Data Quality?  https://www.datactics.com/blog/ai-ml/outlier-detection-what-is-it-and-how-can-it-help-in-the-improvements-of-data-quality/ Fri, 27 May 2022 11:05:50 +0000 https://www.datactics.com/?p=18748 Identifying outliers and errors in data is an important but time-consuming task. Depending on the context and domain, errors can be impactful in a variety of ways, some very severe. One of the issues with detecting outliers and errors is that they come in many different forms. There are syntactic errors, where a value like […]

The post Outlier Detection – What Is It And How Can It Help In The Improvements Of Data Quality?  appeared first on Datactics.

]]>
Outlier Detection

Identifying outliers and errors in data is an important but time-consuming task. Depending on the context and domain, errors can be impactful in a variety of ways, some very severe. One of the issues with detecting outliers and errors is that they come in many different forms. There are syntactic errors, where a value like a date or time is in the wrong format, and semantic errors, where a value is in the correct format but doesn’t make sense in the context of the data, like an age of 500. The biggest problem with creating a method for detecting outliers in dataset is how to identify a vast range of different errors with the one tool. 

At Datactics, we’ve been working on a tool to solve some of these problems and enable errors and outliers to be quickly identified with minimal user input. With this project, our goal is to assign a number to each value in a dataset which represents the likelihood that the value is an outlier. To do this we use a number of different features of the data, which range from quite simple methods like looking at the frequency of a value or its length compared to others in its column, to more complex methods using n-grams and co-occurrence statistics. Once we have used these features to get a numerical representation of each value, we can then use some simple statistical tests to find the outliers. 

When profiling a dataset, there are a few simple things you can do to find errors and outliers in the data. A good place to start could be to look at the least frequent values in a column or the shortest and longest values. These will highlight some of the most obvious errors but what then? If you are profiling numeric or time data, you could rank the data and look at both ends of the spectrum to see if there are any other obvious outliers. But what about text data or unique values that can’t be profiled using frequency analysis? If you want to identify semantic errors, this profiling would need to be done by a domain expert. Another factor to consider is the fact that this must all be done manually. It is evident that there are a number of aspects of the outlier detection process that limit both its convenience and practicality. These are some of the things we have tried to address with this project. 

Outlier Detection

When designing this tool, our objective was to create a simple, effective, universal approach to outlier detection. There are a large number of statistical methods for outlier detection that, in some cases, have existed for hundreds of years. These are all based on identifying numerical outliers, which would be useful in some of the cases listed above but has obvious limitations. Our solution to this is to create a numerical representation of every value in the data set that can be used with a straightforward statistical method. We do this using features of the data. The features currently implemented and available for use are: 

  • Character N-Grams 
  • Co-Occurrence Statistics 
  • Date Value 
  • Length 
  • Numeric Value 
  • Symbolic N-Grams 
  • Text Similarities 
  • Time Value 

We are also working on creating a feature of the data to enable us to identify outliers in time series data. Some of these features, such as date and numeric value are only applicable on certain types of data. Some incorporate the very simple steps discussed above, like occurrence and length analysis. Others are more complicated and could not be done manually, like co-occurrence statistics. Then there are some, like the natural language processing text similarities, which make use of machine learning algorithms. While there will be some overlap in the outliers identified by these features, on the most part, they will all single out different errors and outliers, acting as an antidote to the heterogenous nature of errors discussed above. 

One of the benefits of this method of outlier detection is its simplicity which leads to very explainable results. Once features of our dataset have been generated, we have a number of options in terms of next steps. In theory, all of these features could be fed into a machine learning model which could then be used to label data as outlier and non-outlier. However, there are a number of disadvantages to this approach. Firstly, this would require a labelled dataset to train the model with, which would be time-consuming to create. Moreover, the features will differ from dataset to dataset so it would not be a case of “one model fits all”. Finally, if you are using a “black box” machine learning method when a value is labelled as an outlier, you have no way of explaining this decision or evidence as to why this value has been labelled as opposed to others in the dataset. 

All three of these problems are avoidable using the Datactics approach. The outliers are generated using only the features of the original dataset and, because of the statistical methods being used, can be identified with nothing but the data itself and a confidence level (a numerical value representing the likelihood that a value is an outlier). There is no need for any labelling or parameter-tuning with this approach. The other big advantage is, that due to the fact we assign a number to every value, we have evidence to back-up every outlier identified and are able to demonstrate how they differ from other none-outliers in the data. 

Another benefit of this approach is that it is modular and therefore completely expandable. The features the outliers are based on can be selected based on the data being profiled which increases accuracy.  Using this architecture also give us the ability to seamlessly expand the number of features available to be used and if trends or common errors are encounter that aren’t identified using the current features, it is very straightforward to create another feature to rectify this. 

And for more from Datactics, find us on LinkedinTwitter, or Facebook.

The post Outlier Detection – What Is It And How Can It Help In The Improvements Of Data Quality?  appeared first on Datactics.

]]>
Why should you care about data quality? https://www.datactics.com/blog/why-should-you-care-about-data-quality/ Wed, 27 Apr 2022 10:17:07 +0000 https://www.datactics.com/?p=18678 Data quality may not be viewed as a particularly attractive topic of conversation at a dinner party or a company event, yet it is a critical step for organisations to maximise the value of their information and subsequent decision-making. It is therefore imperative that data leaders make every attempt to raise awareness and educate on […]

The post Why should you care about data quality? appeared first on Datactics.

]]>
Why you should care about data quality

Data quality may not be viewed as a particularly attractive topic of conversation at a dinner party or a company event, yet it is a critical step for organisations to maximise the value of their information and subsequent decision-making. It is therefore imperative that data leaders make every attempt to raise awareness and educate on why data quality is so important to an enterprise’s success – and ultimately make people care about data quality.

To begin, what exactly is meant to by data quality?

High quality data is defined as data that is fit for purpose, reliable and trustworthy. While organisations may maintain different quality standards, the Data Management Association UK (DAMA) propose that in order to be considered high quality, data must satisfy the following six dimensions; accuracy, completeness, uniqueness, consistency, timeliness and validity. These are often viewed as the foundations of data quality, however it can be argued that technical measurements and standards are determined by the use case.

Data problems can destroy business value. Recent research from Gartner shows that organisations estimate the average cost of poor data quality at over $10 million per year, a figure which will likely increase as the modern business environment becomes increasingly digitalized and unstructured data becomes harder to decipher. It’s also estimated that in 2021, eight out of ten companies admitted to struggling with data quality issues.

Senior leadership often neglect data quality as an organizational priority unless they are provided with an immediate reason to address the issue. Only when bad data quality is demonstrably proven to have a negative impact on the business will action be taken; bad data that casues complications to the initiatives and processes that senior management care about and are essential to the business.

Data quality lays at the foundation of all data management, data governance and data lineage processes; therefore it is imperative to get it right and ensure your business embraces a culture that cares about data quality.

In order to get your organization talking about the importance of good data quality, here are some recommendations:

First step: directly expose the pain caused to business operations by bad data quality;

  • This can be communicated to senior leaders and stakeholders through presenting a problem statement or a business case which they can own and align with the current strategic objectives of the firm. Connecting the business case with the organizational trajectory will help senior leaders understand the operational and financial benefits of addressing data quality issues. Also, this can help you answer the inevitable “why do we need to do this?” question.
  • Business leaders can also be engaged with end users to discuss their experiences, share anecdotes and help senior management emotionally connect with those at the firm who are regularly impacted by the consequences of poor quality data.
  • It may also be of merit to present historical events where firms have suffered from bad data quality, resulting in organizational disasters or heavy fines from financial regulators.

Completing these actions will help focus your business case on improving the health and quality of data that matters to the stakeholders who care about the problem and are prepared to help instigate an institutional change in organizational attitude’s to data quality.

Second step: Shift the company culture to one that cares about data quality management. Present key metrics which illustrate the tangible impacts of poor data quality to the organization and outline the resources that are required to make a change;

  • The key processes and process owners (needed to deliver on outcomes) must be identified and will likely span across multiple business areas. This will alleviate any concerns on siloed thinking between functional areas and elevate the role that data quality plays across the firm.
  • Work with key process owners to determine the key indicators which will be most critical to the newly identified business processes. This will help you define your critical data elements and the data quality associated with them e.g. quality of Customer Contact master data.
  • Data profiling and analysis of critical data elements can help to demonstrate their impact on business performance. This can be carried out via in-house or external data management tools, or using programming scripts such as Python or SQL. Results must be communicated and explained to business stakeholders to infer how data quality limitations on critical data elements can hinder organizational performance and how improvements can contribute to superior results.
  • Identify alternative areas of the business where the importance of good data quality is imperative as the business scales e.g. data science, advanced analytics and machine learning. Good data quality is fundamental to generate business value from these areas and therefore must be addressed.

By completing these two steps, your key business stakeholders will have developed an empathetic and rational understanding of the day to day operational benefits of data quality improvement. By creating an organizational culture that cares about data quality, business users will hopefully see data quality issues propelled to the forefront of the firm’s IT and data management strategy. Ideally, this may lead to additional funding and available resources to tackle data quality issues.

To have further conversations about the drivers and benefits of a Self-Service Data Quality platform, reach out to Brendan McCarthy.

And for more from Datactics, find us on LinkedinTwitter, or Facebook.

The post Why should you care about data quality? appeared first on Datactics.

]]>
AI Ethics: The Next Generation of Data Scientists https://www.datactics.com/blog/ai-ethics-the-next-generation-of-data-scientists/ Mon, 04 Apr 2022 12:54:50 +0000 https://www.datactics.com/?p=18414 In March 2022 Datactics took advantage of the offer to visit a local high school to discuss AI Ethics and Machine Learning in production.

The post AI Ethics: The Next Generation of Data Scientists appeared first on Datactics.

]]>
In March 2022, Datactics took advantage of the offer to visit a local secondary school and the next generation of Data Scientists to discuss AI Ethics and Machine Learning in production. Matt Flenley shares more from the first of these two visits in his latest blog below…

Pictures of two Datactics employees and two students from Wallace High School Lisburn after an AI Ethics talk
Students from Wallace High School meet Dr Fiona Browne (centre) and Matt Flenley (right)

AI Ethics is often the poster child of the modern discourse on whenever the inevitable machine-led apocalypse occurs. Yet, as we look around at wars in Ukraine and Yemen, record water shortages in the developing world, and the ongoing struggle for the education of girls in Afghanistan, it becomes readily apparent that as in all things, ethics starts with humans.

This was the main thrust of the discussion with the students at Wallace High School in Lisburn, NI. As Dr Fiona Browne, Head of AI and Software Development, talked the class of second-year A-Level students through data classification for training machine learning models, the question of ‘bad actors’ came up. What if, theorised Dr Browne, people can’t be trusted to label a dataset correctly, and the machine learning model learns things that aren’t true?

At this stage, a tentative hand slowly raised in the classroom; one student confessed that, in fact, they had done exactly this in a recent dataset labelling exercise in class. It was the perfect opportunity to detail in a practical way how the human involvement in Artificial Intelligence, Machine Learning, and especially in the quality of the data underpinning both.

Humans behind the machines, and baked-in bias

As is common, the exciting part of technology is often the technology itself. What can it do? How fast can it go? Where can it take me? This applies just as much to the everyday, from home electronics through to transportation, as it does to the cutting edge of space exploration or genome mapping. However, the thought processes behind the technology, imagined up by humans, specified and scoped by humans, create the very circumstances for how those technologies will behave and interact with the world around us.

In her promotion for the book Invisible Women, the author Caroline Criado-Perez writes,

“Imagine a world where your phone is too big for your hand, where your doctor prescribes a drug that is wrong for your body, where in a car accident you are 47% more likely to be seriously injured, where every week the countless hours of work you do are not recognised or valued.  If any of this sounds familiar, chances are that you’re a woman.”

Caroline Criado-Perez, Invisible Women

One example is of the comparatively high rate of anterior cruciate ligament injuries among female soccer players. While some of this can be attributed to different anatomies, it is in part caused by the lack of female-specific footwear in the sport (with most brands choosing to offer smaller sizes rather than tailored designs). Yet the anatomical design of the female knee in particular is substantially different to that of males. Has this human-led decision, to simply offer small sizes, taken into account the needs of the buyer, or the market? Has it been made from the point of view of creating a fairer society?

AI Ethics: The Next Generation of Data Scientists
The Datactics team (L to R: Matt Flenley, Shauna Leonard, Edele Copeland) meet GCSE students from the Wallace High School as part of a talk on Women in Technology Careers

If an algorithm was therefore applied to specify a female-specific football boot from the patterns and measurements of existing footwear on the market today, would it result in a different outcome? No, of course not. It takes humans to look at the world around us, detect the risk of bias, and then do something about it.

It is the same in computing. The product, in this case the machine learning model or AI algorithm, is going to be no better than the work that has gone into defining and explaining it. A core part of this is understanding what data to use, and of what quality the data should be.

Data Quality for Machine Learning – just a matter of good data?

Data quality in a business application sense is relatively simple to define. Typically a business unit has requirements, usually around how complete the data is and to what extent the data in it is unique (there are a wide range of additional data quality dimensions, which you can read about here). For AI and Machine Learning, however, data quality is a completely different animal. On top of the usual dimensions, the data scientist or ML engineer needs to consider if they have all the data they need to create unbiased, explainable outcomes. Put simply, if a decision has been made, then the data scientists need to be able to explain why and how this outcome was reached. This is particularly important as ML becomes part and parcel of everyday life. Turned down for credit? Chances are an algorithm has assessed a range of data sources and generated a ‘no’ decision – and if you’re the firm whose system has made that decision, you’re going to need to explain why (it’s the law!).

AI Ethics: The Next Generation of Data Scientists

This is the point at which we return to the class in Wallace High School. The student tentatively raising their arm would have got away with it, with the model predicting patterns incorrectly, if the student had stayed silent. There was no monitoring in place to detect which user had been the ‘bad actor’ and so the flaw would have gone undetected without the student’s confession. It was, however, utterly perfect to explain the need to free algorithms from bias, for this next generation of data scientists. In the five years between now and when these students are working in industry, they will need to be fully aware of needing every possible aspect of the society people wish to inhabit being in the room when data is being classified, and models are being created.

For an industry still so populated overwhelmingly by males, it is clear that the decision to do something about what comes next lies where it always has: in the hearts, minds and hands of technology’s builders.

The post AI Ethics: The Next Generation of Data Scientists appeared first on Datactics.

]]>
Gartner Blog 3 – Key Insights and Takeaways https://www.datactics.com/blog/marketing-insights/gartner-blog-3-key-insights-and-takeaways/ Tue, 25 Jan 2022 14:57:48 +0000 https://www.datactics.com/?p=17824 The previous two editions of this blog series provided an overview of the Gartner Magic Quadrant from the perspective of someone relatively new to the world of data management, defining what it really means for a scaling business like Datactics to be recognised by Gartner.

The post Gartner Blog 3 – Key Insights and Takeaways appeared first on Datactics.

]]>

The previous two editions of this blog series provided an overview of the Gartner Magic Quadrant from the perspective of someone relatively new to the world of data management, defining what it really means for a scaling business like Datactics to be recognised by Gartner, as well as drawing attention to the core strengths of the Datactics product, as highlighted by Gartner analysts. With this knowledge, this blog will focus on the key insights we derived from this research, also highlighting market trends from the buyer and seller side. 

The data quality solutions market continues to mature at a rapid pace, with vendors from across the space innovating their offerings by making more impactful use of metadata and AI to solve customer problems and cater for increasingly complex use cases. 

Data Quality had traditionally been mandated to adhere to regulatory compliance and governance requirements and reduce operational risks and costs. However, as referenced in Gartner’s research, senior executives in businesses across the globe are recognising the necessity for Data Quality when amplifying analytics for more accurate insights and data driven decision-making.  

As per Gartner surveys, by the year 2023 it is estimated that over 60% of organisations will leverage machine-learning enabled data quality technology to increase automation of tasks and provide accurate recommendations, significantly reducing bottlenecks and manual effort often associated with tasks for Data Quality improvement. Additionally, by 2024, over 50% of businesses will implement modern Data Quality solutions to better support enterprise-wide digital transformation and business initiatives. Firms are engaging with vendors from the Magic Quadrant to gain competitive advantage and ultimately achieve their goals. 

One of the key trends taken from this year’s Magic Quadrant is the shift towards Self-Service Data Quality. The era of requiring heavy programming and IT resources to perform Data Quality tasks is changing, and no-code platforms will continue to rise due to their accessibility and usability. Datactics were the only vendor in this year’s Magic Quadrant accredited with this feature as a key strength, as they continue to champion the movement of no-code functionality in Data Quality. Additionally, firms are seeking to centralise Data Quality controls and improve interoperability by simplifying the ease of integration with adjacent software tools such as MDM, metadata and data governance.  

This research indicates that to fulfil the market demand for simplified data quality management, despite the increasingly complicated data landscape, vendors must offer a product that goes beyond simply fixing data errors, to helps clients actively manage their data right across the enterprise.  

If you would like to open a conversation about any of the topics discussed in the previous three blog articles, feel free to reach out to me on LinkedIn or send me an email at brendan.mccarthy@datactics.com. 

The post Gartner Blog 3 – Key Insights and Takeaways appeared first on Datactics.

]]>
Data Quality fundamentals driving valuable Data Insights in Insurance https://www.datactics.com/blog/self-service-data-quality/data-quality-fundamentals-driving-valuable-data-insights-in-insurance/ Wed, 19 Jan 2022 14:09:55 +0000 https://www.datactics.com/?p=17754 Data in a Changing World The Insurance industry traditionally uses data to inform decision-making and manage growth and profitability across marketing, underwriting, pricing and policy servicing processes. However, like most established financial institutions, insurance companies have many data repositories and different teams managing analytics functions. Traditionally, they also struggle to share this information or communicate […]

The post Data Quality fundamentals driving valuable Data Insights in Insurance appeared first on Datactics.

]]>

Data in a Changing World

The Insurance industry traditionally uses data to inform decision-making and manage growth and profitability across marketing, underwriting, pricing and policy servicing processes. However, like most established financial institutions, insurance companies have many data repositories and different teams managing analytics functions. Traditionally, they also struggle to share this information or communicate with one another, with many organisations having their own processes for capturing data. These factors combine to cause poor quality and inconsistent data, creating barriers toward seamless integration.

The Insurance industry recognises the importance of maintaining a competitive edge, with many companies looking to adopt a ‘single platform’ approach using Cloud Services from AWS, Azure or Google in the short to medium term. Such a platform needs to be flexible to support different skill sets, react to changing market conditions and able to integrate alternative sources of data. Fundamental to this is the quality of data across different data sources, ensuring it is trusted, of a high degree of integrity, and complete for business decisioning purposes.

Challenges

Customer insights are isolated to silos and scattered across lines of business, functional areas and even channels. As a result, much of the work that surrounds the handling of data becomes manual and time consuming, with no common keys or even set definitions of key terms, i.e., ‘customer’. It is estimated that as much as 70% of a highly qualified analyst’s time is spent locating and fixing the data.

The challenge for Insurance companies is being able to recognize the same customer across product lines and/or at different stages of the policy lifecycle. Direct and agency channels may compete for the same customer or attract a high-risk prospect that was turned down previously by underwriting. Since the claims department data is not available to pricing and marketing to inform their decisions, the result is often extra expenditures and a larger than necessary marketing budget that could easily be streamlined should these inefficiencies be addressed. It also causes poor customer experiences, which harm the brand.

There is, however, a significant demand for customer-centric solutions which allow insurance companies to link different pieces of data about a customer. These solutions use Data Quality tools to match, merge and link records, creating a holistic view across product lines and throughout the policy lifecycle.

Customer-centric solutions help insurance companies realise important business goals, including more accurate targeting, longer retention, and better profitability.

Opportunity

Generating valuable insights from expanding data sets is becoming significantly harder. On top of this, leveraging the right technology, people and process to analyse data remains a key challenge for Executives. Prepping the data is often where the real heavy lifting is done and using Data Quality automation and a Self-Service approach can really benefit a company in terms of significantly reducing costs and accelerating decision making.

While the Insurance industry faces a plethora of challenges with data and analytics, it’s imperative that executives recognize that the quality of the data is fundamental to capitalising on market opportunities. By overcoming these barriers, the industry will be better prepared to embark on the next frontier of Data and Analytics (D&A).

About Datactics

Datactics helps Insurance companies drive valuable Data Insights, supports Operational Data needs and process, including Data Governance and Compliance & Regulation by removing roadblocks common in data management. We specialise in class-leading, self-service data quality and fuzzy matching software solutions, designed to empower business users who know the data to visualise and fix the data.

To have further conversations about the drivers and benefits of a Self-Service Data Quality platform in Insurance, book a quick call with Kieran Seaward.    

And for more from Datactics, find us on LinkedinTwitter, or Facebook.

The post Data Quality fundamentals driving valuable Data Insights in Insurance appeared first on Datactics.

]]>