The Eager Analyst

View Original

Where’s your data from? - How tracking the lineage of data can add value to your analytics

Originally published on napes.co.uk

Provenance is an increasingly popular topic in the food industry as quality scandals and ethical sourcing of produce push their way to their way to the front pages of the press but have you considered how tracking the provenance of your data can add value to your data analytics?

Is the meat you're eating from the animal you expected? Are your potatoes more travelled than touring rock stars? We don't just demand to know more about the source of our food either, as is exemplified by the volume of nutritional information provided with each ready meal packet. We want to know its makeup, the processes the food has undergone, and how that might impact our health.

Without knowing the provenance of a product it would not be possible to manage the food supply chain so as to satisfy all of these informational labels - and given some of the latest scandals, some might say it still isn't managed well - but enough is recorded against each product to support the quality control process and provide customer information.

With a productionized process, data provenance can be carefully considered, designed and implemented into a robust solution too. In the realm of data this can be seen in well defined feed specifications, warehouses with data dictionaries and reporting schema and dimensions that track versions such that a report can be run both in the moment and looking back through time.

The landscape of data warehousing and data lineage is well studied and defined but it can take years to get right. The W3C recently ratified a standard for provenance representation named PROV which builds on semantic RDF models to annotate entities, their relationships to other entity instances and agents http://www.w3.org/TR/prov-overview/. This encapsulates one of the latest forays into categorizing information across the boundaries of systems and organizations.

But what happens when you’re not building a warehouse, but running an analytics team? And over the last few weeks you’ve extracted 50 copies of some dataset 20 different ways and are already on your 4th version of one of dozens of analyses you’re destined to produce? How do you keep track of each version? How do you explain the differences between versions to your clients, weeks after they were produced?

It's certainly a challenge and something that I've come up against many times in my career. Particularly as many of these projects had an element of regulatory reporting for which a degree of rigor and auditability is required. The teams on such projects are often dynamic, in size, pace and the composition of team members and as the analytics team grows, it becomes ever harder to track analyses, business requirements and track how the two have evolved together.

Guerilla Analytics is the field of research published by Enda Ridge and Edward Curry and something that I’ve written about before. However Enda has now published a companion book, Guerilla Analytics - A Practical Approach to Working With Data, and halfway through I’m already a convert to the mantra that provenance is everything.

The book covers a number of data principles, risks and suggestions for managing them. However provenance sticks out as a simple keyword that can be used to help drive methodology and self-check that the approach undertaken is going to be a robust and reliable one. By placing emphasis on the organization and tagging of data, the book instills the values of Provenance. It's not enough simply to have a Provenance mantra, of course. The book promotes convention over configuration, outlining some ways in which these principles can be implemented while not relying on heavy documentation or industrial tooling and processes. For instance, aligning working papers and code with the development process, cataloguing data extracts, and versioning central repositories of data. These techniques don’t have to rely upon specialist technologies to support them either. File naming conventions and common work practices can go far in helping a team to track their data products.

Now working In the field of Internal Audit, the buzzword is always ‘reproducibility’ and so provenance is aligned with my current world-view - if you understand where something has come from and how it has come to where it is now, you should have a pretty good idea of how to reproduce it.

And if you know how to reproduce it, hopefully you have a starting point to understanding it too.