Super-charge your Data Science method - the importance of a hypothesis

January 11, 2020 Niall Napier

It's been some time since I was in an academic setting and even longer since I dusted off my lab book and documented the hypothesis, method, and conclusion of the prescribed experiment of the day. I never excelled at lab classes and yet now, years later and in quite a different setting I'm being drawn to the scientific method. In fact, as machine learning has come to dominate the world of data science over the last few years it feels like the scientific method has had a resurgence in popularity.

Recently I was fortunate to be invited to a nearby university to view some of the projects their analytics students had been working on over the past semester. As well as being struck by the enthusiasm and creativity of their projects I was pleasantly surprised by the balance of scientific method and the applied nature of their work. Hypotheses formed from problems that companies everywhere are trying to solve for their own customers:

Interpreting review text to provide better recommendations;
Identifying objects from space to make life at ground level safer;
Using government data to optimize capacity issues for public bodies.

Each presentation started with a hypothesis and while the method in each varied and was defined and reinvented over the course of the project, each ended with a conclusion on the results observed and I couldn't help but form my own follow-up hypotheses and questions for the students. Challenging the source data and exploring the context of the problem for new ideas.

In the day-to-day, it's easy to get lost in the data and subsequently misinterpret results. Exploratory analysis can be wonderfully liberating and prompt ideas and lines of inquiry that would never have existed before, however in isolation, this method risks wandering aimlessly through the data without any course of direction. The hypothesis in each presentation helped the students to stay focused on a clear objective and measure success, regardless of the outcome.

Measuring success bears repeating. One of the things I've seen analytics teams struggle with is how to demonstrate their contribution to the business. Sometimes this contribution gets lost when a path of investigation doesn't yield positive results. In actual fact, there is often still a lot of value in the work conducted; Being able to prove that the original direction wasn't the optimal one and a course correction is needed or a new approach can be just as important as an analysis that proves you got things right first time. Setting the question out in a hypothesis and documenting the results can help demonstrate where time and effort were spent and what was learned over the course of the project.

Having a hypothesis can be the difference between performing exploratory analysis and conducting data science. Both are important in understanding the data landscape and deriving insight, but they are very different ways of looking at the data.

In a recent seminar Cassie Kozyrkov reminded the audience that if we’re conducting real data science then being able to independently test the hypothesis is key. Sometimes that means withholding not only a test set of data, but also a secret test set that even the scientists are not allowed to see - something to prevent them from gaming the experiment to provide optimal outcomes. It's a common trick used by Kaggle and other data science competitions to give participants enough data to experiment but not so much that they can cheat in the contest.

Whatever your motives, take a step back and ask ‘how will we know when we're done?’

How do you track value when performing your analysis? Do hypotheses feature in your approach or do you use a different technique? Let me know in the comments below, or on Twitter @eageranalyst.

Going Beyond Accuracy and Completeness - A deeper Dive into Data Quality

February 27, 2019 Niall Napier

You will never have perfect data.

It’s a sad but inevitable truth. Rare enough are the cases where can find the data you need for your project let alone data without a variety of errors that could compromise its quality.

The silver lining is that you probably don’t need perfect data to begin with. As long as you can effectively measure and test the quality of what you do have, you can develop an understanding of your data’s limitations and develop new and creative ways to address them.

I often hear people talk about measuring the ‘accuracy’ and ‘completeness’ of their data, and while these are certainly important concepts, I wanted to go beyond that to discuss some of the other aspects of data quality that I think are equally important. After all, what does it mean for your data to be accurate and complete if it isn’t available until after you’ve made a decision you needed it for? So, let me summarize the six measures I consider for data quality - Accuracy, Completeness, Integrity, Precision, Validity and Timeliness - and share some thoughts on how to test for them in practice.

Accuracy

Data is accurate when it can be verified against an accepted version of the truth, a trusted authoritative source. Without an external validation source, it is very difficult to prove your data is accurate.

Data can be inaccurate at capture or become inaccurate due to corruption over time. An example of inaccurate data at capture would be if my address was captured as 132 Rand Street when I actually lived at 123 Rand Street. Both are valid addresses, but only one is my actual home. Over time I might move home, but not have my new address updated in my records.

How can I test it?

Find an external trusted data source to validate against, if one is not available, find a proxy that you can compare against. For example, you might not be able to compare individual transactions against another source, but maybe you can compare them to the aggregate total or end of day values.

Completeness

Data is complete when each entity is counted exactly once. This means that you’re not missing any records or columns in the data that you need, and that you don’t have any extra records or values either. Nothing is being under or over reported.

Completeness issues often occur when data are transmitted between systems and records are either sent more than once or are missed altogether. In some cases, data is filtered before being shared with another system and so unknowingly, records are lost.

How can I test it?

When testing for completeness you will first need to decide on the things you are trying to measure. This might sound obvious, but consider the case when a customer signs up for an account on your website. If they sign up 5 times, how many records will you hold for that customer? What if they signed up using different email addresses? You will need to think about whether to consider them as individual accounts or as a single person.

If you have a unique ID or reference key, check if you have more than one record for that key to identify duplicate values.
If you have an external source of truth, reconcile your ID/reference key against that source to identify missing values or extra ones.
If you have a checksum or control total, confirm that your dataset has that many records or computes to the same checksum.
For column completeness, identify which columns should not be missing values and analyze any records that are missing data in those columns.
If you don’t have a reference number, consider other things you could use as a proxy to indicate if your data is over or under reported. For example, the volume of transactions per day, the aggregate value of transactions in a month or per user.
If you don’t have a reference number, consider what combination of attributes you could use to identify duplicates in your data. For example, email address, physical address or phone number paired with a name or date of birth.

Integrity

For users to be able to leverage data across multiple systems and reports it's important that the data is recorded in a consistent manner so that comparisons are meaningful. If two sources disagree on a particular value or calculate the same value in a different way, it can lead to inconsistencies in the data that make it hard to focus on bigger issues.

Integrity is a big data risk when information flows between systems and could either be miscommunicated or become stale over time.

How can I test it?

Try to obtain a data dictionary for any data provided so that there is a clear explanation of what each column is expected to contain. Compare these definitions across sources.
Where you have two sources, compare attributes between the sources to identify potential omissions or inaccuracies and define which you consider to be the authoritative source of that data.

Precision

“I can report my height as 47.239476 meters – very precise, and utterly false”

Concepts such as precision measure how granular a result is. Precision is an important concept to consider when considering the reliability of measurements in their greater context and should be communicated using units and a tolerance – for example, height being measured in meters +/- 1cm.

Precision can lead to data quality risk if it is not maintained when flowing data between systems or there is a lack of clarity on the recorded precision (or the reasoning behind a particular level of precision). In low latency trading, for instance, decisions are made in milliseconds. If the time of a trade is recorded in seconds or minutes, the information shared would be meaningless for future decisions.

Note that precision is not the same as accuracy; a result can be emphatically precise and completely inaccurate (for example, I can report my height as 47.239476 meters – very precise, and utterly false).

How can I test it?

Examine the values you have - do they cluster around specific values that may indicate they are being rounded to a less granular level?
Be clear about the reasons why you have a particular level of precision for a recorded value. What level of precision do you need for your calculations?

Validity

Validity measures whether a data attribute falls within certain predefined validation criteria. These rules typically reflect business requirements and can be valuable in stopping bad data from entering a system in the first place. For example, birth dates should [probably] never be in the future and a person’s weight should not be a negative number.

If the data in our system don’t obey the rules we plan for them, they may break any information flows or systems that depend on that data.

How can I test it?

Where you’re expecting categorical values, check that your data falls within those categories.
If you require certain fields to be present, check they are populated.
For free text fields, analyze field lengths to look for instances where dummy data has been recorded.

Timeliness

Timeliness measures how relevant the information provided is in the timeframe you’re using it. Information quality may deteriorate over time and become less meaningful. For example, a business may prepare accounts every month to provide a picture of the company’s finances. If they did this only on an annual basis, it may be harder for the company to make financial decisions come the end of the year.

Depending on the type of decision or action being performed and the nature of the data, there may be risk if data is not being updated or provided in a timely manner.

How can I test it?

Look for expiration dates or last updated dates on key information fields. Things like contact information will inevitably change over time.
Check when a report was last refreshed or prepared.

Round-up

Accuracy, Completeness, Integrity, Precision, Validity and Timeliness - These are my top six data quality considerations and concepts I routinely rely on to help test the quality of data I encounter. If your responsibilities include information management more generally, you may also want to consider things like privacy and security of your data.

Let me know if you have any other things you watch out for when performing your analysis in the comments below, or on Twitter @eageranalyst.

"Just give me all the data" - the importance of robust requirements and how agile can help

July 18, 2018 Niall Napier

Last year, I was part of a team that invested heavily in making its application development pipeline ‘agile’ by adopting the Agile/Scrum methodology. Along the way I realized the value in having a pipeline of good, robust requirements and how focusing those requirements around the user benefit allowed us to develop features in line with the amount of user benefit that would be received. Soon we were delivering real value to our users faster and were hyper focused on building the right thing. These techniques can easily be leveraged when working with data and improving the quality of any data product.