Missing observations

When people hear the term missing data, they typically think of NULL values in database tables. However, one of the most treacherous forms of missing data is missing observations. It means that an entire record is missing from your table. It is not just full of NULL values - it is not there at all. I call it treacherous because from a purely technical perspective, there is no indication that there is anything wrong with the data. Let me give you a few real-life examples of missing observations that I came across in the past.

As a data scientist, you may encounter a situation where you must request the delivery of a dataset that follows a given set of specifications. A few years ago, I worked as a consultant on a project to investigate a suspected financial fraud. The suspects who ran a real estate investment company had supposedly misappropriated money from their customers. They had built a complex network of dozens of companies, each one with its' own set of bank accounts. After receiving money from their customers, it disappeared somewhere within this network, along with several hundred other transactions per day.

The investigators had a hard time understanding what happened to the money due to the complexity of the network, so they asked for help. But in order to answer their questions, we had to obtain the bank statements from more than a thousand bank accounts at dozens of different banks, including all transactions from the last five years. We requested this data from the corresponding banks in a machine-readable format and loaded it into an Oracle database. However, we quickly discovered that there were missing observations in the data. Sometimes, there would be months or even years missing. In many cases, the employees from the bank had simply pressed the export button on their software, not knowing that data older than two years is automatically moved to an archive database and not included in the export. In other cases, the export skipped entire months for some unknown reason.

Another example occurred a few years later when I analyzed log files from a software that was being used within the company I was working for. Data like this will usually exhibit a weekly pattern since employees will use the software mostly from Monday to Friday, but not as much during the weekend. However, I discovered that there were also some days during the week that had close to zero log messages despite being regular workdays. The reason, as it turned out, was that the software had been shut down for maintenance.

At the same company, I also analyzed web tracking data. This data was generated from tracking pixels that the developers had to put on the website. After one of the frequent updates the developers made to the side, the page views and interactions in one particular section of the website went down to zero for a few days. The product managers - always having an eye on the reports to see the effect of their new ideas and changes to the site - panicked. As part of the BI department, I looked into the sudden drop and discovered that the developers had forgotten to put the tracking pixels on this part of the website. People still visited like before - as was visible in the web server's log files - but the tracking system didn't record their visits.

The easiest way to spot gaps like in the above-mentioned transactional data is through appropriate visualizations. You can generate daily, weekly and monthly plots of the total number of transactions and search for gaps in the data.

Unfortunately, this is more difficult when there's only part of the data missing. In the web tracking example from above, the total number of page views is the sum of the views from every page of the website. When you have dozens of different pages and one of them suddenly drops to zero, it may not be directly visible in a chart showing the total daily page views. In this case, it's a good idea to generate a chart for every single section.

More generally speaking, there might be dimensions other than time where data is missing. Depending on what kind of data you're working with, it's worthwhile to think about which dimensions could lead to missing data. In website tracking data, this can be specific sections or subsections of a website. In retail transaction data, it can be individual shops or product categories causing missing values. Spend some time thinking about what dimensions your data has that could lead to missing observations.

Takeaways:

  • Check for gaps in the data, e.g. missing days, weeks, months

  • Check for gaps across different dimensions of the data. The overall counts can look fine, but data from an entire category could be missing from one day to the next

Last updated