The intangible nature of data

Data has a weird property. While you can easily look at a few records to see what's going on, it's impossible to look at a data set in its entirety. In that regard, it's like the Great Wall of China. If you looked at a small piece of the wall right in front of you, you could easily spot cracks and other types of damages. You could look at the individual bricks and judge what state they're in. But even if the piece of the wall you were looking at looked perfectly fine, there could be severe damages or even entire pieces of the wall missing a few kilometers from your current position and you wouldn't know it.

If the emperor of China asked you to make sure that the wall is intact, you would need to send riders to every part of it and have them inspect it for you. For a data set, these riders are your database queries. The catch is that they are very stupid: They are only able to answer a single question that you ask them when you send them on their journey to inspect the wall. If you asked them to tell you about the state of the bricks, they would only tell you whether they saw any broken or damaged bricks. They wouldn't tell you if parts of the wall were missing. They wouldn't tell you if there were tunnels under the wall. And they wouldn't tell you if the wall was too thin, not high enough or in other ways poorly built somewhere along the border. In order to confidently say that the wall is fine, you would need to think of every conceivable question necessary to make sure that there are no parts of the wall that would allow enemies to get in.

This is what working with data feels like. On the one hand, it's right there in your database, on the other hand, there's no way to look at all of it. Even if you stood on top of the Great Wall at the highest mountain along the border and the weather was completely clear, you would still only see a few kilometers in every direction. And even then, it would be easy to miss small cracks in the wall.

Data doesn't speak. It only answers very simple questions. If you don't know what questions to ask, you cannot completely assess the state of your data. This book is intended to help you ask the right questions.

Last updated