The chapter on type- and format-related problems would not be complete without a side note on the biggest problem in this area - Microsoft(R) Excel(R).
There is a saying that all data eventually ends up in Excel. Whether or not this is true is debatable, but this spreadsheet application is certainly the number one tool in business that people use to explore, present and exchange small datasets.
While Excel is a very powerful tool that covers the most common needs of people working with small datasets, it's a terrible format for exchanging data and should be handled with great care. Here is a list of the biggest problems you will come across when using this program:
AUTOMATIC TYPECASTING: Excel has built in mechanisms to guess the variable type of the fields you enter. Unfortunately, this automatic typecasting can backfire. I have opened many datasets with Excel only to discover that Excel has converted numbers and text fields into dates, making the data pretty much useless. Whenever something looks remotely like a date, Excel will convert it. Even worse, it doesn't even apply the same typecasting to the entire column, but instead to each individual cell, making you end up with a column containing two or more different data types. If you are unlucky enough to save the file after Excel has mutilated it with its typecasting assumptions, you can't go back to your original data and it may be lost forever.
FORMATTING: Since people use Excel for presenting data, Microsoft has added a variety of formatting options to Excel over the years. Not only can you use random cells of your spreadsheet for titles, explanations and disclaimers, you can also insert pictures, charts, colors, fonts, word art, formulas, clip arts and many other gimmicks that make looking at the data more convenient and enjoyable for humans. But while a human can easily tell where the data starts and where it ends, a machine cannot distinguish between cells that contain data and all the clutter around it.
POOR PERFORMANCE WITH LARGE DATASETS: Excel works well with small datasets, but when your data exceeds a few hundred thousand records, it breaks down. Opening the spreadsheet takes a long time and scrolling through it becomes a hassle. Let your dataset exceed a million records and Excel becomes completely unusable.
LIMITED ANALYTICS CAPABILITIES: Most data-driven questions in the business world can be solved with very basic mathematics. Doing some aggregations and calculating some sums, averages and occasionally a standard deviation will get you very far. Ask any deeper questions that require more complex statistical methods like prediction models or clustering algorithms and you will quickly hit the limits of what Excel can do.
LACK OF REPRODUCIBILITY: When you are processing data, reproducibility is an important factor. If someone would rerun your data processing and analysis on the same data, they should get to the same result. Unfortunately, Excel makes it very easy for the user to manually 'fix' little problems directly in the spreadsheet containing the source data. This type of manual manipulation is not reproducible and if someone tries to run your code on new data of the same kind, it will not work unless they apply a similar manual editing process to the raw data. When you are working with data in a legal context, this type of manual manipulation may even be forbidden and can completely invalidate your entire analysis in court.
NO ENFORCEMENT OF DATA TYPES: Every cell in Excel is basically a free text field and there is - at least by default - no form of data quality check built in. This means that the person entering the data can pretty much put anything in there. Text, numbers, dates, formatted text, dead fish, pictures of Spiderman... you name it. This causes all kinds of problems when you try to load the data into any other program.
Excel is like a Swiss army knife of data analysis in that it is used for gathering, storing, integrating, analyzing, exchanging and presenting data. The fact that it can be used for all these tasks makes it one of the most useful pieces of software out there - as proven by its widespread adoption. But a Swiss army knife, while being somewhat useful for a variety of task, is hardly ever the best tool for any specific task. It can neither replace a lumberjack's chainsaw nor a surgeon's scalpel nor Wolverine's claws. Each of the six above-mentioned tasks can be performed with Excel, but there are always better-suited tools for each individual task.
So how SHOULD you use Excel as a data scientist? If the people in your organization like to use Excel, let them use it. Just make sure that it does not become a critical element within your data analysis workflow. Try to use it exclusively at the end of your data pipelines, e.g. for data presentation when providing the results of a report or analysis to the people in the business.
If by some misfortune you must READ data from Excel, keep all the above weaknesses in mind and make sure to check all the assumptions that you may have about the data and its format. Reject all files that fail these data quality checks.
- While Excel is potentially able to perform most tasks in a data analysis pipeline, there is almost always a more suitable tool for each individual task
- Let people in your organization use Excel, but don't make it part of your data analysis pipeline