Missing or insufficient headers

There is nothing more annoying than getting data without header names. This often happens when someone exports data from a software without checking the 'Include headers' check box. Also, some data storage formats store the data and the schema, i.e. the format of the data, separately from one another. You will often see this when you are working on a Hadoop-based data processing infrastructure and try to read the flat data files.

Luckily, this problem is easy to identify by manually looking at the first row of your data file and checking if there are header names.

If headers are missing and you are familiar with the data, you may guess what each column means, but most of the time, you will be completely lost. The only solution is to ask the person who provided the data to rerun the export with column names or to give you a list of the column names.

Aside from missing column names, you will also encounter cases where you have a column name but still cannot make any sense of it because it is some abbreviation that you are not familiar with. Some old databases even have restrictions on the length of the column name, enforcing the use of cryptic abbreviations. Others might simply enumerate some column ID and refer you to a manual or data dictionary where you can find what the ID stands for.

Choosing good names for your tables and columns is an art in and of itself. I will go into more detail in the chapter on database-related problems later in this book.

Takeaways:

  • When requesting data, explicitly ask for the header names to be included in the data

  • Ask for a data dictionary to avoid not knowing what a cryptic or abbreviated header name means

Last updated