Column type violations
Most CSV parsers allow you to specify the data types of each column. Sometimes, they will even guess the data type based on the content of the first few rows. If the parser hits a record where the data does not fit the assumed column type, it will try to implicitly cast to that type. If this attempt fails, most parsers will abort with an error message or insert a missing value.
The latter option is something you never want to happen! Sure, it sucks when your import runs for an while and then fails with an error message, but at least you know that something went wrong. When some implicit type-casting creates a NULL value, you will probably never know that this happened. You might look at the parsed data, see a bunch of NULL values and assume that they were there from the beginning, while it was your ETL pipeline that created them.
Also, this type of automatic type-casting is somewhat error-prone because you are assuming that an entire column follows a certain format, e.g. containing a floating-point number or a certain date format. Like I have emphasized before, you should never leave your assumptions untested.
My preferred way of importing data is to import all columns as text data types (in R the character class) and run a few tests. Usually, I use regular expressions to test if a date always has the same format or if a column always contains a number. This way, you can test your assumptions and easily find out in which row and which column they don't hold up rather than simply getting an error message. If you are going to type-cast an entire column from one data type to another, always do it explicitly! Don't leave it to chance.
Takeaways:
  • Avoid implicit typecasting
  • Read columns as text and do the typecasting in an explicit step where you can easily check for problem cases
  • Check your assumption regarding the data format in each column
Copy link