📈
beginners-guide-to-clean-data
  • A Beginner's Guide to Clean Data
  • Introduction
    • Foreword
    • The value of data
    • The intangible nature of data
  • Missing data
    • Missing values
    • Missing value patterns
    • Missing value representations
    • Missing observations
    • Truncated exports
    • Handling missing values
  • Data range problems
    • Unexpected values
    • Outliers
    • Freak cases
  • Common CSV problems
    • CSV basics
    • Quotation characters
    • Line breaks in text fields
    • Missing or insufficient headers
    • Trailing line breaks
    • Data export and import
    • Column type violations
    • Guidelines for working with CSV files
  • Text mining problems
    • Text mining basics
    • Encoding in your data and IDE
    • Special characters
    • Character entities
    • Lookalike characters
    • Dummy words
  • Type- and format-related problems
    • Inconsistent timestamp formats
    • Whitespace-padded strings
    • Binary data
    • Semi-structured log files
    • Proprietary data formats
    • Spreadsheets
  • Database-related problems
    • Numeric overflow
    • Duplicate rows
    • Table joins
    • Huge enterprise databases
    • Case sensitivity
    • Separating DDL and DML statements
    • Database performance considerations
    • Naming tables and columns
    • Poorly written SQL
    • Large monolithic SQL scripts
    • SQL orchestration
  • Data inconsistency
    • No single point of truth
    • Non-matching aggregated data
    • Internal inconsistency
  • Data modeling
    • Business concepts
    • Handling complexity
    • Interfaces
    • Generalized data models
    • Reproducibility
    • Feature stores and feature engines
    • Thinking pragmatic
  • Monitoring and testing
    • Automated testing
    • Measuring database load
  • Bonus content
    • Checklist for new data
Powered by GitBook
On this page

Was this helpful?

  1. Common CSV problems

Column type violations

Most CSV parsers allow you to specify the data types of each column. Sometimes, they will even guess the data type based on the content of the first few rows. If the parser hits a record where the data does not fit the assumed column type, it will try to implicitly cast to that type. If this attempt fails, most parsers will abort with an error message or insert a missing value.

The latter option is something you never want to happen! Sure, it sucks when your import runs for an while and then fails with an error message, but at least you know that something went wrong. When some implicit type-casting creates a NULL value, you will probably never know that this happened. You might look at the parsed data, see a bunch of NULL values and assume that they were there from the beginning, while it was your ETL pipeline that created them.

Also, this type of automatic type-casting is somewhat error-prone because you are assuming that an entire column follows a certain format, e.g. containing a floating-point number or a certain date format. Like I have emphasized before, you should never leave your assumptions untested.

My preferred way of importing data is to import all columns as text data types (in R the character class) and run a few tests. Usually, I use regular expressions to test if a date always has the same format or if a column always contains a number. This way, you can test your assumptions and easily find out in which row and which column they don't hold up rather than simply getting an error message. If you are going to type-cast an entire column from one data type to another, always do it explicitly! Don't leave it to chance.

Takeaways:

  • Avoid implicit typecasting

  • Read columns as text and do the typecasting in an explicit step where you can easily check for problem cases

  • Check your assumption regarding the data format in each column

PreviousData export and importNextGuidelines for working with CSV files

Last updated 4 years ago

Was this helpful?