📈
beginners-guide-to-clean-data
  • A Beginner's Guide to Clean Data
  • Introduction
    • Foreword
    • The value of data
    • The intangible nature of data
  • Missing data
    • Missing values
    • Missing value patterns
    • Missing value representations
    • Missing observations
    • Truncated exports
    • Handling missing values
  • Data range problems
    • Unexpected values
    • Outliers
    • Freak cases
  • Common CSV problems
    • CSV basics
    • Quotation characters
    • Line breaks in text fields
    • Missing or insufficient headers
    • Trailing line breaks
    • Data export and import
    • Column type violations
    • Guidelines for working with CSV files
  • Text mining problems
    • Text mining basics
    • Encoding in your data and IDE
    • Special characters
    • Character entities
    • Lookalike characters
    • Dummy words
  • Type- and format-related problems
    • Inconsistent timestamp formats
    • Whitespace-padded strings
    • Binary data
    • Semi-structured log files
    • Proprietary data formats
    • Spreadsheets
  • Database-related problems
    • Numeric overflow
    • Duplicate rows
    • Table joins
    • Huge enterprise databases
    • Case sensitivity
    • Separating DDL and DML statements
    • Database performance considerations
    • Naming tables and columns
    • Poorly written SQL
    • Large monolithic SQL scripts
    • SQL orchestration
  • Data inconsistency
    • No single point of truth
    • Non-matching aggregated data
    • Internal inconsistency
  • Data modeling
    • Business concepts
    • Handling complexity
    • Interfaces
    • Generalized data models
    • Reproducibility
    • Feature stores and feature engines
    • Thinking pragmatic
  • Monitoring and testing
    • Automated testing
    • Measuring database load
  • Bonus content
    • Checklist for new data
Powered by GitBook
On this page

Was this helpful?

  1. Common CSV problems

Guidelines for working with CSV files

As you saw in this chapter, most problems that occur when reading CSV data stem from a difference between the expected and the actual format. This can happen in the first row of the data, e.g. when the CSV parser expects a different field separator, or it can happen a million lines later, when some obscure entry contains a strange character that you did not expect to occur.

The key takeaway from this chapter should be to always check your assumptions about the format and to get familiar with the error messages of your CSV parser. Also, be aware that parsing errors do not necessarily lead to error messages. Always check the imported data for consistency and plausibility to make sure that the data has been read correctly.

On the other hand, when you're the one creating the CSV file to pass it on to some person or software, try to stick to these guidelines unless you have a good reason to deviate from them:

  • Include a header row with sufficiently detailed header names. Avoid abbreviations and try to include units if necessary. Bad: NTRV. Good: NET_REVENUE_IN_EUR.

  • Accompany your CSV file with a data dictionary detailing the way each column was measured, which unit was used, how missing values have been encoded, etc.

  • Use quotation characters for text fields. I recommend using double quotes. Make sure that the quotation character is properly escaped in the text fields.

  • Pick a decimal separator that is different from the field separator. I recommend using commas as field separators and points as decimal separators. Avoid digit group separators.

  • Compare the number of records before export with the number of lines in the CSV file. Are there discrepancies?

PreviousColumn type violationsNextText mining basics

Last updated 4 years ago

Was this helpful?