📈
beginners-guide-to-clean-data
  • A Beginner's Guide to Clean Data
  • Introduction
    • Foreword
    • The value of data
    • The intangible nature of data
  • Missing data
    • Missing values
    • Missing value patterns
    • Missing value representations
    • Missing observations
    • Truncated exports
    • Handling missing values
  • Data range problems
    • Unexpected values
    • Outliers
    • Freak cases
  • Common CSV problems
    • CSV basics
    • Quotation characters
    • Line breaks in text fields
    • Missing or insufficient headers
    • Trailing line breaks
    • Data export and import
    • Column type violations
    • Guidelines for working with CSV files
  • Text mining problems
    • Text mining basics
    • Encoding in your data and IDE
    • Special characters
    • Character entities
    • Lookalike characters
    • Dummy words
  • Type- and format-related problems
    • Inconsistent timestamp formats
    • Whitespace-padded strings
    • Binary data
    • Semi-structured log files
    • Proprietary data formats
    • Spreadsheets
  • Database-related problems
    • Numeric overflow
    • Duplicate rows
    • Table joins
    • Huge enterprise databases
    • Case sensitivity
    • Separating DDL and DML statements
    • Database performance considerations
    • Naming tables and columns
    • Poorly written SQL
    • Large monolithic SQL scripts
    • SQL orchestration
  • Data inconsistency
    • No single point of truth
    • Non-matching aggregated data
    • Internal inconsistency
  • Data modeling
    • Business concepts
    • Handling complexity
    • Interfaces
    • Generalized data models
    • Reproducibility
    • Feature stores and feature engines
    • Thinking pragmatic
  • Monitoring and testing
    • Automated testing
    • Measuring database load
  • Bonus content
    • Checklist for new data
Powered by GitBook
On this page

Was this helpful?

  1. Common CSV problems

CSV basics

CSV (column-separated values) files are one of the most widely used formats to store and exchange data. They are plain text files that contain tabular data where the values of different columns are separated by a delimiter, usually a comma or a semicolon. Each row of data represents a row in the data table and is usually terminated with a carriage return and a line break character. The below example shows the content of a small CSV file representing a table with three rows and three columns. The carriage return character \r and the line break character \n have been explicitly printed for clarity.

"NAME","AGE","SALARY"\r\n
"John Smith",42,82000.00\r\n
"Michael Baker",35,71450.50\r\n
"Miranda Watts",40,90000.00\r\n

A major problem with the CSV format is that there are no universal standards for the details of the format and every program uses a slightly different combination of parameters. There are at least a dozen different combinations in the choice of the column delimiter, the decimal separator, the row terminator, the quotation character, whether a header is provided and how to escape which characters. As an example, consider the following CSV file that has the same data as the first one, but in a different format:

NAME;AGE;SALARY\n
John Smith;42;82,000\n
Michael Baker;35;71,450.5\n
Miranda Watts;40;90,000\n

When somebody sends you a CSV file, you should always confirm that your assumptions about the structure of the file hold up. Simply displaying the first few lines of a CSV file should tell you most of what you need to know. In the above example, you can quickly see that the semicolon ; is used to separate columns, rows are terminated with \n and no quotation characters are being used.

This chapter will show you the most common pitfalls that you will encounter when working with CSV files. Since these files are also plain text files, many of the issues in the chapter on text data, e.g. encoding-related problems or lookalike characters, also apply here.

PreviousFreak casesNextQuotation characters

Last updated 4 years ago

Was this helpful?