📈
beginners-guide-to-clean-data
  • A Beginner's Guide to Clean Data
  • Introduction
    • Foreword
    • The value of data
    • The intangible nature of data
  • Missing data
    • Missing values
    • Missing value patterns
    • Missing value representations
    • Missing observations
    • Truncated exports
    • Handling missing values
  • Data range problems
    • Unexpected values
    • Outliers
    • Freak cases
  • Common CSV problems
    • CSV basics
    • Quotation characters
    • Line breaks in text fields
    • Missing or insufficient headers
    • Trailing line breaks
    • Data export and import
    • Column type violations
    • Guidelines for working with CSV files
  • Text mining problems
    • Text mining basics
    • Encoding in your data and IDE
    • Special characters
    • Character entities
    • Lookalike characters
    • Dummy words
  • Type- and format-related problems
    • Inconsistent timestamp formats
    • Whitespace-padded strings
    • Binary data
    • Semi-structured log files
    • Proprietary data formats
    • Spreadsheets
  • Database-related problems
    • Numeric overflow
    • Duplicate rows
    • Table joins
    • Huge enterprise databases
    • Case sensitivity
    • Separating DDL and DML statements
    • Database performance considerations
    • Naming tables and columns
    • Poorly written SQL
    • Large monolithic SQL scripts
    • SQL orchestration
  • Data inconsistency
    • No single point of truth
    • Non-matching aggregated data
    • Internal inconsistency
  • Data modeling
    • Business concepts
    • Handling complexity
    • Interfaces
    • Generalized data models
    • Reproducibility
    • Feature stores and feature engines
    • Thinking pragmatic
  • Monitoring and testing
    • Automated testing
    • Measuring database load
  • Bonus content
    • Checklist for new data
Powered by GitBook
On this page

Was this helpful?

  1. Common CSV problems

Trailing line breaks

Trailing line breaks are usually a minor problem, but it is useful to be aware of them. Some CSV files have a few additional blank lines at the end. This can confuse some CSV parsers into thinking that they are dealing with actual records where all the fields are empty. Importing a file like that can lead to records with many missing values and distort summary statistics.

'ID','FIRST_NAME'\r\n
1,'Adam'\r\n
2,'Billy'\r\n
3,'Elizabeth'\r\n
\r\n
\r\n

A bad parser could read the above CSV and interpret the content as

ID

FIRST_NAME

1

Adam

2

Billy

3

Elizabeth

NULL

NULL

NULL

NULL

To avoid this, check the last few lines of the CSV file manually, for example with the command line tool tail on Linux or by looking at the CSV file in Notepad++, if it isn't too big. You can also check for NULL values in a column that should absolutely not be NULL. If you find such a record and all the other columns are NULL too, that is an indicator for trailing line breaks as well.

You can also read the file in as a plain text file and calculate the number of characters in every line. If there are N columns, there should be at least N-1 column separators, i.e. , or ; characters. Any column with less characters is obviously an error and should be inspected and possibly excluded.

Takeaways:

  • Inspect the last few lines of a CSV file using the tail command or Notepad++

  • Check for rows where all values are NULL Count the number of column separators in each row. With N columns, there should be at least N-1 occurrences of the column separator

PreviousMissing or insufficient headersNextData export and import

Last updated 4 years ago

Was this helpful?