Trailing line breaks

Trailing line breaks are usually a minor problem, but it is useful to be aware of them. Some CSV files have a few additional blank lines at the end. This can confuse some CSV parsers into thinking that they are dealing with actual records where all the fields are empty. Importing a file like that can lead to records with many missing values and distort summary statistics.

'ID','FIRST_NAME'\r\n
1,'Adam'\r\n
2,'Billy'\r\n
3,'Elizabeth'\r\n
\r\n
\r\n

A bad parser could read the above CSV and interpret the content as

ID

FIRST_NAME

1

Adam

2

Billy

3

Elizabeth

NULL

NULL

NULL

NULL

To avoid this, check the last few lines of the CSV file manually, for example with the command line tool tail on Linux or by looking at the CSV file in Notepad++, if it isn't too big. You can also check for NULL values in a column that should absolutely not be NULL. If you find such a record and all the other columns are NULL too, that is an indicator for trailing line breaks as well.

You can also read the file in as a plain text file and calculate the number of characters in every line. If there are N columns, there should be at least N-1 column separators, i.e. , or ; characters. Any column with less characters is obviously an error and should be inspected and possibly excluded.

Takeaways:

  • Inspect the last few lines of a CSV file using the tail command or Notepad++

  • Check for rows where all values are NULL Count the number of column separators in each row. With N columns, there should be at least N-1 occurrences of the column separator

Last updated