Line breaks in text fields
Besides quotation characters, text fields can also contain line breaks. While Microsoft Windows uses two characters to mark the end of a line, the carriage return character (CR, 0x0d or \r, respectively) and the linefeed character (LF, 0x0a or \n, respectively), all Unix-based operating systems just use the linefeed character. Since the line break \n is almost always used as a row terminator, it has a special meaning in CSV files and can therefore confuse a parser when it occurs within a text column.
Usually, wrapping text fields in quotation characters should help to avoid this kind of problem, but I have worked with CSV parsers that gave precedence to line breaks over quotes. When a parser hits a line break and does not recognize it as part of a string field, it will assume that it has reached the end of the current line and continue to read the next line. This will usually lead to parsing errors due to the inconsistent number of columns or inconsistent column types.
SENT_ON
MESSAGE_ID
MESSAGE_TEXT
2018-05-25
125
Send more bacon!
Best regards
Ben
2018-05-26
126
Houston, we have a problem!
'SENT_ON','MESSAGE_ID','MESSAGE_TEXT'\r\n
'2018-05-25',125,'Send more bacon!\r\nBest regards\r\nBen'\r\n
'2018-05-26',126,'Houston, we have a problem!'\r\n
Sometimes, if fixing the CSV file manually is an option, you can use regular expressions to identify line breaks that are not supposed to serve as row terminators. For example, if the first column always holds a date in the format YYYY-MM-DD, then you would expect four digits after every line break. On the other hand, if a line break occurs within a text field, it's unlikely that it is followed by four digits. Editors like Notepad++ (on Windows) allow searching with regular expressions and replacing the matches. If the file is small enough, you should avoid the 'replace all' option and do a stepwise search and replace, checking if the identified spurious line break is indeed not a row terminator. Of course, this is only a valid option for analyses that do not need to be repeated on a regular basis.
If manually editing the CSV file is not an option and you really need to automate this workaround, read the CSV file as a simple text file line by line and use regular expressions as described above to identify which lines are starting lines for a new record and which ones are just the continuation of the previous record. This problem will come up again later in this book when we talk about working with log files and the occurrence of multi-line logs.
Takeaways:
  • Use regular expressions to check CSV files for lines that do not start with the expected pattern, e.g. a data in a specific format
  • Check what type of line ending is used within your file, both as a row terminator and as an actual line ending in a quoted character string
Copy link