Quotation characters
The idea of comma-separated values seems simple: Just put a comma between each value from a table row so that a machine can read the data automatically row by row, putting each entry into the correct column in your data table. However, let's say you're using a comma as the field separator. How do you deal with a text field that contains this character?
The usual way is by using quotation characters like single quotes ' or double quotes " around text fields. Any sufficiently intelligent parser software trying to read the CSV file can use these quotes to determine whether a given comma is a field separator or part of a text field. After reading an opening quotation character, it simply has to ignore all commas until it hits the closing quotation character.
For this reason, many CSV parsers are configured to expect quotation marks around text fields by default. Usually, they will still work if they find a text field without quotation marks, but what if one of the text fields somewhere down the line contains a single quotation mark? In this case, the CSV parser will continue to read everything up to the next quotation mark, possibly everything up to the very end of the file, as if it was a quoted text field. It will also typically skip all line breaks, assuming that they are also part of the quoted text string that it thinks it is reading. This will usually lead to parsing errors as the number of columns read in the problematic row will be different from previous rows or the parser will reach the end of the file before finding a closing quote.
Sometimes, the parser will continue to parse the file up to the end. You should always verify that the number of lines read equals the number of records that you expect - at least if you know it beforehand. If it doesn't, chances are that a stray quotation character messed up your parsing and swallowed multiple lines. Also, check for unexpectedly long text fields as they may be parts of the CSV file that got interpreted as one long quoted string. Simply use a function like LENGTH() in SQL or nrow() in R and get the maximum length of the entries in the text field.
Wrapping text fields into quotation marks is a good idea to avoid problems with characters that have a special meaning in the CSV format: the column separator, the decimal separator or the line break character(s), for example. When wrapped in quotation marks - either single quotes or double quotes, depending on your CSV parser - they lose their special meaning and can be used freely without having to worry about the parser seeing them as instructions. Unfortunately, this won't work for the quotation character itself. After all, how is the parser supposed to know whether the quotation character is part of the text field or whether it is supposed to end the text field? That is why the quotation character needs to be escaped when using it within a text field. There are typically two ways to do this: either by writing two quotes back to back ("") or by putting a backslash character right in front of the quote (\"). So the value
He said 'Hi!'
could be represented in one of the following ways in a CSV file.
'He said ''Hi!'''
'He said \'Hi!\''
in a CSV file. Both methods are equally valid, so you need to verify your assumptions about which escape method was used in the data you are currently working with. If you create a CSV file yourself, make sure to tell the people who will be working with it which method you or your parser chose.
Takeaways:
  • Check for unexpectedly long text fields. Sometimes, a spurious quotation mark can cause the parser to read several rows from the CSV as part of a single text field
Copy link