Guidelines for working with CSV files

As you saw in this chapter, most problems that occur when reading CSV data stem from a difference between the expected and the actual format. This can happen in the first row of the data, e.g. when the CSV parser expects a different field separator, or it can happen a million lines later, when some obscure entry contains a strange character that you did not expect to occur.

The key takeaway from this chapter should be to always check your assumptions about the format and to get familiar with the error messages of your CSV parser. Also, be aware that parsing errors do not necessarily lead to error messages. Always check the imported data for consistency and plausibility to make sure that the data has been read correctly.

On the other hand, when you're the one creating the CSV file to pass it on to some person or software, try to stick to these guidelines unless you have a good reason to deviate from them:

Include a header row with sufficiently detailed header names. Avoid abbreviations and try to include units if necessary. Bad: NTRV. Good: NET_REVENUE_IN_EUR.
Accompany your CSV file with a data dictionary detailing the way each column was measured, which unit was used, how missing values have been encoded, etc.
Use quotation characters for text fields. I recommend using double quotes. Make sure that the quotation character is properly escaped in the text fields.
Pick a decimal separator that is different from the field separator. I recommend using commas as field separators and points as decimal separators. Avoid digit group separators.
Compare the number of records before export with the number of lines in the CSV file. Are there discrepancies?

PreviousColumn type violations NextText mining basics

Last updated 4 years ago

Was this helpful?