CSV basics

CSV (column-separated values) files are one of the most widely used formats to store and exchange data. They are plain text files that contain tabular data where the values of different columns are separated by a delimiter, usually a comma or a semicolon. Each row of data represents a row in the data table and is usually terminated with a carriage return and a line break character. The below example shows the content of a small CSV file representing a table with three rows and three columns. The carriage return character \r and the line break character \n have been explicitly printed for clarity.

"John Smith",42,82000.00\r\n
"Michael Baker",35,71450.50\r\n
"Miranda Watts",40,90000.00\r\n

A major problem with the CSV format is that there are no universal standards for the details of the format and every program uses a slightly different combination of parameters. There are at least a dozen different combinations in the choice of the column delimiter, the decimal separator, the row terminator, the quotation character, whether a header is provided and how to escape which characters. As an example, consider the following CSV file that has the same data as the first one, but in a different format:

John Smith;42;82,000\n
Michael Baker;35;71,450.5\n
Miranda Watts;40;90,000\n

When somebody sends you a CSV file, you should always confirm that your assumptions about the structure of the file hold up. Simply displaying the first few lines of a CSV file should tell you most of what you need to know. In the above example, you can quickly see that the semicolon ; is used to separate columns, rows are terminated with \n and no quotation characters are being used.

This chapter will show you the most common pitfalls that you will encounter when working with CSV files. Since these files are also plain text files, many of the issues in the chapter on text data, e.g. encoding-related problems or lookalike characters, also apply here.

Last updated