📈
beginners-guide-to-clean-data
  • A Beginner's Guide to Clean Data
  • Introduction
    • Foreword
    • The value of data
    • The intangible nature of data
  • Missing data
    • Missing values
    • Missing value patterns
    • Missing value representations
    • Missing observations
    • Truncated exports
    • Handling missing values
  • Data range problems
    • Unexpected values
    • Outliers
    • Freak cases
  • Common CSV problems
    • CSV basics
    • Quotation characters
    • Line breaks in text fields
    • Missing or insufficient headers
    • Trailing line breaks
    • Data export and import
    • Column type violations
    • Guidelines for working with CSV files
  • Text mining problems
    • Text mining basics
    • Encoding in your data and IDE
    • Special characters
    • Character entities
    • Lookalike characters
    • Dummy words
  • Type- and format-related problems
    • Inconsistent timestamp formats
    • Whitespace-padded strings
    • Binary data
    • Semi-structured log files
    • Proprietary data formats
    • Spreadsheets
  • Database-related problems
    • Numeric overflow
    • Duplicate rows
    • Table joins
    • Huge enterprise databases
    • Case sensitivity
    • Separating DDL and DML statements
    • Database performance considerations
    • Naming tables and columns
    • Poorly written SQL
    • Large monolithic SQL scripts
    • SQL orchestration
  • Data inconsistency
    • No single point of truth
    • Non-matching aggregated data
    • Internal inconsistency
  • Data modeling
    • Business concepts
    • Handling complexity
    • Interfaces
    • Generalized data models
    • Reproducibility
    • Feature stores and feature engines
    • Thinking pragmatic
  • Monitoring and testing
    • Automated testing
    • Measuring database load
  • Bonus content
    • Checklist for new data
Powered by GitBook
On this page

Was this helpful?

  1. Common CSV problems

Data export and import

Many software programs offer the option to import or export data. CSV files are usually the preferred format because they are simple, human-readable and easily accessible.

Analytics software like R has a very powerful CSV parser that allows you to import CSV files of any kind and account for several different parameters. You can specify the field separator, the row terminator, the decimal separator, the quotation character, the way characters in text fields are escaped or the file encoding, among other things. This allows you to import and export files in pretty much every CSV-like format.

The code block below shows the head of the read.table() function in base R with all the parameters that you can specify, along with their default values.

read.table(file, header = FALSE, sep = "", quote = "\"'",
    dec = ".", numerals = c("allow.loss", "warn.loss", "no.loss"),
    row.names, col.names, as.is = !stringsAsFactors,
    na.strings = "NA", colClasses = NA, nrows = -1,
    skip = 0, check.names = TRUE, fill = !blank.lines.skip,
    strip.white = FALSE, blank.lines.skip = TRUE,
    comment.char = "#", allowEscapes = FALSE, flush = FALSE,
    stringsAsFactors = default.stringsAsFactors(),
    fileEncoding = "", encoding = "unknown", text, skipNul = FALSE
)

However, most programs use parsers that are not this powerful and customizable. In many cases, they have fixed parameters, expecting for example that the field separator is always a comma and that quotation is always used for text fields. If you are lucky, the documentation is comprehensive enough to figure out what parameters the program's parser expects, but don't count on that. In that case, there is little you can do except rely on trial-and-error.

My main advice here is to become familiar with how your program works and learn the parameters it uses. Never just assume a certain format, always test your assumptions. When you are sharing a CSV file with someone, make sure to add all the information about the format that people will need to parse it. If you are writing a program that takes CSV files as input, document all your choices regarding the format so that people know what type of CSV your program accepts.

Takeaways:

  • Know the format that your CSV parser expects

  • When you give someone a CSV file, specify exactly what format you used

PreviousTrailing line breaksNextColumn type violations

Last updated 4 years ago

Was this helpful?