📈
beginners-guide-to-clean-data
  • A Beginner's Guide to Clean Data
  • Introduction
    • Foreword
    • The value of data
    • The intangible nature of data
  • Missing data
    • Missing values
    • Missing value patterns
    • Missing value representations
    • Missing observations
    • Truncated exports
    • Handling missing values
  • Data range problems
    • Unexpected values
    • Outliers
    • Freak cases
  • Common CSV problems
    • CSV basics
    • Quotation characters
    • Line breaks in text fields
    • Missing or insufficient headers
    • Trailing line breaks
    • Data export and import
    • Column type violations
    • Guidelines for working with CSV files
  • Text mining problems
    • Text mining basics
    • Encoding in your data and IDE
    • Special characters
    • Character entities
    • Lookalike characters
    • Dummy words
  • Type- and format-related problems
    • Inconsistent timestamp formats
    • Whitespace-padded strings
    • Binary data
    • Semi-structured log files
    • Proprietary data formats
    • Spreadsheets
  • Database-related problems
    • Numeric overflow
    • Duplicate rows
    • Table joins
    • Huge enterprise databases
    • Case sensitivity
    • Separating DDL and DML statements
    • Database performance considerations
    • Naming tables and columns
    • Poorly written SQL
    • Large monolithic SQL scripts
    • SQL orchestration
  • Data inconsistency
    • No single point of truth
    • Non-matching aggregated data
    • Internal inconsistency
  • Data modeling
    • Business concepts
    • Handling complexity
    • Interfaces
    • Generalized data models
    • Reproducibility
    • Feature stores and feature engines
    • Thinking pragmatic
  • Monitoring and testing
    • Automated testing
    • Measuring database load
  • Bonus content
    • Checklist for new data
Powered by GitBook
On this page

Was this helpful?

  1. Common CSV problems

Missing or insufficient headers

There is nothing more annoying than getting data without header names. This often happens when someone exports data from a software without checking the 'Include headers' check box. Also, some data storage formats store the data and the schema, i.e. the format of the data, separately from one another. You will often see this when you are working on a Hadoop-based data processing infrastructure and try to read the flat data files.

Luckily, this problem is easy to identify by manually looking at the first row of your data file and checking if there are header names.

If headers are missing and you are familiar with the data, you may guess what each column means, but most of the time, you will be completely lost. The only solution is to ask the person who provided the data to rerun the export with column names or to give you a list of the column names.

Aside from missing column names, you will also encounter cases where you have a column name but still cannot make any sense of it because it is some abbreviation that you are not familiar with. Some old databases even have restrictions on the length of the column name, enforcing the use of cryptic abbreviations. Others might simply enumerate some column ID and refer you to a manual or data dictionary where you can find what the ID stands for.

Choosing good names for your tables and columns is an art in and of itself. I will go into more detail in the chapter on database-related problems later in this book.

Takeaways:

  • When requesting data, explicitly ask for the header names to be included in the data

  • Ask for a data dictionary to avoid not knowing what a cryptic or abbreviated header name means

PreviousLine breaks in text fieldsNextTrailing line breaks

Last updated 4 years ago

Was this helpful?