📈
beginners-guide-to-clean-data
  • A Beginner's Guide to Clean Data
  • Introduction
    • Foreword
    • The value of data
    • The intangible nature of data
  • Missing data
    • Missing values
    • Missing value patterns
    • Missing value representations
    • Missing observations
    • Truncated exports
    • Handling missing values
  • Data range problems
    • Unexpected values
    • Outliers
    • Freak cases
  • Common CSV problems
    • CSV basics
    • Quotation characters
    • Line breaks in text fields
    • Missing or insufficient headers
    • Trailing line breaks
    • Data export and import
    • Column type violations
    • Guidelines for working with CSV files
  • Text mining problems
    • Text mining basics
    • Encoding in your data and IDE
    • Special characters
    • Character entities
    • Lookalike characters
    • Dummy words
  • Type- and format-related problems
    • Inconsistent timestamp formats
    • Whitespace-padded strings
    • Binary data
    • Semi-structured log files
    • Proprietary data formats
    • Spreadsheets
  • Database-related problems
    • Numeric overflow
    • Duplicate rows
    • Table joins
    • Huge enterprise databases
    • Case sensitivity
    • Separating DDL and DML statements
    • Database performance considerations
    • Naming tables and columns
    • Poorly written SQL
    • Large monolithic SQL scripts
    • SQL orchestration
  • Data inconsistency
    • No single point of truth
    • Non-matching aggregated data
    • Internal inconsistency
  • Data modeling
    • Business concepts
    • Handling complexity
    • Interfaces
    • Generalized data models
    • Reproducibility
    • Feature stores and feature engines
    • Thinking pragmatic
  • Monitoring and testing
    • Automated testing
    • Measuring database load
  • Bonus content
    • Checklist for new data
Powered by GitBook
On this page
  • Summary
  • About the author
  • Disclaimer

Was this helpful?

A Beginner's Guide to Clean Data

Practical advice to spot and avoid data quality problems. - Benjamin Greve

NextForeword

Last updated 5 years ago

Was this helpful?

This is a free version of my book "A Beginner's Guide to Clean Data: Practical advice to spot and avoid data quality problems". If you like the content, feel free to there.

Summary

This book will help you to become a better data scientist by showing you the things that can go wrong when working with data - particularly low-quality data. A key difference between a junior and a senior data scientist is the awareness of potential pitfalls. The experienced data scientist will expect them, navigate around them and avoid costly iteration cycles. After reading this book, you will be able to spot data quality problems and deal with them before they can break your work, saving yourself a lot of time.

In the past six years of working in data science, I have made all the mistakes described in this book. Every time, it cost me hours, sometimes days to figure out what the problem was and to fix it. This type of iterative work is what data scientists mean when they talk about how they spend most of their time on data preparation. Yet, for some reason, the art of preparing data and ensuring a sufficiently high level of quality is largely ignored by textbooks, university programs, online courses and industry conferences. That's why I felt the need to write this book and share some of my experiences. It is the hands-on advice that I myself wish I had when I started my career as a data scientist.

About the author

My name is Benjamin Greve and I'm a data scientist, mathematician and clean data enthusiast from Germany. I teach machines how to solve complex tasks so that people can concentrate on the things that really matter.

To contact me, use one of the following channels:

  • Website:

  • Twitter:

  • LinkedIn:

  • Email:

Disclaimer

Copyright (c) 2019 Benjamin Greve. All rights reserved.

Some of the advice from this book is opinion-based. If you disagree or your personal experience differs from mine, feel free to contact me on any of the above-mentioned channels. I'm constantly learning new things and I'd love to discuss all things data with you.

buy this book on Amazon and/or leave a positive review
https://benjamingreve.wordpress.com/
@benjamingreve
www.linkedin.com/in/bgreve
greve.professional@gmail.com