📈
📈
📈
📈
beginners-guide-to-clean-data
Search
⌃K
📈
📈
📈
📈
beginners-guide-to-clean-data
Search
⌃K
A Beginner's Guide to Clean Data
Introduction
Foreword
The value of data
The intangible nature of data
Missing data
Missing values
Missing value patterns
Missing value representations
Missing observations
Truncated exports
Handling missing values
Data range problems
Unexpected values
Outliers
Freak cases
Common CSV problems
CSV basics
Quotation characters
Line breaks in text fields
Missing or insufficient headers
Trailing line breaks
Data export and import
Column type violations
Guidelines for working with CSV files
Text mining problems
Text mining basics
Encoding in your data and IDE
Special characters
Character entities
Lookalike characters
Dummy words
Type- and format-related problems
Inconsistent timestamp formats
Whitespace-padded strings
Binary data
Semi-structured log files
Proprietary data formats
Spreadsheets
Database-related problems
Numeric overflow
Duplicate rows
Table joins
Huge enterprise databases
Case sensitivity
Separating DDL and DML statements
Database performance considerations
Naming tables and columns
Poorly written SQL
Large monolithic SQL scripts
SQL orchestration
Data inconsistency
No single point of truth
Non-matching aggregated data
Internal inconsistency
Data modeling
Business concepts
Handling complexity
Interfaces
Generalized data models
Reproducibility
Feature stores and feature engines
Thinking pragmatic
Monitoring and testing
Automated testing
Measuring database load
Bonus content
Checklist for new data
Powered By
GitBook
SQL orchestration
This chapter is under construction. BG, 19.02.2021
##### SQL Execution Framework
##### FOCUS ON DATA QUALITY HERE
Database-related problems - Previous
Large monolithic SQL scripts
Next - Data inconsistency
No single point of truth
Last modified
2yr ago