📈
beginners-guide-to-clean-data
  • A Beginner's Guide to Clean Data
  • Introduction
    • Foreword
    • The value of data
    • The intangible nature of data
  • Missing data
    • Missing values
    • Missing value patterns
    • Missing value representations
    • Missing observations
    • Truncated exports
    • Handling missing values
  • Data range problems
    • Unexpected values
    • Outliers
    • Freak cases
  • Common CSV problems
    • CSV basics
    • Quotation characters
    • Line breaks in text fields
    • Missing or insufficient headers
    • Trailing line breaks
    • Data export and import
    • Column type violations
    • Guidelines for working with CSV files
  • Text mining problems
    • Text mining basics
    • Encoding in your data and IDE
    • Special characters
    • Character entities
    • Lookalike characters
    • Dummy words
  • Type- and format-related problems
    • Inconsistent timestamp formats
    • Whitespace-padded strings
    • Binary data
    • Semi-structured log files
    • Proprietary data formats
    • Spreadsheets
  • Database-related problems
    • Numeric overflow
    • Duplicate rows
    • Table joins
    • Huge enterprise databases
    • Case sensitivity
    • Separating DDL and DML statements
    • Database performance considerations
    • Naming tables and columns
    • Poorly written SQL
    • Large monolithic SQL scripts
    • SQL orchestration
  • Data inconsistency
    • No single point of truth
    • Non-matching aggregated data
    • Internal inconsistency
  • Data modeling
    • Business concepts
    • Handling complexity
    • Interfaces
    • Generalized data models
    • Reproducibility
    • Feature stores and feature engines
    • Thinking pragmatic
  • Monitoring and testing
    • Automated testing
    • Measuring database load
  • Bonus content
    • Checklist for new data
Powered by GitBook
On this page

Was this helpful?

  1. Missing data

Truncated exports

Truncated exports are a common problem in many software programs that handle large amounts of data. In order to avoid long loading times that would ruin the usability of a frontend, software developers like to truncate data. This is a reasonable solution that works well in most use-cases. Most of the time a user uses the export function, the just want a few hundred rows of data to feed it into some Excel report. The developers don't anticipate a situation where the user would want to export millions of rows or even the complete table. That is why truncation is often activated by default. On top of that, many programs don't even issue a warning to inform the user about the truncation. They simply limit the number of returned result rows to a pre-defined number and claim to have successfully finished the export.

The same goes for many APIs that return data upon request like REST interfaces. APIs are often built to answer very specific queries and return a small number of results, so the default export limits are usually much smaller than in server or desktop applications. This may be fine for most use-cases, but when you intend to export a complete dataset with millions or billions of records, that represents a completely different use-case that the developers often didn't have in mind and subsequently did not account for.

When you rely on the export function of a program that you are not familiar with, make sure to check if it has any restrictions regarding the number of observations it can export. If you are requesting data from the users of a program that handles data, ask them to make sure that any export-related restrictions are turned off before they run the export.

When you inspect the exported data, count the number of records. When your data has exactly 10.000 or 1.000.000 records, it is very likely that you have hit an export limit and need to go back and fix that before continuing with your analysis. Even if your output does not have such an even number of records, make sure to check the last record to see if it is complete. Some programs truncate not by the number of records but by the number of bytes of the output. This can lead to the export ending right in the middle of a record because the export exceeded the total byte limit.

Another potential source of errors are failed exports. Sometimes, an export job fails and needs to be restarted. Depending on the software, there is the potential for duplicate records when the export job didn't clean up after failing and the restarted export job simply appends its records to the output from the failed job.

In an ideal world, the person running the export would make sure that this does not happen, but in reality, he or she often doesn't think about that. On top of that, when you are exporting large amounts of data, it is often necessary to run dozens of export jobs in parallel, which divides the attention and the amount of time that can be dedicated to every single job.

The safest solution is to get a record count within the software program itself. This record count can be compared to the number of records in the exported files. If the counts are not equal, chances are high that something went wrong during the export.

Takeaways:

  • Compare the expected number of records in the software with the actual number of records in the export file

  • Be suspicious if there is an even number of records

  • Check the last record for completeness

PreviousMissing observationsNextHandling missing values

Last updated 4 years ago

Was this helpful?