Truncated exports
Truncated exports are a common problem in many software programs that handle large amounts of data. In order to avoid long loading times that would ruin the usability of a frontend, software developers like to truncate data. This is a reasonable solution that works well in most use-cases. Most of the time a user uses the export function, the just want a few hundred rows of data to feed it into some Excel report. The developers don't anticipate a situation where the user would want to export millions of rows or even the complete table. That is why truncation is often activated by default. On top of that, many programs don't even issue a warning to inform the user about the truncation. They simply limit the number of returned result rows to a pre-defined number and claim to have successfully finished the export.
The same goes for many APIs that return data upon request like REST interfaces. APIs are often built to answer very specific queries and return a small number of results, so the default export limits are usually much smaller than in server or desktop applications. This may be fine for most use-cases, but when you intend to export a complete dataset with millions or billions of records, that represents a completely different use-case that the developers often didn't have in mind and subsequently did not account for.
When you rely on the export function of a program that you are not familiar with, make sure to check if it has any restrictions regarding the number of observations it can export. If you are requesting data from the users of a program that handles data, ask them to make sure that any export-related restrictions are turned off before they run the export.
When you inspect the exported data, count the number of records. When your data has exactly 10.000 or 1.000.000 records, it is very likely that you have hit an export limit and need to go back and fix that before continuing with your analysis. Even if your output does not have such an even number of records, make sure to check the last record to see if it is complete. Some programs truncate not by the number of records but by the number of bytes of the output. This can lead to the export ending right in the middle of a record because the export exceeded the total byte limit.
Another potential source of errors are failed exports. Sometimes, an export job fails and needs to be restarted. Depending on the software, there is the potential for duplicate records when the export job didn't clean up after failing and the restarted export job simply appends its records to the output from the failed job.
In an ideal world, the person running the export would make sure that this does not happen, but in reality, he or she often doesn't think about that. On top of that, when you are exporting large amounts of data, it is often necessary to run dozens of export jobs in parallel, which divides the attention and the amount of time that can be dedicated to every single job.
The safest solution is to get a record count within the software program itself. This record count can be compared to the number of records in the exported files. If the counts are not equal, chances are high that something went wrong during the export.
Takeaways:
  • Compare the expected number of records in the software with the actual number of records in the export file
  • Be suspicious if there is an even number of records
  • Check the last record for completeness
Copy link