Numeric overflow
I have named this chapter 'Numeric overflow' because that is the name of the error that occurs when the value of a numeric column exceeds the upper or lower limit allowed by the chosen data type. The typical 32-Bit (4-Byte) signed Integer[1] can hold values from βˆ’2.147.483.648 to 2.147.483.647. This sounds like a lot at first, but when the data you are working with has more than 3 Billion rows, even a simple COUNT(*) can quickly lead to a numeric overflow.
That is the reason why I always chose the biggest possible data type when I first started working with data. That way, getting a numeric overflow was almost impossible. I couldn't even understand why the developers of database software decided to include any smaller data types. Why would somebody want to use a data type that could lead to problems when the value it holds gets too big?
This worked well for me, until I started to work with very large datasets. When a database table holds more than a billion rows, you will start to notice the performance difference between choosing an 8-Bit signed Integer (-128 to 127) or a 32-Bit signed Integer. In a table with 5 Billion rows, a 32-Bit Integer column will require 18.6 GB of storage space while a 8-Bit Integer will require only about 4.7 GB. Not only does the 32-Bit Integer use more storage, it also slows down every query that includes this column. Data must be read and written back and forth between the hard drive and the main memory, so naturally a 75% volume reduction will speed things up.
Typically, the performance effect is rather small. In a complex data pipeline, there are hundreds of possible ways to unintentionally make it less efficient and introduce performance problems far bigger than oversized data types. But once you have eliminated the big problems, avoiding unnecessarily large column data types will give you a nice speed boost.
The downside is the potential numeric overflow error. As already mentioned in the chapter on data range problems and outliers, occasional freak cases will violate your expectations on how big certain attributes can get. Capping it at a certain value will help prevent numeric overflows. If 99% of your customers spend less than 100.000 EUR per year at your company, capping the feature CUSTOMER_REVENUE_LAST_YEAR at 100.000 EUR seems reasonable. You can do this by combining the GREATEST() and LEAST() functions (Teradata syntax):
SELECT
CUSTOMER_ID
, GREATEST(0, LEAST(100000, SUM(REVENUE))) AS CUSTOMER_REVENUE_LAST_YEAR
FROM SALES_DB.SALES_FROM_LAST_YEAR
GROUP BY CUSTOMER_ID
;
This will allow you to store CUSTOMER_REVENUE_LAST_YEAR in a DECIMAL(9,3) field that takes up only 4 Byte. Without the use of the capping functions, a customer with 1.000.000,00 EUR of revenue would have caused a numeric overflow error. You would have been forced to use a DECIMAL(10,3) or bigger, which would use 8 Byte per record. Of course, this is only an option when the exact value of the revenue per customer is not critically important. In a financial report, just capping the revenue would be very problematic. When using it as a feature in a machine learning product, it typically is not.
Takeaways:
  • Always use the smallest possible data type
  • Protect yourself against numeric overflow errors by capping the values of features where the exact value is not important
[1] Signed means that one of the bits is used to indicate whether the number stored in the 32 bits is positive or negative. See also https://en.wikipedia.org/wiki/Integer_(computer_science)​
Copy link