NAvalues in a data frame in R. Sometimes, empty fields are encoded by specific values that represent the missing data point. For example, an empty string, that is, a string of length zero, can represent a missing string value or it can simply be an empty string with no characters. To make things even more complicated, some databases automatically convert zero-length strings to NULL while others treat them as different objects.
'empty', among others. It's always a good idea to calculate counts for all values of a text column, look at the most frequent values and check for patterns like the above-mentioned. I also recommend checking how your database treats zero-length strings and do a quick check on how often they occur in your dataset.
' '. They are not of length zero, but they are effectively empty and could also be considered missing values.
1970-01-01this is a commonly used fallback value when a date field is empty. Another value that should catch your attention is January 1st, 1900 (
1900-01-01), which is - or used to be - the smallest possible date in Microsoft Excel. When you encounter this value, you can almost always be sure that this is not the actual value of the field, but some fallback due to missing data.
START_DATE = 2016-03-01to
END_DATE = 2018-08-31. Another product may have a
START_DATE = 2017-05-01and still be sold in your shop, so
END_DATEshould be empty, i.e. NULL. The problem with this is that you may want to select all currently valid products by using the condition
WHERE CURRENT_DATE BETWEEN PRODUCT.START_DATE AND PRODUCT.END_DATE. This obviously doesn't work because
END_DATEis NULL. Sometimes, people solve this problem by setting
END_DATEto some unrealistically high value like
9999-12-31. Even though you may often see this in practice, it's not best practice. After all, this is how the millennium bug, also known as Y2K bug, came to be.
NAin the case of R). Using values to stand in for missing data introduces a level of implicit knowledge required to work with the data. While it may be obvious to you that
CUSTOMER_AGE = -1represents a missing value, it will be very confusing to someone who works with that data for the first time.