Whitespace-padded strings

All databases have some quirks. One of the more subtle and occasionally annoying quirks I have come across are fixed-length string types. This data type is often used for text columns where the data is known to have a constant or at least bounded length, for example when the column is a 32 character long alphanumeric hash.

The advantages of fixed-length string types are usually performance benefits and having a constraint that cannot be violated, which helps to create a dataset that is more resilient to human error.

When you know that all your text entries are of the same length, the use of these data types, if available, is strongly recommended. However, when you use a type like this and insert a string that is shorter than the fixed length, it will usually be padded with spaces. This becomes a problem when you are checking the value of a string since 'hello world' is not the same as 'hello world ' with five spaces at the end. Of couse, this problem of leading or trailing whitespace characters is not limited to fixed-length string types.

A quick way to check if you have this type of problem comes with the TRIM() function in SQL or trimws() in R. These functions allow you to remove leading and trailing white spaces. Simply check if TRIM(MY_COLUMN) = MY_COLUMN. If this evaluates to FALSE at any point, you have leading or trailing white spaces in one of your text fields. Please note that this doesn't work with all databases, because they will sometimes ignore the leading and trailing spaces. If you want to be on the safe side, check if LENGTH(TRIM(MY_COLUMN)) = LENGTH(MY_COLUMN). Alternatively, you can check if the string satisfies a regular expression like '^[ ]+' or '[ ]+$'. Be aware that the space character U+0020 is not the only type of whitespace out there. In text data, you will often find the non-breaking space character U+00A0, the tab character U+0009 or even more obscure characters. Therefore, it is safer to use regular expressions like '^[\s]+' or '[\s]+$'.

Takeaways:

  • When you are working with a text column, check if is a fixed-length data type

  • Use the TRIM() function to check for leading and trailing whitespace characters

Last updated