Case sensitivity

Case sensitivity is a pretty straight-forward problem that is easily solved, but only if you know that you have a problem. It's a typical problem with text data, but it can also cause you headaches in SQL scripts.

The common situation for case sensitivity to bite your ass is in filtering data. When your data is segmented into categories, you might only be interested in one of them, so you're writing a SQL query to retrieve the data, e.g. using WHERE DEPARTMENT = 'Finance'. But what if there are some records with DEPARTMENT = 'finance'? Or DEPARTMENT = 'FINANCE'? In the best case, you don't get any results and immediately realize that you made a mistake. In the worst case, you actually do get results, but you miss a lot of records WITHOUT REALIZING IT.

As always, you need to check your assumptions. Do a GROUP BY DEPARTMENT to get all distinct values for DEPARTMENT. This will typically let you see whether there are any other versions of the category 'Finance' that you need to be aware of. Please note that this can change in the future, so even better than checking the distinct values once would be to do a simple pre-processing on the text field. You can simple use WHERE UPPER(DEPARTMENT) = 'FINANCE', which will be true for 'Finance', 'finance', 'FINANCE' and even 'FInance'.

While I recommend to us this method as a safety measure, it IS just another example of fighting the symptoms instead of getting to the root of the problem. When you come across this type of data quality problem, you should fix it as early as possible in the data pipeline. If you can't fix it in the source system that generates that data, at least fix it at the beginning of the pipeline to avoid unwanted side effects.

Another challenge related to case-sensitivity is object naming in databases. This is a problem that I came across in a Teradata database where I experimented with different SQL formatting styles. I decided to go with upper case for keywords and lower case for all database names, table names and column names. This seemed like the best code style for readability as it clearly separated reserved keywords and custom database and table names. Since Teradata SQL is case insensitive, this should not pose a problem, right? As a matter of fact, it did cause me problems a while later. I had created tables using this code style, for example:

CREATE TABLE mydb.monthly_sales AS (
    SELECT
      product_id
    , month_id
    , SUM(sales) AS sales
    FROM mydatabase.sales
    GROUP BY
      product_id
    , month_id
 ) WITH DATA
 UNIQUE PRIMARY INDEX (product_id, month_id)
 ;

When accessing the table, you can use SELECT sales FROM mydb.monthly_sales or SELECT SALES FROM MYDB.MONTHLY_SALES. As mentioned before, SQL is case insensitive, so it doesn't make a difference. However, when I tried to query the table that contains all column names for all tables, DBC.TABLESV, I couldn't find the column named 'SALES'. The following query returned zero rows.

SELECT
  databasename
, tablename
, columnname
FROM dbc.columnsv
WHERE
    databasename = 'MYDB'
    AND tablename = 'MONTHLY_SALES'
    AND columnname = 'SALES'
;

The reason is that internally, the database stores the table and column names just like you entered them, including the upper or lower case. So the actual values were TABLENAME = 'monthly_sales' and COLUMNNAME = 'sales'. This means that your code style - that should be free for you to choose - actually influences how the database stores the column names. This is very unfortunate and can lead to some annoying side effects, so I recommend writing queries in all upper case to avoid this problem.

Takeaways:

  • When you filter on text data, check for variations in spelling

  • When filtering database metadata, be aware that different coding styles can influence the metadata, e.g. lower case letters in table names

Last updated