No single point of truth

The 'single point of truth', also known as 'single source of truth', is concept from data warehousing that requires data to be stored in such a way that no element is stored more than once. This forces a level of consistency onto the data because there can - in theory - be no disagreement between two different storage points.

While almost no ensemble of IT systems and databases can claim to stick to this practice 100%, there should at least be a certain level of consistency within the data. Your task as a data scientist is usually not to assure consistency within the wider company IT infrastructure, but to take precautions to keep at least your teams' data store consistent. There are a number of ways this can go wrong.

Obviously, the easiest way to mess up and create inconsistent data is by duplicating data, especially if you also apply filters and transformations. When people discover your table later, they won't know how you derived the data from the original table. If you applied filters, they will usually notice that you table has fewer records, but they won't know what you did with the data to obtain this particular subset of data. Therefore, avoid duplicating data as much as you can.

If you ever need a different perspective on the same data, create views whenever feasible. That way, everybody can simply look at the view definition to find out how exactly you came from the original data to the data in the view.

If you are working in a contained environment like a sandbox database, determine the master data tables that you want to attach yourself to. Don't copy any tables to your sandbox, but try to fetch the data from the master data tables any time you need them. This will ensure that you always get the 'fresh' and recent records. If you only need parts of the data and you need to apply some filters, build this logic into a view on the original data so that everybody else can simply use this view. If this is not an option for whatever reason, set up an ETL that runs daily and creates your own, filtered version of the master data table.

I would also recommend to not create any tables 'on the go', i.e. without saving the scripts that create the table. A table without the instructions to build it, i.e. the SQL script that created it, is like a very complicated machine without a manual. You may be able to get it to work somehow, but you will never know if you are using it correctly. And if the table ever gets deleted, you will have no way to get it back. Every table in your database should be the end result of a script, a workflow or an ETL that creates it and can be rerun at any time if necessary.

Takeaways:

  • Do not duplicate data.

  • Avoid copying data to your sandbox environment.

Last updated