Huge enterprise databases
For a data analyst, every new dataset is a challenge. You have to understand the business context, the technicalities of how the data was collected, and all of the intricacies within the data to be able to use it in a meaningful way. This is an iterative process that takes a while. You will almost always discover something unexpected or something that you don't understand in the data. You will need to consult people on the business side of things or talk to other people who are familiar with the data in order to understand what's going on.
This is particularly challenging when you're not just working with a single data table, but with a complete database containing thousands of tables with tens of thousands of columns. This is very common when you start working for a large company, but also in medium-sized companies that are strongly data-driven.
How do you even get started in this situation? Just do what you as a data scientist would do in any other situation: do some data exploration. But instead of looking at the actual data within the tables, look at the metadata of the database like the list of databases or tables.
- How many databases/tables are there?
- What are their names? Can you guess what each database/table is for?
- Are there comment strings that describe what purpose each database/table serves?
- Are there any groups of databases/tables that belong together?
- Which ones are the largest databases/tables in terms of space in memory?
- How many tables does each database contain?
- How is the ratio of tables vs. views within each database?
- How large are they relative to each other? A tree map is a good visualization tool to compare the databases/tables by size or the databases by the number of tables/views within them.
- Are there any hierarchies or staging concepts that you can derive from the database names, e.g. DWH and DEV_DWH?
- Which tables contain the 'raw' data, which ones contain aggregations?
- Which tables contain master data like data on products, suppliers etc. Which ones contain transactional data like sales data, website tracking data or server logs?
Not all the databases will be relevant to your daily work, but it's always a good idea to take a look at the bigger picture of you company's data ecosystem to learn what data is available. In order to zoom in, ask your boss or colleagues which databases are most relevant for your daily work so you can take a closer look on the tables in those.
It's helpful to build your own database administration report from the metadata so that you can occasionally go back to look at the big picture view. Usually, this report will be a snapshot of the current data landscape as the metadata will only be available for the current status of the database. If you're administrating part of the database like your data scientists' sandbox or a project database, you might be interested in the change of the metadata over time. You might want to know who created or deleted which table at what time or how the volume of a table developed in the last few weeks. In this case, consider setting up an automated process that takes a snapshot of all relevant metadata tables every day. You can also ask your database administrators if historic metadata is available. They might have a toolbox database with additional metadata tables for database administration purposes.
I'm not going to lie, there's no way to gain a complete understanding of all the data from simply looking at the metadata. In a large company, you will be confronted with new tables that you've never looked at even years after you started. Furthermore, the only way to get enough insight into the intricacies of the data is to actually work with it and talk to people who know how it works.
However, imagine your database administration report as map of a new city you haven't been to before. It will tell you which parts of the city you can go to for shopping opportunities, where the people live, where they work and where the parks, rivers and lakes are. It will NOT tell you the location of every ice cream parlor. It will also NOT tell you where you can get the best ice cream in town and which places to avoid. In order to learn this, you have to go to the areas outlined on the city map and explore them for yourself or ask around.
Typical metadata tables in some commonly used relational database systems:
- Table of all databases: DBC.DATABASESV
- Table of all tables: DBC.TABLESV (includes views)
- Table of all columns: DBC.COLUMNSV
- Size of the tables: DBC.TABLESIZEV
- Query log: DBC.QRYLOGV
MICROSOFT SQL SERVER:
- Table of all databases: SYS.DATABASES
- Table of all tables: SYS.TABLES, SYS.VIEWS
- Table of all columns: SYS.COLUMNS
- Size of the tables: … It's complicated ;)
- Query log: SYS.DM_EXEC_CACHED_PLANS
- Table of all databases: via SELECT DISTINCT OWNER FROM ALL_OBJECTS
- Table of all tables: ALL_TABLES, ALL_VIEWS
- Table of all columns: ALL_TAB_COLUMNS
- Size of the tables: … It's complicated ;)
- Query log: Not available inside of the database
- Explore a new data environment by exploring the metadata that every database collects