Separating DDL and DML statements
A bad habit that I keep seeing in almost every SQL script written by junior data scientists is the refusal to use explicit
CREATE TABLEstatements. Within this type of statement, you define the table structure, i.e. the columns, column types, keys, indices and constraints. It's annoying because you need to think about the data type, the range of numeric columns, whether or not a column can contain NULL values, how to partition the table, whether or not you want to use compression algorithms on specific columns and many other technical details. This is stuff that you don't really want to spend time on when you're preparing data to be used in a model.
Most junior data scientists prefer implicit
CREATE TABLEstatements using the
CREATE TABLE AS SELECTsyntax where you create a table directly from a
SELECTstatement. This is convenient because the database will decide about the table structure and guess the appropriate column types. The downside is that this will almost always lead to suboptimal performance, especially on large datasets.
On numeric columns, the database will often make a safe guess, choosing bigger column types like
INTEGER(typically 4 Byte), when a much smaller data type like
BYTEINT(1 Byte) would have been sufficient. It is understandable that a 4x difference in the size of the data will negatively affect the performance of the database.
Another consideration is the
PRIMARY INDEX. This is a Teradata-specific concept, but there are comparable concepts in other relational databases like Microsoft SQL Server(R), Oracle(R) and MySQL. The primary index, which can consist of one or more columns, determines how Teradata distributes the data records across the worker nodes of the database. This is important for JOINs and affects the execution plan that the database creates before executing a query. When two tables have similar or the same primary index, JOINs are much faster because less data needs to be moved around between the worker nodes.
Another previously mentioned performance-relevant concept is called partitioning and exists in all relational databases. When you select a subset of data, the database would usually need to scan the entire table, check the filter condition from the
WHEREstatement and return the selected subset. Creating a table as partitioned by a specific column or set of columns allows it to skip this complete table scan. Imagine a partitioned table as if the records were stored in multiple different labeled directories. If a directory doesn't match the filter condition, the database can completely discard it without ever having to look at it. Whether there are ten or ten billion records in the directory doesn't make a difference! When you're creating large tables, partitioning by well-selected columns, often date columns, can significantly speed up the later use of the table. But when you create a table using the
CREATE TABLE AS SELECTsyntax, the database will by default not use partitioning.
For both concepts - the primary index and partitions - be careful to choose a column with a large number of distinct values and/or a wide distribution. If the partition column only has two distinct values, often times the performance benefits will be lost. The same goes for the primary index concept in Teradata. I once came upon a table that had only about 10 GB of data but took up 3.4 TB of memory because the primary index happened to be a binary variable that was 0 for more than 90% of the records.
Even worse than the performance problem of implicitly created tables is the potential danger of errors. The database could create a column with a data type that is too small to hold some of the values. This can apply to both numeric and text columns. Especially text columns can become a problem if you later insert new records to a table where the text is longer than the longest text known to the table up until then. This will often lead to truncated string data because the database didn't expect having to deal with longer strings.
On top of that, implicit
CREATE TABLEstatements will leave the database in a weird state if the query ever aborts due to syntax errors or load problems, both of which WILL eventually happen if you repeatedly run the query in a production-like environment. If your script aborts somewhere in the middle of execution, some tables have already been created while others have not. This means that it's not always possible to simply restart the script from the beginning because it may run into a 'table already exists' error.
It is also possible that your database user will not have
CREATE TABLErights in a productive environment. All you can do is
DELETEthe contents of a table and
INSERTnew data into the table. With this limitation, using the
CREATE TABLE AS SELECTsyntax is not possible.
I personally recommend keeping the DDL statements (
CREATE TABLE) and the DML statements (
UPDATE) in separate scripts. That way, you can simply run your DML script without having to worry about it creating or deleting any tables. At the same time, you have a script with all relevant DDL statements that you can quickly use to create all relevant tables for your workflow, even if somebody has completely wiped your database.
- Avoid implicit create table statements, even if they are convenient.
- Write explicit
CREATE TABLEstatements in a separate script.