CREATE TABLE
statements. Within this type of statement, you define the table structure, i.e. the columns, column types, keys, indices and constraints. It's annoying because you need to think about the data type, the range of numeric columns, whether or not a column can contain NULL values, how to partition the table, whether or not you want to use compression algorithms on specific columns and many other technical details. This is stuff that you don't really want to spend time on when you're preparing data to be used in a model.CREATE TABLE
statements using the CREATE TABLE AS SELECT
syntax where you create a table directly from a SELECT
statement. This is convenient because the database will decide about the table structure and guess the appropriate column types. The downside is that this will almost always lead to suboptimal performance, especially on large datasets.INTEGER
(typically 4 Byte), when a much smaller data type like BYTEINT
(1 Byte) would have been sufficient. It is understandable that a 4x difference in the size of the data will negatively affect the performance of the database.PRIMARY INDEX
. This is a Teradata-specific concept, but there are comparable concepts in other relational databases like Microsoft SQL Server(R), Oracle(R) and MySQL. The primary index, which can consist of one or more columns, determines how Teradata distributes the data records across the worker nodes of the database. This is important for JOINs and affects the execution plan that the database creates before executing a query. When two tables have similar or the same primary index, JOINs are much faster because less data needs to be moved around between the worker nodes.WHERE
statement and return the selected subset. Creating a table as partitioned by a specific column or set of columns allows it to skip this complete table scan. Imagine a partitioned table as if the records were stored in multiple different labeled directories. If a directory doesn't match the filter condition, the database can completely discard it without ever having to look at it. Whether there are ten or ten billion records in the directory doesn't make a difference! When you're creating large tables, partitioning by well-selected columns, often date columns, can significantly speed up the later use of the table. But when you create a table using the CREATE TABLE AS SELECT
syntax, the database will by default not use partitioning.CREATE TABLE
statements will leave the database in a weird state if the query ever aborts due to syntax errors or load problems, both of which WILL eventually happen if you repeatedly run the query in a production-like environment. If your script aborts somewhere in the middle of execution, some tables have already been created while others have not. This means that it's not always possible to simply restart the script from the beginning because it may run into a 'table already exists' error.CREATE TABLE
rights in a productive environment. All you can do is DELETE
the contents of a table and INSERT
new data into the table. With this limitation, using the CREATE TABLE AS SELECT
syntax is not possible.CREATE TABLE
) and the DML statements (INSERT
, DELETE
, UPDATE
) in separate scripts. That way, you can simply run your DML script without having to worry about it creating or deleting any tables. At the same time, you have a script with all relevant DDL statements that you can quickly use to create all relevant tables for your workflow, even if somebody has completely wiped your database.CREATE TABLE
statements in a separate script.