Using ETL Staging Tables

Using ETL Staging Tables

When the volume or granularity of the transformation process causes ETL processes to perform poorly, consider using a staging table on the destination database as a vehicle for processing interim data results. Staging tables are normally considered volatile tables, meaning that they are emptied and reloaded each time without persisting the results from one execution to the next. Staging tables should be used only for interim results and not for permanent storage.
When using a load design with staging tables, the ETL flow looks something more like this:
  1. Delete existing data in the staging table(s)
  2. Extract the data from the source
  3. Load this source data into the staging table(s)
  4. Perform relational updates (typically using T-SQL, PL/SQL, or another language that is specific to your RDBMS) to cleanse or apply business rules to the data, repeating this transformation stage as necessary
  5. Load the transformed data from the staging table(s) into the final destination table(s)
This load design pattern has more steps than the traditional ETL process, but it also brings additional flexibility as well. By loading the data first into staging tables, you’ll be able to use the database engine for things that it already does well. For example, joining two sets of data together for validation or lookup purposes can be done in most every ETL tool, but this is the type of task that the database engine does exceptionally well. The same thing with performing sort and aggregation operations; ETL tools can do these things, but in most cases, the database engine does them too, but much faster.

Tips for Using ETL Staging Tables

1- Separate the ETL staging tables from the durable tables
2- Use permanent staging tables, not temp tables.
3- Consider indexing your staging tables.
4- Consider emptying the staging table before and after the load.
5- Be careful if you want to run several concurrent loads at once.
6- Handle data lineage properly.

References:

Comments

Popular posts from this blog

Maxpooling vs minpooling vs average pooling

Percentiles, Deciles, and Quartiles

Understand the Softmax Function in Minutes