5 Restartability Design Pattern for Different Type ETL Loads
Restartable ETL jobs are very crucial to job failure recovery, supportability and data quality of any ETL System. So you need to build your ETL system around the ability to recover from abnormal ending of a job and restart. So a well designed ETL system should have a good restartable mechanism. In this article lets discuss ETL restartability approaches to support different type of ETL Jobs such as Dimension loads, Fact Loads etc...
What is ETL Restartability
Restartability is the ability to restart an ETL job if a processing step fails to execute properly. This will avoid the need of any manual cleaning up before a failed job can restart. You want the ability to restart processing at the step where it failed as well as the ability to restart the entire ETL session.
Lets discuss ETL restartability approaches to support commonly used ETL Jobs types.
- Slowly Changing Dimension
- Fact Table
- Snapshot Table
- Current State Table
- Very Large Table
1. Slowly Changing Dimension Load
Below diagram shows the high level steps required for SCD loading ETL Job.
Lets see this in bit more detail.
Key Design Factor : The key aspect of this design is the CHECKSUM Number comparison. As per this design, any incoming record with the same CHECKSUM number of the active record in the target will not be processed into the Dimension table, Hence we can restart the process with out impacting the partially processed data.
Step 1 : In this step, we will read all the data from the staging table. This will include joining data from different tables and applying any incremental data capturing logic.
Step 2 : Data will be compared between source and target to identify if any change in any of the attributes. CHECKSUM Number can be used to make this process simple.
Step 3 : If the check CHECKSUM Number is different, Data is processed further, else ignored.
Step 4 : Do any transformation required, including the error handling.
Step 5 : Load the data into the Dimension Table.
2. Fact Table Load
High level design for the Fact Table design is given in below image.
Key Design Factor : As per this design, FACT Table records are loaded into a TEMP table, then to the actual table. Truncate and Load design for the TEMP gives the restartability.
Note : Data movement from the TEMP table to FACT table is assumed to be very less likely to get errors. Any error in this process will require manual intervention.
Some more details on the high level design.
Step 1 : In this step, we will read all the data from the source table. This will include joining data from different tables and applying any incremental data capturing logic.
Step 2 : Perform any transformation required, including the error handling.
Step 3 : Load the data into the TEMP Table.
Step 4 :Load the data from the TEMP Table into the FACT table. This can be done either using Database script or using an Informatica PowerCenter session.
Step 4 :Load the data from the TEMP Table into the FACT table. This can be done either using Database script or using an Informatica PowerCenter session.
3. Snapshot Table Load
Many times we create snapshot tables and do build reporting on top of it. This particular restartability technique is appropriate for such scenarios. Below image shows the high level steps.
Key Design Factor : Truncate and Load design for the TEMP gives the restartability.
Key Design Factor : Truncate and Load design for the TEMP gives the restartability.
Detailed steps are as below.
Step 1 : In this step, we will read all the data from the source table. This will include joining data from different tables and applying any incremental data capturing logic.
Step 2 : Truncate the data from the target table.
Step 3 : Perform any transformation required, including the error handling.
Step 4 : Load the data into Target Table.
Step 4 : Load the data into Target Table.
4. Current State Table Load
Just like SCD Type 1, there are scenarios, where you are interested to keep only the latest state of the data. Here we are discussing a very common and simple approach to achieve restartability for such scenarios.
Key Design Factor : Update else Insert design gives the restartability.
More about the Steps.
Step 1 : In this step, we will read all the data from the source table. This will include joining data from different tables and applying any incremental data capturing logic.
Step 2 : Identify Records for INSERT/UPDATE and perform any transformations that is required, including the error handling.
Step 3 : Insert the record which is identified for Insert.Step 4 : Update the record which is identified for Update.
5. Very Large Table Load
The approach we are discussing here is appropriate for loading very large snapshot table , which is required to be available 24/7. You can read the complete design from this article.
Key Design Factor : Switching the tables using RENAME DDL Command.
Below is the high level design.
Lets see this in bit more detail.
Step 1 : In this step, we will read all the data from the source table. This will include joining data from different tables and applying any incremental data capturing logic.
Step 2 : Perform any transformations that is required, including the error handling.
Step 3 : Load the data into the TEMP Table.
Step 4 : Rename the TEMP table to the Target table. This will move the data from the TEMP table to the actual target table.
Note : Click the link to Learn more on this restartability design.
Please leave us a comment below, if you have any other thoughts or scenarios to be covered. We will be more than happy to help you.