Change Data Capture (CDC) Implementation for Multi Sourced ETL Processes
We have discussed couple of different options for Change Data Capture including a Change Data Capture Framework in our prior discussions. Implementing change capture for ETL process which involves multiple data source needs special care to capture changes from any of your data source. Here in this article lets see CDC implementation for ETL Process which involve multiple data sources.
Change Scenarios
Lets see different possible scenarios to be considered, when we implement Change Data Capture for multi-sourced ETL.
- Multiple data sources : Multiple data sources may be required to generate all the data elements required for the Dimension or Target table.
- Change in data source : Change can be in one of the data sources or in multiple data sources. Any change needs to be captured.
- Parent Child Relation : Data source can have parent child relation and all parent records may not have child records.
- Reference & Lookup Tables : Reference and Lookup tables may be used to generate the required data elements for the Dimension or Target table.
- Change identification : Changed data from a data source is identified using a date column. Any change in any of the data source will result in the change of the date column say "UPDATE_DT".
Below chart shows the different scenarios we mentioned above. Records to be pulled by the change data process from both CUST, ADDR tables are highlighted in blue based on the assumption that the last ETL run was 03/09/2013.
Preserving Last Run Timestamp
We will have to store the last ETL run timestamp, so that the subsequent ETL runs can identify any changes based on the last ETL run timestamp. We have discussed couple of different options for Change Data Capture including a Change Data Capture Framework in our prior discussions. Please visit the below links for more details.
Querying Changed Records
The important part of Change Data Capture which involves multiple data source is to build an SQL query to pull all the required data. Data sources needs to be joined and queries such that any changes discussed in the above change scenarios need to be captured.
Below SQL query is build on CUST, ADDR table, to cover all the scenarios we discusses before.SQL Query Option 1
SELECT CUST_ID,
CUST_NAME,
CUST_DOB,
CUST_UPDATE_DT,
ADDRESS_LINE,
CITY,
ZIP,
STATE,
ADDR_UPDATE_DT
FROM (SELECT C.CUST_ID AS CUST_ID,
C.CUST_NAME AS CUST_NAME,
C.CUST_DOB AS CUST_DOB,
C.UPDATE_DT AS CUST_UPDATE_DT,
A.ADDRESS_LINE AS ADDRESS_LINE,
A.CITY AS CITY,
A.ZIP AS ZIP,
A.STATE AS STATE,
A.UPDATE_DT AS ADDR_UPDATE_DT
FROM CUST C LEFT OUTER JOIN ADDR A ON C.CUST_ID = A.CUST_ID)
WHERE CUST_UPDATE_DT > TO_DATE ('03/09/2013', 'MM/DD/YYYY') OR
ADDR_UPDATE_DT > TO_DATE ('03/09/2013', 'MM/DD/YYYY')
CUST_NAME,
CUST_DOB,
CUST_UPDATE_DT,
ADDRESS_LINE,
CITY,
ZIP,
STATE,
ADDR_UPDATE_DT
FROM (SELECT C.CUST_ID AS CUST_ID,
C.CUST_NAME AS CUST_NAME,
C.CUST_DOB AS CUST_DOB,
C.UPDATE_DT AS CUST_UPDATE_DT,
A.ADDRESS_LINE AS ADDRESS_LINE,
A.CITY AS CITY,
A.ZIP AS ZIP,
A.STATE AS STATE,
A.UPDATE_DT AS ADDR_UPDATE_DT
FROM CUST C LEFT OUTER JOIN ADDR A ON C.CUST_ID = A.CUST_ID)
WHERE CUST_UPDATE_DT > TO_DATE ('03/09/2013', 'MM/DD/YYYY') OR
ADDR_UPDATE_DT > TO_DATE ('03/09/2013', 'MM/DD/YYYY')
How this SQL query Works
Part 1 : The SQL query part in RED will give all the records from both CUST and ADDR tables.
Part 2 : The SQL query part in BLUE will filter out records which do not have any changes.
Part 2 : The SQL query part in BLUE will filter out records which do not have any changes.
SQL Query Option 2
SELECT C.CUST_ID AS CUST_ID,
C.CUST_NAME AS CUST_NAME,
C.CUST_DOB AS CUST_DOB,
C.UPDATE_DT AS CUST_UPDATE_DT,
A.ADDRESS_LINE AS ADDRESS_LINE,
A.CITY AS CITY,
A.ZIP AS ZIP,
A.STATE AS STATE,
A.UPDATE_DT AS ADDR_UPDATE_DT
FROM CUST C LEFT OUTER JOIN ADDR A ON C.CUST_ID = A.CUST_ID
WHERE C.UPDATE_DT > TO_DATE ('03/09/2013', 'MM/DD/YYYY')
UNION
SELECT C.CUST_ID AS CUST_ID,
C.CUST_NAME AS CUST_NAME,
C.CUST_DOB AS CUST_DOB,
C.UPDATE_DT AS CUST_UPDATE_DT,
A.ADDRESS_LINE AS ADDRESS_LINE,
A.CITY AS CITY,
A.ZIP AS ZIP,
A.STATE AS STATE,
A.UPDATE_DT AS ADDR_UPDATE_DT
FROM CUST C LEFT OUTER JOIN ADDR A ON C.CUST_ID = A.CUST_ID
WHERE A.UPDATE_DT > TO_DATE ('03/09/2013', 'MM/DD/YYYY')
C.CUST_NAME AS CUST_NAME,
C.CUST_DOB AS CUST_DOB,
C.UPDATE_DT AS CUST_UPDATE_DT,
A.ADDRESS_LINE AS ADDRESS_LINE,
A.CITY AS CITY,
A.ZIP AS ZIP,
A.STATE AS STATE,
A.UPDATE_DT AS ADDR_UPDATE_DT
FROM CUST C LEFT OUTER JOIN ADDR A ON C.CUST_ID = A.CUST_ID
WHERE C.UPDATE_DT > TO_DATE ('03/09/2013', 'MM/DD/YYYY')
UNION
SELECT C.CUST_ID AS CUST_ID,
C.CUST_NAME AS CUST_NAME,
C.CUST_DOB AS CUST_DOB,
C.UPDATE_DT AS CUST_UPDATE_DT,
A.ADDRESS_LINE AS ADDRESS_LINE,
A.CITY AS CITY,
A.ZIP AS ZIP,
A.STATE AS STATE,
A.UPDATE_DT AS ADDR_UPDATE_DT
FROM CUST C LEFT OUTER JOIN ADDR A ON C.CUST_ID = A.CUST_ID
WHERE A.UPDATE_DT > TO_DATE ('03/09/2013', 'MM/DD/YYYY')
How this SQL query Works
Part 1 : The SQL query part in RED will give all the changes from CUST table and corresponding data from the ADDR table.
Part 2 : The SQL query part in BLUE will give all the changes from ADDR table and corresponding data from the CUST table.
Once the data is pulled correctly from the data sources, we need to apply the ETL logic to load the Dimension or Target table.
Hope you enjoyed this article. Please lets us know if you have any questions on this article or share us any of your experiences with change data capture.
Once the data is pulled correctly from the data sources, we need to apply the ETL logic to load the Dimension or Target table.
Hope you enjoyed this article. Please lets us know if you have any questions on this article or share us any of your experiences with change data capture.