Change Data Capture (CDC) Implementation for Multi Sourced ETL Processes

Johnson Cyriac Mar 9, 2013

Initial History Building Algorithm for Slowly Changing Dimensions
We have discussed couple of different options for Change Data Capture including a Change Data Capture Framework in our prior discussions. Implementing change capture for ETL process which involves multiple data source needs special care to capture changes from any of your data source. Here in this article lets see CDC implementation for ETL Process which involve multiple data sources.

Change Scenarios

Lets see different possible scenarios to be considered, when we implement  Change Data Capture for multi-sourced ETL.
  1. Multiple data sources : Multiple data sources may be required to generate all the data elements required for the Dimension or Target table.
  2. Change in data source : Change can be in one of the data sources or in multiple data sources. Any change needs to be captured.
  3. Parent Child Relation : Data source can have parent child relation and all parent records may not have child records.
  4. Reference & Lookup Tables : Reference and Lookup tables may be used to generate the required data elements for the Dimension or Target table.
  5. Change identification : Changed data from a data source is identified using a date column.  Any change in any of the data source will result in the change of the date column say "UPDATE_DT".
Below chart shows the different scenarios we  mentioned above. Records to be pulled by the change data process from both CUST, ADDR tables are highlighted in blue based on the assumption that the last ETL run was 03/09/2013.
Change Data Capture Implementation for Multi Sourced ETL Processes

Preserving Last Run Timestamp

We will have to store the last ETL run timestamp, so that the subsequent ETL runs can identify any changes based on the last ETL run timestamp. We have discussed couple of different options for Change Data Capture including a Change Data Capture Framework in our prior discussions. Please visit the below links for more details.
    1. Change Data Capture Made Easy Using Mapping Variables
    2. An ETL Framework for Change Data Capture
    3. Change Data Capture Implementation Using CHECKSUM Number

Querying Changed Records

The important part of Change Data Capture which involves multiple data source is to build an SQL query to pull all the required data. Data sources needs to be joined and queries such that any changes discussed in the above change scenarios need to be captured.

Below SQL query is build on CUST, ADDR table, to cover all the scenarios we discusses before.

SQL Query Option 1

                SELECT CUST_ID,
                       CUST_NAME,
                       CUST_DOB,
                       CUST_UPDATE_DT,
                       ADDRESS_LINE,
                       CITY,
                       ZIP,
                       STATE,
                       ADDR_UPDATE_DT
                  FROM
(SELECT C.CUST_ID AS CUST_ID,
                               C.CUST_NAME AS CUST_NAME,
                               C.CUST_DOB AS CUST_DOB,
                               C.UPDATE_DT AS CUST_UPDATE_DT,
                               A.ADDRESS_LINE AS ADDRESS_LINE,
                               A.CITY AS CITY,
                               A.ZIP AS ZIP,
                               A.STATE AS STATE,
                               A.UPDATE_DT AS ADDR_UPDATE_DT
                          FROM CUST C LEFT OUTER JOIN ADDR A ON C.CUST_ID = A.CUST_ID
)
                 WHERE CUST_UPDATE_DT > TO_DATE ('03/09/2013', 'MM/DD/YYYY') OR
                                ADDR_UPDATE_DT > TO_DATE ('03/09/2013', 'MM/DD/YYYY')

How this SQL query Works

Part 1 :  The SQL query part in RED will give all the records from both CUST and ADDR tables.
Part 2 :  The SQL query part in BLUE will filter out records which do not have any changes.

SQL Query Option 2

                SELECT C.CUST_ID AS CUST_ID,
                       C.CUST_NAME AS CUST_NAME,
                       C.CUST_DOB AS CUST_DOB,
                       C.UPDATE_DT AS CUST_UPDATE_DT,
                       A.ADDRESS_LINE AS ADDRESS_LINE,
                       A.CITY AS CITY,
                       A.ZIP AS ZIP,
                       A.STATE AS STATE,
                       A.UPDATE_DT AS ADDR_UPDATE_DT
                FROM CUST C LEFT OUTER JOIN ADDR A ON C.CUST_ID = A.CUST_ID
                WHERE C.UPDATE_DT > TO_DATE ('03/09/2013', 'MM/DD/YYYY')

                UNION
                SELECT C.CUST_ID AS CUST_ID,
                       C.CUST_NAME AS CUST_NAME,
                       C.CUST_DOB AS CUST_DOB,
                       C.UPDATE_DT AS CUST_UPDATE_DT,
                       A.ADDRESS_LINE AS ADDRESS_LINE,
                       A.CITY AS CITY,
                       A.ZIP AS ZIP,
                       A.STATE AS STATE,
                       A.UPDATE_DT AS ADDR_UPDATE_DT
                FROM CUST C LEFT OUTER JOIN ADDR A ON C.CUST_ID = A.CUST_ID
                WHERE A.UPDATE_DT > TO_DATE ('03/09/2013', 'MM/DD/YYYY')

How this SQL query Works

Part 1 :  The SQL query part in RED will give all the changes from CUST table and corresponding data from the ADDR table.
Part 2 :  The SQL query part in BLUE will give all the changes from ADDR table and corresponding data from the CUST table.

Once the data is pulled correctly from the data sources, we need to apply the ETL logic to load the Dimension or Target table.

Hope you enjoyed this article. Please lets us know if you have any questions on this article or share us any of your experiences with change data capture.





About US Contact US Advertise Guest Post Terms and Conditions Privacy Policy Disclaimer

© 2012-2013 Data Intelligence Solution, All Rights Reserved
The contents in this site is copyrighted to Data intelligence Solution and may not be reproduced on other websites.