Design Approach to Handle Late Arriving Dimensions and Late Arriving Facts

Johnson Cyriac Dec 29, 2013
|

Design Approach to Handle Late Arriving Dimensions and Late Arriving Facts
In the typical case for a data warehouse, dimensions are processed first and the facts are loaded later, with the assumption that all required dimension data is already in place. This may not be true in all cases because of nature of your business process or the source application behavior. Fact data also, can be sent from the source application to the warehouse way later than the actual fact data is created. In this article lets discusses several options for handling late arriving dimension and Facts.

What is Late Arriving Dimension

Late arriving dimensions or sometimes called early-arriving facts occur when you have dimension data arriving in the data warehouse later than the fact data that references that dimension record.

For example, an employee availing medical insurance through his employer is eligible for insurance coverage from the first day of employment. But the employer may not provide the medical insurance information to the insurance provider for several weeks. If the employee undergo any medical treatment during this time, his medical claim records will come as fact records with out having the corresponding patient dimension details.

Design Approaches

Depending on the business scenario and the type of dimension in use, we can take different design approaches.
      • Hold the Fact record until Dimension record is available.
      • 'Unknown' or default Dimension record.
      • Inferring the Dimension record.
      • Late Arriving Dimension and SCD Type 2 changes.

1. Hold the Fact record until Dimension record is available

One approach is to place the fact row in a suspense table. The fact row will be held in the suspense table until the associated dimension record has been processed. This solution is relatively easy to implement, but the primary drawback is that the fact row isn’t available for reporting until the associated dimension record has been handled.

This approach is more suitable when your data warehouse is refreshed as a scheduled batch process and a delay in loading fact records until the dimension records are available is acceptable for the business.
Late Arriving Dimension design approach

2. 'Unknown' or default Dimension record

Another approach is to simply assign the “Unknown” dimension member to the fact record. On the positive side, this approach does allow the fact record to be recorded during the ETL process. But it won’t be associated with the correct dimension value. 

The "Unknown" fact records can also be kept into a suspense table. Eventually, when the Dimension data is processed, the suspense data can be reprocessed and associate with a real, valid Dimension record.
Late Arriving Dimension design approach

3. Inferring the Dimension record

Another method is to insert a new Dimension record with a new surrogate key and use the same surrogate key to load the incoming fact record. This only works if you have enough details about the dimension in the fact record to construct the natural key. Without this, you would never be able to go back and update this dimension row with complete attributes.

In the insurance claim example explained in the beginning; it is almost certain that the "patient id" will be part of the claim fact, which is the natural key of the patient dimension. So we can create a new placeholder dimension record for the patient with a new surrogate key and the natural key "patient id".
Late Arriving Dimension design approach
Note : When you get all other attributes for the patient dimension record in a later point, you will have to do a SCD Type 1 update for the first time and SCD Type 2 going forward.

4. Late Arriving Dimension and SCD Type 2 changes

Late arriving dimension with SCD Type 2 changes gets more complex to handle.

4.1. Late Arriving Dimension with multiple historical changes

As described above, we can handle late arriving dimension by keeping an "Unknown" dimension record or an "Inferred" dimension record, which acts an a placeholder.

Even before we get the full dimension record details from the source system, there may be multiple SCD Type 2 changes to the placeholder dimension record. This leads to the creation of new dimension record with new surrogate key and modify any subsequent fact records surrogate key to point the new surrogate key.

Lets see the scenario in detail with the help medical insurance claim example.

The patient with ID 67223 have made two insurance claims. One on 9/10 and other on 9/20. As there is no patient dimension information is available for patient id 67223 yet, an 'Inferred' dimension record is created for the patient with surrogate key 1001.

Below shown is the state of the dimension and the fact table at this point.
Design Approach to Handle Late Arriving Dimensions and Late Arriving Facts
Later, by the time dimension information is made available, there has already been SCD Type 2 changes for the patient id 67223. There has been changes for the patient id 67223 on 9/10 and again on 9/12.  Below shown is the current state of the dimension and fact records. The fact record created on 9/20 is still referring to surrogate key 1001, which is not the correct representation.
image

This means the claim record created on 9/20 need to be reassigned to the correct  surrogate key, which is active for the same time period. Below shown is the correct state of the dimension and fact records.
image

4.2. Late Arriving Dimension with retro effective changes

You can get Dimension records from source system with retro effective dates. For example you might update your marital status in your HR system way later than your marriage date.  This update come to data warehouse with retro effective date.

This leads to a new dimension record with a new surrogate key and changes in effective dates for the affected dimension. You will have to scan forward in the dimension to see if there is any subsequent type 2 rows for this dimension. This further leads in modify any subsequent fact records surrogate key to point the new surrogate key.

Lets again use the medical insurance claim example for our explanation.

Below shown state of the Patient Dimension and the Claim Fact table at this point, which is perfectly good.
Design Approach to Handle Late Arriving Dimensions and Late Arriving Facts
Now we have got a Patient Dimension data from the source system say on 10/1, which is in effective from 9/15 as shown below.
Design Approach to Handle Late Arriving Dimensions and Late Arriving Facts
This new Dimension data which comes with a retro effective date makes all dimension records out of sync in terms of the effective start and end date. In addition to that, the fact records are referring to incorrect dimension records.

So in addition to inserting a new dimension record with a new surrogate key, we will have to adjust the effective dates of the prior period dimension record and propagate the dimension column value changes to the remaining records. The fact table also need to be updated to reassign the correct surrogate key.

Below shown red is the corrections required to take care of the retro effective dimension records.     
Design Approach to Handle Late Arriving Dimensions and Late Arriving Facts

What is Late Arriving Facts

Late arriving fact scenario occurs when the transaction or fact data comes to data warehouse way later than the  actual transaction occurred in the source application. If the late arriving fact need to be associated with an SCD Type 2 dimension, the situation become messy. This is because we have to search back in history within the dimensions to decide how to assign the right dimension keys that were in effect when the activity occurred in the past.

Design Approaches

Unlike late arriving dimensions, late arriving fact records can be handles relatively easily. When loading the fact record, the associated dimension table history has to be searched to find out the appropriate surrogate key which is effective at the time of the transaction occurrences. Below data flow describes the late arriving fact design approach.
Late Arriving Fact design approach
Hope you guys enjoyed this article and gave you some new insights into late arriving dimension and fact scenarios in Data Warehouse. Leave us your questions and commends. We would also like to hear how you have handled late arriving dimension and fact in your data warehouse.




About US Contact US Advertise Guest Post Terms and Conditions Privacy Policy Disclaimer

© 2012-2013 Data Intelligence Solution, All Rights Reserved
The contents in this site is copyrighted to Data intelligence Solution and may not be reproduced on other websites.