Recently, I had an interesting conversation with my project’s data architect regarding possible back-dated changes for a primary dimension — Employee — in the data warehouse. In our existing data model, Employee was maintained as a Type 2 slowly changing dimension.
Six months into deployment, it was confirmed by business management that employee changes could be back-dated. This conversation reminded me of a project that I was part of three years earlier where such back-dated scenarios happened frequently.
Many of us are familiar with design sessions where business leaders confirm the chances of a certain scenario happening is rare. The architect then designs a solution without considering this supposedly rare scenario at all. After one month of deployment, it is noticed that this scenario occurs and breaks the existing design. Back-dated changes are an example of such “rare scenarios.”
In addition to the challenge we faced in designing a solution that tackles back dated changes, such scenarios bring in additional complexities such as possible insertion of records within historical date ranges, surrogate keys of dimensions getting out of sequence, the need to update historical fact data with new surrogate keys, and altering of summarized data in fact tables/aggregate fact tables.
After the code was deployed to the production environment the team realized that this “rare scenario” happened more than anticipated. For the first few occurrences of this scenario, manual updates were performed to clean the data. Soon it was decided to re-design the architecture to handle back-dated scenarios.
With rapid technological innovations in the area of business intelligence, business management expects solutions to all their issues as a return for their investment. Often overlooked are the complexities that scenarios like backdating presents. I am aware that Oracle BI handles such back-dated changes as part of their solution, but not all systems do. As the BI field becomes more mature, vendor systems will accommodate occurrences of such back-dated scenarios as a normal situation in any data warehousing design. Everyone’s goals is to have a data warehousing solution that is committed to keeping the data as clean as possible.
In the remainder of this article, I will present a logical approach that you can use to tackle back-dated scenarios.
A conventional type 2 slowly changing dimension is modeled with an educated assumption that all changes coming from the source occur going forward. The following diagram depicts the conventional design of a type 2 dimension. To illustrate a scenario, I have given an example below.
Employee XYZ resides in California, and was hired on 1/1/2010.
On 5/1/2013, we were notified from the source system that this employee has moved to Texas. The source table src_Employee looks as follows:
In a conventional type 2 slowly changing dimension — Dim_Employee, the following records are created.
On 7/1/2013, we are informed by the user that this employee had relocated to Texas on 3/1/2013. In a scenario in which ‘location’ is the only column to trigger historical employee records, the start and end dates in ‘DIM_Employee’ could be updated for each record.
However, in a real life scenario, there will be multiple columns that trigger historical records for a dimension. In the previous example, let us consider Position to be another critical attribute for Employee.
Employee was hired on 1/1/2010 as a Software Engineer. He resides in California.
On 4/1/2013, he gets promoted to the position of a Senior Software Engineer.
On 5/1/2013, we get a trigger from the source system that this employee has relocated to Texas.
In the conventional type 2 slowly changing dimension, the following records are created for the employee.
On 7/1/2013, we are informed by the user that this employee had relocated to Texas on 3/1/2013. In this scenario, we cannot go ahead and update the effective dates of each record.
Assumption: There is a source table ‘src_Employee_location’ that maintains start and end date of each employee and his corresponding location.
Design a typical Type 2 dimension for Employee. The image of Dim_Employee after this step will be as seen in exhibit 3.
Identify all records that have changed src_Employee_location (based on Update_Date). Join the changes in src_Employee_location to DIM_Employee based on the following condition Emp_No=Emp_No and Start Date between eff_start_date and eff_end_date or End Date between eff_start_date and eff_end_date
After this join operation, the output looks as follows:
Start date (source) < eff_st_date (target) and end date(source) > eff_end_date (target)
Description- The date range of the target record is within the date range of source record.
Resolution- Update existing record with source location.
Start date (source) between eff_st_date (target) and eff_end_date (target) and End Date(source) between eff_st_date(target) and eff_end_date(target)
Description- The date range of the source record within the date range of target record.
Resolution- Insert new record that retrieves location from source record and remaining data from target record, record start date= start date (source), record end date = end date (source)
End Date (source) >= eff_end_date(target) and Start Date (source) > eff_start_date (target)
Description- Overlap of source and target records in which source start date in between target start and end dates
Resolution- Insert new record that retrieves location from source record and remaining data from target record, record start date= start date(source), record end date = eff_end_date (target)
End Date (source) < eff_end_date(target) and Start Date (source) <= eff_start_date (target)
Description- Overlap of source and target records in which source end date in between target start and end dates
Resolution- Update existing record that retrieves all data from target record, record start date remains same, record end date = end date ( source)
The data in Dim_Employee after the above pseudo code has been applied looks as follows: