Blog Posts in Series:
In earlier post when we discussed dimension it was assumed all dimension members are static in nature and they never change. But it is not always the case. Since dimensions are 1st Normal form, always there would be some attributes that would always change. Example could be “A customer moved form City A to City B”. If Customer dimension had a customer city and it is important for analysis then as a DW modeler we would want to record fact that customer moved from City A to City B, else if facts are aggregated on City, analysis would be wrong. This is slowly Changing Dimension. Dimensions whose attributes change, and that change of attributes need to be recorded, can be classified as 3 Types.
Type 1: Type 1 handling of attribute change implies, an attribute if there is a change, need not be recorded. It can be of 2 types.
Either earlier state of that attribute is retained or last state of attribute is retained. For example in customer dimension there is a attribute called weight that measures weight of customer. In our analysis we do not need that attribute extensively. But in extreme cases it may be useful and that was reason why it was included in first place. For such weight attribute we do not need to maintain history of states. In such cases either ignore fact that customer weight has ever changed or update dimension table with latest weight. Coming to dimension tables, either run update command with latest values or do not run any update command and ignore change of attribute. Typically instead of leaving earlier records, attribute is updated to latest status. Type 1 handling is pretty simple to handle and it is no update or simple statement as
Update DimTable set Attribute = '<New Attribute value>' where DimKey = 'Primary Key Column'
Assume Employee’s City Changed from London to Bangalore, in Type 1 (Notice in Type 1 existing row is modified to reflect latest status of Employee’s City Attribute)
Another important thing that needs to be kept in mind with Type 1, though this may be subtle after affects is say a city had changed from City A to City B and since all aggregations that are there on cube are on City A, post update to City B, aggregations need to be rebuilt ie. processing of Dimension and related facts need to be done to have an affective analysis.
Type 2: In Type 2 way of handling any modification to an attribute of dimension, a new row with changed attribute (new value of attribute) and other attributes unchanged (old) is added to dimension table. Let us revisit our example where Customer’s city changed from “City A” to “City B”. In this type of handling a new row is inserted into Customer dimension, where newly added row contains all customer information same as previous row but for City it is entered as “City B”.
In addition to changed attribute value when SCD of Type 2 is enabled on server, there would 2 or 3 additional columns added to dimension that would deal with “Start Date”, “End Date”, “Status”. Start Date would indicate when current status started and End Date would indicate when Customer attribute (city) in this case changed and Status would indicate which of these rows is active for customer. Even Status alone can indicate which is current row but if there were to be any analysis of when did this change occur, without dates columns it would not be possible.
Since type 2 inserts new version of updated row in the table, any primary key of Transactional system can not be made primary key in Dimension tables. So instead a new key called “Surrogate key” is added and original primary key from transactional system is either called “Alternate Key”, “Business Key” or a “Natural Key”. In Type 2, while populating fact table, for dimension key lookups always only latest (or active) dimension member is picked up and old dimension member is not used. If Customer moves from London to Bangalore, any new Dimension Keys inserted into fact would refer to this key (of customer being in Bangalore).
Assume Employee’s City Changed from London to Bangalore, in Type 2, as you can see a new row is inserted with a new Employee Key. (Note: There would be other columns like Start Date, End Date and CurrentStatus that are ignored in this case. But actual implementations would have such keys to identify latest records for lookups)
In case of Type 2, for historical information since old version of row is still maintained, a comparison between historical and current values could be done. Also for any change in value of attribute new row is inserted. In this manner there is no limit in number of values an attribute could change. For every change of attribute a new row would be inserted.
So aggregations or processing need not be done again as existing aggregations are still valid only aggregations need to be built for new facts after city change.
Type 3: Similar to Type 2, this type also handle any changes to attribute but approach is different. In Type 3 instead of adding a new row for changed attribute a new column is added for older value.
This would effect dimension table to have different attributes for each city change like
City, Prior City – 1, Prior City – 2, Prior City – 3 etc.
As you can see, it is not feasible for attributes which change very frequently or dimensions whose attributes change frequently.
In Type 3, when city is changed from London to Bangalore, in an existing dimension with only city attribute a new attribute (column) is added PriorCity and this PriorCity is updated with London and City is updated with Bangalore. Similar to Type 1, these need to be processed again to rollup correctly.
Assume Employee’s City Changed from London to Bangalore, in Type 3, as you can see a new row is inserted with a new Employee Key. (Note: There would be other columns like Start Date, End Date and that are ignored in this case. But actual implementations would have such keys to identify latest records for lookups)
Dimension tables could handle attribute changes either in Type 1 , Type 2 or Type 3. or they could be a hybrid of Type 1, Type 2 and Type 3 as well. In most implementations we would see Type 1 or Type 2 and rarely Type 3. Before deciding which Type may be some questions like below could help? And these questions are for business users and not for us to answer :)
1. Does it matter from analysis perspective if an attribute value changes?
2. If it changes do we need to get latest value of that attribute.
3. Is it possible that both states (Pre Change and Post Change) of attribute would be analyzed simultaneously and need to aggregated independent of changes. rephrase (Should be aggregate all facts with before and after change values of attribute and show for comparison).
Some of them though are good for analysis are implicitly not possible, for example a customer whose city has changed would cease (typically) to exist in older city atleast as far as model or business is concerned.
4. How frequent are these changes of such attributes?
5. Do we need to know when it had changed?
Based on all of these initial questions and lot more we decide which of these attribute is a type 1 , type 2 or a type 3.
There are additional SCDs Type 4 and Type 6.
Type 4: In Type 4 historical information is stored in a separate Table. Main table would contain current information and historical information is moved to a different table. And top of these tables a view could be created that unions data from both tables. Even Type 4 only data that has been added needs to be processed. Old historical data is already processed and aggregated.
Before Change: (Employee Table)
Assume Employee’s City Changed from London to Bangalore, in Type 4 a new is added to Employee_History Table. Type 4 is a flavor of Type 1 and Type 2, Type 1, because current row in current table is updated to latest value and type 2 because a new row is inserted into a separate table (Historical table)
Type 6: (Type 1 + Type 2 + Type 3): Type 6 is a combination of preceding Type 1, 2, 3. In Type 6 a change to row in dimension is maintained in Type 2 (A new is added) and Type 3 (A Column is updated) manner.
Type 1 because London is updated to Bangalore in Current City attribute, Type 2 because a new row with Bangalore city is inserted and Type 3 because historical information is maintained in columnar manner. In Type 6 again historical information can be stored in a different table similar to Type 4. May be Type 10..
Of all types of SCD mentioned above, understanding Type 1 and Type 3 is most important as they are most prevalent in dimension models and others are typically edge cases.
This finishes ground work for Slowly Changing Dimension Types.
Until next post, ciya
Consult Guru Group