In last series of Slowly Changing Dimensions (SCD henceforth) was discussed as three main types (Type 1, Type 2 and Type 3) and additional types (Type 4, Type 6). But use cases for each of these SCD Types are intentionally left out as use case by itself is a huge topic. Through this blog let us discuss use cases for each type of SCD.
Type 1: As Type 1 overwrites old values, such attributes where there is no need to maintain historical information may use Type 1. And questions regarding if historical information storage need to answered by Business Analyst and varies from case to case basis. May be in general some attributes inherently are Type 1. An Employee dimension is typically a parent and child dimension and a confirmed dimension as well. So, care should be taken while designing confirmed dimension.

One approach is to during design phase of any dimension list out all attributes for each dimension and then for each attribute list out different properties. And one such property would be to mark each attribute values as “Not Changing” , “Type 1”, “Type 2”, “Type 3”. Something akin to below, though it could be expanded further and all columns filled. A template as such would aid at later stages in development of Dimension, setting each attribute properties as well as creating user defined attribute hierarchies.

In Employee dimension as is obvious, some dimensions attributes like Employee Key, Employee ID are considered “Not Changing” but “First Name” , “Last Name” may change and typically very rarely and if such an event occurs it may be rarer to have historical information of names instead it may be preferred to update names to latest values. So such attributes are of Type 1.
Employee attributes like “Work City”, “Work State”, “Employee Position” are of Type 2 where it may be necessary to store historical information for business analysis. Example if there arises a question like “Are employees whose Work Location equals to their Native Location (Home City) put in more effort or is it contrary”, to answer such questions for each employee Work Location is variable but Home Location is static and can be compared to calculate efficiency of work.
Coming to Type 3, during life cycle of employment in a company, each employee may report to more than one manager (like technical manager, reporting manager) at any point in time and over the period managers may change be it technical or reporting. In such cases if business need is only to maintain last manager and current manager that an employee reports to then SCD Type 3 may be chosen.
Type 1:
-
· Advantages of Type 1:
-
No increase in Number of Columns, and No of rows thus No increase in Dimension Size.
-
Very Simple to develop and maintain.
-
SSIS Natively (Out of Box) Supports this.
-
Disadvantages of Type 1:
-
No historical information and there by analytical capabilities may be hampered if not judiciously chosen.
-
Cube need to be processed for new aggregations to be built. Say customer changed from City 1 to City 2, unless cube is processed, its aggregations still point to City 1
-
Typical Use Cases
-
Where history need not be maintained and only latest status is required.
-
Corrections of Incorrect data technical speaking is a Type 1 SCD
-
In Junk dimensions columns like Status Columns and other indicators typically would not need historical split and may be candidates for Type 1
-
In a Rapidly Changing Dimension (Which we would talk about in subsequent post), breakoff table would typically be Type 1. Maintaining Type 2 on such dimension would bloat dimension and Type 3 could be used by with rapid changing dimensions historical information stored may be too limited.
Type 2:
-
Advantages of Type 2:
-
Historical information can be stored to as much depth as possible with out impacting model.
-
No increase in Columns, that is no meta data changes needed and can increase to as many rows as possible.
-
SSIS Natively (Out of Box) supports this.
-
Only need to process that partition of fact table with new set of rows.
-
Disadvantages of Type 2:
-
If dimensions are already deep (like lots of rows) type 2 has a tendency to increase depth of dimension.
-
Each new historical row added, inserts an entire row there be increasing dimension size. Dimension size increase is maximum in type 2. So, if dimension is rapidly changing Type 2 may not be optimal options. In such situations split dimension into 2 and introduce Snow flaking.
-
With Type 2, there is a clear demarcation of historical information and current information. As new versions of rows are maintained a clear distinct state of row of dimension is maintained. And since new keys join to fact table, there is no way would could see new data as old or old data as new. (Explained this below bit more where Type 2 could not be used)
Type 3:
-
Advantage of type 3:
-
Historical information can be stored by increasing column.
-
Rows are not increased, so only columns that got modified may be added.
-
Disadvantage of Type 3:
-
Only limited when it comes to maintaining historical information.
-
Needs meta data changes to underlying schema.
-
SSIS natively (Out of Box) does not support this.
-
Processing of Cube becomes necessary as new aggregations need to be created.
Most of times we see implementation of either Type 1 and / or Type 2 but in some cases Type 3 makes sense.
Case 1:
In Type 2, for every new avatar of a dimension row, a new row is inserted into Dimension tables with new surrogate key but retaining old business key. In Type 2 SCD, a Surrogate key becomes mandatory and it needs to be independent of natural or business key. Surrogate key maintains historical versioning of record and natural key maintains relationship across all versions (as business keys never change). To link fact, new surrogate key is inserted. Example an Employee changed from City1 to City 2, new rows inserted post change would contain surrogate key value of City 2. And for older fact records before change surrogate key values would be used. This implies old facts can not be used with new dimension rows (post change) or new facts can not be joined with old dimension members if needed.
Case 2:
Say for example, a Employee in Sales department started selling Product B instead of Product A and between Products commission rates are different. If sales person managed to sell same amount of products (Post and Pre) and wanted to check if it is better to continue with new product or revert back to old product in such scenarios Type 2 many not be useful atleast from data layer. In such cases Type 3 would be useful and Type 3 model old and new city both belong to same row.
Case 3:
The other one is physical implementation where Type 2 can end up increasing size of dimension at a higher rate than Type 1 and Type 3. Just as an example, consider Customer dimension of any large Retailer, Telco or anyother org with public presence. Customer dimension would contain millions of rows. If City of Customer is maintained as Type 2 then even if 10% of customers change city every year and 20% is increase in Customer Base (new customers added) then Customer dimension size increases by at least 30% YoY and 10% is a conservative figure. May be in such scenarios Type 3 is a better option or splitting attributes that change rapidly into different dimensions.
For most of dimensions even with rows 100K it makes sense to maintain as Type 2 for those attributes where historical information needs to be maintained but if dimensions are larger than those and are rapidly changing then there could be other mechanisms. In next post we would discuss Rapidly Changing Dimensions (Small, Medium and Large)…
Until next post..
ConsultGuru Group