Surrogate Key in Data Warehouse, What, When and Why
Surrogate keys are widely used and accepted design standard in data warehouses. It is sequentially generated unique number attached with each and every record in a Dimension table in any Data Warehouse. It join between the fact and dimension tables and is necessary to handle changes in dimension table attributes.
What Is Surrogate Key
Surrogate Key (SK) is sequentially generated meaningless unique number attached with each and every record in a table in any Data Warehouse (DW).
- It is UNIQUE since it is sequentially generated integer for each record being inserted in the table.
- It is MEANINGLESS since it does not carry any business meaning regarding the record it is attached to in any table.
- It is SEQUENTIAL since it is assigned in sequential order as and when new records are created in the table, starting with one and going up to the highest number that is needed.
Surrogate Key Pipeline and Fact Table
During the FACT table load, different dimensional attributes are looked up in the corresponding Dimensions and SKs are fetched from there. These SKs should be fetched from the most recent versions of the dimension records. Finally the FACT table in DW contains the factual data along with corresponding SKs from the Dimension tables.
The below diagram shows how the FACT table is loaded from the source.
The below image shows a typical Star Schema, joining different Dimensions with the Fact using SKs.
Why Should We Use Surrogate Key
Basically it’s an artificial key that is used as a substitute for a Natural Key (NK). We should have defined NK in our tables as per the business requirement and that might be able to uniquely identify any record. But, SK is just an Integer attached to a record for the purpose of joining different tables in a Star or Snowflake schema based DW. SK is much needed when we have very long NK or the datatype of the NK is not suitable for Indexing.
- Smart, where you can tell something about the record just by looking at the key.
- Composed of natural keys glued together.
- Implemented as multiple parallel joins between the dimension table and the fact table; so-called double or triple barreled joins.
As per Thomas Kejser, a “good key” is a column that has the following properties:
- It forced to be unique
- It is small
- It is an integer
- Once assigned to a row, it never changes
- Even if deleted, it will never be re-used to refer to a new row
- It is a single column
- It is stupid
- It is not intended as being remembered by users
If the above mentioned features are taken into account, SK would be a great candidate for a Good Key in a DW.
Apart from these, few more reasons for choosing this SK approach are:
- If we replace the NK with a single Integer, it should be able to save a substantial amount of storage space. The SKs of different Dimensions would be stored as Foreign Keys (FK) in the Fact tables to maintain Referential Integrity (RI), and here instead of storing of those big or huge NKs, storing of concise SKs would result in less amount of space needed. The UNIQUE indexes built on the SK will take less space than the UNIQUE index built on the NK which may be alphanumeric.
- Replacing big, ugly NKs and composite keys with beautiful, tight integer SKs is bound to improve join performance, since joining two Integer columns works faster. So, it provides an extra edge in the ETL performance by fastening data retrieval and lookup.
- Advantage of a four-byte integer key is that it can represent more than 2 billion different values, which would be enough for any dimension and SK would not run out of values, not even for the Big or Monster Dimension.
- SK is usually independent of the data contained in the record, we cannot understand anything about the data in a record simply by seeing only the SK. Hence it provides Data Abstraction.
So, apart from the abstraction
of critical business data involved in the NK, we have the advantage of
storage space reduction as well to implement the SK in our DW. It has
become a Standard Practice to associate an SK with a table in DW irrespective of being it a Dimension, Fact, Bridge or Aggregate table.
Why Shouldn’t We Use Surrogate Key
There are myriad number of disadvantages as well while working with SK. Let’s see them one by one:
- The values of SKs have no relationship with the real world meaning of the data held in a row. Therefore over usage of SKs lead to the problem of disassociation.
- The generation and attachment of SK creates unnecessary ETL burden. Sometimes it may be found that the actual piece of code is short and simple, but generating the SK and carrying it forward till the target adds extra overhead on the code.
- During the Horizontal Data Integration (DI) where multiple source systems loads data into a single Dimension, we have to maintain a single SK Generating Area to enforce the Uniqueness of SK. This may come as an extra overhead on the ETL.
- Even query optimization becomes difficult since SK takes the place of PK, unique index is applied on that column. And any query based on NK leads to Full Table Scan (FTS) as that query cannot take the advantage of unique index on the SK.
- Replication of data from one environment to another, i.e. Data Migration, becomes difficult since SKs from different Dimension tables are used as the FKs in the Fact table and SKs are DW specific, any mismatch in the SK for a particular Dimension would result in no data or erroneous data when we join them in a Star Schema.
- If duplicate records come from the source, there is a potential risk of duplicates
About the Authorbeing loaded into the target, since Unique Constraint is defined on the SK and not on the NK.
SK should not be implemented just in the name of standardizing your code. SK is required when we cannot use an NK to uniquely identify a record or when using an SK seems more suitable as the NK is not a good fit for PK.
Reference : Ralph Kimball, Thomas Kejser
Reference : Ralph Kimball, Thomas Kejser