Surrogate Key Generation Approaches Using Informatica PowerCenter

Surrogate Key is sequentially generated unique number attached with each and every record in a Dimension table in any Data Warehouse. We discussed about Surrogate Key in in detail in our previous article. Here in this article we will concentrate on different approaches to generate Surrogate Key for different type ETL process.
Surrogate Key for Dimensions Loading in Parallel
When you have a single dimension table loading in parallel from different application data sources, special care should be given to make sure that no keys are duplicated. Lets see different design options here.
1. Using Sequence Generator Transformation
This is the simplest and most preferred way to generate Surrogate Key(SK). We create a reusable Sequence Generator transformation in the mapping and map the NEXTVAL port to the SK field in the target table in the INSERT flow of the mapping. The start value is usually kept 1 and incremented by 1.
Below shown is a reusable Sequence Generator transformation.
Below shown is a reusable Sequence Generator transformation.
NEXTVAL port from the Sequence Generator can be mapped to the surrogate key in the target table. Below shown is the sequence generator transformation.
Note : Make sure to create a reusable transformation, so that the same transformation can be reused in multiple mappings, which loads the same dimension table.
2. Using Database Sequence
We can create a SEQUENCE in the database and use the same to generate the SKs for any table. This can be invoked by a SQL Transformation or a using a Stored Procedure Transformation.
First we create a SEQUENCE using the following command.
CREATE SEQUENCE DW.Customer_SKMINVALUE 1MAXVALUE 99999999START WITH 1INCREMENT BY 1;
Using SQL Transformation
You can create a create reusable reusable SQL Transformation as shown below. It takes the name of the database sequence and the schema name as input and returns SK numbers.
Schema name (DW) and sequence name (Customer_SK) can be passed in as input value for the transformation and the output can be mapped to the target SK column. Below shown is the SQL transformation image.
Using Stored Procedure Transformation
We use the SEQUENCE DW.Customer_SK to generate the SKs in an Oracle function, which in turn called via a stored procedure transformation.Create a database function as below. Here we are creating an Oracle function.
CREATE OR REPLACE FUNCTION DW.Customer_SK_FuncRETURN NUMBERISOut_SK NUMBER;BEGINSELECT DW.Customer_SK.NEXTVAL INTO Out_SK FROM DUAL;RETURN Out_SK;EXCEPTIONWHEN OTHERS THENraise_application_error(-20001,'An error was encountered - '||SQLCODE||' -ERROR- '||SQLERRM);END;
You can import the database function as a stored procedure transformation as shown in below image.
Now, just before the target instance for Insert flow, we add an Expression transformation. We add an output port there with the following formula. This output port GET_SK can be connected to the target surrogate key column.
Note : Database function can be parametrized and the stored procedure can also be made reusable to make this approach more effective
- GET_SK =:SP. CUSTOMER_SK_FUNC()
Note : Database function can be parametrized and the stored procedure can also be made reusable to make this approach more effective
Surrogate Key for Non Parallel Loading Dimensions
If the dimension table is not loading in parallel from different application data sources, we have couple of more options to generate SKs. Lets see different design options here.
Using Dynamic LookUP
When we implement Dynamic LookUP in any mapping, we may not even need to use the Sequence Generator for generating the SK values.
For a Dynamic LookUP on Target, we have the option of associating any LookUP port with an input port, output port, or Sequence-ID. When we associate a Sequence-ID, the Integration Service generates a unique Integer value for each inserted rows in the lookup cache., but this is applicable for the ports with Bigint, Integer or Small Integer data type. Since SK is usually of Integer type, we can exploit this advantage.
The Integration Service uses the following process to generate Sequence IDs.
- When the Integration Service creates the dynamic lookup cache, it tracks the range of values for each port that has a sequence ID in the dynamic lookup cache.
- When the Integration Service inserts a row of data into the cache, it generates a key for a port by incrementing the greatest sequence ID value by one.
- When the Integration Service reaches the maximum number for a generated sequence ID, it starts over at one. The Integration Service increments each sequence ID by one until it reaches the smallest existing value minus one. If the Integration Service runs out of unique sequence ID numbers, the session fails.
Above shown is a dynamic lookup configuration to generate SK for CUST_SK.
The Integration Service generates a Sequence-ID for each row it inserts into the cache. For any records which is already present in the Target, it gets the SK value from the Target Dynamic LookUP cache, based on the Associated Ports matching. So, if we take this port and connect to the target SK field, there will not be any need to generate SK values separately, since the new SK value(for records to be Inserted) or the existing SK value(for records to be Updated) is supplied from the Dynamic LookUP.
The Integration Service generates a Sequence-ID for each row it inserts into the cache. For any records which is already present in the Target, it gets the SK value from the Target Dynamic LookUP cache, based on the Associated Ports matching. So, if we take this port and connect to the target SK field, there will not be any need to generate SK values separately, since the new SK value(for records to be Inserted) or the existing SK value(for records to be Updated) is supplied from the Dynamic LookUP.
The disadvantage of this technique lies in the fact that we don’t have any separate SK Generating Area and the source of SK is totally embedded into the code.
Select CUSTOMER_KEY as Return Port and Lookup Condition as
Suppose, we have a session s_New_Customer, which loads the Customer Dimension table. Before that session in the Workflow, we add a dummy session as s_Dummy.
In s_Dummy, we will have a mapping variable, e.g. $$MAX_CUST_SK which will be set with the value of MAX (SK) in Customer Dimension table.
In our second mapping we start generating the SK from the value $$MAX_CUST_SK + 1. But how can we pass the parameter value from one session into the other one?
Here the use of Workflow Variable comes into picture. We define a WF variable as $$MAX_SK and in the Post-session on success variable assignment section of s_Dummy, we assign the value of $$MAX_CUST_SK to $$START_SK. Now the variable $$MAX_SK contains the maximum available SK value from CUSTOMER_DIM table. Next we define another mapping variable in the session s_New_Customer as $$START_VALUE and this is assigned the value of $$MAX_SK in the Pre-session variable assignment section of s_New_Customer.
So, the sequence is:
Hope you enjoyed this article and earned some new ways to generate surrogate keys for your dimension tables. Please leave us a comment or feedback if you have any, we are happy to hear from you.
Using Expression Transformation
Suppose we are populating a CUSTOMER_DIM. So in the Mapping, first create a Unconnected Lookup for the dimension table, say LKP_CUSTOMER_DIM. The purpose is to get the maximum SK value in the dimension table. Say the SK column is CUSTOMER_KEY and the NK column is CUSTOMER_ID.Select CUSTOMER_KEY as Return Port and Lookup Condition as
- CUSTOMER_ID = IN_CUSTOMER_ID
- SELECT MAX (CUSTOMER_KEY) AS CUSTOMER_KEY, '1' AS CUSTOMER_ID FROM CUSTOMER_DIM
- VAR_COUNTER = IIF(ISNULL( VAR_INC ), NVL(:LKP.LKP_CUSTOMER_DIM('1'), 0) + 1, VAR_INC + 1 )
- VAR_INC = VAR_COUNTER
- OUT_COUNTER = VAR_COUNTER
Using Mapping & Workflow Variable
Here again we will use the Expression transformation to compute the next SK, but will get the MAX available SK in a different way.Suppose, we have a session s_New_Customer, which loads the Customer Dimension table. Before that session in the Workflow, we add a dummy session as s_Dummy.
- SELECT MAX (CUSTOMER_KEY) AS CUSTOMER_KEY FROM CUSTOMER_DIM
- INP_CUSTOMER_KEY = INP_CUSTOMER_KEY -– The MAX of SK coming from Customer Dimension table.
- OUT_MAX_SK = SETVARIABLE ($$MAX_CUST_SK, INP_CUSTOMER_KEY) –- Output Port
In our second mapping we start generating the SK from the value $$MAX_CUST_SK + 1. But how can we pass the parameter value from one session into the other one?
Here the use of Workflow Variable comes into picture. We define a WF variable as $$MAX_SK and in the Post-session on success variable assignment section of s_Dummy, we assign the value of $$MAX_CUST_SK to $$START_SK. Now the variable $$MAX_SK contains the maximum available SK value from CUSTOMER_DIM table. Next we define another mapping variable in the session s_New_Customer as $$START_VALUE and this is assigned the value of $$MAX_SK in the Pre-session variable assignment section of s_New_Customer.
So, the sequence is:
- Post-session on success variable assignment of First Session:
- $$MAX_SK = $$MAX_CUST_SK
- Pre-session variable assignment of Second Session:
- $$START_VALUE = $$MAX_SK
- VAR_COUNTER = IIF (ISNULL (VAR_INC), $$START_VALUE + 1, VAR_INC + 1)
- VAR_INC = VAR_COUNTER
- OUT_COUNTER = VAR_COUNTER
About the Author
Hope you enjoyed this article and earned some new ways to generate surrogate keys for your dimension tables. Please leave us a comment or feedback if you have any, we are happy to hear from you.