Data Integration Solutions

An ETL Parameter Framework to Deal with all sorts of Parametrization Needs

2014-10-05T15:23:00.001-07:00

We spoke about different etl frameworks in our prior articles. Here in this article lets talk about an ETL framework to deal with parameters we normally use in different ETL jobs and different use cases. Using parametrization in the ETL code increases code reusability, code maintainability and is critical to the quality of the code and reduces the development cycle time.

Framework Components

Our ETL parameter framework will include primarily two components.

A Relational Table :- To store the parameter details and parameter values.
Reusable Mapplet :- Mapplet to log the parameter details and values into the relational table.

1. Relational Table

A relation table will be used to store the parameter details with the below structure. This will store the parameter name, value and the other information relevant to identify the context of the parameter, like folder name, workflow name and session name.

ETL_PARM_ID : A unique sequence number.
FOLDER_NAME : Folder name, in which the parameter is used.
WRKFLW_NAME : Workflow name, in which the parameter is used.
SESSN_NAME : Session name, in which the parameter is used.
PARM_NAME : Name of the parameter
PARM_VAL : Value of the parameter.
ETL_CRT_DATE : Record create timestamp.
ETL_UPD_DATE : Record update timestamp.

Note : You can add repository name to the the table, if the framework is planned to use for workflows running in multiple repositories.
Note : All parameter should be stored into the parameter table with its initial value to start with.

2. Reusable Mapplet

A mapplet to capture and load the parameter values into the database table. This mapplet takes two input values and gives all the data elements required in the parameter table mentioned above.

Mapplet Input : Parameter name, parameter value.

Mapplet Output : All the data elements required to be stored in the parameter table mentioned above. This output can be connected to the target table to store the information into the relational table.

Framework Implementation in a Workflow

This framework can be implemented for both dynamically changing parameters as well as rarely changing or static parameters.

Dynamically Changing Parameters

Typical example of dynamically changing parameter is "ETL Run Timestamp" which is used for incremental data extraction logic. Lets see how incremental data extraction is implemented using this parameter framework.

Create a mapping variable with MAX aggregation. This variable will hold the parameter value.

Note : Reset the mapping variable in the workflow using the pre-session variable assignment.

Set the mapping variable using the SETVARIABLE function in an expression as shown in below image. This will update the mapping variable to the greatest ETL_UPD_DATE value, which will finally be stored into the parameter table using the mapplet.

Adjust the source filter to pull incremental data. Incremental data is pulled from the source based on ETL_UPD_DATE as shown in below image.

Above mapping configuration will make sure the correct parameter is used and will set the correct parameter value, which is to be stored into the parameter table.

Add an additional mapping pipeline as shown in below image to store the parameter value into the parameter table. This pipeline will update the current value in the parameter table to the latest value. The mapplet used will make sure the correct parameter and parameter value is updated in the parameter table.

Note : Set the target load order of the new pipeline to the last one in the mapping. Source qualifier of this pipeline will generate one record using "select 'x' from dual" SQL.

Below shown is the complete mapping design.

Static or Rarely Changing Parameters

Parameters, which might need occasional changes or static parameter can be stored in the parameter table and can be retrieved in the Informatica mapping using a LookUp transformation. Any changes require for the parameter value should be one time updated done outside of the ETL process.

Below shown is the lookup transformation, which can be used to retrieve parameter value. You just need to pass in the input parameters to the lookup and get the parameter value from the parameter table.

Note : The static parameter value should already be saved into the parameter table with its static value, before it can be used in a mapping.

How Parameter Data is Stored in the Parameter table

As discussed, the parameter framework support both static and dynamic parameters. Lets consider a sample data for the explanation.

ETL_PARM_ID	FOLDER_NAME	WRKFLW_NAME	SESSN_NAME	PARM_NAME	PARM_VAL
1	ALL	ALL	ALL	YR_BEGIN	01-JAN-2014
2	DW_SALES	ALL	ALL	REGION_NAME	USA
3	DW_SALES	wf_LOAD_CUST_DIM	s_LOAD_CUST_DIM	LST_RUN_TS	10-OCT-2014

Parameter IDs 1 and 2 are static parameters. First parameter is defined to used across all folders, workflow and sessions. Second parameter is still a static one, but specific to all workflows and sessions in the folder DW_SALES. Third parameter is dynamic parameter specific to the session s_LOAD_CUST_DIM, which is running in DW_SALES folder.

Better than Informatica Parameters and Variables

Since the parameter framework stores the values outside Informatica environment, you get much more flexibility with it.

Prevents any accidentally parameter value changes, which might happen for mapping variables during code migration.
Centralized storage for all parameter values rather than the storing it in different parameter files or mapping variables.
Easy to update or change the parameter value, unlike it is with mapping variables. When using it with incremental data extraction logic, it is to update the parameter value to reprocess same data set and enable restartability.
Dynamic changing parameters can be handled in the framework. Mapping variables can have only MAX or MIN operations to handle dynamically changing parameters.
Parameter framework can handle both static and dynamic parameters.
More secure than storing the parameters in a parameter file.

Please leave us a comment below, if you have any other thoughts or scenarios to be covered. We will be more than happy to help you.

Dynamic Transformation Port Linking Rules in Infromatica Cloud Designer

2014-09-16T23:34:00.000-07:00

One of the coolest features which was missing in Informatica PowerCenter was the capability to dynamically link ports between transformations. Many other ETL tools has already been providing this features in there tools. With Informatica Cloud Designer, you can build mapping, with dynamic rules to connect ports between transformations.

What is Dynamic Field Linking

In the normal PowerCenter mapping, you need to explicitly map the ports to get connected form one transformation to other transformation in the pipeline. But in the Cloud Designer, you can define the rule to dynamically link ports between transformations in the data pipeline. Based on the rules defined, the ports are connected or dropped out between transformations.

This feature provide much flexibility and code reusability from the developer and administrator perspective. We will see the business use case in the further sections.

Field Rules and Type of Rules

Field rules define how data enters a transformation from the upstream transformation. By default, a transformation inherits all incoming fields from the upstream transformation. All transformations except Source transformations include field rule configuration. When you configure more than one field rule, the Mapping Configuration application evaluates the field rules in the specified order. Use the Actions menu to change the order of rules and delete rules.

The following image shows the field rules configured for the transformation. Base on the rules you choose, you can see the ports included and excluded to the transformation.

All Fields :- All Fields rule, includes or excludes all fields from one transformations to downstream transformation. Using the rename option, you can rename the port from one transformation to the other.

Fields by Data Type :- Includes or excludes ports of selected data types from one transformations to downstream transformation. In the Include/Exclude Fields by Data Type dialog box, you can select the data types that you want to include or exclude. If you want to rename the ports, you can do it by choosing the Rename tab.

You click on the Configure button to get the below window and choose the port data type, which is required to be passed on to the downstream transformation.

Fields by Text or Pattern :- Includes or excludes fields by prefix, suffix, or pattern. You can use this option to select fields that you renamed earlier in the data flow. On the Select Fields tab, you can select prefix, suffix, or pattern, and define the rule to use. When you select the prefix option or suffix option, you enter the text to use as the prefix or suffix. When you select pattern, you can enter a regular expression.

You click on the Configure button to get the below window and choose the port name pattern, which is required to be passed on to the downstream transformation.

Named Fields :- Includes or excludes the selected fields. Opens the Include/Exclude Named Fields dialog box. On the Select Fields tab, you can review all incoming fields for selection. On the Rename Selected tab, you can rename selected fields individually or in bulk.

You click on the Configure button to get the below window and choose the port, which is required to be passed on to the downstream transformation.

Pros and Cons

All approaches has its own benefits and drawbacks. Here is what we see as the good and bad of dynamic column mapping.

Pros

Better code reusability, You build the mapping once and you can reuse the code for multiple data sources.
Better flexibility and scalability for development, by providing parametrization and reusability.
Reduce the number of objects to be maintained in the PowerCenter Repository.

Cons

Loses Metadata about column mapping, hence the data lineage can not be produced.
Dynamically including all column might lead to processing unwanted columns in the mapping pipeline.

Business Use Case

One of the typical use case would be to build stage table loading mapping building. Since a typical stage table mapping will not include not unique complex transformations, you can create just one mapping and can parametrize the source table, target table, connection details etc. This makes the development effort simple and highly reusable.

Hope you enjoyed this article. Please let us your feedback and questions in the comment section below.

Informatica Cloud Mapping Tutorial for Beginners, Building the First Mapping

2014-09-06T17:53:00.000-07:00

In the last couple of articles we discussed the basics of Informatica Cloud and Informatica Cloud Designer. In this tutorial we describe how to create a basic mapping, save and validate the mapping, and create a mapping configuration task. The demo mapping reads and writes data sources, also include the parameterization technique.

The mapping we create here reads source data, filters out unwanted data, and writes data to the target. The mapping also includes parameters for the source connection and filter value. For this tutorial, you can use a sample Account source file available in the Informatica Cloud Community. You can download the sample source file from the following link Sample Source File for the Mapping Tutorial.

Step 1. Mapping Creation and Source Configuration

The following procedure describes how to create a new mapping and configure the sample Account flat file as the source.

To create a mapping, click Design > Mappings > New Mapping.

In the New Mapping dialog box, enter a name for the mapping: Account_by_State. You can use underscores in mapping and transformation names, but do not use other special characters.

To add a source to the mapping, on the Transformation Palette, click Source.

In the Properties Panel, on the General tab, enter a name for the source: FF_Account.
On the Source tab, configure the following properties:

Connection :- Source connection. Select the flat file connection for the sample Account source file. Or, create a new flat file connection for the sample source file.
Source Type :- Source type. Select Object.
Object :- Source object. Select the sample Account source file. To preview source data, click Preview Data.

To view source fields and field metadata, click the Fields tab.

To save the mapping and continue, on the toolbar, click Save > Save and Continue.

Step 2. Filter Creation and Field Rule Configuration

In the following procedure, you add a Filter transformation to the data flow and define a parameter for the value in the filter condition. When you use a parameter for the value of the filter condition, you can define the filter value that you want to use when you configure the task. And you can create a different task for the data for each state.

The sample Account source file includes a State field. When you use the State field in the filter condition, you can write data to the target by state. For example, when you use State = MD as the condition, you include accounts based in Maryland in the data flow.

To add a Filter transformation, on the Transformation palette, drag a Filter transformation to the mapping canvas.
To link the Filter transformation to the data flow, draw a link from the FF_Account source to the Filter transformation. When you link transformations, the downstream transformation inherits fields from the previous transformation.
To configure the Filter transformation, select the Filter transformation on the mapping canvas.
To name the Filter transformation, in the Properties panel, click General and enter the name: Filter_by_State.

To configure field rules, click Incoming Fields. Field rules define the fields that enter the transformation and how they are named. By default, all available fields are included in the transformation. Since we want to use all fields, do not configure additional field rules.

To configure the filter condition, click Filter.

To create a simple filter with a parameter for the value, for Filter Condition, select Simple.
Click Add New Filter Condition.
For Field Name, select State, and use Equals as the operator.
For Value, select New Parameter.

In the New Parameter dialog box, configure the following options and click OK.

Name: FConditionValue
Display Label: Filter Value for State
Description: Enter the two-character state name for the data you want to use.
Default Value: MD. Notice, you can only create a string parameter in this location.

To save your changes, click Save > Save and Continue.

Step 3. Target and Source Parameter Configuration

In the following procedure, you configure the target, then replace the source connection with a parameter.

Because you plan to parametrize the source, you also need to use a parameter for the field mapping.

To add a Target transformation, on the Transformation palette, drag a Target transformation to the mapping canvas.
To link the Target transformation to the data flow, draw a link from the Filter transformation to the Target transformation.
Click the Target tab and configure the following properties:

Connection :- Target connection. Select a connection for the target. Or, create a new connection to the target. Target Type :- Target type. Select Object.
Object :- Target object. Select an appropriate target.
Operation :- Target operation. Select Insert.

To configure the field mapping, click Field Mapping.
To map some fields and allow the remaining fields to be mapped in the task, configure the Field Map Option for Partially Parametrized.

Create a New Parameter and configure the following properties.

Name: PartialFieldMapping.
Display Label: Partial Field Mapping.
Select Allow partial mapping override. This allows you to view and edit mapped fields in the task. When want to prevent the task developer from changing field mappings configured in the mapping, clear this option.

Map the fields that you want to show as mapped in the task.
Click Save > Save and Continue.
To edit the source to add a parameter for the source connection, click the FF_Account Source transformation, and then click the Source tab.
For Connection, click New Parameter.
In the New Parameter dialog box, configure the following parameter properties.

Name: SourceConnection.
Display Label: Sample Flat File.
Description: Select the connection to the sample file.

Below shown is the completed mapping.

Step 4. Mapping Validation and Task Creation

In the following procedure, you save and validate the mapping. And you create a mapping configuration task based on the mapping.

To validate the mapping, click Save > Save and Continue.

When you save the mapping, the Mapping Designer validates the mapping. The mapping is valid when the Status in the status area shows Valid.
If the status is Invalid, in the toolbar, click the Validation icon. In the Validation panel, click Validate.

The Validation panel lists the transformations in the mapping and the mapping status. The mapping should be valid. If errors display, correct the errors. Click Validate to verify that errors are corrected.

To create a task based on the mapping, click Save > Save and New Mapping Configuration Task. The Mapping Configuration Task wizard launches as shown below.

On the Definition page, enter a name for the task: Mapping Tutorial and give your Secure Agent. Notice, the task uses the mapping that you just completed.

Click Next. On the Sources page, the source parameter displays. Notice, the tool tip for the connection displays the parameter description. For Sample Flat File, select the source connection to the sample file, and click Next.

Notice, the Targets page does not display because the target connection and object is defined in the mapping.
The Other Parameters page displays the remaining parameters for the mapping.

In the Partial Field Mapping parameter, map the target fields that you want to use.

Note that because you allowed partial mapping override, the Target Fields list displays all fields. You can keep or remove the existing links.

For the Filter Value for State parameter, delete the default value, MD, and enter TX.

To save and close the task, click Save > Save and Close.

In the next step you can schedule the mapping on a predefined schedule. Hope you guys enjoyed this article. We are curious to know about your feedback.

Informatica Incremental Aggregation Implementation and Business Use Cases

2014-07-16T22:48:00.000-07:00

Incremental Aggregation is the perfect performance improvement technique to implement; when you have to do aggregate calculations on your incrementally changing source data. Rather than forcing the session to process the entire source data and recalculate the same data each time you run the session, incremental aggregation persist the aggregated value and adds the incremental changes to it. Lets see more details in this article.

What is Incremental Aggregation

Using incremental aggregation, you can apply changes captured from the source to aggregate calculations such as Sum, Min, Max, Average etc... If the source changes incrementally and you can capture changes, you can configure the session to process those changes. This allows the Integration Service to update the target incrementally, rather than forcing it to process the entire source and recalculate the same data each time you run the session.

When to Use Incremental Aggregation

You can capture new source data : Use incremental aggregation when you can capture new source data each time you run the session. Use a change data capture mechanism for the same.

Incremental changes do not significantly change the target : Use incremental aggregation when the changes do not significantly change the target. If processing the incrementally changed source alters more than half the existing target, the session may not benefit from using incremental aggregation. In this case, drop the table and recreate the target with complete source data.

How Incremental Aggregation Works

When the session runs with incremental aggregation enabled for the first time, it uses the entire source data. At the end of the session, the Integration Service stores aggregate data from that session run in two files, the index file and the data file, in the cache directory specified in the Aggregator transformation properties.

Each subsequent time you run the session with incremental aggregation, you use the incremental source changes in the session. For each input record, the Integration Service checks historical information in the index file for a corresponding aggregate group. If it finds a corresponding group, the Integration Service performs the aggregate operation incrementally, using the aggregate data for that group, and saves the incremental change. If it does, the Integration Service creates a new group and saves the record data.

Note : Before enabling incremental aggregation, it is important to read incremental changes from source to avoid double count.

Business Use Case

Lets consider an ETL job, which is used to load the Sales Summary Table. The summary table generates yearly sales summary by product line. The table includes the columns 'Sales Year', 'Product Line Name', 'Sales Quantity', 'Sales Amount'

Incremental Aggregation Implementation

Lets create a mapping, which can identify the new sales data from the data source and set the incremental aggregation. New sales data records are identified using the CREATE_DT column in the source table. The source qualifies of the mapping looks as in below image. The source qualifier is set to read the changed data using mapping variables.

Now do the aggregation calculation using the aggregator transformation as shown in below image.

Complete the mapping as shown in below image.

Create the Workflow and set the incremental aggregation setting in the session property as shown in the image.

Note : No need to use an update strategy transformation to implement Insert else Update logic. You can set the session properties just like 'Insert' only mapping. When you use the incremental aggregation, Integration Service does the Insert or Update based on the primary key set in the target table.

Incremental Aggregation Behind the Scene

Lets understand how incremental aggregator works behind the scene. For the better understanding lets use the data set from the use case explained above.

Source data from Day I

On Day 1, all data from the source is read and processed in the mapping.

Sales Date	Product Line	Sales Quantity	Sales Amount	Create Date
04-Jan-2014	Tablet	1	$450	04-Jan-2014
03-Feb-2014	Tablet	1	$500	03-Feb-2014
03-Feb-2014	Computers	1	$1,300	03-Feb-2014
13-Mar-2014	Cell Phone	2	$350	13-Mar-2014

Data from the source is read, summarized and persisted in Aggregator Cache. One row per aggregator group is persisted in the cache.

Sales Year	Product Line	Sales Quantity	Sales Amount	Note
2014	Tablet	2	$950	New In Cache
2014	Computers	1	$1,300	New In Cache
2014	Cell Phone	2	$350	New In Cache

Source data from Day 2

On Day 2, only new data is read from the source and processed in the mapping.

Sales Date	Product Line	Sales Quantity	Sales Amount	Create Date
14-Mar-2014	Tablet	1	$450	14-Mar-2014
14-Mar-2014	Tablet	1	$500	14-Mar-2014
14-Mar-2014	Video Game	1	$300	14-Mar-2014

Aggregator Cache is updated with the new values and new aggregator groups are inserted.

Sales Year	Product Line	Sales Quantity	Sales Amount	Note
2014	Tablet	4	$1,900	Update In Cache
2014	Computers	1	$1,300	No Change In Cache
2014	Cell Phone	2	$350	No Change In Cache
2014	Video Game	1	$300	New In Cache

Reinitializing the Aggregate Cache Files

Based on the use case we discussed here, we need to reset the aggregate cache file for every new year. You can reset the cache file using the settings shown in below image. You get a warning message about clearing the persisted aggregate values, but can be ignored.

After you run a session that reinitializes the aggregate cache, edit the session properties to disable the Reinitialize Aggregate Cache option. If you do not clear Reinitialize Aggregate Cache, the Integration Service overwrites the aggregate cache each time you run the session.

Hope this article is useful for you guys. Please feel free to share your comments and any questions you may have.

Informatica Cloud Designer for Advanced Data Integration On the Cloud

2014-06-05T22:52:00.000-07:00

Informatica Cloud is an on-demand subscription service that provides cloud applications. It uses functionality from Informatica PowerCenter to provide easy to use, web-based applications. Cloud Designer is one of the applications provided by Informatica Cloud. Lets see the features of Informatica Cloud Designer in this article.

What is Informatica Cloud Designer

Informatica Cloud Designer is the counterpart of PowerCenter Designer on the cloud. Use Cloud Mapping Designer to configure mappings similar to PowerCenter mappings. When you configure a mapping, you describe the flow of data from source and target.

As it is in PowerCenter Designer you can add transformations to transform data, such as an Expression transformation for row-level calculations, or Filter transformation to remove data from the data flow. It additionally support Joiner transformation and LookUp transformation. A transformation includes field rules to define incoming fields. Links visually represent how data moves through the data flow.

Cloud Designer Interface

Cloud Designer provides a web based user interface similar to what we have for PowerCenter Designer. This interface can be accessed from your Informatica Cloud Portal.

Below is a screenshot of Cloud Designer with different mapping designer areas.

Mapping Canvas :- The canvas for configuring a mapping, which is similar the workspace what we have for PowerCenter Designer.
Transformation Palette :- Lists the transformations that you can use in the mapping. You can add a transformation by clicking the transformation name. Or, drag the transformation to the mapping canvas.
Properties Panel :- Displays configuration options for the mapping or selected transformation. Different options display based on the transformation type. This is similar to different tabs available in PowerCenter Transformations.
Toolbar :- Provides different options such as Save, Cancel, Validate, Arrange All icon, Zoom In/Out.
Status Area :- Displays the status of the mapping and related tasks. It indicates if the mapping includes unsaved changes. When all changes are saved, indicates if the mapping is valid or invalid.

Transformations On Cloud Designer

Transformations are a part of a mapping that represent the operations that you want to perform on data. Transformations also define how data enters each transformation.

The Mapping Designer provides a set of Active and Passive transformations. 'Joiner' and 'Filter' are the two active transformations available. 'Expression' is passive transformation and 'LookUp' transformation act as passive when returning one row and active when returning more than one row.

Additionally designer supports 'Source' and 'Target' transformations to read and write data from different sources and targets.

Transformation	Type	Description
Source	N/A	Reads data from a source.
Target	N/A	Writes data to a target.
Joiner	Active	Joins two sources.
Filter	Active	Filters data from the data flow.
Expression	Passive	Modifies data based on passive expressions.
Lookup	Passive when returning one row. Active when returning more than one row.	Looks up data from a lookup object. Defines the lookup object and connection, as well as the lookup condition and return values.

Mapping Configuration Task

Mapping Configuration Task is similar to a session task in PowerCenter. The Mapping Configuration Task allows you to process data based on the data flow logic defined in a mapping.

When you create a mapping configuration task, you select the mapping for the task to use, just like you choose a mapping while you create a session task in PowerCenter. You also define the parameter value associated with the mapping.

Below shown is the different options you need to set for the Mapping Configuration.

Task Flows

Task Flows are similar to a workflows in PowerCenter. You can create a task flow to group multiple tasks and run them in a specific order. You can run the task flow immediately or on a schedule. The task flow runs tasks serially, in the specified order.

Below shown is the different options you need to set for the Mapping Configuration.

How Cloud Designer is Different

Cloud Designer is not a replacement for PowerCenter Designer, but to provide more advanced data integration capability on the cloud. There are few interesting features available with Cloud Designer, which is not available in PowerCenter Designer.

1. Dynamic Field Propagation

Unlike PowerCenter Designer, you do not have to connect all the ports manually between transformations. It uses logical rules to propagate fields or ports from one transformation to other transformation.

Possible options for logical field mapping.

Include All Fields.
Include/Exclude Field by specific names.
Include/Exclude Fields by Data Types.
Include/Exclude Fields by name patterns.

Below shown is the screenshot of available options for logical field mapping. This option is available in the "Property Panel".

It helps the mapping to self-adapts to source or target structure changes. For example if you use “All Fields” brings in newly added fields dynamically into the mapping.

2. Parameterized Templates

A parameter is placeholder for a value or values in a mapping. The Cloud Designer can be used to build reusable mappings that include parameterized values. This can be configured to create an integration workflow with specific business parameters entered at runtime.

You define the value of the parameter when you configure the mapping configuration task. as mentioned above paragraph. Parameterization along with dynamic field propagation, makes the mapping build on cloud extremely reusable templates.

Video Demo

You can get a free 30 day trial from here. Leave us your thoughts on Informatica Cloud Designer and other Cloud Apps and how you are using it in your enterprise.

Informatica Cloud for Dummies - Informatica Cloud, Components & Applications

2014-05-12T23:13:00.000-07:00

Informatica Cloud is an on-demand subscription service that provides cloud applications. When you subscribe to Informatica Cloud, you use a web browser to connect to Informatica Cloud. Informatica Cloud runs at a hosting facility.

Informatica Cloud Components

Informatica Cloud includes the following components.

1. Informatica Cloud :- A browser-based application that runs at the Informatica Cloud hosting facility. It allows you to configure connections, create users, and create, run, schedule, and monitor tasks.
You can log on to Informatica Cloud application using your user id and password.

2. Informatica Cloud hosting facility :- A facility where the Informatica Cloud application runs. The Informatica Cloud hosting facility stores all task and organization information like it is stored in PowerCenter repository. Informatica Cloud does not store or stage source or target data.

3. Informatica Cloud applications :- Applications that you can use to perform tasks, such as data synchronization, contact validation, and data replication.

4. Informatica Cloud Secure Agent :- A component of Informatica Cloud installed on a local machine that runs all tasks and provides firewall access between the hosting facility and your organization. When the Secure Agent runs a task, it connects to the Informatica Cloud hosting facility to access task information, connects directly and securely to sources and targets, transfers data between sources and targets, and performs any additional task requirements.

Informatica Cloud Applications

Informatica Cloud provides the following applications to help with different type of data integration tasks. These applications can be used to perform tasks, such as data synchronization, contact validation, and data replication and more.

PowerCenter
Mapping Configuration
Data Synchronization
Data Replication
Contact Validation
Data Assessment
Data Masking

PowerCenter

The PowerCenter application allows you to Import PowerCenter workflows in to Informatica Cloud and run them as Informatica Cloud tasks. When you create a task, you can associate it with a schedule to run it at specified times or on regular intervals. Or, you can run it manually. You can monitor tasks that are currently running in the activity monitor and view logs about completed tasks in the activity log.

Below screenshot captures the options available to import a PowerCenter workflow.

Mapping Configuration

Mapping Configuration Task is similar to a session task in PowerCenter. The Mapping Configuration Task allows you to process data based on the data flow logic defined in a mapping.

Below screenshot captures the options available to build a mapping configuration.

Data Synchronization

Use to load data and integrate applications, databases, and files. Includes add-on functionality such as saved queries and mapplets. The Data Synchronization application allows you to synchronize data between a source and target. This performs insert,update,delete and upsert operations.

Using data synchronization task you can perform insert,update,delete and upsert. Options are shown below.

For example, you can read sales leads from your sales database and write them into Salesforce. You can also use expressions to transform the data according to your business logic or use data filters to filter data before writing it to targets.

Data Replication

Use to replicate data from Salesforce or database sources to database or file targets. You might replicate data to archive the data, perform offline reporting, or consolidate and manage data.

Shown is the options available to setup data replication task.

Contact Validation

Contact validation is used to validate and correct postal address data, and add geocode information to postal address data. You can also validate email addresses and check phone numbers against the Do Not Call Registry. With the Contact Validation application, you can validate and correct postal address data, and add geocode information to postal address data. You can also validate email addresses and check phone numbers against the Do Not Call Registry.

The Contact Validation application reads data from sources, validates and corrects the selected validation fields, and writes data to output files. In addition to validation fields, the Contact Validation application can include up to 30 additional source fields in the output files for a task.

Data Assessment

The Data Assessment application allows you to evaluate the quality of your Salesforce data. Use to measure and monitor the quality of data in the Accounts, Contacts, Leads, and Opportunities Salesforce CRM objects. It generates graphical dashboards that measure field completeness, field conformance, record duplication, and address validity for each Salesforce object. You can run data assessment tasks on an on-going basis to show trends in the data quality.

Data Masking

Use data masking to replace source data in sensitive columns with realistic test data for non-production environments. Data masking rules define the logic to replace the sensitive data. Assign data masking rules to the columns you need to mask.

How to Use Error Handling Options and Techniques in Informatica PowerCenter

2014-04-07T00:58:00.002-07:00

Data quality is very critical to the success of every data warehouse projects. So ETL Architects and Data Architects spent a lot of time defining the error handling approach. Informatica PowerCenter is given with a set of options to take care of the error handling in your ETL Jobs. In this article, lets see how do we leverage the PowerCenter options to handle your exceptions.

Error Classification

You have to deal with different type of errors in the ETL Job. When you run a session, the PowerCenter Integration Service can encounter fatal or non-fatal errors. Typical error handling includes:

User Defined Exceptions : Data issues critical to the data quality, which might get loaded to the database unless explicitly checked for quality. For example, a credit card transaction with a future transaction data can get loaded into the database unless the transaction date of every record is checked.
Non-Fatal Exceptions : Error which would get ignored by Informatica PowerCenter and cause the records dropout from target table otherwise handled in the ETL logic. For example, a data conversion transformation error out and fail the record from loading to the target table.
Fatal Exceptions : Errors such as database connection errors, which forces Informatica PowerCenter to stop running the workflow.

I. User Defined Exceptions

Business users define the user defined user defined exception, which is critical to the data quality. We can setup the user defined error handling using;

Error Handling Functions.
User Defined Error Tables.

1. Error Handling Functions

We can use two functions provided by Informatica PowerCenter to define our user defined error capture logic.

ERROR() : This function Causes the PowerCenter Integration Service to skip a row and issue an error message, which you define. The error message displays in the session log or written to the error log tables based on the error logging type configuration in the session.

You can use ERROR in Expression transformations to validate data. Generally, you use ERROR within an IIF or DECODE function to set rules for skipping rows.

Eg : IIF(TRANS_DATA > SYSDATE,ERROR('Invalid Transaction Date'))

Above expression raises an error and drops any record whose transaction data is greater than the current date from the ETL process and the target table.

ABORT() : Stops the session, and issues a specified error message to the session log file or written to the error log tables based on the error logging type configuration in the session. When the PowerCenter Integration Service encounters an ABORT function, it stops transforming data at that row. It processes any rows read before the session aborts.

You can use ABORT in Expression transformations to validate data.

Eg : IIF(ISNULL(LTRIM(RTRIM(CREDIT_CARD_NB))),ABORT('Empty Credit Card Number'))

Above expression aborts the session if any one of the transaction records are coming with out a credit card number.

Error Handling Function Use Case

Below shown is the configuration required in the expression transformation using ABORT() and ERROR() Function. This transformation is using the expressions as shown in above examples.

Note :- You need to use these two functions in a mapping along with a session configuration for row error logging to capture the error data from the source system. Depending on the session configuration, source data will be collected into Informatica predefined PMERR error tables or files.

Please refer the article "User Defined Error Handling in Informatica PowerCenter" for more detailed level implementation information on user defined error handling.

2. User Defined Error Tables

Error Handling Functions are easy to implement with very less coding efforts, but at the same time there are some disadvantages such as readability of the error records from the PMERR tables and performance impact. To avoid the disadvantages of error handling functions, you can create your own error log tables and capture the error records into it.

Typical approach is to create an error table which is similar in structure to the source table. Error tables will include additional columns to tag the records as "error fixed", "processed". Below is a sample error table. This error table includes all the columns from the source table and additional columns to identify the status of the error record.

Below is the high level design.

Typical ETL Design will read error data from the error table along with the source data. During the data transformation, data quality will be checked and any record violating the quality check will be moved to error tables. Record flags will be used to identify the reprocessed and records which are fixed for reprocessing.

II. Non-Fatal Exceptions

Non-fatal exception causes the records to be dropped out in the ETL process, which is critical to quality. You can handle non-fatal exceptions using;

Default Port Value Setting.
Row Error Logging.
Error Handling Settings.

1. Default Port Value Setting

Using default value property is a good way to handle exceptions due to NULL values and unexpected transformation errors. The Designer assigns default values to handle null values and output transformation errors. PowerCenter Designer let you override the default value in input, output and input/output ports.

Default value property behaves differently for different port types;

Input ports : Use default values if you do not want the Integration Service to treat null values as NULL.
Output ports : Use default values if you do not want to skip the row due to transformation error or if you want to write a specific message with the skipped row to the session log.
Input/output ports : Use default values if you do not want the Integration Service to treat null values as NULL. But no user-defined default values for output transformation errors in an input/output port.

Default Value Use Case

Use Case 1

Below shown is the setting required to handle NULL values. This setting converts any NULL value returned by the dimension lookup to the default value -1. This technique can be used to handle late arriving dimensions

Use Case 2

Below setting uses the default expression to convert the date if the incoming value is not in a valid date format.

2. Row Error Logging

Row error logging helps in capturing any exception, which is not consider during the design and coded in the mapping. It is the perfect way of capturing any unexpected errors.

Below shown session error handling setting will capture any un handled error into PMERR tables.

Please refer the article Error Handling Made Easy Using Informatica Row Error Logging for more details.

3. Error Handling Settings

Error handling properties at the session level is given with options such as Stop On Errors, Stored Procedure Error, Pre-Session Command Task Error and Pre-Post SQL Error. You can use these properties to ignore or set the session to fail if any such error occurs.

Stop On Errors : Indicates how many non-fatal errors the Integration Service can encounter before it stops the session.
On Stored Procedure Error : If you select Stop Session, the Integration Service stops the session on errors executing a pre-session or post-session stored procedure.
On Pre-Session Command Task Error : If you select Stop Session, the Integration Service stops the session on errors executing pre-session shell commands.
Pre-Post SQL Error : If you select Stop Session, the Integration Service stops the session errors executing pre-session or post-session SQL.

III. Fatal Exceptions

A fatal error occurs when the Integration Service cannot access the source, target, or repository. When the session encounters fatal error, the PowerCenter Integration Service terminates the session. To handle fatal errors, you can either use a restartable ETL design for your workflow or use the workflow recovery features of Informatica PowerCenter

1. Restartable ETL Design

Restartability is the ability to restart an ETL job if a processing step fails to execute properly. This will avoid the need of any manual cleaning up before a failed job can restart. You want the ability to restart processing at the step where it failed as well as the ability to restart the entire ETL session.

Please refer the article "Restartability Design Pattern for Different Type ETL Loads" for more details on restartable ETL design.

2. Workflow Recovery

Workflow recovery allows you to continue processing the workflow and workflow tasks from the point of interruption. During the workflow recovery process Integration Service access the workflow state, which is stored in memory or on disk based on the recovery configuration. The workflow state of operation includes the status of tasks in the workflow and workflow variable values.

Please refer the article "Informatica Workflow Recovery with High Availability for Auto Restartable Jobs" for more details on workflow recovery.

Hope this article is useful for you guys. Please feel free to share your comments and any questions you may have.

How to Avoid The Usage of SQL Overrides in Informatica PowerCenter Mappings

2014-03-16T23:28:00.005-07:00

Many Informatica PowerCenter developers tend to use SQL Override during mapping development. Developers finds it easy and more productive to use SQL Override. At the same time ETL Architects do not like SQL Overrides as it hide the ETL logic from metadata manager. In this article lets see the options available to avoid SQL Override in different transformations.

What is SQL Override

Transformations such as Source Qualifier and LookUp provides an option to override the default query generated by PowerCenter. You can enter any valid SQL statement supported by the underlying database. You can enter your own SELECT statement with a list of columns in the SELECT clause of the SQL, which is matching with the transformation ports. The SQL can perform aggregate calculations, or call a stored procedure or stored function to read the data.

Source Qualifier Options to Avoid SQL Override

There are few options available in source qualifier to avoid the usage of SQL Override. These can be effectively used to avoid the usage of SQL override.

1. User Defined Join

User defined join option provides the most flexible options to avoid the usage of SQL Override. You need to enter only the contents of the WHERE clause of your SQL, not the entire query in user defined join option.

If the JOIN Syntax of your query is entirely with in the WHERE clause, you can directly enter the WHERE clause of your query into the user defined join option, with out any modification. Oracle still supports the old way of join using (+), which is with in the WHERE clause. Where as most of the other databases uses the latest JOIN syntax, which uses the JOIN syntax in the FROM clause.

Below image shows the left outer join between CUSTOMER table and PURCHASES table. This join uses the Oracle Join syntax (+).

Note :- You can not use the above option, if the JOIN Syntax of your query is with in the FROM clause.

Informatica Join Syntax

If the JOIN Syntax of your query is written with in the FROM clause, you should use the Informatica Join Syntax in the user defined join option. When you use the Informatica join syntax, the Integration Service insert the join syntax in the WHERE clause or the FROM clause of the query, depending on the underlying database syntax.

Informatica Join supports, Normal, Left Outer and Right Outer Joins and here is the join syntax.

Normal Join :- { source1 INNER JOIN source2 on join_condition }
Left Outer Join :- { source1 LEFT OUTER JOIN source2 on join_condition }
Right Outer Join :- { source1 RIGHT OUTER JOIN source2 on join_condition }

Note :- Enclose Informatica join syntax in braces { }.

Above shown image is displaying the Informatica Join Syntax. Using the user defined join option, CUSTOMER table is left outer joined with PURCHASES table as shown in the above image.

2. Source Filter

Source filter option can be used to adjust the ‘WHERE’ clause of the SQL created by the integration service, with out using the SQL Override option. You can enter a source filter to reduce the number of rows the Integration Service queries. You can provide the source filter condition with out giving the string ‘WHERE’.

Source filter option is used to filter source data based on the Customer ID.

3. Sorted Ports

Using the sorted ports option, you can sort the source data. When using sorted port option, Integration Service adds the ports to the ORDER BY clause in the default query. The Integration Service adds the configured number of ports, starting at the top of the Source Qualifier transformation. The sorted ports are applied on the connected ports rather than the ports that start at the top of the Source Qualifier transformation.

Based on the setting above, source data is sorted on the first two connected ports from the source qualifier to the downstream transformations. The data is sourced in the ascending order.

4. Select Distinct

If you want the Integration Service to select unique values from a source, use the Select Distinct option. Using Select Distinct filters out unnecessary data earlier in the data flow, which might improve performance.

'Select Distinct' option can be set in source qualifier as shown in the above image.

Advantages and Limitations of SQL Override

Pros

Utilize database optimizers techniques such as indexes, hints.
Can accommodate complex queries.

Cons

Lose transformation logic in metadata searched.
Unable to utilize Partitioning or Pushdown Optimization options.
Processing impacts database resources.

Hope you enjoyed this article. Feel free to ask any further questions or clarification you may have below in the comment section. We are happy to help you with.

Data Security Using Informatica PowerCenter Data Masking Transformation

2014-02-28T00:02:00.000-08:00

You might have come across scenario where in you do not have enough good data in your Development and QA regions for your testing purpose; and you are not allowed to copy over data from production environment due to the data security reasons. Now using Informatica PowerCenter data masking transformation you can overcome such scenarios. In this article, lets see the usage of masking transformation.

What is Data Masking Transformation

Using Data Masking transformation, you change sensitive production data to realistic test data for non-production environments. The Data Masking transformation modifies source data based on masking rules that you configure for each column.

You can apply the following types of masking with the Data Masking transformation.

Key masking :- Produces deterministic results for the same source data,.
Random masking :- Produces random, non-repeatable results for the same source data.
Expression masking :- Applies an expression to a port to change the data or create data.
Substitution :- Replaces a column of data with similar but unrelated data from a dictionary.
Special mask formats :- Applies special mask formats to change SSN, credit card number, phone number, URL, email address, or IP addresses.

Lets see each masking rules in detail.

Key Masking

A column configured for key masking returns deterministic masked data each time the source value and seed value are the same. The masked output remains the same with the same input value. Use the same seed value to generate same masked value between transformations for the same input value.

Key Masking Properties

You can configure the following masking rules and properties for key masking string values:

Seed :- Apply a seed value to generate same masked data for a column for the input between sessions. Select one of the following options:

Value :- Accept the default seed value or enter a number between 1 and 1,000.
Mapping Parameter :- Use a mapping parameter to define the seed value.

Mask Format :- Define the type of character to substitute for each character in the input data. Use this property to keep the input and masked data in the same format.
Source String Characters :- Source string characters are source characters that you choose to mask or not mask.
Result String Characters :- Substitute the characters in the target string with the characters you define in Result String Characters.

Hint :- Use the same seed value to mask a primary key in a table and the foreign key value in another table.

Example :- Below shown is the masking properties for Key Masking. This transformation masks the DEPT_ID column using key masking. The masked DEPT_ID will have the format for DDD+AAAAAA

Substitution Masking

Substitution masking replaces a column of data with similar but unrelated data. When you configure substitution masking, define the relational or flat file dictionary that contains the substitute values. The Data Masking transformation performs a lookup on the dictionary that you configure and replaces source data with data from the dictionary. It is an effective way to replace production data with realistic test data.

Substitution Source Directories

For using substitution masking, you need a flat file or relational table that contains the substitute data and a serial number for each row in the file or the relational table. The serial number should start from one and can not have any missing numbers..

Below is the structure of the substitution file, which got a serial number column, department id and the corresponding masked department id.

SNO,DEPT_ID,MASKED_DEPT_ID,1,DPT-128923,ABC-999999,2,DPT-234265,LMN-888888

Substitution Masking Properties

You can configure the following masking rules for substitution masking.

Repeatable Output :- Returns same results between sessions for the same input.
Seed :- Apply a seed value to generate same masked data for a column for the input between sessions. Select one of the following options:

Value :- Accept the default seed value or enter a number between 1 and 1,000.
Mapping Parameter :- Use a mapping parameter to define the seed value.
Unique Output :- Force the PowerCenter Integration Service to create unique Data Masking output values for unique input values. No two input values are masked to the same output value.

Dictionary Information :- Configure the flat file or relational table that contains the substitute data values.

Relational Table :- Select Relational Table if the dictionary is in a database table.
Flat File :- Select Flat File if the dictionary is in flat file delimited by commas.
Dictionary Name :- Displays the flat file or relational table name that you selected.
Serial Number Column :- Select the column in the dictionary that contains the serial number.
Output Column :- Choose the column to return to the Data Masking transformation.

Lookup condition :- When you configure a lookup condition you compare the value of a column in the source with a column in the dictionary to pick the masked value.

Input port :- Source data column to use in the lookup.
Dictionary column :- Dictionary column to compare the input port to.

Example :- Below shown is the masking properties for Substitution Masking. As per the example below, SNO is the serial number column and MASKED_DEPT_ID is the substitution value from the file for each DEPT_ID. Lookup condition to search the flat file is defined on DEPT_ID.

Random Masking

Random masking generates random masked data. The Data Masking transformation returns different values when the same source value occurs in different rows. You can mask numeric, string or date values with random masking.

Random Masking Properties

You can configure the following masking rules for random masking.

Range :- Configure the minimum and maximum string length. The Data Masking transformation returns a string of random characters between the minimum and maximum string length.
Mask Format :- Define the type of character to substitute for each character in the input data. Use this property to keep the input and masked data in the same format.
Source String Characters :- Source string characters are source characters that you choose to mask or not mask.
Result String Characters :- Substitute the characters in the target string with the characters you define in Result String Characters.

Example :- Below shown is the masking properties for Expression Masking. As per the example below, masked DEPT_ID will have the format for DDD+AAAAAA and the character '-' will not be masked.

Expression Masking

Expression masking applies an expression to a port to change the data or create new data. When you configure expression masking, create an expression in the Expression Editor. You can select input and output ports, functions, variables, and operators to build expressions.

Example :- Below shown is the masking properties for Expression Masking.

Special Masking Formats

Applies special mask formats to change SSN, credit card number, phone number, URL, email address, or IP addresses. The Data Masking transformation returns a masked value that has a realistic format, but is not a valid value. For example, when you mask an SSN, the Data Masking transformation returns an SSN that is the correct format but is not valid. You can configure repeatable masking for Social Security numbers.

Example :- Below shown is the masking properties for Special Masking.

Masking Properties in Detail

Lets see few masking properties in detail.

1. Mask Format

Configure a mask format to limit each character in the output column to an alphabetic, numeric, or alphanumeric character. This property is used by random and key masking. Use the following characters to define a mask format:

A :- Alphabetical characters. For example, ASCII characters a to z and A to Z.
D :- Digits. 0 to 9.
N :-Alphanumeric characters. For example, ASCII characters a to z, A to Z, and 0-9.
X :-Any character. For example, alphanumeric or symbol.
+ :- No masking.
R :- Specifies that the remaining characters in the string can be any character type.

2. Source String Characters

Source string characters are source characters that you choose to mask or not mask. The position of the characters in the source string does not matter but it is case sensitive. This property is used by random and key masking.

Mask Only :- The Data Masking transformation masks characters in the source that you configure as source string characters. For example, if you enter the characters A, B, and c, the Data Masking transformation replaces A, B, or c with a different character when the character occurs in source data. A source character that is not an A, B, or c does not change. The mask is case sensitive.

Mask All Except :- Masks all characters except the source string characters that occur in the source string.

3. Result String Replacement Characters

Result string replacement characters are characters you choose as substitute characters in the masked data. When you configure result string replacement characters, the Data Masking transformation replaces characters in the source string with the result string replacement characters. This property is used by random and key masking.

Use Only :- Mask the source with only the characters you define as result string replacement characters. For example, if you enter the characters A, B, and c, the Data Masking transformation replaces every character in the source column with an A, B, or c. The word “horse” might be replaced with “BAcBA.”

Use All Except :- Mask the source with any characters except the characters you define as result string replacement characters. For example, if you enter A, B, and c result string replacement characters, the masked data never has the characters A, B, or c.

Hope you enjoyed this article. Feel free to ask any further questions or clarification you may have below in the comment section. We are happy to help you with.

Transaction Control Transformation to Control Commit and Rollback in Your ETL

2014-01-29T22:37:00.001-08:00

In a typical Informatica PowerCenter workflow data is committed to the target table after a predefined number of rows are processed into target, which is specified in the session properties. But there are scenarios in which you need more control on the commits and rollbacks. In this article, lets see how we can achieve this using Transaction Control Transformation.

What is Transaction Control Transformation

A transaction is the set of rows bound by commit or roll back rows. The Transaction Control Transformation lets you control the commit and rollback transactions based on an expression or logic defined in the mapping. For example, you might want to define transactions based on a group of rows ordered on a common key, such as employee ID or order entry date.

When you run the session, the Integration Service evaluates the expression defined in the transformation for each row that enters the transformation. When it evaluates a commit row, it commits all rows in the transaction to the target. When the Integration Service evaluates a roll back row, it rolls back all rows in the transaction from the target.

Configuring Transaction Control Transformation

Transaction Control Transformation can be created and used as any other active transformations. All the required properties to configure this transformation can be provided in the Properties tab as shown in below image.

You can enter the transaction control expression in the Transaction Control Condition field. The transaction control expression uses the IIF function to test each row against the condition. The Integration Service evaluates the condition on a row-by-row basis. The return value determines whether the Integration Service commits, rolls back, or makes no transaction changes to the row.

You can use the following built-in variables in the Expression Editor when you create a transaction control expression.

TC_CONTINUE_TRANSACTION. The Integration Service does not perform any transaction change for this row. This is the default value of the expression.
TC_COMMIT_BEFORE. The Integration Service commits the transaction, begins a new transaction. The current row is in the new transaction.
TC_COMMIT_AFTER. The Integration Service writes the current row to the target, commits the transaction, and begins a new transaction. The current row is in the committed transaction.
TC_ROLLBACK_BEFORE. The Integration Service rolls back the current transaction, begins a new transaction. The current row is in the new transaction.
TC_ROLLBACK_AFTER. The Integration Service writes the current row to the target, rolls back the transaction, and begins a new transaction. The current row is in the rolled back transaction.

Transaction Control Transformation Use Case

Lets consider an ETL Job loading data into an OLTP application. The application data is being accessed by the system real time. This means the data loaded into the target table should confirm the consistency and integrity.

To be more specific about the use case, Sales order data loaded into the OLTP Application target table need to be committed after all the order items in a sales order is loaded into the target table.

Solution : Here lets create a Transaction Control Transformation, which is connected in the mapping pipeline after all the ETL logic is complete. The logic to define the commit points can be provided in the Transaction Control Transformation.

Step 1 :- Once the required transformation logic is build in the mapping, you create create a sorter transformation to group all the order items with in a sales order together based on ORDER_ID as shown in below.

Step 2 :- Create an expression transformation and add new ports with below expression. This step will let you identify, when all records in an order is complete processing.

V_NEXT_ORDER_FLAG (Variable) :- IIF(ORDER_ID = V_PRIOR_ORDER_ID, 'N', 'Y')
V_PRIOR_ORDER (Variable) :- ORDER_ID
NEXT_ORDER_FLAG (Output) :- V_NEXT_ORDER_FLAG

Hint :- This variable port technique can be used to preserve the value from a prior record.

Step 3 :- Now you can create the Transaction Control Transformation like any other active transformation and connect to the upstream transformation as shown below. Provide the expression to define the commit logic, below given is the expression per our use case.

IIF(NEXT_ORDER_FLAG = 'N',TC_CONTINUE_TRANSACTION,TC_COMMIT_BEFORE)

Step 4 :- Now you connect all the ports from Transaction Control transformation to the target definition.
Note :- While configuring the session, be sure to set the "Commit Type" Property as "User Defined"

Hope this tutorial was useful for your project. Please leave you questions and commends, We will be more than happy to help you.

Informatica PowerCenter Design Best Practices and Guidelines

2014-01-20T18:20:00.001-08:00

A high-level systematic ETL design will help to build efficient and flexible ETL processes. So special care should be given in the design phase of your project. In following we will be covering the key points one should keep in mind while designing an ETL process. The following recommendations can be integrated into your ETL design and development processes to simplify the effort and improve the overall quality of the finished product.

Consistency
Modularity
Reusability
Scalability
Simplicity

1. Consistency

To ensure consistency and facilitate easy maintenance post production it is important to define and agree on development standards before development work has begun.

The standards will define the ground rules for the development team. Standards can range in items from naming conventions to documentation standards to error handling standards. Development work should adhere to these standards throughout the life cycle and new team members will be able to reference these standards to understand the requirements placed upon the design and build activities

Applying consistent standards such as naming conventions, design patterns, error handling, change data capture reduces long term complications and makes maintenance easy.

2. Modularity

A modular design is important for an efficient ETL design. Divide different components of your ETL process such as incremental data pull logic, error handling, change data capture, operational meta data logging into different modules. This makes the ETL processes efficient, scalable, and maintainable.

3. Reusability

Reusability is a great feature in Informatica PowerCenter which can be used by developers. Its general purpose is to reduce unnecessary coding which ultimately reduces development time and increases supportability. In addition to that, it also help to react quickly to potential changes required for a program.

A great focus should be given during the design phase on reuse to make quick and universal modifications. Informatica PowerCenter has provided a variety of methods to achieve reusability such as Mapplets, Worklets, Reusable Transformations, Reusable functions, Parameters, Shared Folders.

4. Scalability

Keep volumes in mind in order to create efficient ETL process. Estimating the data volume requirements of a data integration project is a critical. Based on the volume estimates special consideration need to be given for caching different transformations, running complex queries, applying different performance turning techniques, such as push down optimization, Session Partitioning, Dynamic Session Partition, Concurrent Workflows, Grid Deployments, Workflow Load Balancing and Other available Performance Tips.

5. Simplicity

It is recommended to create multiple simple ETL Process, Informatica Mappings and Informatica Workflows instead of few complex ones. Use Staging Area and try to keep the processing logic as clear and simple as possible. Such design makes develop, debug, maintain easy compared to complex ETL logic.

Design Approach to Handle Late Arriving Dimensions and Late Arriving Facts

2013-12-29T23:48:00.003-08:00

In the typical case for a data warehouse, dimensions are processed first and the facts are loaded later, with the assumption that all required dimension data is already in place. This may not be true in all cases because of nature of your business process or the source application behavior. Fact data also, can be sent from the source application to the warehouse way later than the actual fact data is created. In this article lets discusses several options for handling late arriving dimension and Facts.

What is Late Arriving Dimension

Late arriving dimensions or sometimes called early-arriving facts occur when you have dimension data arriving in the data warehouse later than the fact data that references that dimension record.

For example, an employee availing medical insurance through his employer is eligible for insurance coverage from the first day of employment. But the employer may not provide the medical insurance information to the insurance provider for several weeks. If the employee undergo any medical treatment during this time, his medical claim records will come as fact records with out having the corresponding patient dimension details.

Design Approaches

Depending on the business scenario and the type of dimension in use, we can take different design approaches.

Hold the Fact record until Dimension record is available.
'Unknown' or default Dimension record.
Inferring the Dimension record.
Late Arriving Dimension and SCD Type 2 changes.

1. Hold the Fact record until Dimension record is available

One approach is to place the fact row in a suspense table. The fact row will be held in the suspense table until the associated dimension record has been processed. This solution is relatively easy to implement, but the primary drawback is that the fact row isn’t available for reporting until the associated dimension record has been handled.

This approach is more suitable when your data warehouse is refreshed as a scheduled batch process and a delay in loading fact records until the dimension records are available is acceptable for the business.

2. 'Unknown' or default Dimension record

Another approach is to simply assign the “Unknown” dimension member to the fact record. On the positive side, this approach does allow the fact record to be recorded during the ETL process. But it won’t be associated with the correct dimension value.

The "Unknown" fact records can also be kept into a suspense table. Eventually, when the Dimension data is processed, the suspense data can be reprocessed and associate with a real, valid Dimension record.

3. Inferring the Dimension record

Another method is to insert a new Dimension record with a new surrogate key and use the same surrogate key to load the incoming fact record. This only works if you have enough details about the dimension in the fact record to construct the natural key. Without this, you would never be able to go back and update this dimension row with complete attributes.

In the insurance claim example explained in the beginning; it is almost certain that the "patient id" will be part of the claim fact, which is the natural key of the patient dimension. So we can create a new placeholder dimension record for the patient with a new surrogate key and the natural key "patient id".

Note : When you get all other attributes for the patient dimension record in a later point, you will have to do a SCD Type 1 update for the first time and SCD Type 2 going forward.

4. Late Arriving Dimension and SCD Type 2 changes

Late arriving dimension with SCD Type 2 changes gets more complex to handle.

4.1. Late Arriving Dimension with multiple historical changes

As described above, we can handle late arriving dimension by keeping an "Unknown" dimension record or an "Inferred" dimension record, which acts an a placeholder.

Even before we get the full dimension record details from the source system, there may be multiple SCD Type 2 changes to the placeholder dimension record. This leads to the creation of new dimension record with new surrogate key and modify any subsequent fact records surrogate key to point the new surrogate key.

Lets see the scenario in detail with the help medical insurance claim example.

The patient with ID 67223 have made two insurance claims. One on 9/10 and other on 9/20. As there is no patient dimension information is available for patient id 67223 yet, an 'Inferred' dimension record is created for the patient with surrogate key 1001.

Below shown is the state of the dimension and the fact table at this point.

Later, by the time dimension information is made available, there has already been SCD Type 2 changes for the patient id 67223. There has been changes for the patient id 67223 on 9/10 and again on 9/12. Below shown is the current state of the dimension and fact records. The fact record created on 9/20 is still referring to surrogate key 1001, which is not the correct representation.

This means the claim record created on 9/20 need to be reassigned to the correct surrogate key, which is active for the same time period. Below shown is the correct state of the dimension and fact records.

4.2. Late Arriving Dimension with retro effective changes

You can get Dimension records from source system with retro effective dates. For example you might update your marital status in your HR system way later than your marriage date. This update come to data warehouse with retro effective date.

This leads to a new dimension record with a new surrogate key and changes in effective dates for the affected dimension. You will have to scan forward in the dimension to see if there is any subsequent type 2 rows for this dimension. This further leads in modify any subsequent fact records surrogate key to point the new surrogate key.

Lets again use the medical insurance claim example for our explanation.

Below shown state of the Patient Dimension and the Claim Fact table at this point, which is perfectly good.

Now we have got a Patient Dimension data from the source system say on 10/1, which is in effective from 9/15 as shown below.

This new Dimension data which comes with a retro effective date makes all dimension records out of sync in terms of the effective start and end date. In addition to that, the fact records are referring to incorrect dimension records.

So in addition to inserting a new dimension record with a new surrogate key, we will have to adjust the effective dates of the prior period dimension record and propagate the dimension column value changes to the remaining records. The fact table also need to be updated to reassign the correct surrogate key.

Below shown red is the corrections required to take care of the retro effective dimension records.

What is Late Arriving Facts

Late arriving fact scenario occurs when the transaction or fact data comes to data warehouse way later than the actual transaction occurred in the source application. If the late arriving fact need to be associated with an SCD Type 2 dimension, the situation become messy. This is because we have to search back in history within the dimensions to decide how to assign the right dimension keys that were in effect when the activity occurred in the past.

Design Approaches

Unlike late arriving dimensions, late arriving fact records can be handles relatively easily. When loading the fact record, the associated dimension table history has to be searched to find out the appropriate surrogate key which is effective at the time of the transaction occurrences. Below data flow describes the late arriving fact design approach.

Hope you guys enjoyed this article and gave you some new insights into late arriving dimension and fact scenarios in Data Warehouse. Leave us your questions and commends. We would also like to hear how you have handled late arriving dimension and fact in your data warehouse.

SOFT and HARD Deleted Records and Change Data Capture in Data Warehouse

2013-12-08T18:16:00.001-08:00

In our couple of prior articles we spoke about change data capture, different techniques to capture change data and a change data capture frame work as well. In this article we will deep dive into different aspects for change data in Data Warehouse including soft and hard deletions in source systems.

Revisiting Change Data Capture (CDC)

When we talk about Change Data Capture (CDC) in DW, we mean to capture those changes that have happened at the source side so far after we have run our job last time. In Informatica we call our ETL code as ‘Mapping’, because we MAP the source data (OLTP) into the target data (DW) and the purpose of running the ETL codes is to keep the source and target data in sync, along with some transformations in between, as per the business rules.

Now, data may get changed at source in three different ways.

NEW transactions happened at source.
CORRECTIONS happened on old transactional values or measured values.
INVALID transactions removed from source.

Usually in our ETL we take care of the 1st and 2nd case(Insert/Update Logic); the 3rd change is not captured in DW unless it is specifically instructed in the requirement specification. But when it’s especially amended, we need to devise convenient ways to track the transactions that were removed i.e., to track the deleted records at source and accordingly DELETE those records in DW.

One thing to make clear is that Purging might be enabled at your OLTP, i.e OLTP keeping data for a fixed historical period of time, but that is a different scenario. Here we are more interested about what was DELETED at Source because the transactions was NOT valid.

Effects in DW for Source Data Deletion

DW tables can be divided into three categories as related to the deleted source data.

When the DW table load nature is 'Truncate & Load' or 'Delete & Reload', we don't have any impact, since the requirement is to keep the exact snapshot of the source table at any point of time.
When the DW table does not track history on data changes and deletes are allowed against the source table. If a record is deleted in the source table, it is also deleted in the DW.
When the DW table tracks history on data changes and deletes are allowed against the source table. The DW table will retain the record that has been deleted in the source system, but this record will be either expired in DW based on the change captured date or 'Soft Delete' will be applied against it.

Types of Data Deletion

Academically, deleting records from DW table is forbidden, however, it’s a common practice in most DWs when we face this kind of situations. Again, if we are deleting records from DW, it has to be done after proper discussions with Business. If your Business requires DELETION, then there are two ways.

Logical Delete :- In this case, we have a specific flag in the source table as STATUS which would be having the values as ‘ACTIVE’ or ‘INACTIVE’. Some OLTPs keep the field name as ACTIVE with the values as ‘I’, ‘U’ or ‘D’, where ‘D’ means that the record is deleted or the record is INACTIVE. This approach is quite safe and also known as Soft DELETE.

Physical Delete :- In this case the record related to invalid transactions are fully deleted from the source table by issuing DML statement. This is usually done after thorough discussing with Business Users and related business rules are strictly followed. This is also known as Hard DELETE.

ETL Perspective on Deletion

When we have ‘Soft DELETE’ implemented at the source side, it becomes very easy to track the invalid transactions and we can tag those transactions in DW accordingly. We just need to filter the records from source using that STATUS field and issue an UPDATE in DW for the corresponding records. Few things to be kept in mind in this case.

If only ACTIVE records are supposed to be used in ETL processing, we need to add specific filters while fetching source data.

Sometimes INACTIVE records are pulled into the DW and moved till the ETL Data Warehouse level. While pushing the data into Exploration Data Warehouse, only the ACTIVE records are sent for reporting purpose.

For ‘Hard DELETE’, if Audit Table is maintained at source systems for what are transactions were deleted, we can source the same, i.e. join the Audit table and the Source table based on NK and logically delete them in DW too. But it becomes quite cumbersome and costly when no account is kept of what was deleted at all. In these cases, we need to use different ways to track them and update the corresponding records in DW.

Deletion in Data Warehouse : Dimension Vs Fact

In most of the cases, we see only the transactional records to be deleted from source systems. DELETION of Data Warehouse records are a rare scenario.

Deletion in Dimension Tables

If we have DELETION enabled for Dimensions in DW, it's always safe to keep a copy of the OLD record in some AUDIT table, as it helps to track any defects in future. A simple DELETE trigger should work fine; since DELETION hardly happens, this trigger would not degrade the performance much.

Let's take this ORDERS table into consideration. Along with this, we can have a History table for ORDERS, e.g. ORDERS_Hist, which would store the DELETED records from ORDERS.

The below Trigger will work fine to achieve this.

The AUDIT Fields will convey when this particular record was deleted and by which user. But this table needs to be created for each and every DW table where we want to keep the audit of what was DELETED. If the entire record is not need and only fields involved in Natural Key(NK) may work, we can have a consolidated table for all the Dimensions.

Here the Record_IDENTIFIER field contains the values of all the columns involved in the Natural Key(NK) separated by '#' of the table mentioned in the OBJECT_NAME field.

Sometimes, we face a situation in DW where a FACT table record contains a Surrogate Key(SK) from a Dimension but the Dimension table doesn't own it anymore. In those cases, the FACT table record becomes orphan and it will hardly be able to appear in any report since we always use the INNER JOIN between Dimensions and Fact while retrieving data in the reporting layer, and there it misses the Referential Integrity(RI).

Suppose, we want to track the orphan records from the SALES Fact table in respect of Product Dimension. We can use the query as below.

So, the above query will provide only the Orphan records, BUT certainly it cannot provide you the records DELETED from the PRODUCT_Dimension. So, one feasible solution could be while populating the EVENT table with the SKs from PRODUCT_Dimension that are being DELETED, provided we don't reuse our Surrogate Keys. So, when we have both the SKs and the NKs from the PRODUCT_Dimension in the EVENT table for DELETED entries, we can achieve a better compliance over the Data Warehouse data.

Another useful but least used approach is enabling the audit for any table for DELETE in an Oracle DB using queries like the following.

Audit DELETE on SCHEMA.TABLE;

The table DBA_AUDIT_STATEMENT will contain all the related details related to this deletion, example the user who issued the, exact DML statement and so on, but this cannot provide you with the record that was deleted. Since this approach cannot directly provide you information on which record was deleted, it’s not so useful in our current discussion, so I would like to keep aloof from the topic here.

Deletion in Fact Tables

Now, this was all about DELETION in DW Dimension tables. Regarding FACT data DELETION, I would like to cite an extract of what Ralph Kimball has to say on Physical Deletion of Facts from DW.

Change Data Capture & Apply for 'Hard DELETE' in Source

Again, whether we should track the DELETED records from source or not depends on the type of table and its Load Nature. I will share few genuine scenarios that are usually faced in any DW and discuss about the solutions accordingly.

1. Records are DELETED from SOURCE for a known Time Period, no Audit Trail was kept.

In this case, the ideal solution is to DELETE the entire records’ set in DW for the Target table and pull the source records once again for the time period. This will bring the DW in sync with Source and DELETED records also will not be available in DW.

Usually time period is mentioned in terms of Ship_DATE or Invoice_DATE or Event_DATE, i.e. a DATE type field from the actual dataset of the source table is used, and hence the way we can filter the records for Extraction from source table using WHERE clause, we can do the same in DW table as well.

Obviously, in this case we are NOT able to capture the 'Hard DELETE' from the Source i.e., we cannot track the History of DATA, but we would be able to bring the Source and DW in sync at the least. Again, this approach is recommended only when the situation occurs once in a while and not on regular basis.

2. Records are DELETED from SOURCE on regular basis with NO Timeframe, no Audit Trail was kept.

The possible solution in this case would be to implement FULL Outer JOIN between the Source and the Target table. The tables should be joined on the fields involved in the Natural Key(NK). This approach will help us to track all three kinds of changes to source data in one shot.

The logic can be better explained with the help of a Venn diagram.

Out of the Joiner (kept in FULL Outer Join mode),

Records that have values for the NK fields only from the Source and not from the Target, they should go for the INSERT flow. These are all new records coming from source.
Records that have values for the NK fields from both the Source and the Target, they should go for the UPDATE flow. These are already existing records of Source.
Records that have values for the NK fields only from Target, will go for the DELETE flow. These are the records that were somehow DELETED from Source table.

Now, what we do with those DELETED records from Source, i.e. apply 'Soft DELETE' or 'Hard DELETE' in DW, depends on our requirement specification and business scenarios.

But this approach is having severe disadvantage in terms of ETL Performance. Whenever we go for a FULL Outer JOIN between Source and Target, we are using the entire data set from both the ends and this will obviously obstruct the smooth processing of ETL when data volume increases.

3. Records are DELETED from SOURCE, Audit Trail was kept.

Even though I'm mentioning it a DELETION, it's NOT the kind of Physical DELETION that we discussed previously. This is mainly related to incorrect transactions in Legacy Systems, e.g. Mainframes, which usually send data in flat files.

When some old transactions become invalidated, source team sends those transactions related records again to DW but with inverted measures, i.e. the sales figure are same as the old ones but they are negative. So, DW contains both the old set of records and the newly arrived records, but the aggregated measures become NULL in the aggregated FACT table, thus diminishing the impact of those invalid transactions in DW to NULL.

Only disadvantage of this approach is, Aggregated FACT contains the correct data at the summarized level, but the transactional FACT dual set of records, which together

About the Author

represent the real scenario, i.e. at first the transaction happened(with the older record) and then it became invalid(with the newer record).

Hope you guys enjoyed this article and gave you some new insights into change data capture in Data Warehouse. Leave us your questions and commends. We would like to hear how you have handled change data capture in your data warehouse.

Informatica Performance Tuning Guide, Performance Enhancements - Part 4

2013-11-30T22:25:00.002-08:00

In our performance turning article series, so far we covered about the performance turning basics, identification of bottlenecks and resolving different bottlenecks. In this article we will cover different performance enhancement features available in Informatica PowerCener. In addition to the features provided by PowerCenter, we will go over the designs tips and tricks for ETL load performance improvement.

Performance Enhancements Features

The main PowerCenter features for Performance Enhancements are.

Performance Tuning Tutorial Series
Part I : Performance Tuning Introduction.
Part II : Identify Performance Bottlenecks.
Part III : Remove Performance Bottlenecks.
Part IV : Performance Enhancements.

Pushdown Optimization.
Pipeline Partitions.
Dynamic Partitions.
Concurrent Workflows.
Grid Deployments.
Workflow Load Balancing.
Other Performance Tips and Tricks.

1. Pushdown Optimization

Pushdown Optimization Option enables data transformation processing, to be pushed down into any relational database to make the best use of database processing power. It converts the transformation logic into SQL statements, which can directly execute on database. This minimizes the need of moving data between servers and utilizes the power of database engine.

Read More about Pushdown Optimization.

2. Session Partitioning

The Informatica PowerCenter Partitioning Option increases the performance of PowerCenter through parallel data processing. Partitioning option will let you split the large data set into smaller subsets which can be processed in parallel to get a better session performance.

Read More about Session Partitioning.

3. Dynamic Session Partitioning

Informatica PowerCenter session partition can be used to process data in parallel and achieve faster data delivery. Using Dynamic Session Partitioning capability, PowerCenter can dynamically decide the degree of parallelism. The Integration Service scales the number of session partitions at run time based on factors such as source database partitions or the number of CPUs on the node resulting significant performance improvement.

Read More about Dynamic Session Partition.

4. Concurrent Workflows

A concurrent workflow is a workflow that can run as multiple instances concurrently. A workflow instance is a representation of a workflow. We can configure two types of concurrent workflows. It can be concurrent workflows with the same instance name or unique workflow instances to run concurrently.

Read More about Concurrent Workflows.

5. Grid Deployments

When a PowerCenter domain contains multiple nodes, you can configure workflows and sessions to run on a grid. When you run a workflow on a grid, the Integration Service runs a service process on each available node of the grid to increase performance and scalability. When you run a session on a grid, the Integration Service distributes session threads to multiple DTM processes on nodes in the grid to increase performance and scalability.

Read More about Grid Deployments.

6. Workflow Load Balancing

Informatica Load Balancing is a mechanism which distributes the workloads across the nodes in the grid. When you run a workflow, the Load Balancer dispatches different tasks in the workflow such as Session, Command, and predefined Event-Wait tasks to different nodes running the Integration Service. Load Balancer matches task requirements with resource availability to identify the best node to run a task. It may dispatch tasks to a single node or across nodes on the grid.

Read More about Workflow Load Balancing.

7. Other Performance Tips and Tricks

Through out this blog we have been discussing different tips and tricks to improve your ETL load performance. We would like to reference those tips and tricks in this article for your reference.

Read More about Other Performance Tips and Tricks.

Hope you guys enjoyed these tips and tricks and it is helpful for your project needs. Leave us your questions and commends. We would like to hear any other performance tips you might have used in your projects.

Surrogate Key Generation Approaches Using Informatica PowerCenter

2013-11-21T22:14:00.004-08:00

Surrogate Key is sequentially generated unique number attached with each and every record in a Dimension table in any Data Warehouse. We discussed about Surrogate Key in in detail in our previous article. Here in this article we will concentrate on different approaches to generate Surrogate Key for different type ETL process.

Surrogate Key for Dimensions Loading in Parallel

When you have a single dimension table loading in parallel from different application data sources, special care should be given to make sure that no keys are duplicated. Lets see different design options here.

1. Using Sequence Generator Transformation

This is the simplest and most preferred way to generate Surrogate Key(SK). We create a reusable Sequence Generator transformation in the mapping and map the NEXTVAL port to the SK field in the target table in the INSERT flow of the mapping. The start value is usually kept 1 and incremented by 1.

Below shown is a reusable Sequence Generator transformation.

NEXTVAL port from the Sequence Generator can be mapped to the surrogate key in the target table. Below shown is the sequence generator transformation.

Note : Make sure to create a reusable transformation, so that the same transformation can be reused in multiple mappings, which loads the same dimension table.

2. Using Database Sequence

We can create a SEQUENCE in the database and use the same to generate the SKs for any table. This can be invoked by a SQL Transformation or a using a Stored Procedure Transformation.

First we create a SEQUENCE using the following command.

CREATE SEQUENCE DW.Customer_SK

MINVALUE 1

MAXVALUE 99999999

START WITH 1

INCREMENT BY 1;

Using SQL Transformation

You can create a create reusable reusable SQL Transformation as shown below. It takes the name of the database sequence and the schema name as input and returns SK numbers.

Schema name (DW) and sequence name (Customer_SK) can be passed in as input value for the transformation and the output can be mapped to the target SK column. Below shown is the SQL transformation image.

Using Stored Procedure Transformation

We use the SEQUENCE DW.Customer_SK to generate the SKs in an Oracle function, which in turn called via a stored procedure transformation.

Create a database function as below. Here we are creating an Oracle function.

CREATE OR REPLACE FUNCTION DW.Customer_SK_Func

RETURN NUMBER

IS

Out_SK NUMBER;

BEGIN

SELECT DW.Customer_SK.NEXTVAL INTO Out_SK FROM DUAL;

RETURN Out_SK;

EXCEPTION

WHEN OTHERS THEN

raise_application_error(-20001,'An error was encountered - '||SQLCODE||' -ERROR- '||SQLERRM);

END;

You can import the database function as a stored procedure transformation as shown in below image.

Now, just before the target instance for Insert flow, we add an Expression transformation. We add an output port there with the following formula. This output port GET_SK can be connected to the target surrogate key column.

GET_SK =:SP. CUSTOMER_SK_FUNC()

Note : Database function can be parametrized and the stored procedure can also be made reusable to make this approach more effective

Surrogate Key for Non Parallel Loading Dimensions

If the dimension table is not loading in parallel from different application data sources, we have couple of more options to generate SKs. Lets see different design options here.

Using Dynamic LookUP

When we implement Dynamic LookUP in any mapping, we may not even need to use the Sequence Generator for generating the SK values.

For a Dynamic LookUP on Target, we have the option of associating any LookUP port with an input port, output port, or Sequence-ID. When we associate a Sequence-ID, the Integration Service generates a unique Integer value for each inserted rows in the lookup cache., but this is applicable for the ports with Bigint, Integer or Small Integer data type. Since SK is usually of Integer type, we can exploit this advantage.

The Integration Service uses the following process to generate Sequence IDs.

When the Integration Service creates the dynamic lookup cache, it tracks the range of values for each port that has a sequence ID in the dynamic lookup cache.
When the Integration Service inserts a row of data into the cache, it generates a key for a port by incrementing the greatest sequence ID value by one.
When the Integration Service reaches the maximum number for a generated sequence ID, it starts over at one. The Integration Service increments each sequence ID by one until it reaches the smallest existing value minus one. If the Integration Service runs out of unique sequence ID numbers, the session fails.

Above shown is a dynamic lookup configuration to generate SK for CUST_SK.

The Integration Service generates a Sequence-ID for each row it inserts into the cache. For any records which is already present in the Target, it gets the SK value from the Target Dynamic LookUP cache, based on the Associated Ports matching. So, if we take this port and connect to the target SK field, there will not be any need to generate SK values separately, since the new SK value(for records to be Inserted) or the existing SK value(for records to be Updated) is supplied from the Dynamic LookUP.

The disadvantage of this technique lies in the fact that we don’t have any separate SK Generating Area and the source of SK is totally embedded into the code.

Using Expression Transformation

Suppose we are populating a CUSTOMER_DIM. So in the Mapping, first create a Unconnected Lookup for the dimension table, say LKP_CUSTOMER_DIM. The purpose is to get the maximum SK value in the dimension table. Say the SK column is CUSTOMER_KEY and the NK column is CUSTOMER_ID.

Select CUSTOMER_KEY as Return Port and Lookup Condition as

CUSTOMER_ID = IN_CUSTOMER_ID

Use the SQL Override as below:

SELECT MAX (CUSTOMER_KEY) AS CUSTOMER_KEY, '1' AS CUSTOMER_ID FROM CUSTOMER_DIM

Next in the mapping after the SQ use an Expression transformation. Here actually we will be generating the SKs for the Dimension based on the previous value generated. We will create the following ports in the EXP to compute the SK value.

VAR_COUNTER = IIF(ISNULL( VAR_INC ), NVL(:LKP.LKP_CUSTOMER_DIM('1'), 0) + 1, VAR_INC + 1 )

VAR_INC = VAR_COUNTER

OUT_COUNTER = VAR_COUNTER

When the mapping starts, for the first row we will look up the Dimension table to fetch the maximum available SK in the table. Next we will keep on incrementing the SK value stored in the variable port by 1 for each incoming row. Here the O_COUNTER will give the SKs to be populated in CUSTOMER_KEY.

Using Mapping & Workflow Variable

Here again we will use the Expression transformation to compute the next SK, but will get the MAX available SK in a different way.

Suppose, we have a session s_New_Customer, which loads the Customer Dimension table. Before that session in the Workflow, we add a dummy session as s_Dummy.
In s_Dummy, we will have a mapping variable, e.g. $$MAX_CUST_SK which will be set with the value of MAX (SK) in Customer Dimension table.

SELECT MAX (CUSTOMER_KEY) AS CUSTOMER_KEY FROM CUSTOMER_DIM

We will have the CUSTOMER_DIM as our source table and target can be a simple flat file, which will not be used anywhere. We pull this MAX (SK) from the SQ and then in an EXP we assign this value to the mapping variable using the SETVARIABLE function. So, we will have the following ports in the EXP:

INP_CUSTOMER_KEY = INP_CUSTOMER_KEY -– The MAX of SK coming from Customer Dimension table.
OUT_MAX_SK = SETVARIABLE ($$MAX_CUST_SK, INP_CUSTOMER_KEY) –- Output Port

This output port will be connected to the flat file port, but the value we assigned to the variable will persist in the repository.

In our second mapping we start generating the SK from the value $$MAX_CUST_SK + 1. But how can we pass the parameter value from one session into the other one?

Here the use of Workflow Variable comes into picture. We define a WF variable as $$MAX_SK and in the Post-session on success variable assignment section of s_Dummy, we assign the value of $$MAX_CUST_SK to $$START_SK. Now the variable $$MAX_SK contains the maximum available SK value from CUSTOMER_DIM table. Next we define another mapping variable in the session s_New_Customer as $$START_VALUE and this is assigned the value of $$MAX_SK in the Pre-session variable assignment section of s_New_Customer.

So, the sequence is:

Post-session on success variable assignment of First Session:

$$MAX_SK = $$MAX_CUST_SK

Pre-session variable assignment of Second Session:

$$START_VALUE = $$MAX_SK

Now in the actual mapping, we add an EXP and the following ports into that to compute the SKs one by one for each records being loaded in the target.

VAR_COUNTER = IIF (ISNULL (VAR_INC), $$START_VALUE + 1, VAR_INC + 1)

About the Author

VAR_INC = VAR_COUNTER
OUT_COUNTER = VAR_COUNTER

OUT_COUNTER will be connected to the SK port of the target.

Hope you enjoyed this article and earned some new ways to generate surrogate keys for your dimension tables. Please leave us a comment or feedback if you have any, we are happy to hear from you.

Surrogate Key in Data Warehouse, What, When and Why

2013-11-13T20:02:00.000-08:00

Surrogate keys are widely used and accepted design standard in data warehouses. It is sequentially generated unique number attached with each and every record in a Dimension table in any Data Warehouse. It join between the fact and dimension tables and is necessary to handle changes in dimension table attributes.

What Is Surrogate Key

Surrogate Key (SK) is sequentially generated meaningless unique number attached with each and every record in a table in any Data Warehouse (DW).

It is UNIQUE since it is sequentially generated integer for each record being inserted in the table.
It is MEANINGLESS since it does not carry any business meaning regarding the record it is attached to in any table.
It is SEQUENTIAL since it is assigned in sequential order as and when new records are created in the table, starting with one and going up to the highest number that is needed.

Surrogate Key Pipeline and Fact Table

During the FACT table load, different dimensional attributes are looked up in the corresponding Dimensions and SKs are fetched from there. These SKs should be fetched from the most recent versions of the dimension records. Finally the FACT table in DW contains the factual data along with corresponding SKs from the Dimension tables.

The below diagram shows how the FACT table is loaded from the source.

Why Should We Use Surrogate Key

Basically it’s an artificial key that is used as a substitute for a Natural Key (NK). We should have defined NK in our tables as per the business requirement and that might be able to uniquely identify any record. But, SK is just an Integer attached to a record for the purpose of joining different tables in a Star or Snowflake schema based DW. SK is much needed when we have very long NK or the datatype of the NK is not suitable for Indexing.

The below image shows a typical Star Schema, joining different Dimensions with the Fact using SKs.

Ralph Kimball emphasizes more on the abstraction of NK. As per him, Surrogate Keys should NOT be:

Smart, where you can tell something about the record just by looking at the key.
Composed of natural keys glued together.
Implemented as multiple parallel joins between the dimension table and the fact table; so-called double or triple barreled joins.

As per Thomas Kejser, a “good key” is a column that has the following properties:

It forced to be unique
It is small
It is an integer
Once assigned to a row, it never changes
Even if deleted, it will never be re-used to refer to a new row
It is a single column
It is stupid
It is not intended as being remembered by users

If the above mentioned features are taken into account, SK would be a great candidate for a Good Key in a DW.

Apart from these, few more reasons for choosing this SK approach are:

If we replace the NK with a single Integer, it should be able to save a substantial amount of storage space. The SKs of different Dimensions would be stored as Foreign Keys (FK) in the Fact tables to maintain Referential Integrity (RI), and here instead of storing of those big or huge NKs, storing of concise SKs would result in less amount of space needed. The UNIQUE indexes built on the SK will take less space than the UNIQUE index built on the NK which may be alphanumeric.
Replacing big, ugly NKs and composite keys with beautiful, tight integer SKs is bound to improve join performance, since joining two Integer columns works faster. So, it provides an extra edge in the ETL performance by fastening data retrieval and lookup.
Advantage of a four-byte integer key is that it can represent more than 2 billion different values, which would be enough for any dimension and SK would not run out of values, not even for the Big or Monster Dimension.
SK is usually independent of the data contained in the record, we cannot understand anything about the data in a record simply by seeing only the SK. Hence it provides Data Abstraction.

So, apart from the abstraction of critical business data involved in the NK, we have the advantage of storage space reduction as well to implement the SK in our DW. It has become a Standard Practice to associate an SK with a table in DW irrespective of being it a Dimension, Fact, Bridge or Aggregate table.

Why Shouldn’t We Use Surrogate Key

There are myriad number of disadvantages as well while working with SK. Let’s see them one by one:

The values of SKs have no relationship with the real world meaning of the data held in a row. Therefore over usage of SKs lead to the problem of disassociation.
The generation and attachment of SK creates unnecessary ETL burden. Sometimes it may be found that the actual piece of code is short and simple, but generating the SK and carrying it forward till the target adds extra overhead on the code.
During the Horizontal Data Integration (DI) where multiple source systems loads data into a single Dimension, we have to maintain a single SK Generating Area to enforce the Uniqueness of SK. This may come as an extra overhead on the ETL.
Even query optimization becomes difficult since SK takes the place of PK, unique index is applied on that column. And any query based on NK leads to Full Table Scan (FTS) as that query cannot take the advantage of unique index on the SK.
Replication of data from one environment to another, i.e. Data Migration, becomes difficult since SKs from different Dimension tables are used as the FKs in the Fact table and SKs are DW specific, any mismatch in the SK for a particular Dimension would result in no data or erroneous data when we join them in a Star Schema.
If duplicate records come from the source, there is a potential risk of duplicates
About the Author

being loaded into the target, since Unique Constraint is defined on the SK and not on the NK.

SK should not be implemented just in the name of standardizing your code. SK is required when we cannot use an NK to uniquely identify a record or when using an SK seems more suitable as the NK is not a good fit for PK.

Reference : Ralph Kimball, Thomas Kejser

Informatica PowerCenter Load Balancing for Workload Distribution on Grid

2013-11-08T23:06:00.000-08:00

Informatica PowerCenter Workflows runs on grid, distributes workflow tasks across nodes in the grid. It also distributes Session, Command, and predefined Event-Wait tasks within workflows across the nodes in a grid. PowerCenter uses load balancer to distribute workflows and session tasks to different nodes. This article describes, how to use load balancer to setup high workflow priorities and how to allocate resources.

What is Informatica Load Balancing

Performance Improvement Features

Pushdown Optimization
Pipeline Partitions
Dynamic Partitions
Concurrent Workflows
Grid Deployments
Workflow Load Balancing

Informatica load Balancing is a mechanism which distributes the workloads across the nodes in the grid. When you run a workflow, the Load Balancer dispatches different tasks in the workflow such as Session, Command, and predefined Event-Wait tasks to different nodes running the Integration Service. Load Balancer matches task requirements with resource availability to identify the best node to run a task. It may dispatch tasks to a single node or across nodes on the grid.

Identifying the Nodes to Run a Task

Load Balancer matches the resources required by the task with the resources available on each node. It dispatches tasks in the order it receives them. You can adjust the workflow priorities and the assign resources needs for tasks, such that load balancer can distribute the tasks to the right nodes and right priority.

Assign service levels : You assign service levels to workflows. Service levels establish priority among workflow tasks that are waiting to be dispatched.

Assign resources : You assign resources to tasks. Session, Command, and predefined Event-Wait tasks require PowerCenter resources to succeed. If the Integration Service is configured to check resources, the Load Balancer dispatches these tasks to nodes where the resources are available.

Assigning Service Levels to Workflows

Service levels determine the order in which the Load Balancer dispatches tasks from the dispatch queue. When multiple tasks are waiting to be dispatched, the Load Balancer dispatches high priority tasks before low priority tasks. You create service levels and configure the dispatch priorities in the Administrator tool.

Integration service will be limited to run You give Higher Service Level for the workflows, which needs to be dispatched first, when multiple workflows are running in parallel. Service Levels are set up in the Admin console.

You assign service levels to workflows on the General tab of the workflow properties as shown below.

Assigning Resources to Tasks

If the Integration Service runs on a grid and is configured to check for available resources, the Load Balancer uses resources to dispatch tasks. The Integration Service matches the resources required by tasks in a workflow with the resources available on each node in the grid to determine which nodes can run the tasks.

You can configure the resource requirements by the tasks as shown in below image.

Below configuration shows that, the source qualifier needs source file from File Directory NDMSource, which is accessible only from one node. Available resource on different nodes are configured from Admin console.

Hope you enjoyed this article and this will help you prioritize your workflows to to meet your data refresh time lines. Please leave us a comment or feedback if you have any, we are happy to hear from you.

Informatica PowerCenter on Grid for Greater Performance and Scalability

2013-10-31T07:53:00.000-07:00

Informatica has developed a solution that leverages the power of grid computing for greater data integration scalability and performance. The grid option delivers the load balancing, dynamic partitioning, parallel processing and high availability to ensure optimal scalability, performance and reliability. In this article lets discuss how to setup Infrmatica Workflow to run on grid.

What is PowerCenter On Grid

Performance Improvement Features

Pushdown Optimization
Pipeline Partitions
Dynamic Partitions
Concurrent Workflows
Grid Deployments
Workflow Load Balancing

Domain : A PowerCenter domain consists of one or more nodes in the grid environment. PowerCenter services run on the nodes. A domain is the foundation for PowerCenter service administration.

Node : A node is a logical representation of a physical machine that runs a PowerCenter service.

Admin Console with Grid Configuration

Below shown is an Informatica Admin Console, with two node Grid configuration. We can see two nodes Node_1, Node_2 and the Node_GRID grid created using two nodes. The integration service Int_service_GRID is running on the grid.

Setting up Workflow on Grid

When you setup a workflow to run grid, the Integration Service distributes workflows across the nodes in a grid. It also distributes the Session, Command, and predefined Event-Wait tasks within workflows across the nodes in a grid.

You can setup the workflow to run on grid as shown in below image.You can assign the integration service, which is configured on grid to run the workflow on grid.

Setting up Session on Grid

When you run a session on a grid, the Integration Service distributes session threads across nodes in a grid. The Load Balancer distributes session threads to DTM processes running on different nodes. You might want to configure a session to run on a grid when the workflow contains a session that takes a long time to run.

You can setup the session to run on grid as shown in below image.

Workflow Running on Grid

Below workflow monitor screen shots sows a workflow running on grid. You see two of the session in the workflow wf_Load_CUST_DIM run on Node_1 and other one on Node_1 from 'Task Progress Details' Window.

Key Features and Advantages of Grid

Load Balancing : While facing spikes in data processing, load balance guarantees smooth operations by switching the data processing between nodes on the grid. The node is chosen dynamically based on process size, CPU utilization, memory requirements etc...
High Availability : Grid complements the High Availability feature or PowerCenter by switching the master node in case of a node failure. This ensures the monitoring and the shorten time needed for recovery processes.
Dynamic Partitioning : Dynamic Partitioning helps making the best use of currently available nodes on the grid. By adapting to available resources, it also helps increasing the performance of the whole ETL process.

Hope you enjoyed this article, please leave us a comment or feedback if you have any, we are happy to hear from you.

Time Zones Conversion and Standardization Using Informatica PowerCenter

2013-10-23T22:49:00.000-07:00

When your data warehouse is sourcing data from multi-time zoned data sources, it is recommended to capture a universal standard time, as well as local times. Same goes with transactions involving multiple currencies. This design enables analysis on the local time along with the universal standard time. The time standardization will be done as part of the ETL, which loads the warehouse. In this article lets discuss about the implementation using Informatica PowerCenter.

We will concentrate only on the ETL part of time zone conversion and standardization, but not the data modeling part. You can learn more about the dimensional modeling aspect from Ralph Kimball.

Business Use Case

Lets consider an ETL job, which is used to integrate sales data from different global sales regions in to the enterprise data warehouse. Sales transactions are happening in different time zones and from different sales applications. Local sales applications are capturing sales in the local time. Data in the warehouse needs to be standardized and sales transaction need to be captured in local as well as GMT time.

Solution : Create a reusable expression to convert the local time into GMT time. This transformation can be reused in all the ETL process, which needs a time standardization. This reusable transformation can be used in any Mapping, which needs the time zone conversion.

Building the Reusable Expression

You can create the reusable transformation in the Transformation Developer.

In the expression transformation, you can create below ports and the corresponding expressions. Be sure to have the ports created in the same order, data type and precision in the transformation.

LOC_TIME_WITH_TZ : STRING(36) (Input)

DATE_TIME : DATE/TIME (Variable)

TZ_DIFF : INTEGER (Variable)

TZ_DIFF_HR (V) : INTEGER (Variable)

TZ_DIFF_MI (V) : INTEGER (Variable)

GMT_TIME_HH : DATE/TIME (Variable)

GMT_TIME_MI : DATE/TIME (Variable)

GMT_TIME_WITH_TZ STRING(36) (Output)

Now create expressions as below for all the ports.

DATE_TIME : TO_DATE(SUBSTR(LOC_TIME_WITH_TZ,0,29),'DD-MON-YY HH:MI:SS.US AM')

TZ_DIFF : IIF(SUBSTR(LOC_TIME_WITH_TZ,30,1)='+',-1,1)

TZ_DIFF_HR : TO_DECIMAL(SUBSTR(LOC_TIME_WITH_TZ,31,2))

TZ_DIFF_MI : TO_DECIMAL(SUBSTR(LOC_TIME_WITH_TZ,34,2))

GMT_TIME_HH : ADD_TO_DATE(DATE_TIME,'HH',TZ_DIFF_HR*TZ_DIFF)

GMT_TIME_MI : ADD_TO_DATE(GMT_TIME_HH,'MI',TZ_DIFF_MI*TZ_DIFF)

GMT_TIME_WITH_TZ : TO_CHAR(GMT_TIME_MI,'DD-MON-YYYY HH:MI:SS.US AM') || ' +00:00'

Note : The expression is based on the timestamp format 'DD-MON-YYYY HH:MI:SS.FF AM TZH:TZM'. If you are using a different oracle timestamp format, this expression might not work.

Below is the expression transformation with the expressions added.

The reusable transformation can be used in any Mapping, which needs the time zone conversion. Below shown is the completed expression transformation. You can see a sample output data generated by expression as shown in below image.

Expression Usage

This reusable transformation takes one input port and gives one output port. The input port should be a date timestamp with time zone information. Below shown is a mapping using this reusable transformation.

Note : Timestamp with time zone is processed as STRING(36) data type in the mapping. All the transformations should use STRING(36) data type. Source and target should use VARCHAR2(36) data type.

Download

You can download the reusable expression we discussed in this article. Click here for the download link.

Hope this tutorial was helpful and useful for your project. Please leave you questions and commends, We will be more than happy to help you.

Dynamically Changing ETL Calculations Using Informatica Mapping Variable

2013-10-16T23:43:00.001-07:00

Quite often we deal with ETL logic, which is very dynamic in nature. Such as a discount calculation which changes every month or a special weekend only logic. There is a lot of practical difficulty in making such frequent ETL change into production environment. Best option to deal with this dynamic scenario is parametrization. In this article let discuss how we can make the ETL calculations dynamic.

Business Use Case

Lets start our discuss with the help of a real life use case.

The sales department wants to build a monthly sales fact table. The fact table need to be refreshed after the month end closure. Sales commission is one of the fact table data element, its calculation is dynamic in nature. It is a factor of sales or sales revenue or net sales.

Sales Commission calculation can be :

Sales Commission = Sales * 18 / 100
Sales Commission = Sales Revenue * 20 / 100
Sales Commission = Net Sales * 20 / 100

Note : The expression calculation can be as complex as the business requirement demands.

The calculation need to be used by the month end ETL will be decided by the Sales Manager before the month ETL load.

Mapping Configuration

Now we understand the use case, lets build the mapping logic.

Here we will be building the dynamic sales commission calculation logic with the help of a mapping variable. The changing expression for the calculation will be passed into the mapping using a session parameter file.

Step 1 : As the first step, Create a mapping variable $$EXP_SALES_COMM and set the isExpVar property TRUE as shown in below image.

Note : Precision for the mapping variable should be big enough to hold the whole expression.

Step 2 : In an expression transformation, create an output port and provide the mapping variable as the expression. Below shown is the screenshot of expression transformation.
Note : All the ports used in the expression $$EXP_SALES_COMM should be available as an input or input/output port in the expression transformation.

Workflow Configuration

In the workflow configuration, we will create the parameter file with the expression for Sales Commission and set up in the session.

Step 1 : Create the session parameter file with the expression for Sales Commission calculation with the below details.

[s_m_LOAD_SALES_FACT]
$$EXP_SALES_COMM=SALES_REVENUE*20/100

Step 2 : Set the parameter in the session properties as shown below.

With that we are done with the configuration. You can update the expression in the parameter file when ever a change is required in the sales commission calculation. This clearly eliminate the need of a ETL code change.

Hope you enjoyed this article, please leave us a comment or feedback if you have any, we are happy to hear from you.

Informatica Performance Tuning Guide, Resolve Performance Bottlenecks - Part 3

2013-10-08T23:38:00.000-07:00

In our previous article in the performance tuning series, we covered different approaches to identify performance bottlenecks. In this article we will cover the methods to resolve different performance bottlenecks. We will talk about session memory, cache memory, source, target and mapping performance turning techniques in detail.

I. Buffer Memory Optimization

When the Integration Service initializes a session, it allocates blocks of memory to hold source and target data. Sessions that use a large number of sources and targets might require additional memory blocks.

Not having enough buffer memory for DTM process, can slowdown reading, transforming or writing and cause large fluctuations in performance. Adding extra memory blocks can keep the threads busy and improve session performance. You can do this by adjusting the buffer block size and DTM Buffer size.

Note : You can identify DTM buffer bottleneck from Session Log File, Check here for details.

1. Optimizing the Buffer Block Size

Depending on the source, target data, you might need to increase or decrease the buffer block size.

To identify the optimal buffer block size, sum up the precision of individual source and targets columns. The largest precision among all the source and target should be the buffer block size for one row. Ideally, a buffer block should accommodates at least 100 rows at a time.

Buffer Block Size = Largest Row Precision * 100

You can change the buffer block size in the session configuration as shown in below image.

2. Increasing DTM Buffer Size

When you increase the DTM buffer memory, the Integration Service creates more buffer blocks, which improves performance. You can identify the required DTM Buffer Size based on below calculation.

Session Buffer Blocks = (total number of sources + total number of targets) * 2
DTM Buffer Size = Session Buffer Blocks * Buffer Block Size / 0.9

You can change the DTM Buffer Size in the session configuration as shown in below image.

II. Caches Memory Optimization

Transformations such as Aggregator, Rank, Lookup uses cache memory to store transformed data, which includes index and data cache. If the allocated cache memory is not large enough to store the data, the Integration Service stores the data in a temporary cache file. Session performance slows each time the Integration Service reads from the temporary cache file.

Note : You can examine the performance counters to determine what all transformations require cache memory turning, Check here for details.

1. Increasing the Cache Sizes

You can increase the allocated cache sizes to process the transformation in cache memory itself such that the integration service do not have to read from the cache file.

You can calculate the memory requirements for a transformation using the Cache Calculator. Below shown is the Cache Calculator for Lookup transformation.

You can update the cache size in the session property of the transformation as shown below.

2. Limiting the Number of Connected Ports

For transformations that use data cache, limit the number of connected input/output and output only ports. Limiting the number of connected input/output or output ports reduces the amount of data the transformations store in the data cache.

III. Optimizing the Target

The most common performance bottleneck occurs when the Integration Service writes to a target database. Small database checkpoint intervals, small database network packet sizes, or problems during heavy loading operations can cause target bottlenecks.

Note : Target bottleneck can be determined with the help of Session Log File, check here for details.

1. Using Bulk Loads

You can use bulk loading to improve the performance of a session that inserts a large amount of data into a DB2, Sybase ASE, Oracle, or Microsoft SQL Server database. When bulk loading, the Integration Service bypasses the database log, which speeds performance. Without writing to the database log, however, the target database cannot perform rollback. As a result, you may not be able to perform recovery.

2. Using External Loaders

To increase session performance, configure PowerCenter to use an external loader for the following types of target databases. External loader can be used for Oracle, DB2, Sybase and Teradata.

3. Dropping Indexes and Key Constraints

When you define key constraints or indexes in target tables, you slow the loading of data to those tables. To improve performance, drop indexes and key constraints before you run the session. You can rebuild those indexes and key constraints after the session completes.

4. Minimizing Deadlocks

Encountering deadlocks can slow session performance. You can increase the number of target connection groups in a session to avoid deadlocks. To use a different target connection group for each target in a session, use a different database connection name for each target instance.

5. Increasing Database Checkpoint Intervals

The Integration Service performance slows each time it waits for the database to perform a checkpoint. To decrease the number of checkpoints and increase performance, increase the checkpoint interval in the database.

6. Increasing Database Network Packet Size

If you write to Oracle, Sybase ASE, or Microsoft SQL Server targets, you can improve the performance by increasing the network packet size. Increase the network packet size to allow larger packets of data to cross the network at one time.

IV. Optimizing the Source

Performance bottlenecks can occur when the Integration Service reads from a source database. Inefficient query or small database network packet sizes can cause source bottlenecks.

Note : Session Log File details can be used to identify Source bottleneck, check here for details.

1. Optimizing the Query

If a session joins multiple source tables in one Source Qualifier, you might be able to improve performance by optimizing the query with optimizing hints. Usually, the database optimizer determines the most efficient way to process the source data. However, you might know properties about the source tables that the database optimizer does not. The database administrator can create optimizer hints to tell the database how to execute the query for a particular set of source tables.

2. Increasing Database Network Packet Size

If you read from Oracle, Sybase ASE, or Microsoft SQL Server sources, you can improve the performance by increasing the network packet size. Increase the network packet size to allow larger packets of data to cross the network at one time.

V. Optimizing the Mappings

Mapping-level optimization may take time to implement, but it can significantly boost session performance. Focus on mapping-level optimization after you optimize the targets and sources.

Generally, you reduce the number of transformations in the mapping and delete unnecessary links between transformations to optimize the mapping. Configure the mapping with the least number of transformations and expressions to do the most amount of work possible. Delete unnecessary links between transformations to minimize the amount of data moved.

Note : You can identify Mapping bottleneck from Session Log File, check here for details.

1. Optimizing Datatype Conversions

You can increase performance by eliminating unnecessary datatype conversions. For example, if a mapping moves data from an Integer column to a Decimal column, then back to an Integer column, the unnecessary datatype conversion slows performance. Where possible, eliminate unnecessary datatype conversions from mappings.

2. Optimizing Expressions

You can also optimize the expressions used in the transformations. When possible, isolate slow expressions and simplify them.

Factoring Out Common Logic : If the mapping performs the same task in multiple places, reduce the number of times the mapping performs the task by moving the task earlier in the mapping.
Minimizing Aggregate Function Calls : When writing expressions, factor out as many aggregate function calls as possible. Each time you use an aggregate function call, the Integration Service must search and group the data. For example SUM(COL_A + COL_B) performs better than SUM(COL_A) + SUM(COL_B)
Replacing Common Expressions with Local Variables : If you use the same expression multiple times in one transformation, you can make that expression a local variable.
Choosing Numeric Versus String Operations : The Integration Service processes numeric operations faster than string operations. For example, if you look up large amounts of data on two columns, EMPLOYEE_NAME and EMPLOYEE_ID, configuring the lookup around EMPLOYEE_ID improves performance.
Using Operators Instead of Functions : The Integration Service reads expressions written with operators faster than expressions with functions. Where possible, use operators to write expressions.

3. Optimizing Transformations

Each transformation is different and the tuning required for different transformation is different. But generally, you reduce the number of transformations in the mapping and delete unnecessary links between transformations to optimize the transformation.

Note : Tuning technique for different transformation will be covered as a separate article.

What is Next in the Series

The next article in this series will cover the additional features available in Informatica PowerCenter to improve session performance. Hope you enjoyed this article, please leave us a comment or feedback if you have any, we are happy to hear from you.

Informatica HTTP Transformation, The Interface Between ETL and Web Services

2013-09-30T22:37:00.000-07:00

In a matured data warehouse environment, you will see all sorts of data sources, like Mainframe, ERP, Web Services, Machine Logs, Message Queues, Hadoop etc. Informatica has provided a variety of connector to get data extracted from such data sources. Using Informatica HTTP transformation, you can make Web Service calls and get data from web servers. We will have this transformation explained in this article with a use case.

What is HTTP Transformation

The HTTP transformation enables you to connect to an HTTP server to use its services and applications. When you run a session with an HTTP transformation, the Integration Service connects to the HTTP server and issues a request to retrieve data from or update data on the HTTP server.

For example, you can get the currency conversion rate between USD and EUR by calling this web service call. http://rate-exchange.appspot.com/currency?from=USD&to=EUR Using HTTP Transformation you can :

Read data from an HTTP server :- It retrieves data from the HTTP server and passes the data to a downstream transformation in the mapping.
Update data on the HTTP server :- It posts data to the HTTP server and passes HTTP server responses to a downstream transformation in the mapping.

Developing HTTP Transformation

Like any other transformation, you can create HTTP transformations in the Transformation Developer or in the Mapping Designer. As shown in below image, all the configuration required for this transformation in on the HTTP tab.

Read or Write data to HTTP server

As shown in the image, on the HTTP tab, you can configure the transformation to read data or write data to the HTTP server. Select GET method to read data and POST or SIMPLE POST method to write data to an HTTP server.

Configuring Groups and Ports

Base on the type of the HTTP method, you choose and the port group and port in the transformation in the HTTP tab..

Output. Contains data from the HTTP response. Passes responses from the HTTP server to downstream transformations.
Input. Used to construct the final URL for the GET method or the data for the POST request.
Header. Contains header data for the request and response.

In the above shown image, we have two input ports for the GET method and the response from the server as the output port

Configuring a URL

The web service will be accessed using a URL and the base URL of the web service need to be provided in the transformation. The Designer constructs the final URL for the GET method based on the base URL and port names in the input group.

In the above shown image, you can see the base url and the constructed URL, which includes the query parameters. This web service call is to get the currency conversion and we are passing two parameters to the base url, "from" and "to" currency.

Connecting to the HTTP Server

If the HTTP server requires authentication, you can create an HTTP connection object in the Workflow Manager. This connection can be used in the session configuration to connect the HTTP server.

HTTP Transformation Use Case

Lets consider an ETL job, which is used to integrate sales data from different global sales regions in to the enterprise data warehouse. Data in the warehouse needs to be standardized and all the sales figure need to be stored in US Dollars (USD).

Solution : Here in the ETL process lets us use a web service call to get the real time currency conversion rate and convert the foreign currency to USD. We will use HTTP Transformation to call the web service.

For the demo, we will concentrate only on the HTTP transformation. We will be using the web service from http://rate-exchange.appspot.com/ for the demonstration. This web service take two parameters, "from currency" and "to currency" and returns a JSON document, with the exchange rate information.

http://rate-exchange.appspot.com/currency?from=USD&to=EUR

Step 1 :- Create the HTTP Transformation like any other transformation in the mapping designer. We need to configure the transformation for the GET HTTP method to access currency conversion data. Below shown is the configuration.

Step 2 :- Create two input ports as shown in below image. The ports need to be string data type and the port name should match with the url parameter name.

Step 3 :- Now you can provide the base URL for the web service and the designer will construct the complete URL with the parameters included.

Step 4 :- The output from the HTTP transformation will look similar to what is given below.

{"to": "USD", "rate": 1.3522000000000001, "from": "EUR"}

Finally, you can plug in the transformation into the mapping as shown in below image. Parse the output from HTTP Transformation in an expression transformation and do the calculation to convert the currency to USD.

Hope you enjoyed this tutorial, Please let us know if you have any difficulties in trying out HTTP transformation or share us if you use any different use cases you want to implement using HTTP transformation.

Informatica SQL Transformation, SQLs Beyond Pre & Post Session Commands

2013-09-24T00:03:00.000-07:00

SQL statements can be used as part of pre or post SQL commands in a PowerCenter workflow. These are static SQLs and can run only once before or after the mapping pipeline is run. With the help of SQL transformation, we can use SQL statements much more effectively to build your ETL logic. In this tutorial lets learn more about the transformation and its usage with a real time use case.

What is SQL Transformation

The SQL transformation can be used to processes SQL queries midstream in a mapping. You can execute any valid SQL statement using this transformation. This can be external SQL scripts or SQL queries that are created with in the transformation. SQL transformation processes the query and returns rows and database errors if any.

Configuring SQL Transformation

SQL transformation can run in two different modes.

Script mode :- Runs SQL scripts from text files that are externally located. You pass a script name to the transformation with each input row. It outputs script execution status and any script error.
Query mode :- Executes a query that you define in a query editor. You can pass strings or parameters to the query to define dynamic queries. You can output multiple rows when the query has a SELECT statement.

Script Mode

An SQL transformation running in script mode runs SQL scripts from text files. It creates an SQL procedure and sends it to the database to process. The database validates the SQL and executes the query. You cannot use scripting languages such as Oracle PL/SQL or Microsoft/Sybase T-SQL in the script.

In the script mode, you pass script file name with the complete path from the source to the SQL transformation ScriptName port. ScriptResult port gives the status of the script execution status. It will be either PASSED or FAILED. ScriptError returns errors that occur when a script fails for a row.

Above shown is an SQL transformation in Script Mode, which will have a ScriptName input and ScripResult, ScriptError as output.

Query Mode

When SQL transformation runs in query mode, it executes an SQL query defined in the transformation. You can pass strings or parameters to the query from the transformation input ports to change the SQL query statement or the query data. The SQL query can be static or dynamic.

Static SQL query :- The query statement does not change, but you can use query parameters to change the data, which is passed in through the input ports of the transformation.
Dynamic SQL query :- You can change the query statements and the data, which is passed in through the input ports of the transformation.

With static query, the Integration Service prepares the SQL statement once and executes it for each row. With a dynamic query, the Integration Service prepares the SQL for each input row.

Above shown SQL transformation, which runs in query mode has two input parameters and returns one output.

SQL Transformation Use Case

Lets consider the ETL for loading Dimension tables into a data warehouse. The surrogate key for each of the dimension tables are populated using an Oracle Sequence. The ETL architect needs to create an Informatica reusable component, which can be reused in different dimension table loads to populate the surrogate key.

Solution : Here lets create a reusable SQL transformation in Query mode, which can take the name of the oracle sequence generator, and pass the sequence number as the output.

Step 1 :- Once you have the transformation developer open you can start creating the SQL transformation like any other transformations. It opens up a window like shown in below image.

This screen will let you choose the mode, database type, database connection type and you can make the transformation active or passive. If the database connection type is dynamic, you can dynamically pass in the connection details into the transformation. If the SQL query returns more than one record, you need to make the transformation active.

Step 2 :- Now create the input and output ports as shown in the below image. We are passing in the database schema name and the sequence name. It return sequence number as an output port.

Step 3 :- Using the SQL query editor, we can build the query to get the sequence generator. Using the 'String Substitution' ports we can make the SQL dynamic. Here we are making the query dynamic by passing the schema name, sequence name dynamically as an input port.

That is all we need for the reusable SQL transformation. Below shown is the completed SQL transformation, which can take two input values (schema name, sequence name) and returns one output value (sequence number).

Step 4 :- We can use this transformation just like any other reusable transformations, Need to pass in the schema name, sequence name as input ports and returns sequence number, which can be used to populate the surrogate key of the dimension table as shown below.

As per the above example, integration service will convert the SQL as follows during the session runtime. SELECT DW.S_CUST_DIM.NEXTVAL FROM DUAL;

Hope you enjoyed this tutorial, Please let us know if you have any difficulties in trying out this tutorial or share us if you use any different use cases you want to implement using SQL transformation.

Informatica Java Transformation to Leverage the Power of Java Programming

2013-09-17T22:16:00.001-07:00

Java is, one of the most popular programming languages in use, particularly for client-server web applications. With the introduction of PowerCenter Java Transformation, ETL developers can get their feet wet with Java programming and leverage the power of Java. In this article lets learn more about Java Transformation, its components and its usage with the help of a use case.

What is Java Transformation

With Java transformation you can define transformation logic using java programming language without advanced knowledge of the Java programming language or an external Java development environment.

The PowerCenter Client uses the Java Development Kit (JDK) to compile the Java code and generate byte code for the transformation. The PowerCenter Client stores the byte code in the PowerCenter repository. When the Integration Service runs a session with a Java transformation, the Integration Service uses the Java Runtime Environment (JRE) to execute the byte code and process input rows and generate output rows.

Developing Code in Java Transformation

You can use the code entry tabs to enter Java code snippets to define Java transformation functionality. Using the code entry tabs with in the transformation, you can import Java packages, write helper code, define Java expressions, and write Java code that defines transformation behavior for specific transformation events.

Below image shows different code entry tabs under 'Java Code'.

Import Packages :- Import third-party Java packages, built-in Java packages, or custom Java packages.
Helper Code :- Define variables and methods available to all tabs except Import Packages. After you declare variables and methods on the Helper Code tab, you can use the variables and methods on any code entry tab except the Import Packages tab.
On Input Row :- Define transformation behavior when it receives an input row. The Java code in this tab executes one time for each input row
On End of Data :- Use this tab to define transformation logic when it has processed all input data.
On Receiving Transaction :- Define transformation behavior when it receives a transaction notification. You can use this only with active Java transformations.
Java Expressions : - Define Java expressions to call PowerCenter expressions. You can use this in multiple code entry tabs.

Java Transformation Use Case

Lets take a simple example for our demonstration. The employee data source contains the employee ID, name, Age, Employee description, and the manager ID. We need to create an ETL transformation to find the manager name for a given employee based on the manager ID and generates output file that contain employee ID, name, Employee description, and the Manager name.

Below shown is the complete structure of the mapping to build the functionality we described above. We are using only Java Transformation other than source, target and source qualifier.

Step 1 :- Once you have the source and source qualifier pulled in to the Java Transformation and create input and output ports as shown in below image. Just like any other transformation, you can drag and drop ports from other transformations to create new ports.

Step 2 :- Now move to the 'Java Code' tab and from 'import package' tab import the external java classes required by the java code. This tab can be used to import any third party java classes or build in java classes.

As shown in above image here is the import code used.

import java.util.Map;
import java.util.HashMap;

Step 3 :- In the 'Helper Code' tab, define the variables, objects and functions required by the java code, which will be written in 'On Input Row'. Here we have created four objects.

Below is the code used.

private static Map <Integer, String> empMap = new HashMap <Integer, String> ();
private static Object lock = new Object();
private boolean generateRow;
private boolean isRoot;

Step 4 :- In the 'On Input Row' tab, define the ETL logic, which will be executed for every input record.

Below is the complete code we need to place it in the 'On Input Row'

generateRow = true;
isRoot = false;
if (isNull ("EMP_ID_INP") || isNull ("EMP_NAME_INP"))
{
   incrementErrorCount(1);
   generateRow = false;
} else {
   EMP_ID_OUT = EMP_ID_INP;
   EMP_NAME_OUT = EMP_NAME_INP;
}
if (isNull ("EMP_DESC_INP"))
{
   setNull("EMP_DESC_OUT");
} else {
   EMP_DESC_OUT = EMP_DESC_INP;
}
boolean isParentEmpIdNull = isNull("EMP_PARENT_EMPID");
if(isParentEmpIdNull)
{
   isRoot = true;
   logInfo("This is the root for this hierarchy.");
   setNull("EMP_PARENT_EMPNAME");
}
synchronized(lock)
{
   if(!isParentEmpIdNull)
EMP_PARENT_EMPNAME = (String) (empMap.get(new Integer (EMP_PARENT_EMPID)));
   empMap.put (new Integer(EMP_ID_INP), EMP_NAME_INP);
}
if(generateRow)
generateRow();

With this we are done with the coding required in Java Transformation and only left with code compilation. Remaining tabs in this java transformation do not need any code for our use case.

Compile the Java Code

To compile the full code for the Java transformation, click Compile on the Java Code tab. The Output window displays the status of the compilation. If the Java code does not compile successfully, correct the errors in the code entry tabs and recompile the Java code. After you successfully compile the transformation, save the transformation to the repository.

Completed Mapping

Remaining tabs do not need any code for our use case and all the ports from the java transformation can be connected from the source qualifier and to the target. Below shown is the completed structure of the mapping.

Hope you enjoyed this tutorial, Please let us know if you have any difficulties in trying out this java code and java transformation or share us if you use any different use cases you want to implement using java transformation.

Informatica Performance Tuning Guide, Identify Performance Bottlenecks - Part 2

2013-09-08T22:26:00.002-07:00

In our previous article in the performance tuning series, we covered the basics of Informatica performance tuning process and the session anatomy. In this article we will cover the methods to identify different performance bottlenecks. Here we will use session thread statistics, session performance counter and workflow monitor properties to help us understand the bottlenecks.

Source, Target & Mapping Bottlenecks Using Thread Statistics

Thread statics gives run time information from all the three threads; reader, transformation and writer thread. The session log provides enough run time thread statistics to help us understand and pinpoint the performance bottleneck.

Gathering Thread Statistics

You can get thread statistics from the session long file. When you run a session, the session log file lists run time information and thread statistics with below details.

Run Time : Amount of time the thread runs.
Idle Time : Amount of time the thread is idle. Includes the time the thread waits for other thread processing.
Busy Time : Percentage of the run time. It is (run time - idle time) / run time x 100.
Thread Work Time : The percentage of time taken to process each transformation in a thread.

Note : Session Log file with normal tracing level is required to get the thread statistics.

Understanding Thread Statistics

When you run a session, the session log lists run information and thread statistics similar to the following text.

If you read it closely, you will see reader, transformation and writer thread and how much time is spent on each thread and how busy each thread is. Additional to that, transformation thread shows how much busy each transformation in the mapping is.

The total run time for the transformation thread is 506 seconds and the busy percentage is 99.7%. This means the transformation thread was never idle for the 506 seconds. The reader and writer busy percentages were significantly smaller, about 9.6% and 24%. In this session, the transformation thread is the bottleneck in the mapping.

To determine which transformation in the transformation thread is the bottleneck, view the busy percentage of each transformation in the thread work time breakdown. The transformation RTR_ZIP_CODE had a busy percentage of 53%.

Hint : Thread with the highest busy percentage is the bottleneck.

Session Bottleneck Using Session Performance Counters

All transformations have counters to help measure and improve performance of the transformations. Analyzing these performance details can help you identify session bottlenecks. The Integration Service tracks the number of input rows, output rows, and error rows for each transformation.

Gathering Performance Counters

You can setup the session to gather performance counters in the workflow manager. Below image shows the configuration required for a session to collect transformation performance counters.

Understanding Performance Counters

Below shown image is the performance counters for a session, which you can see from the workflow monitor session run properties.. You can see the transformations in the mapping and the corresponding performance counters.

A non-zero counts for readfromdisk and writetodisk indicate sub-optimal settings for transformation index or data caches. This may indicate the need to tune session transformation caches manually.

A non-zero count for Errorrows indicates you should eliminate the transformation errors to improve performance.

Errorrows : Transformation errors impact session performance. If a transformation has large numbers of error rows in any of the Transformation_errorrows counters, you should eliminate the errors to improve performance.
Readfromdisk and Writetodisk : If these counters display any number other than zero, you can increase the cache sizes to improve session performance.
Readfromcache and Writetocache : Use this counters to analyze how the Integration Service reads from or writes to cache.
Rowsinlookupcache : Gives the number of rows in the lookup cache. To improve session performance, tune the lookup expressions for the larger lookup tables.

Session Bottleneck Using Session Log File

When the Integration Service initializes a session, it allocates blocks of memory to hold source and target data. Not having enough buffer memory for DTM process, can slowdown reading, transforming or writing and cause large fluctuations in performance.

If the session is not able to allocate enough memory for the DTP Process, Integration service will write a warning message in to the session log file and gives you the recommended buffer size. Below is a sample message seen in the session

Message: WARNING: Insufficient number of data blocks for adequate performance. Increase DTM buffer size of the session. The recommended value is xxxx.

System Bottleneck Using the Workflow Monitor

You can view the Integration Service properties in the Workflow Monitor to see CPU, memory, and swap usage of the system when you are running task processes on the Integration Service. Use the following Integration Service properties to identify performance issues:

CPU% : The percentage of CPU usage includes other external tasks running on the system. A high CPU usage indicates the need of additional processing power required by the server.
Memory Usage : The percentage of memory usage includes other external tasks running on the system. If the memory usage is close to 95%, check if the tasks running on the system are using the amount indicated in the Workflow Monitor or if there is a memory leak. To troubleshoot, use system tools to check the memory usage before and after running the session and then compare the results to the memory usage while running the session.
Swap Usage : Swap usage is a result of paging due to possible memory leaks or a high number of concurrent tasks.

What is Next in the Series

The next article in this series will cover how to remove bottlenecks and improve session performance. Hope you enjoyed this article, please leave us a comment or feedback if you have any, we are happy to hear from you.