Informatica Incremental Aggregation Implementation and Business Use Cases

ETL Design , Performance Tips

Informatica PowerCenter Incrimental Aggregation

Incremental Aggregation is the perfect performance improvement technique to implement; when you have to do aggregate calculations on your incrementally changing source data. Rather than forcing the session to process the entire source data and recalculate the same data each time you run the session, incremental aggregation persist the aggregated value and adds the incremental changes to it. Lets see more details in this article.

What is Incremental Aggregation

Using incremental aggregation, you can apply changes captured from the source to aggregate calculations such as Sum, Min, Max, Average etc... If the source changes incrementally and you can capture changes, you can configure the session to process those changes. This allows the Integration Service to update the target incrementally, rather than forcing it to process the entire source and recalculate the same data each time you run the session.

When to Use Incremental Aggregation

You can capture new source data : Use incremental aggregation when you can capture new source data each time you run the session. Use a change data capture mechanism for the same.

Incremental changes do not significantly change the target : Use incremental aggregation when the changes do not significantly change the target. If processing the incrementally changed source alters more than half the existing target, the session may not benefit from using incremental aggregation. In this case, drop the table and recreate the target with complete source data.

How Incremental Aggregation Works

When the session runs with incremental aggregation enabled for the first time, it uses the entire source data. At the end of the session, the Integration Service stores aggregate data from that session run in two files, the index file and the data file, in the cache directory specified in the Aggregator transformation properties.

Each subsequent time you run the session with incremental aggregation, you use the incremental source changes in the session. For each input record, the Integration Service checks historical information in the index file for a corresponding aggregate group. If it finds a corresponding group, the Integration Service performs the aggregate operation incrementally, using the aggregate data for that group, and saves the incremental change. If it does, the Integration Service creates a new group and saves the record data.

Note : Before enabling incremental aggregation, it is important to read incremental changes from source to avoid double count.

Business Use Case

Lets consider an ETL job, which is used to load the Sales Summary Table. The summary table generates yearly sales summary by product line. The table includes the columns 'Sales Year', 'Product Line Name', 'Sales Quantity', 'Sales Amount'

Incremental Aggregation Implementation

Lets create a mapping, which can identify the new sales data from the data source and set the incremental aggregation. New sales data records are identified using the CREATE_DT column in the source table. The source qualifies of the mapping looks as in below image. The source qualifier is set to read the changed data using mapping variables.

Now do the aggregation calculation using the aggregator transformation as shown in below image.

Complete the mapping as shown in below image.

Create the Workflow and set the incremental aggregation setting in the session property as shown in the image.

Note : No need to use an update strategy transformation to implement Insert else Update logic. You can set the session properties just like 'Insert' only mapping. When you use the incremental aggregation, Integration Service does the Insert or Update based on the primary key set in the target table.

Incremental Aggregation Behind the Scene

Lets understand how incremental aggregator works behind the scene. For the better understanding lets use the data set from the use case explained above.

Source data from Day I

On Day 1, all data from the source is read and processed in the mapping.

Sales Date	Product Line	Sales Quantity	Sales Amount	Create Date
04-Jan-2014	Tablet	1	$450	04-Jan-2014
03-Feb-2014	Tablet	1	$500	03-Feb-2014
03-Feb-2014	Computers	1	$1,300	03-Feb-2014
13-Mar-2014	Cell Phone	2	$350	13-Mar-2014

Data from the source is read, summarized and persisted in Aggregator Cache. One row per aggregator group is persisted in the cache.

Sales Year	Product Line	Sales Quantity	Sales Amount	Note
2014	Tablet	2	$950	New In Cache
2014	Computers	1	$1,300	New In Cache
2014	Cell Phone	2	$350	New In Cache

Source data from Day 2

On Day 2, only new data is read from the source and processed in the mapping.

Sales Date	Product Line	Sales Quantity	Sales Amount	Create Date
14-Mar-2014	Tablet	1	$450	14-Mar-2014
14-Mar-2014	Tablet	1	$500	14-Mar-2014
14-Mar-2014	Video Game	1	$300	14-Mar-2014

Aggregator Cache is updated with the new values and new aggregator groups are inserted.

Sales Year	Product Line	Sales Quantity	Sales Amount	Note
2014	Tablet	4	$1,900	Update In Cache
2014	Computers	1	$1,300	No Change In Cache
2014	Cell Phone	2	$350	No Change In Cache
2014	Video Game	1	$300	New In Cache

Reinitializing the Aggregate Cache Files

Based on the use case we discussed here, we need to reset the aggregate cache file for every new year. You can reset the cache file using the settings shown in below image. You get a warning message about clearing the persisted aggregate values, but can be ignored.

After you run a session that reinitializes the aggregate cache, edit the session properties to disable the Reinitialize Aggregate Cache option. If you do not clear Reinitialize Aggregate Cache, the Integration Service overwrites the aggregate cache each time you run the session.

Hope this article is useful for you guys. Please feel free to share your comments and any questions you may have.

Popular Posts

Random Posts

Posts Being Viewed

Posts Category