Saturday, August 22, 2015

How To Extract Duplicate Data from a Source Excel/CSV File in SSIS

Scenario: The source excel/csv file must have unique case numbers.Since the alphanumeric case numbers are not auto generated there can be duplicate case numbers with different details.

Requirement: The SSIS package has to identify the duplicate case numbers and load their details in a separate destination file 





Case 1: Only the duplicate Case_numbers are required




 SSIS Package Transformation: Aggregate



 SSIS Package Transformation: Conditional Split



Destination File with Duplicate Case Numbers




Case 2: Duplicate Case_numbers with details are required

 The SSIS package will be different from the previous one.
Additional data transformations Multicast,Sort,Merge Join will be used.


Merge Join transformation will use a left inner join and extract details from master file for the matching data from conditional split transformation


Sort transformation is required for both input sources of merge join.
After execution of the package the destination file has all the details for duplicate case numbers




The important question here is why did I use a merge join  over a cache transform & lookup approach?

A cache transform & lookup approach uses a equi-join and is followed if for each input row there exists a single row in the lookup/cache reference.It returns back the first found matching row.
In our case there were multiple rows for the same case numbers.When lookup is used the destination file looks like as below which is not the desired output.




















 

1 comment:

  1. I feel there is a need to look for more and more information about SSIS and many other components.

    SSIS Upsert

    ReplyDelete