Thursday, 6 September 2012

FUZZY GROUPING TRANSFORMATION






Introduction:
            Using “Fuzzy grouping” transformation we can find the duplicate records.

            Fuzzy grouping transformation will add few additional columns to the “transformation” output along with source columns. In those, with the help of “_Key_in and _Key_out” columns we can find the duplicate records.

Example:
  • Find the below table with source data and corresponding “Key_in and Key_out” values.



  • _Key_in returns the unique number for each record.



  • _Key_out returns unique number for each group (duplicate record set), and same number for a group of records.



  • Key_out value for a group of duplicate records is the Minimum “_Key_in” value of the same group.



  • If “_Key_in is equal to _Key_out” then that record is valid record otherwise duplicate record.


Sample package creation:
Requirement:-
            I have a source table that contains duplicate records. I want load the source data into target table, while the time of loading skip the duplicate records.




Package creation:-
  • From control flow items, drag and drop “Data flow task” into designer surface.
  • Right click on “Data Flow Task” and select “Edit” option then it will go to “Data flow”
  • From data flow items, drag and drop “OLEDB source” into designer surface. Create the source table mappings.
  • After source mappings, drag and drop the “Fuzzy Grouping” transformation into designer surface. Connect the source green arrow to the transformation shown below screen shot.
 

 
  • Edit the “Fuzzy grouping” transformation, then “Fuzzy grouping transformation editor” window will open.



  • Specify the connection manager. The OLEDB connection is required to create temp table for performing “Fuzzy grouping” operation.
  • Click on “Columns” tab.



  • Select the required columns to find the duplicate records.



Four your information: ID column data type is “INT” so match type is “Exact”. “Name” column data type is “varchar” so matched type is “Fuzzy”.

  • Click on “Advanced” tab.




  • Name column is having string value so we need to find the fuzzy match.
  • If we give “Similarity threshold:” as “0.80” the string column values should match more than 80% then only it will treat as duplicate record other wise it will treat as a valid record.
  • Click “OK” button.
  • Drag and drop the “Conditional split” into designer surface. Connect the “Fuzzy grouping” transformation green arrow to the “Conditional split”. Then write the required condition to filter duplicate records.
    • Condition: [_key_in]== [_key_out]



  • Click “OK”.
  • Drag and drop the “OLEDB destination” into designer surface, then map the target table. After that execute the package.



Target table output:

No comments:

Post a Comment