Introduction:
Fuzzy grouping transformation will add few additional
columns to the “transformation” output along with source columns. In those,
with the help of “_Key_in and _Key_out” columns we can find the
duplicate records.
Example:
- Find the below table with source data and corresponding “Key_in and Key_out” values.
- _Key_in returns the unique number for each record.
- _Key_out returns unique number for each group (duplicate record set), and same number for a group of records.
- Key_out value for a group of duplicate records is the Minimum “_Key_in” value of the same group.
- If “_Key_in is equal to _Key_out” then that record is valid record otherwise duplicate record.
Sample package creation:
Requirement:-
I have a source table that contains duplicate
records. I want load the source data into target table, while the time of
loading skip the duplicate records.
Package creation:-
- From control flow items, drag and drop “Data flow task” into designer surface.
- Right click on “Data Flow Task” and select “Edit” option then it will go to “Data flow”
- From data flow items, drag and drop “OLEDB source” into designer surface. Create the source table mappings.
- After source mappings, drag and drop the “Fuzzy Grouping” transformation into designer surface. Connect the source green arrow to the transformation shown below screen shot.
- Edit the “Fuzzy grouping” transformation, then “Fuzzy grouping transformation editor” window will open.
- Specify the connection manager. The OLEDB connection is required to create temp table for performing “Fuzzy grouping” operation.
- Click on “Columns” tab.
- Select the required columns to find the duplicate records.
Four
your information: ID column data type is “INT” so match type is “Exact”. “Name”
column data type is “varchar” so matched type is “Fuzzy”.
- Click on “Advanced” tab.
- Name column is having string value so we need to find the fuzzy match.
- If we give “Similarity threshold:” as “0.80” the string column values should match more than 80% then only it will treat as duplicate record other wise it will treat as a valid record.
- Click “OK” button.
- Drag and drop the “Conditional split” into designer surface. Connect the “Fuzzy grouping” transformation green arrow to the “Conditional split”. Then write the required condition to filter duplicate records.
- Condition: [_key_in]== [_key_out]
- Click “OK”.
- Drag and drop the “OLEDB destination” into designer surface, then map the target table. After that execute the package.
Target table output: