Thursday, 6 September 2012

FUZZY GROUPING TRANSFORMATION






Introduction:
            Using “Fuzzy grouping” transformation we can find the duplicate records.

            Fuzzy grouping transformation will add few additional columns to the “transformation” output along with source columns. In those, with the help of “_Key_in and _Key_out” columns we can find the duplicate records.

Example:
  • Find the below table with source data and corresponding “Key_in and Key_out” values.



  • _Key_in returns the unique number for each record.



  • _Key_out returns unique number for each group (duplicate record set), and same number for a group of records.



  • Key_out value for a group of duplicate records is the Minimum “_Key_in” value of the same group.



  • If “_Key_in is equal to _Key_out” then that record is valid record otherwise duplicate record.


Sample package creation:
Requirement:-
            I have a source table that contains duplicate records. I want load the source data into target table, while the time of loading skip the duplicate records.




Package creation:-
  • From control flow items, drag and drop “Data flow task” into designer surface.
  • Right click on “Data Flow Task” and select “Edit” option then it will go to “Data flow”
  • From data flow items, drag and drop “OLEDB source” into designer surface. Create the source table mappings.
  • After source mappings, drag and drop the “Fuzzy Grouping” transformation into designer surface. Connect the source green arrow to the transformation shown below screen shot.
 

 
  • Edit the “Fuzzy grouping” transformation, then “Fuzzy grouping transformation editor” window will open.



  • Specify the connection manager. The OLEDB connection is required to create temp table for performing “Fuzzy grouping” operation.
  • Click on “Columns” tab.



  • Select the required columns to find the duplicate records.



Four your information: ID column data type is “INT” so match type is “Exact”. “Name” column data type is “varchar” so matched type is “Fuzzy”.

  • Click on “Advanced” tab.




  • Name column is having string value so we need to find the fuzzy match.
  • If we give “Similarity threshold:” as “0.80” the string column values should match more than 80% then only it will treat as duplicate record other wise it will treat as a valid record.
  • Click “OK” button.
  • Drag and drop the “Conditional split” into designer surface. Connect the “Fuzzy grouping” transformation green arrow to the “Conditional split”. Then write the required condition to filter duplicate records.
    • Condition: [_key_in]== [_key_out]



  • Click “OK”.
  • Drag and drop the “OLEDB destination” into designer surface, then map the target table. After that execute the package.



Target table output:

Tuesday, 4 September 2012

FUZZY LOOKUP TRANSFORMATION





Introduction:
            Using Fuzzy lookup we can find the percentage matches between two strings.

Sample package:
Requirement:-

Source table:-
            Source table is having persons information (ID, Name, and Gender) shown below.



Lookup table:-
            Lookup table is having valid person’s names, shown below.


Target table:-
            We need to populate target table with valid records from source, using lookup table.


            Source table is having person’s information (Valid and invalid). Lookup table is having valid person’s names. I want to extract valid person’s information from source with the help of lookup table, and load it into target table.

            Source and lookup table person names (data in the Name column) are similar but not exactly matched, so we need to find percentage matched between the name columns. Based on that percentage we need to take decision weather that particular record is valid or invalid.

            Lookup transformation creates additional columns, in those “_Similarity” column returns the percentage matches between the mapped columns. Using conditional split we can split the data into different sets (valid and invalid records).


Package creation:-
  • From control flow items, drag and drop “Data flow task” into designer surface.
  • Right click on “Data flow task” and select edit option then it will go to “Data flow”
  • From data flow items, drag and drop “OLEDB source” into designer surface. Create the source table mappings.
  • After source mappings, drag and drop the “Fuzzy Lookup” transformation into designer surface. Connect the source green arrow to the transformation shown below screen shot.



  • Edit the “Fuzzy Lookup” transformation. Select the reference table (Lookup table) shown below.



  • Go to “Columns” tab.



  • In the above screen shot, two mappings are available between the ID and Name columns. We need to delete the ID mapping, because we need to find the valid records based on name field only.
  • Click on the ID mapping line and press delete button from the key board then mapping will get delete. Select “Name” column check box, from lookup columns. See the below screen shot for reference.



  • Click “OK” button, then “Fuzzy Lookup Transformation Editor” window will disappear.
  • Drag and drop the “Conditional split” transformation into designer surface.
  • Connect the “Fuzzy lookup” transformation green arrow to the “Conditional split”.
  • Edit the “Conditional split” and expand the columns.



  • Drag and drop the “_Similarity” column into “Condition” box.



  • If the strings are matched more than 60% then I want to consider that particular record as valid other wise we should ignore.

Condition: _Similarity>0.6



  • Click “OK” Button.
  • Drag and drop the “OLEDB destination” into designer surface.
  • Connect the conditional split green arrow to the “OLEDB destination” then “Input and output selection” window will open.



  • In that window “Output:” dropdown list select the case name (Case 1).



  • Click “OK” button.
  • Edit the “OLEDB target” and map the target table. After that execute the package.



Target table output:


Note: R King is not loaded into target table because the record is invalid record (Reference table doesn’t have the “R King” information).