Using Checksum in SSIS for Delta Loads

The checksum function in SQL is a great way to compare two rows to see if the data matches. The checksum command returns a number that represents the value of the data in the row. When you compare the checksum value of two rows, and the values match, this confirms the data on the rows match. If a change has occurred on one of the rows then the checksum value will change.

Here is a table that has some data that I am going to load into a second database. The ID column is the primary key and therefore unique.

image

  Here is the table that needs to be updated.

image

  Notice that Mike Davis has moved to a different address in the same city. So this is the only row that needs to be updated. If I use an SSIS package to load this data I could truncate and reload the entire table. If the table only has a few rows this would be fine. But if the table contains millions of rows this would be a very long process. We can use a lookup and compare each column to see if there are any changes. This is fine if there are only a few columns. If the table has hundreds of columns then it would take a long time to map every column in the lookup transform and would perform badly due to all of the comparisons. With the checksum value we can compare just one column in a conditional split transform.

Here is the dataflow in an SSIS package that is using the checksum value to compare the input table to the table to update. Below we will look at each transform in detail.

image

  The source is an OLEDB source and is using the following SQL command.

  SELECT Checksum(ID,First,Last,Address1,Address2,City,State,ZipCode) as ‘Checksum’
,ID ,First ,Last ,Address1 ,Address2 ,City ,State,ZipCode
FROM InputData

  The results from this query are shown below.

image

  Notice the checksum value in the first column. The checksum function is wrapped around every column in the table. If you only want to compare a subset of the columns then only add this subset in the checksum function. The checksum Syntax is Checksum(Col1, Col2, Col3…) as Alias.

  The next transform is a lookup and is doing a compare of the ID columns and getting the checksum from the table that needs to be updated.

  I set the no matching rows to redirect to the no match output.

image

  This lookup transform is using a query with the checksum value in the SQL query also.

image

  The results of this query are shown below. Notice all the checksums match the input table’s checksums from above, except for the row that has changed, which is the row with ID 1.

image

  In the Columns node of the lookup transform I mapped the ID columns together and I placed a check next to the Checksum on the lookup table. This way we can compare the Checksums in the next transform, the conditional split.

image

  If the lookup does not find a match then the record is new and needs to be added to the table regardless of the checksum value. The no match output goes to an OLEDB destination.

The matching rows go to the conditional Split which compares the Checksum Values. The conditional split has two outputs. The “No Changes” output is the matching checksums. The “Updates Needed” output is the non matching checksums. If the checksums do not match this indicates that something on the incoming row is different from the existing table and needs to be updated. We know the table has a matching ID because of the preceding lookup finding the matching ID’s.

image

  The “Update Needed” rows go to an OLEDB Command transform. This transform is running an update statement for every column in the row with the matching ID.

Here is the update command.

UPDATE [Testing].[dbo].[TableToUpdate]
SET
[First] = ?
,[Last] = ?
,[Address1] = ?
,[Address2] = ?
,[City] = ?
,[State] = ?
,[ZipCode] = ?
WHERE [ID] =?

Here is the parameter mapping in the OLEDB Command transform.

image

  Each row is mapped to the proper column to update. The ID column is mapped to the where clause. Go back to the picture of the entire data flow and see if it makes sense now.

This is not the only way to accomplish this. I could have written the updated rows to a staging table and done a set based update to the table to improve performance.

On a side note all of this work could be done with one simple add-on. Task Factory has an Upsert destination and does all of this work and the batch update so it is very fast. If you used the task factory Upsert destination your data flow would look like this.

image

  Let me know if you have any questions.

About these ads
This entry was posted in SSIS, Syndication and tagged . Bookmark the permalink.

7 Responses to Using Checksum in SSIS for Delta Loads

  1. lala says:

    thanks,this is very informative..

  2. John says:

    Hi Mike, Thank you for the excellent article.

    Can you please provide a download link for the package mentioned in the article?

    Thanks…

  3. guru says:

    thanks,for this article

  4. jERRY says:

    Hi Mike,
    This is great example for insert and update process. I just wonder, if I use checksum transformation in SSIS instead of using checksum() function like you did at SQL query , what different between those 2 methods? is the performance making different ? I know checksum transformation is using M5 algorithm while checksum function using hash index.

    Thanks,Jerry

    • MikeDavisSQL says:

      That is a great question. I have not tested this out. I have seen others use the hash algorithm, which is less likely to have issues like checksum. Some have even stored the hash in the same table so it is just a part of the query. This is better for performance, but takes up a lot more space.

  5. jERRY says:

    Hi Mike,
    Thanks for your response. Just like you mentioned,I just using hashbytes function inside query and test at SSIS package. it works fine. However, there are not much performance difference between hashbytes and check-sum when the data is not real huge..FYI, i test one table with around 200,000 rows…. From my research, hashbytes even return more correct data than checksum… so not sure if we can see the difference if having more than millions rows for process

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s