Using Checksum in SSIS for Delta Loads

The checksum function in SQL is a great way to compare two rows to see if the data matches. The checksum command returns a number that represents the value of the data in the row. When you compare the checksum value of two rows, and the values match, this confirms the data on the rows match. If a change has occurred on one of the rows then the checksum value will change.

Here is a table that has some data that I am going to load into a second database. The ID column is the primary key and therefore unique.

Here is the table that needs to be updated.

Notice that Mike Davis has moved to a different address in the same city. So this is the only row that needs to be updated. If I use an SSIS package to load this data I could truncate and reload the entire table. If the table only has a few rows this would be fine. But if the table contains millions of rows this would be a very long process. We can use a lookup and compare each column to see if there are any changes. This is fine if there are only a few columns. If the table has hundreds of columns then it would take a long time to map every column in the lookup transform and would perform badly due to all of the comparisons. With the checksum value we can compare just one column in a conditional split transform.

Here is the dataflow in an SSIS package that is using the checksum value to compare the input table to the table to update. Below we will look at each transform in detail.

The source is an OLEDB source and is using the following SQL command.

SELECT Checksum(ID,First,Last,Address1,Address2,City,State,ZipCode) as ‘Checksum’
,ID ,First ,Last ,Address1 ,Address2 ,City ,State,ZipCode
FROM InputData

The results from this query are shown below.

Notice the checksum value in the first column. The checksum function is wrapped around every column in the table. If you only want to compare a subset of the columns then only add this subset in the checksum function. The checksum Syntax is Checksum(Col1, Col2, Col3…) as Alias.

The next transform is a lookup and is doing a compare of the ID columns and getting the checksum from the table that needs to be updated.

I set the no matching rows to redirect to the no match output.

This lookup transform is using a query with the checksum value in the SQL query also.

The results of this query are shown below. Notice all the checksums match the input table’s checksums from above, except for the row that has changed, which is the row with ID 1.

In the Columns node of the lookup transform I mapped the ID columns together and I placed a check next to the Checksum on the lookup table. This way we can compare the Checksums in the next transform, the conditional split.

If the lookup does not find a match then the record is new and needs to be added to the table regardless of the checksum value. The no match output goes to an OLEDB destination.

The matching rows go to the conditional Split which compares the Checksum Values. The conditional split has two outputs. The “No Changes” output is the matching checksums. The “Updates Needed” output is the non matching checksums. If the checksums do not match this indicates that something on the incoming row is different from the existing table and needs to be updated. We know the table has a matching ID because of the preceding lookup finding the matching ID’s.

The “Update Needed” rows go to an OLEDB Command transform. This transform is running an update statement for every column in the row with the matching ID.

Here is the update command.

UPDATE [Testing].[dbo].[TableToUpdate]
SET
[First] = ?
,[Last] = ?
,[Address1] = ?
,[Address2] = ?
,[City] = ?
,[State] = ?
,[ZipCode] = ?
WHERE [ID] =?

Here is the parameter mapping in the OLEDB Command transform.

Each row is mapped to the proper column to update. The ID column is mapped to the where clause. Go back to the picture of the entire data flow and see if it makes sense now.

This is not the only way to accomplish this. I could have written the updated rows to a staging table and done a set based update to the table to improve performance.

On a side note all of this work could be done with one simple add-on. Task Factory has an Upsert destination and does all of this work and the batch update so it is very fast. If you used the task factory Upsert destination your data flow would look like this.

Let me know if you have any questions.

7 Responses to Using Checksum in SSIS for Delta Loads

lala says:

September 14, 2011 at 3:57 am

thanks,this is very informative..

John says:

January 5, 2012 at 5:23 am

Hi Mike, Thank you for the excellent article.

Can you please provide a download link for the package mentioned in the article?

Thanks…

- MikeDavisSQL says:
  
  January 5, 2012 at 9:54 am
  
  Sorry I did not save that package.
  
guru says:

January 25, 2012 at 9:04 am

thanks,for this article

jERRY says:

October 22, 2012 at 3:28 pm

Hi Mike,
This is great example for insert and update process. I just wonder, if I use checksum transformation in SSIS instead of using checksum() function like you did at SQL query , what different between those 2 methods? is the performance making different ? I know checksum transformation is using M5 algorithm while checksum function using hash index.

Thanks,Jerry

- MikeDavisSQL says:
  
  October 24, 2012 at 8:45 am
  
  That is a great question. I have not tested this out. I have seen others use the hash algorithm, which is less likely to have issues like checksum. Some have even stored the hash in the same table so it is just a part of the query. This is better for performance, but takes up a lot more space.
  
jERRY says:

October 29, 2012 at 4:47 pm

Hi Mike,
Thanks for your response. Just like you mentioned,I just using hashbytes function inside query and test at SSIS package. it works fine. However, there are not much performance difference between hashbytes and check-sum when the data is not real huge..FYI, i test one table with around 200,000 rows…. From my research, hashbytes even return more correct data than checksum… so not sure if we can see the difference if having more than millions rows for process