Checksum vs Hashbytes

I recently worked on a project I which I redesigned sales data warehouse as a STAR schema, using daily file partitions, with a automatic sliding window, and applying data compression at the page level. I ended up reducing a 5 terabyte database to less than 750 GB. I will be writing several articles on the lessons that I learned during the process.

Today, I want to talk about how to generate a hash key by using two built in SQL Server functions. A hash function is any algorithm that maps large data sets of variable length keys to smaller data set of a fixed length key.

One of the business requirements in the data warehouse was to have 15 different reporting levels. Each unique combination represents one reporting level. The maximum size of an index in SQL Server is 16 columns and 900 bytes. Adding an index is not feasible since the combined size of all columns can easily exceed this value.

The prior BI developer was joining on all 15 columns in the SSIS package. The execution of the SSIS package results in a full table scan when joining the source data to the reporting level dimension in the attempt to generate a surrogate key. This can be a major performance issue on large tables. How do we speed up the join?

The solution to this join problem is to use a hash key. This should allow the query optimizer to choose a Index Seek for the join. Basically, we apply the hash function to the 15 columns to come up with a unique number or binary string. This hash key will be indexed and used as the natural key in the reporting levels dimension table.

Expanding the BASIC TRAINING database, I am going to use the following T-SQL snippet to create a reporting levels dimension table. I am going to create the hash key as a computed column using the CHECKSUM() function.

If you read books on line closely, you will note that the CHECKSUM() function does not guarantee uniqueness. The function takes a bunch of columns as an input and turns out one integer as an output.

It is using the MD5 algorithm. However, the size of the output (4 bytes) limits the number of possible outputs. I initially used this function in the data warehouse and found over 300 duplicates
in 160,000 levels. The above rows generate the same hash key.

On the other hand, the HASHBYTES() function using MD5 is more unique since it generates a 16 byte hex output. The function can generate hash keys using 7 different alogrithms with output
ranging in size from 16 to 64 bytes.

The code below drops the hask key column and recomputes it using the HASHBYTES() function. It takes an input of characters or bytes up to 8K in size. I suggest making sure the columns are not null and concatenate all columns into one combination.

In summary, a hash function can be used when multiple columns have to be compressed into one unique column. While the CHECKSUM() function is available in SQL Server, I would avoid it since it is not guaranteed to be unique.

A better choice is to design the computed column with the HASHBYTES() function.

Related posts

5 Thoughts to “Checksum vs Hashbytes”

  1. Kaze

    Thank you for this tip. The HASKBYTES() function is a good way to compare records in two tables, for needed updates. Just compare old/stored hashbyte to possibly new hashbyte.

  2. Joey Moelands

    Hello,

    You need to add a “splitter” to your hashbyte calculation. for example -> if you have two columns for example age and amount:
    17 and 5000
    1 and 7500

    this will result both in 175000 if you concatenate both columns. Thus your checksum hash is the same :-)

    Regards Joey.

  3. Joey Moelands

    Sorry, forget a zero -> 1 and 7500 must be 1 and 75000 :) but i think my point is clear

  4. abc

    why are null columns an issue with hashbytes?

    1. Any combined with null results in null.

Leave a Reply to Joey Moelands Cancel reply