{"id":3005,"date":"2012-11-17T02:36:43","date_gmt":"2012-11-17T02:36:43","guid":{"rendered":"http:\/\/craftydba.com\/?p=3005"},"modified":"2017-10-11T19:32:30","modified_gmt":"2017-10-11T19:32:30","slug":"checksum-vs-hashbytes","status":"publish","type":"post","link":"https:\/\/craftydba.com\/?p=3005","title":{"rendered":"Checksum vs Hashbytes"},"content":{"rendered":"<p><a href=\"https:\/\/craftydba.com\/wp-content\/uploads\/2012\/11\/golden-keys.jpg\"><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/craftydba.com\/wp-content\/uploads\/2012\/11\/golden-keys-150x150.jpg\" alt=\"\" title=\"golden-keys\" width=\"150\" height=\"150\" class=\"alignleft size-thumbnail wp-image-2978\" srcset=\"https:\/\/craftydba.com\/wp-content\/uploads\/2012\/11\/golden-keys-150x150.jpg 150w, https:\/\/craftydba.com\/wp-content\/uploads\/2012\/11\/golden-keys.jpg 192w\" sizes=\"auto, (max-width: 150px) 100vw, 150px\" \/><\/a><\/p>\n<p>I recently worked on a project I which I redesigned sales data warehouse as a <a href=\"http:\/\/en.wikipedia.org\/wiki\/Star_schema\">STAR schema<\/a>, using daily file partitions, with a automatic sliding window, and applying data compression at the page level. I ended up reducing a 5 terabyte database to less than 750 GB. I will be writing several articles on the lessons that I learned during the process.<\/p>\n<p>Today, I want to talk about how to generate a <a href=\"http:\/\/en.wikipedia.org\/wiki\/Hash_function\">hash key<\/a> by using two built in SQL Server functions.  A hash function is any algorithm that maps large data sets of variable length keys to smaller data set of a fixed length key.  <\/p>\n<p>One of the business requirements in the data warehouse was to have 15 different reporting levels.  Each unique combination represents one reporting level.  The <a href=\"http:\/\/msdn.microsoft.com\/en-us\/library\/ms143432.aspx\">maximum size<\/a> of an index in SQL Server is 16 columns and 900 bytes.  Adding an index is not feasible since the combined size of all columns can easily exceed this value. <\/p>\n<p>The prior BI developer was joining on all 15 columns in the SSIS package.   The execution of the SSIS package results in a <a href=\"http:\/\/en.wikipedia.org\/wiki\/Full_table_scan\">full table scan<\/a> when joining the source data to the reporting level dimension in the attempt to generate a <a href=\"http:\/\/en.wikipedia.org\/wiki\/Surrogate_key\">surrogate key<\/a>.  This can be a major performance issue on large tables.  How do we speed up the join?<\/p>\n<p>The solution to this join problem is to use a <a href=\"http:\/\/en.wikipedia.org\/wiki\/Hash_function\">hash key<\/a>.  This should allow the query optimizer to choose a Index Seek for the join.  Basically, we apply the hash function to the 15 columns to come up with a unique number or binary string.  This hash key will be indexed and used as the <a href=\"http:\/\/en.wikipedia.org\/wiki\/Natural_key\">natural key<\/a> in the reporting levels dimension table.<\/p>\n<p>Expanding the BASIC TRAINING database, I am going to use the following T-SQL snippet to create a reporting levels dimension table.  I am going to create the hash key as a computed column using the <a href=\"http:\/\/msdn.microsoft.com\/en-us\/library\/ms189788.aspx\">CHECKSUM<\/a>() function.<\/p>\n<pre class=\"lang:TSQL theme:familiar mark:1,2-3\" title=\"basic database schema - checksum() surrogate key\">\r\n-- Use the correct database\r\nUSE [BASIC];\r\nGO\r\n\r\n-- Delete existing table\r\nIF  OBJECT_ID(N'[TRAINING].[DIM_REPORTING_LEVELS]') > 0\r\n    DROP TABLE [TRAINING].[DIM_REPORTING_LEVELS]\r\nGO\r\n\r\n-- Create new table\r\nCREATE TABLE [TRAINING].[DIM_REPORTING_LEVELS]\r\n(\r\n  RPTLVL_KEY INT IDENTITY(1,1) NOT NULL CONSTRAINT PK_DIM_REPORTING_LEVELS PRIMARY KEY,\r\n  RPTLVL_PRODUCT_ID VARCHAR(240) NOT NULL,\r\n  RPTLVL_CURRENCY VARCHAR(240) NOT NULL,\r\n  RPTLVL_ORGANIZATION VARCHAR(240) NOT NULL,\r\n  RPTLVL_PRODUCT_FAMILY VARCHAR(240) NOT NULL\r\n);\r\nGO\r\n\r\n-- Record one\r\nINSERT INTO [TRAINING].[DIM_REPORTING_LEVELS]\r\n(\r\nRPTLVL_PRODUCT_ID,\r\nRPTLVL_CURRENCY,\r\nRPTLVL_ORGANIZATION,\r\nRPTLVL_PRODUCT_FAMILY\r\n)\r\nVALUES ('34HM118-1', 'USD', 'FTW', '34HM');\r\n\r\n\r\n-- Record two\r\nINSERT INTO [TRAINING].[DIM_REPORTING_LEVELS]\r\n(\r\nRPTLVL_PRODUCT_ID,\r\nRPTLVL_CURRENCY,\r\nRPTLVL_ORGANIZATION,\r\nRPTLVL_PRODUCT_FAMILY\r\n)\r\nVALUES ('35HM118-1', 'USD', 'FTW', '35HM');\r\n\r\n-- Add a hash key using checksum()\r\nALTER TABLE [TRAINING].[DIM_REPORTING_LEVELS] ADD RPTLVL_HASH_KEY \r\nAS CHECKSUM(RPTLVL_PRODUCT_ID, RPTLVL_CURRENCY, RPTLVL_ORGANIZATION, RPTLVL_PRODUCT_FAMILY);\r\n<\/pre>\n<\/p>\n<p>If you read books on line closely, you will note that the <a href=\"http:\/\/msdn.microsoft.com\/en-us\/library\/ms189788.aspx\">CHECKSUM<\/a>() function does not guarantee uniqueness.  The function takes a bunch of columns as an input and turns out one integer as an output.<\/p>\n<p>It is using the MD5 algorithm.  However, the size of the output (4 bytes) limits the number of possible outputs.  I initially used this function in the data warehouse and found over 300 duplicates<br \/>\nin 160,000 levels.  The above rows generate the same hash key.<\/p>\n<pre class=\"lang:TSQL theme:familiar mark:1,2-3\" title=\"duplicate keys with checksum()\">\r\n-- Show duplicate data\r\nSELECT * FROM [TRAINING].[DIM_REPORTING_LEVELS];\r\n<\/pre>\n<\/p>\n<p><a href=\"https:\/\/craftydba.com\/wp-content\/uploads\/2012\/11\/check-sum-key-has-duplicates-for-unique-rows.jpg\"><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/craftydba.com\/wp-content\/uploads\/2012\/11\/check-sum-key-has-duplicates-for-unique-rows.jpg\" alt=\"\" title=\"check-sum-key-has-duplicates-for-unique-rows\" width=\"827\" height=\"82\" class=\"aligncenter size-full wp-image-3022\" srcset=\"https:\/\/craftydba.com\/wp-content\/uploads\/2012\/11\/check-sum-key-has-duplicates-for-unique-rows.jpg 827w, https:\/\/craftydba.com\/wp-content\/uploads\/2012\/11\/check-sum-key-has-duplicates-for-unique-rows-300x29.jpg 300w\" sizes=\"auto, (max-width: 827px) 100vw, 827px\" \/><\/a><\/p>\n<p>On the other hand, the <a href=\"http:\/\/msdn.microsoft.com\/en-us\/library\/ms174415.aspx\">HASHBYTES<\/a>() function using MD5 is more unique since it generates a 16 byte hex output.  The function can generate hash keys using 7 different alogrithms with output<br \/>\nranging in size from 16 to 64 bytes.<\/p>\n<p>The code below drops the hask key column and recomputes it using the <a href=\"http:\/\/msdn.microsoft.com\/en-us\/library\/ms174415.aspx\">HASHBYTES<\/a>() function.  It takes an input of characters or bytes up to 8K in size.  I suggest making sure the columns are not null and concatenate all columns into one combination.<\/p>\n<pre class=\"lang:TSQL theme:familiar mark:1,2-3\" title=\"basic database schema - hashbytes() surrogate key\">\r\n-- Drop the column\r\nALTER TABLE [TRAINING].[DIM_REPORTING_LEVELS] DROP COLUMN RPTLVL_HASH_KEY;\r\n\r\n-- Add a hash key using hashbytes()\r\nALTER TABLE [TRAINING].[DIM_REPORTING_LEVELS] ADD RPTLVL_HASH_KEY \r\nAS HASHBYTES('MD5', RPTLVL_PRODUCT_ID + RPTLVL_CURRENCY + RPTLVL_ORGANIZATION + RPTLVL_PRODUCT_FAMILY);\r\n\r\n-- Index the column for speed\r\nCREATE INDEX IX_DIM_REPORTING_LEVELS ON [TRAINING].[DIM_REPORTING_LEVELS] (RPTLVL_HASH_KEY);\r\n\r\n-- Show unique data\r\nSELECT * FROM [TRAINING].[DIM_REPORTING_LEVELS];\r\n<\/pre>\n<\/p>\n<p><a href=\"https:\/\/craftydba.com\/wp-content\/uploads\/2012\/11\/hash-bytes-key-is-unique-for-unique-rows.jpg\"><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/craftydba.com\/wp-content\/uploads\/2012\/11\/hash-bytes-key-is-unique-for-unique-rows.jpg\" alt=\"\" title=\"hash-bytes-key-is-unique-for-unique-rows\" width=\"938\" height=\"86\" class=\"aligncenter size-full wp-image-3021\" srcset=\"https:\/\/craftydba.com\/wp-content\/uploads\/2012\/11\/hash-bytes-key-is-unique-for-unique-rows.jpg 938w, https:\/\/craftydba.com\/wp-content\/uploads\/2012\/11\/hash-bytes-key-is-unique-for-unique-rows-300x27.jpg 300w\" sizes=\"auto, (max-width: 938px) 100vw, 938px\" \/><\/a><\/p>\n<p>In summary, a hash function can be used when multiple columns have to be compressed into one unique column.  While the <a href=\"http:\/\/msdn.microsoft.com\/en-us\/library\/ms189788.aspx\">CHECKSUM<\/a>() function is available in SQL Server, I would avoid it since it is not guaranteed to be unique.  <\/p>\n<p>A better choice is to design the computed column with the <a href=\"http:\/\/msdn.microsoft.com\/en-us\/library\/ms174415.aspx\">HASHBYTES<\/a>() function.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>I recently worked on a project I which I redesigned sales data warehouse as a STAR schema, using daily file partitions, with a automatic sliding window, and applying data compression at the page level. I ended up reducing a 5 terabyte database to less than 750 GB. I will be writing several articles on the lessons that I learned during the process. Today, I want to talk about how to generate a hash key by using two built in SQL Server functions. A hash function is any algorithm that maps&hellip;<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[4],"tags":[618,620,602,12,621,619,617,616,622,15,28,603,29],"class_list":["post-3005","post","type-post","status-publish","format-standard","hentry","category-db-admin","tag-checksum","tag-computed-column","tag-dba","tag-free-code","tag-full-table-scran","tag-hash-bytes","tag-hash-function","tag-hash-key","tag-index-seek","tag-john-f-miner-iii","tag-sql-server","tag-surrogate-key","tag-tsql"],"_links":{"self":[{"href":"https:\/\/craftydba.com\/index.php?rest_route=\/wp\/v2\/posts\/3005","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/craftydba.com\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/craftydba.com\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/craftydba.com\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/craftydba.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=3005"}],"version-history":[{"count":0,"href":"https:\/\/craftydba.com\/index.php?rest_route=\/wp\/v2\/posts\/3005\/revisions"}],"wp:attachment":[{"href":"https:\/\/craftydba.com\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=3005"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/craftydba.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=3005"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/craftydba.com\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=3005"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}