Surrogate Keys

I recently worked on a project I which I redesigned sales data warehouse as a STAR schema, using daily file partitions, with a automatic sliding window, and applying data compression at the page level. I ended up reducing a 5 terabyte database to less than 750 GB. I will be writing several articles on the lessons that I learned during the process.

Today, I want to talk about how surrogate keys can be used to tie the dimension tables to the fact tables. I need to start the talk off with some definitions grabbed from IBM’s designing and modeling data warehouse webpage.

A dimension table in a star or snowflake schema stores details about the facts. For example, a DATE dimension table stores the various aspects of time such as year, quarter, month, and day. A fact table in a star or snowflake schema stores business measures such as sales, cost of goods, or profit.

The concept of a STAR is obtained by placing the fact table in the middle of the ER Diagram, having the dimension on the outside, and the relationships are spokes in a wheel.

Let’s get back to the DATE dimension since this table appears in almost every data model. A 10 digit date formatted at YYYY-MM-DD is a natural key of the table. However, joining tables on a character field is slower than joining them on integers. That is where the speed of a STAR schema comes from.

Expanding the BASIC TRAINING database, I am going to use the following T-SQL snippet to create the DATE dimension and populate it with data for each day in November 2012.

A surrogate key can be created by defining an tiny integer column with a IDENTITY() property. I purposely used a TINYINT with a range of 0 to 255 to illustrate a point later on in the talk.

I used some interesting date and conversion functions to get the date variable into the correct format for the dimension. I will be exploring functions in a future BASIC TRAINING article.

From the image below, we can see that our table is populated with a increasing surrogate key.

 

Let’s make believe that the business line comes back and asks for a years worth of date entries Because we chose a TINYINT at the start, we would have to drop and recreate the table. Therefore, it is very important to choose a large enough integer to satisfy the business need. A typical INT has a positive range up to 2 billion +. This used to be a large number. But if you are counting x-mass light bulbs made in China, this range might be exceeded now a-days.

Try to stick with integers when defining surrogates keys. This will lend to speedy joins. On the other hand, a UNIQUEIDENTIFIER can be used. It is twice as large as an BIGINT data type (8 bytes). Use the NEWSEQUENTIALID() function to generate ascending keys so that you can use a clustered index, the default index for a primary key.

In summary, surrogates key can be used to join dimension tables to fact tables. Try to define the surrogate key using a integer data type that is large enough to hold current and future data.
You will be suprised on how fast reporting queries execute using this simple type of key.

Related posts

Leave a Comment