# Surrogate Keys

I recently worked on a project I which I redesigned sales data warehouse as a STAR schema, using daily file partitions, with a automatic sliding window, and applying data compression at the page level. I ended up reducing a 5 terabyte database to less than 750 GB. I will be writing several articles on the lessons that I learned during the process.

Today, I want to talk about how surrogate keys can be used to tie the dimension tables to the fact tables. I need to start the talk off with some definitions grabbed from IBM’s designing and modeling data warehouse webpage.

A dimension table in a star or snowflake schema stores details about the facts. For example, a DATE dimension table stores the various aspects of time such as year, quarter, month, and day. A fact table in a star or snowflake schema stores business measures such as sales, cost of goods, or profit.

The concept of a STAR is obtained by placing the fact table in the middle of the ER Diagram, having the dimension on the outside, and the relationships are spokes in a wheel.

Let’s get back to the DATE dimension since this table appears in almost every data model. A 10 digit date formatted at YYYY-MM-DD is a natural key of the table. However, joining tables on a character field is slower than joining them on integers. That is where the speed of a STAR schema comes from.

Expanding the BASIC TRAINING database, I am going to use the following T-SQL snippet to create the DATE dimension and populate it with data for each day in November 2012.

A surrogate key can be created by defining an tiny integer column with a IDENTITY() property. I purposely used a TINYINT with a range of 0 to 255 to illustrate a point later on in the talk.

I used some interesting date and conversion functions to get the date variable into the correct format for the dimension. I will be exploring functions in a future BASIC TRAINING article.

From the image below, we can see that our table is populated with a increasing surrogate key.

Let’s make believe that the business line comes back and asks for a years worth of date entries Because we chose a TINYINT at the start, we would have to drop and recreate the table. Therefore, it is very important to choose a large enough integer to satisfy the business need. A typical INT has a positive range up to 2 billion +. This used to be a large number. But if you are counting x-mass light bulbs made in China, this range might be exceeded now a-days.

Try to stick with integers when defining surrogates keys. This will lend to speedy joins. On the other hand, a UNIQUEIDENTIFIER can be used. It is twice as large as an BIGINT data type (8 bytes). Use the NEWSEQUENTIALID() function to generate ascending keys so that you can use a clustered index, the default index for a primary key.

In summary, surrogates key can be used to join dimension tables to fact tables. Try to define the surrogate key using a integer data type that is large enough to hold current and future data.
You will be suprised on how fast reporting queries execute using this simple type of key.