I recently worked on a project I which I redesigned sales data warehouse as a STAR schema, using daily file partitions, with a automatic sliding window, and applying data compression at the page level. I ended up reducing a 5 terabyte database to less than 750 GB. I will be writing several articles on the lessons that I learned during the process.
Today, I want to talk about how surrogate keys can be used to tie the dimension tables to the fact tables. I need to start the talk off with some definitions grabbed from IBM’s designing and modeling data warehouse webpage.
A dimension table in a star or snowflake schema stores details about the facts. For example, a DATE dimension table stores the various aspects of time such as year, quarter, month, and day. A fact table in a star or snowflake schema stores business measures such as sales, cost of goods, or profit.
The concept of a STAR is obtained by placing the fact table in the middle of the ER Diagram, having the dimension on the outside, and the relationships are spokes in a wheel.
Let’s get back to the DATE dimension since this table appears in almost every data model. A 10 digit date formatted at YYYY-MM-DD is a natural key of the table. However, joining tables on a character field is slower than joining them on integers. That is where the speed of a STAR schema comes from.
Expanding the BASIC TRAINING database, I am going to use the following T-SQL snippet to create the DATE dimension and populate it with data for each day in November 2012.
A surrogate key can be created by defining an tiny integer column with a IDENTITY() property. I purposely used a TINYINT with a range of 0 to 255 to illustrate a point later on in the talk.
<span style="color: #008000;font-size: small;">-- Use the correct database
-- Delete existing table
IF OBJECT_ID(N'[TRAINING].[DIM_DATE]') > 0
DROP TABLE [TRAINING].[DIM_DATE]
-- Create new table
CREATE TABLE [TRAINING].[DIM_DATE]
DATE_KEY TINYINT IDENTITY(1,1) NOT NULL CONSTRAINT PK_DIM_DATE PRIMARY KEY,
DATE_DATE SMALLDATETIME NOT NULL,
DATE_STRING VARCHAR(10) NOT NULL,
DATE_MONTH VARCHAR(3) NOT NULL,
DATE_DAY VARCHAR(3) NOT NULL,
DATE_INT_QTR INT NOT NULL,
DATE_INT_DOY INT NOT NULL,
DATE_INT_YYYY INT NOT NULL,
DATE_INT_MM INT NOT NULL,
DATE_INT_DD INT NOT NULL
-- Create and intialize loop variable
DECLARE @VAR_DATE SMALLDATETIME;
SELECT @VAR_DATE = '2012-11-01';
-- Add data to our new table
WHILE (DATEDIFF(D, @VAR_DATE, '2012-12-01') <> 0)
-- Add row to dimension table
INSERT INTO [TRAINING].[DIM_DATE]
-- As small date time
-- Date as string
SUBSTRING(CONVERT(CHAR(10), @VAR_DATE, 120) + REPLICATE(' ', 10), 1, 10),
-- Month as string
UPPER(SUBSTRING(CONVERT(CHAR(10), @VAR_DATE, 100) + REPLICATE(' ', 3), 1, 3)),
-- Day as string
UPPER(SUBSTRING(DATENAME(DW, @VAR_DATE), 1, 3)),
-- As quarter of year
-- As day of year
-- Year as integer
-- Month as integer
-- Day as integer
-- Increment the counter
SELECT @VAR_DATE = DATEADD(D, 1, @VAR_DATE);
I used some interesting date and conversion functions to get the date variable into the correct format for the dimension. I will be exploring functions in a future BASIC TRAINING article.
From the image below, we can see that our table is populated with a increasing surrogate key.
Let’s make believe that the business line comes back and asks for a years worth of date entries Because we chose a TINYINT at the start, we would have to drop and recreate the table. Therefore, it is very important to choose a large enough integer to satisfy the business need. A typical INT has a positive range up to 2 billion +. This used to be a large number. But if you are counting x-mass light bulbs made in China, this range might be exceeded now a-days.
Try to stick with integers when defining surrogates keys. This will lend to speedy joins. On the other hand, a UNIQUEIDENTIFIER can be used. It is twice as large as an BIGINT data type (8 bytes). Use the NEWSEQUENTIALID() function to generate ascending keys so that you can use a clustered index, the default index for a primary key.
In summary, surrogates key can be used to join dimension tables to fact tables. Try to define the surrogate key using a integer data type that is large enough to hold current and future data.
You will be suprised on how fast reporting queries execute using this simple type of key.