I recently worked on a project I which I redesigned sales data warehouse as a STAR schema, using daily file partitions, with a automatic sliding window, and applying data compression at the page level. I ended up reducing a 5 terabyte database to less than 750 GB. I will be writing several articles on the lessons that I learned during the process.
Today, I want to talk about how surrogate keys can be used to tie the dimension tables to the fact tables. I need to start the talk off with some definitions grabbed from IBM’s designing and modeling data warehouse webpage.
A dimension table in a star or snowflake schema stores details about the facts. For example, a DATE dimension table stores the various aspects of time such as year, quarter, month, and day. A fact table in a star or snowflake schema stores business measures such as sales, cost of goods, or profit.
The concept of a STAR is obtained by placing the fact table in the middle of the ER Diagram, having the dimension on the outside, and the relationships are spokes in a wheel.
Let’s get back to the DATE dimension since this table appears in almost every data model. A 10 digit date formatted at YYYY-MM-DD is a natural key of the table. However, joining tables on a character field is slower than joining them on integers. That is where the speed of a STAR schema comes from.
Expanding the BASIC TRAINING database, I am going to use the following T-SQL snippet to create the DATE dimension and populate it with data for each day in November 2012.
A surrogate key can be created by defining an tiny integer column with a IDENTITY() property. I purposely used a TINYINT with a range of 0 to 255 to illustrate a point later on in the talk.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 |
<span style="color: #008000;font-size: small;">-- Use the correct database USE [BASIC]; GO -- Delete existing table IF OBJECT_ID(N'[TRAINING].[DIM_DATE]') > 0 DROP TABLE [TRAINING].[DIM_DATE] GO -- Create new table CREATE TABLE [TRAINING].[DIM_DATE] ( DATE_KEY TINYINT IDENTITY(1,1) NOT NULL CONSTRAINT PK_DIM_DATE PRIMARY KEY, DATE_DATE SMALLDATETIME NOT NULL, DATE_STRING VARCHAR(10) NOT NULL, DATE_MONTH VARCHAR(3) NOT NULL, DATE_DAY VARCHAR(3) NOT NULL, DATE_INT_QTR INT NOT NULL, DATE_INT_DOY INT NOT NULL, DATE_INT_YYYY INT NOT NULL, DATE_INT_MM INT NOT NULL, DATE_INT_DD INT NOT NULL ); GO -- Create and intialize loop variable DECLARE @VAR_DATE SMALLDATETIME; SELECT @VAR_DATE = '2012-11-01'; -- Add data to our new table WHILE (DATEDIFF(D, @VAR_DATE, '2012-12-01') <> 0) BEGIN -- Add row to dimension table INSERT INTO [TRAINING].[DIM_DATE] ( DATE_DATE, DATE_STRING, DATE_MONTH, DATE_DAY, DATE_INT_QTR, DATE_INT_DOY, DATE_INT_YYYY, DATE_INT_MM, DATE_INT_DD ) VALUES ( -- As small date time @VAR_DATE, -- Date as string SUBSTRING(CONVERT(CHAR(10), @VAR_DATE, 120) + REPLICATE(' ', 10), 1, 10), -- Month as string UPPER(SUBSTRING(CONVERT(CHAR(10), @VAR_DATE, 100) + REPLICATE(' ', 3), 1, 3)), -- Day as string UPPER(SUBSTRING(DATENAME(DW, @VAR_DATE), 1, 3)), -- As quarter of year DATEPART(QQ, @VAR_DATE), -- As day of year DATEPART(DY, @VAR_DATE), -- Year as integer YEAR(@VAR_DATE), -- Month as integer MONTH(@VAR_DATE), -- Day as integer DAY(@VAR_DATE) ); -- Increment the counter SELECT @VAR_DATE = DATEADD(D, 1, @VAR_DATE); END;</span> |
I used some interesting date and conversion functions to get the date variable into the correct format for the dimension. I will be exploring functions in a future BASIC TRAINING article.
From the image below, we can see that our table is populated with a increasing surrogate key.
Let’s make believe that the business line comes back and asks for a years worth of date entries Because we chose a TINYINT at the start, we would have to drop and recreate the table. Therefore, it is very important to choose a large enough integer to satisfy the business need. A typical INT has a positive range up to 2 billion +. This used to be a large number. But if you are counting x-mass light bulbs made in China, this range might be exceeded now a-days.
Try to stick with integers when defining surrogates keys. This will lend to speedy joins. On the other hand, a UNIQUEIDENTIFIER can be used. It is twice as large as an BIGINT data type (8 bytes). Use the NEWSEQUENTIALID() function to generate ascending keys so that you can use a clustered index, the default index for a primary key.
In summary, surrogates key can be used to join dimension tables to fact tables. Try to define the surrogate key using a integer data type that is large enough to hold current and future data.
You will be suprised on how fast reporting queries execute using this simple type of key.