Surrogate Keys

I recently worked on a project I which I redesigned sales data warehouse as a STAR schema, using daily file partitions, with a automatic sliding window, and applying data compression at the page level. I ended up reducing a 5 terabyte database to less than 750 GB. I will be writing several articles on the lessons that I learned during the process.

Today, I want to talk about how surrogate keys can be used to tie the dimension tables to the fact tables. I need to start the talk off with some definitions grabbed from IBM’s designing and modeling data warehouse webpage.

A dimension table in a star or snowflake schema stores details about the facts. For example, a DATE dimension table stores the various aspects of time such as year, quarter, month, and day. A fact table in a star or snowflake schema stores business measures such as sales, cost of goods, or profit.

The concept of a STAR is obtained by placing the fact table in the middle of the ER Diagram, having the dimension on the outside, and the relationships are spokes in a wheel.

Let’s get back to the DATE dimension since this table appears in almost every data model. A 10 digit date formatted at YYYY-MM-DD is a natural key of the table. However, joining tables on a character field is slower than joining them on integers. That is where the speed of a STAR schema comes from.

Expanding the BASIC TRAINING database, I am going to use the following T-SQL snippet to create the DATE dimension and populate it with data for each day in November 2012.

A surrogate key can be created by defining an tiny integer column with a IDENTITY() property. I purposely used a TINYINT with a range of 0 to 255 to illustrate a point later on in the talk.

<span style="color: #008000;font-size: small;">-- Use the correct database
USE [BASIC];
GO

-- Delete existing table
IF  OBJECT_ID(N'[TRAINING].[DIM_DATE]') > 0
    DROP TABLE [TRAINING].[DIM_DATE]
GO

-- Create new table
CREATE TABLE [TRAINING].[DIM_DATE]
(
  DATE_KEY TINYINT IDENTITY(1,1) NOT NULL CONSTRAINT PK_DIM_DATE PRIMARY KEY,
  DATE_DATE SMALLDATETIME NOT NULL,
  DATE_STRING VARCHAR(10) NOT NULL,
  DATE_MONTH VARCHAR(3) NOT NULL,
  DATE_DAY VARCHAR(3) NOT NULL,
  DATE_INT_QTR INT NOT NULL,
  DATE_INT_DOY INT NOT NULL,
  DATE_INT_YYYY INT NOT NULL,
  DATE_INT_MM INT NOT NULL,
  DATE_INT_DD INT NOT NULL
);
GO

-- Create and intialize loop variable
DECLARE @VAR_DATE SMALLDATETIME;
SELECT @VAR_DATE = '2012-11-01';

-- Add data to our new table
WHILE (DATEDIFF(D, @VAR_DATE, '2012-12-01') <> 0)
BEGIN

    -- Add row to dimension table
    INSERT INTO [TRAINING].[DIM_DATE]
    (
        DATE_DATE,
        DATE_STRING,
        DATE_MONTH,
        DATE_DAY,
        DATE_INT_QTR,
        DATE_INT_DOY,
        DATE_INT_YYYY,
        DATE_INT_MM,
        DATE_INT_DD
    )
    VALUES
	(
	   -- As small date time
	   @VAR_DATE,

	   -- Date as string
	   SUBSTRING(CONVERT(CHAR(10), @VAR_DATE, 120) + REPLICATE(' ', 10), 1, 10), 

	   -- Month as string
	   UPPER(SUBSTRING(CONVERT(CHAR(10), @VAR_DATE, 100) + REPLICATE(' ', 3), 1, 3)),

	   -- Day as string
	   UPPER(SUBSTRING(DATENAME(DW, @VAR_DATE), 1, 3)),

	   -- As quarter of year
	   DATEPART(QQ, @VAR_DATE),

	   -- As day of year
	   DATEPART(DY, @VAR_DATE),

	   -- Year as integer
	   YEAR(@VAR_DATE),

	   -- Month as integer
	   MONTH(@VAR_DATE),

	   -- Day as integer
	   DAY(@VAR_DATE)

    );

    -- Increment the counter
    SELECT @VAR_DATE = DATEADD(D, 1, @VAR_DATE);
END;</span>

<span style="color: #008000;font-size: small;">-- Use the correct database

USE [BASIC];

-- Delete existing table

IF OBJECT_ID(N'[TRAINING].[DIM_DATE]') > 0

DROP TABLE [TRAINING].[DIM_DATE]

-- Create new table

CREATE TABLE [TRAINING].[DIM_DATE]

(

DATE_KEY TINYINT IDENTITY(1,1) NOT NULL CONSTRAINT PK_DIM_DATE PRIMARY KEY,

DATE_DATE SMALLDATETIME NOT NULL,

DATE_STRING VARCHAR(10) NOT NULL,

DATE_MONTH VARCHAR(3) NOT NULL,

DATE_DAY VARCHAR(3) NOT NULL,

DATE_INT_QTR INT NOT NULL,

DATE_INT_DOY INT NOT NULL,

DATE_INT_YYYY INT NOT NULL,

DATE_INT_MM INT NOT NULL,

DATE_INT_DD INT NOT NULL

);

-- Create and intialize loop variable

DECLARE @VAR_DATE SMALLDATETIME;

SELECT @VAR_DATE = '2012-11-01';

-- Add data to our new table

WHILE (DATEDIFF(D, @VAR_DATE, '2012-12-01') <> 0)

BEGIN

-- Add row to dimension table

INSERT INTO [TRAINING].[DIM_DATE]

(

DATE_DATE,

DATE_STRING,

DATE_MONTH,

DATE_DAY,

DATE_INT_QTR,

DATE_INT_DOY,

DATE_INT_YYYY,

DATE_INT_MM,

DATE_INT_DD

)

VALUES

(

-- As small date time

@VAR_DATE,

-- Date as string

SUBSTRING(CONVERT(CHAR(10), @VAR_DATE, 120) + REPLICATE(' ', 10), 1, 10),

-- Month as string

UPPER(SUBSTRING(CONVERT(CHAR(10), @VAR_DATE, 100) + REPLICATE(' ', 3), 1, 3)),

-- Day as string

UPPER(SUBSTRING(DATENAME(DW, @VAR_DATE), 1, 3)),

-- As quarter of year

DATEPART(QQ, @VAR_DATE),

-- As day of year

DATEPART(DY, @VAR_DATE),

-- Year as integer

YEAR(@VAR_DATE),

-- Month as integer

MONTH(@VAR_DATE),

-- Day as integer

DAY(@VAR_DATE)

);

-- Increment the counter

SELECT @VAR_DATE = DATEADD(D, 1, @VAR_DATE);

END;</span>

I used some interesting date and conversion functions to get the date variable into the correct format for the dimension. I will be exploring functions in a future BASIC TRAINING article.

From the image below, we can see that our table is populated with a increasing surrogate key.

Let’s make believe that the business line comes back and asks for a years worth of date entries Because we chose a TINYINT at the start, we would have to drop and recreate the table. Therefore, it is very important to choose a large enough integer to satisfy the business need. A typical INT has a positive range up to 2 billion +. This used to be a large number. But if you are counting x-mass light bulbs made in China, this range might be exceeded now a-days.

Try to stick with integers when defining surrogates keys. This will lend to speedy joins. On the other hand, a UNIQUEIDENTIFIER can be used. It is twice as large as an BIGINT data type (8 bytes). Use the NEWSEQUENTIALID() function to generate ascending keys so that you can use a clustered index, the default index for a primary key.

In summary, surrogates key can be used to join dimension tables to fact tables. Try to define the surrogate key using a integer data type that is large enough to hold current and future data.
You will be suprised on how fast reporting queries execute using this simple type of key.

Related posts

Leave a Comment Cancel reply

Related posts

Staying Connected

Transferring Logins Between Servers

Page Anatomy – Part 2

Leave a Comment Cancel reply