I am continuing my series of talks on fundamental topics like data types. I am proud to be a United States Army Reservist (USAR) Veteran.
Just like boot camp I went to so long ago, I am going nick name the series BASIC TRAINING.
The most basic unit of any database is a TABLE which consists of COLUMNS. A vital decision during the initial database design is to choose the data types that will capture the information you want in the least amount of space.
Today, I am exploring character strings which can be fixed or variable length in size.
I am extenting the sample database named [BASIC] that contains a sample schema named [TRAINING]. The snippet below creates a sample table named [CHAR_STRINGS] that contains one or more fields for each data type.
Books online describes two data types that are categorized as character strings: CHAR(N) is defined as fixed length of N and VARCHAR(N) is defined as variable length from zero to N. The overhead for VARCHAR is 2 bytes.
In SQL Server versions prior to 2005, N was defined from 0 to 8000 so that the record could fit on a single data page. The MAX key word was introduced in SQL Server 2005 in preparation for removing the TEXT data type in the future. Both VARCHAR(MAX) and TEXT can hold textual data
up to 2 GB in size.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 |
<span style="color: #008000; font-size: small;">-- -- Create test tables (character strings) -- -- Delete existing table IF EXISTS ( SELECT * FROM sys.objects WHERE object_id = OBJECT_ID(N'[TRAINING].[CHAR_STRINGS]') AND type in (N'U')) DROP TABLE [TRAINING].[CHAR_STRINGS] GO -- Create new table CREATE TABLE [TRAINING].[CHAR_STRINGS] ( CS1 CHAR(8), -- 8 BYTES CS2 VARCHAR(8) -- DATA SIZE (1..8) + 2 BYTES OVERHEAD ); GO </span> |
We should test the newly crafted table by inserting data. The first test below will work since all columns are less than or eqaul to eight characters. The second test will fail since truncation will occur due to constant being greater than the field size.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
<span style="color: #008000; font-size: small;">-- Test data set 1 INSERT INTO [TRAINING].[CHAR_STRINGS] VALUES ( '12345678', 'FOUR' ); GO -- Test data set 2 INSERT INTO [TRAINING].[CHAR_STRINGS] VALUES ( '00001111', 'MY TRUNCATED STRING' ); GO </span> |
We can tell the database engine to ignore the truncation warning by using the SET command.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
<span style="color: #008000; font-size: small;">-- Allow truncation to occur SET ANSI_WARNINGS OFF; -- Test data set 2 INSERT INTO [TRAINING].[CHAR_STRINGS] VALUES ( '00001111', 'MY TRUNCATED STRING' ); GO -- Return the data from the table SELECT * FROM [TRAINING].[CHAR_STRINGS]; GO </span> |
I am going to alter the table definition to experiment with TEXT data type. By default, data for TEXT is stored out of row in seperate data pages.
1 2 3 4 5 6 7 8 9 10 |
<span style="color: #008000; font-size: small;">-- Add a text type to table ALTER TABLE [TRAINING].[CHAR_STRINGS] ADD CS3 TEXT; GO -- Update first row with data UPDATE [TRAINING].[CHAR_STRINGS] SET CS3 = 'OUT OF ROW STORAGE' WHERE CS1 = '12345678'; GO </span> |
The sp_tableoption stored procedure can be used to force the storage of TEXT data in row. The default for this setting is to store any string that is 256 bytes or less in row. Again, this stored procedure is slated to be discontinued in the future. It is suggested that you use VARCHAR(MAX) instead.
1 2 3 4 5 6 7 8 |
<span style="color: #008000; font-size: small;">-- By default, 256 bytes are stored in row when on EXEC sp_tableoption 'TRAINING.CHAR_STRINGS', 'text in row', 'ON'; -- Update first row with data UPDATE [TRAINING].[CHAR_STRINGS] SET CS3 = 'IN ROW STORAGE' WHERE CS1 = '00001111'; </span> |
As a database designer, you should always question the components that make up your database.
One question that you might have is ‘What is the maximum number of bytes that a row can have?’.
This is important because data is stored in the *.MDF or *.NDF files as pages ~ 8k.
Since a page can only save 8060 bytes, you can figure out how many records can fit on a page and how many bytes are wasted space.
The following code uses the dynamic management view, sys.dm_db_index_physical_stats, to count the number of fields and calculate the maximum row size.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
<span style="color: #008000; font-size: small;"> -- Valid for SQL Server 2005 + SELECT s.database_id, s.object_id, s.index_type_desc, s.page_count, s.record_count, s.min_record_size_in_bytes, s.max_record_size_in_bytes FROM sys.dm_db_index_physical_stats (DB_ID(N'BASIC'), OBJECT_ID(N'TRAINING.CHAR_STRINGS'), NULL, NULL , 'DETAILED') AS s; GO </span> |
The sp_help stored procedure displays the details of the table. This includes many different settings that can be choosen as a DDL designer such as computed column, field length, nullabilty, and collation to mention a few.
1 2 3 |
<span style="color: #008000; font-size: small;">-- Display size details of table EXEC sp_help 'TRAINING.CHAR_STRINGS'; </span> |
You might have noticed that the index physical statistics DMV displayed two heap pages and three total records. How can we have a total of three records when we did two inserts and two updates? The answer is the out of row storage for BLOBS.
The undocumented DBCC IND command will show the pages being used by the table or HEAP.
1 2 3 |
<span style="color: #008000; font-size: small;">-- Show pages of table DBCC IND('BASIC', 'TRAINING.CHAR_STRINGS', -1); </span> |
The DBCC TRACEON flag 3604 will dump pages to the output window. The DBCC page command will dump the contents of each page. The DBCC TRACEOFF flag 3604 will restore the flag to its orginal setting.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
<span style="color: #008000; font-size: small;">-- Turn on output DBCC TRACEON(3604) GO -- Dump data page (2 records) DBCC page('BASIC', 1, 156, 3); GO -- Dump page (out of row data) DBCC page('BASIC', 1, 158, 3); GO -- Turn off output DBCC TRACEOFF(3604) GO </span> |
The in row data page entry shows the actual text.
The out of row data page entry shows a text pointer to the BLOB.
In summary, use the VARCHAR data type when length of the data is double digits or more in size. While there is a two byte overhead for the datatype, the odds are the empty space reclaimed by the table will make up a total positive savings. Remember, the TEXT data type is on the depreciated list. Use the VARCHAR(MAX) instead.
Next time, I will be reviewing unicode character strings.