Tonight, I want to revisit the [WILD LIFE] database that we were using last time. The main goal is to correctly search the [ANIMALS] table returning all [NAME]’s that contain the word fly. We want the query plan associated with our query to use a index to speed up overall execution time.
The first solution to this business problem resulted in FULL TABLE SCANS or CLUSTERED INDEX SCANS. Please see the desciptions of graphical icons that can show up in a query plan. Since every record in the table has to be parsed, this was not a good scalable solution.
The second solution lead us to believe that a full text index was going to save the day. The TSQL query used an INDEX SEEK to retrieve data; However, the results were not what was expected.
The third time is a charm. I suggest that we categorize the data during the load process. This can be done with a ETL tool or DML table TRIGGERS. I am going to option for the later since the amount of data we are talking about
is relatively samll.
Drop and recreate the [WILD LIFE] database from scratch. We are going to add a column called [CATERGORY] to the table. It can be defined as an INT or a CHAR. When using a integer, store matches as powers of 2. The bitwise or (&) function can be used to retrive any combination of words you are categorizing. For larger sets of values, use a character array. X’ and O’s and be used to mark the existence of a word. The new table is shown below.
1 2 3 4 5 6 7 8 9 10 11 12 |
-- Create the animals table CREATE TABLE ANIMALS ( ID INT NOT NULL IDENTITY (1, 1), NAME VARCHAR(200) NOT NULL, CATEGORY INT DEFAULT (0) ) GO -- Add index on name (clustered) CREATE CLUSTERED INDEX IDX_ANIMAL_CATEGORY ON DBO.ANIMALS(CATEGORY); GO |
A user defined FUNCTION is going to be designed to take a [NAME] as an input and return a [CATEGORY] value as a result. The FUNCTION will parse the [NAME] looking for 4 key animals.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 |
-- Create a function, map name -> code CREATE FUNCTION UFN_MY_CATEGORIES(@NAME VARCHAR(200)) RETURNS INT AS BEGIN -- Default value DECLARE @RETVAL INT = 0; -- 2 POWER 0 IF (LOWER(@NAME) LIKE '%fly%') SET @RETVAL = @RETVAL + 1; -- 2 POWER 1 IF (LOWER(@NAME) LIKE '%pig%') SET @RETVAL = @RETVAL + 2; -- 2 POWER 2 IF (LOWER(@NAME) LIKE '%bat%') SET @RETVAL = @RETVAL + 4; -- 2 POWER 3 IF (LOWER(@NAME) LIKE '%dog%') SET @RETVAL = @RETVAL + 8; -- RETURN THE CODE RETURN(@RETVAL); END; |
A user defined TRIGGER is going to be added to the [ANIMALS] table. When INSERTS or UPDATES happen to the table, the data will be categorized. Esentially, we are moving the text parsing from the QUERY execution to the DATA loading phase. This technique is only good if the data changes seldomly. Large record changes can make this choice worse than the first solution. Since animal names do not change that often, we have a good solution.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 |
-- Create a trigger to categorize name CREATE TRIGGER TRG_CATEGORIZE_ANIMALS ON ANIMALS AFTER INSERT, UPDATE AS BEGIN -- Detect inserts (1 .. n) IF EXISTS (SELECT * FROM inserted) AND NOT EXISTS(SELECT * FROM deleted) BEGIN -- Update records UPDATE ANIMALS SET CATEGORY = dbo.UFN_MY_CATEGORIES(I.NAME) FROM ANIMALS A JOIN INSERTED I ON A.ID = I.ID; -- Debugging PRINT 'INSERT DETECTED, CREATED CATEGORY'; END; -- Detect deletes (1 .. n) IF EXISTS (SELECT * FROM deleted) AND NOT EXISTS(SELECT * FROM inserted) BEGIN -- Debugging PRINT 'DELETE DETECTED, NOTHING TO DO'; END; -- Detected updates (1 .. n) IF EXISTS (SELECT * FROM inserted) AND EXISTS(SELECT * FROM deleted) BEGIN -- Update records UPDATE ANIMALS SET CATEGORY = dbo.UFN_MY_CATEGORIES(I.NAME) FROM ANIMALS A JOIN INSERTED I ON A.ID = I.ID; -- Debugging PRINT 'UPDATE DETECTED, ADJUST CATEGORY'; END; END GO |
Please run the rest of the full script to load the table with data.
I am going to search for all animals that contain the word fly. We are going to look at how the query analyzer creates different cost based execution plans.
For a given plan, we are going to look at the time in milliseconds (SET STATISTICS TIME) and the io counts (SET STATISTICS IO).
To get a good test, we should write dirty pages to disk (CHECKPOINT), drop clean buffers from memory (DBCC DROPCLEANBUFFERS) and free the procedure cache of any query plans (DBCC FREEPROCCAHE). These commands should be completed before executing the query.
The sql snippet below does just that.
1 2 3 4 5 6 7 8 9 10 |
-- Show time & i/o SET STATISTICS TIME ON SET STATISTICS IO ON GO -- Remove clean buffers & clear plan cache CHECKPOINT DBCC DROPCLEANBUFFERS DBCC FREEPROCCACHE GO |
Here is the query to select all records that have the word ‘fly’ in the name.
1 2 3 |
-- Select everything with word 'fly' SELECT * FROM dbo.ANIMALS WHERE NAME LIKE '%FLY%' GO |
Looking at the query results, query plan and io statistics, we have a CLUSTERED INDEX SEEK that returns 8 correct results with a 18 ms compile time and a 1 ms run time.
In summary, categorizing the data ahead of time resulted in the quickest solution yet. Next time, I want to revisit Full Text Indexing (FTI) to discuss how it works internally. This talk will go over all the system stored procedures you need to know when working with FTI’s.