Full Text Search – Part 3

Tonight, I want to revisit the [WILD LIFE] database that we were using last time. The main goal is to correctly search the [ANIMALS] table returning all [NAME]’s that contain the word fly. We want the query plan associated with our query to use a index to speed up overall execution time.

The first solution to this business problem resulted in FULL TABLE SCANS or CLUSTERED INDEX SCANS. Please see the desciptions of graphical icons that can show up in a query plan. Since every record in the table has to be parsed, this was not a good scalable solution.

The second solution lead us to believe that a full text index was going to save the day. The TSQL query used an INDEX SEEK to retrieve data; However, the results were not what was expected.

The third time is a charm. I suggest that we categorize the data during the load process. This can be done with a ETL tool or DML table TRIGGERS. I am going to option for the later since the amount of data we are talking about
is relatively samll.

Drop and recreate the [WILD LIFE] database from scratch. We are going to add a column called [CATERGORY] to the table. It can be defined as an INT or a CHAR. When using a integer, store matches as powers of 2. The bitwise or (&) function can be used to retrive any combination of words you are categorizing. For larger sets of values, use a character array. X’ and O’s and be used to mark the existence of a word. The new table is shown below.

A user defined FUNCTION is going to be designed to take a [NAME] as an input and return a [CATEGORY] value as a result. The FUNCTION will parse the [NAME] looking for 4 key animals.

A user defined TRIGGER is going to be added to the [ANIMALS] table. When INSERTS or UPDATES happen to the table, the data will be categorized. Esentially, we are moving the text parsing from the QUERY execution to the DATA loading phase. This technique is only good if the data changes seldomly. Large record changes can make this choice worse than the first solution. Since animal names do not change that often, we have a good solution.

Please run the rest of the full script to load the table with data.

I am going to search for all animals that contain the word fly. We are going to look at how the query analyzer creates different cost based execution plans.

For a given plan, we are going to look at the time in milliseconds (SET STATISTICS TIME) and the io counts (SET STATISTICS IO).

To get a good test, we should write dirty pages to disk (CHECKPOINT), drop clean buffers from memory (DBCC DROPCLEANBUFFERS) and free the procedure cache of any query plans (DBCC FREEPROCCAHE). These commands should be completed before executing the query.

The sql snippet below does just that.

Here is the query to select all records that have the word ‘fly’ in the name.

Looking at the query results, query plan and io statistics, we have a CLUSTERED INDEX SEEK that returns 8 correct results with a 18 ms compile time and a 1 ms run time.

In summary, categorizing the data ahead of time resulted in the quickest solution yet. Next time, I want to revisit Full Text Indexing (FTI) to discuss how it works internally. This talk will go over all the system stored procedures you need to know when working with FTI’s.

Related posts

Leave a Comment