Full-Text Search – Part 1

Today, I want to go over why text searching for patterns is an expensive query operation.

I am going to create a [WILD LIFE] database that contains a table called [ANIMALS]. A identity column called [ID] will be populated by the system automatically and a [NAME] column will be loaded with 445 animals names that I grabbed from WIKIPEDIA. The idea is to search for animals that have a root word in common.

The snippet below creates the database and table. The full SQL script including insert statements is at the end of this article.

I am going to search for all animals that contain the word fly. We are going to look at how the query analyzer creates different cost based execution plans.

For a given plan, we are going to look at the time in milliseconds (SET STATISTICS TIME) and the io counts (SET STATISTICS IO).

To get a good test, we should write dirty pages to disk (CHECKPOINT), drop clean buffers from memory (DBCC DROPCLEANBUFFERS) and free the procedure cache of any query plans (DBCC FREEPROCCAHE). These commands should be completed before executing the query.

The sql snippet below does just that.

Here is the query to select all records that have the word ‘fly’ in the name.

A FULL TABLE SCAN is performed because no index was defined. The data is stored as a unordered HEAP structure. This execution plan is very expensive when the table contains 1 Million rows.

We can also see 8 records returned in the result set after 1 scans and 2 logical reads.

Let us create a nonclustered index (CREATE INDEX) on the [NAME] column. Hopefully, the query optimizer will use the index. The data is stored in NONCLUSTERED INDEX structure in which leaf nodes are not data pages.

Oh no, we can see the same plan being used and the same execution results.

Lets drop the index and create a clustered index on [NAME]. Again, we are hoping the query optimizer will use the index. The data is stored in CLUSTERED INDEX structure in which leaf nodes are data pages.

This time, we do get the query optimizer to use the index but it is not optimal since it is a FULL CLUSTERED INDEX SCAN. Again, this is very expensive plan when the table contains alot of rows. A SEEK would be a better operation.

The main issue with this query is that we are looking for a pattern in a string. Indexing basically orders (logical or physical) the data given a sort collation. However, we are searching for part of a word. Thus, every [NAME] field needs to be examined to look for the ‘fly’ pattern.

The SQL snippet below turns off the messaging for io and time since we are done for now.

Next time, I am going to introduce setting up a full-text index using the GUI. Full-text indexes are great for searching on a single word or phrase (and optionally ranking the result set), searching on a word or phrase close to another word or phrase, and/or searching on synonymous forms of a specific word.

While this speeds up our searching, we will see that it does not completely solve the problem at hand.

Sample Code:
     TSQL

Related posts

3 Thoughts to “Full-Text Search – Part 1”

  1. A lot of thanks for your own labor on this blog. Ellie delights in engaging in investigation and it’s easy to see why. My spouse and i know all relating to the dynamic way you make efficient techniques on this website and as well increase participation from the others on that area of interest plus our own child is really learning so much. Take advantage of the rest of the new year. You’re carrying out a fabulous job.

  2. heya, superior weblog, and a decent understand! at least one for my bookmarks.

  3. I like this post, enjoyed this one appreciate it for posting .

Leave a Comment