Database by Doug: How indexes can help even without a where clause

I always see examples on indexing that focus on the WHERE clause, and writing SARG-able queries.

And rightly so. It's easy to demonstrate huge benefits when you can match an index to a particular WHERE clause. In other words, it's easy to convert a Table Scan operation to an Index Seek operation by the addition of a specific index.

But, indexes can help quite a bit even without a WHERE clause. In other words, if you can convert a Table Scan to an Index Scan, that can be quite helpful to.

So let me demonstrate some of this with a table that I have described below.

CREATE TABLE [dbo].[Person](
 [ID] [int] IDENTITY(1,1) NOT NULL,
 [salutory] [nvarchar](15) NULL,
 [firstName] [nvarchar](25) NULL,
 [middleInitial] [nchar](1) NULL,
 [lastName] [nvarchar](25) NOT NULL,
 [nameSuffix] [nvarchar](8) NULL,
 [email] [nvarchar](255) NULL,
 [entryDate] [datetime] NOT NULL,
 [lastUpdateDate] [datetime] NOT NULL,
 [weight] [float] NULL,
 [gender] [char](1) NULL,
 [dateOfBirth] [date] NULL,
 CONSTRAINT [PK_Person] PRIMARY KEY CLUSTERED 
(
 [ID] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON, FILLFACTOR = 100) ON [PRIMARY]
) ON [PRIMARY]

To start with, the only index is the clustered index on the primary key, ID. And there is a lot of cooked / simulated data in the table:

SELECT COUNT(*)
FROM   Person

Which yields: 677,025

First, I'll make sure that the table is defragmented as much as possible:

DBCC DBREINDEX (Person, PK_Person, 100)

And to make sure that happened:

DBCC SHOWCONTIG (Person, PK_Person)

Which gives:

DBCC SHOWCONTIG scanning 'Person' table...
Table: 'Person' (2137058649); index ID: 1, database ID: 6
TABLE level scan performed.
- Pages Scanned................................: 6884
- Extents Scanned..............................: 861
- Extent Switches..............................: 860
- Avg. Pages per Extent........................: 8.0
- Scan Density [Best Count:Actual Count].......: 100.00% [861:861]
- Logical Scan Fragmentation ..................: 0.01%
- Extent Scan Fragmentation ...................: 0.93%
- Avg. Bytes Free per Page.....................: 40.6
- Avg. Page Density (full).....................: 99.50%
DBCC execution completed. If DBCC printed error messages, contact your system administrator.

From this, we can tell a few things about this table I've cooked up:

The table has 6884 pages in it
The pages are almost entirely full, 99.50% page density
677,025 records spread across 6884 pages gives about 98 records per page
Since each page is 8092 bytes, each record is about 8092/98 = 82 bytes per record

So let's see what it will take to do a basic query:

SET STATISTICS IO ON

SELECT lastName
FROM   Person

The SET command will return some IO statistics for the query. In this case, the statistics message is:

(677033 row(s) affected)
Table 'Person'. Scan count 1, logical reads 6898, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

Here's what this tells me:

zero physical reads tells me that all my records are in cache (this is not particularly realistic - normally you'd have more data than RAM)
6898 logical reads tells me that 6898 pages that were in cache were accessed to give the result
In a more realistic setting, I'd have experienced somewhere between 0 and 6898 physical reads

For purposes of discussion, we'll assume that both "logical reads" and "physical reads" are equivalently undesirable. In other words, less reads is better in terms of performance.

Also, see the query plan that the SQL Engine came up with:

In this case, the Clustered Index is the Person table: a table scan. So the engine decided to go to the leaf level of the Clustered Index (the data pages of the index), and scan across from data page to data page.

Now let's add an index on lastname:

CREATE NONCLUSTERED INDEX [IX_Person_LastName] ON [dbo].[Person] 
(
 [lastName] ASC
)WITH (PAD_INDEX  = OFF, STATISTICS_NORECOMPUTE  = OFF, SORT_IN_TEMPDB = OFF, IGNORE_DUP_KEY = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS  = ON, ALLOW_PAGE_LOCKS  = ON) ON [PRIMARY]
GO

and look at some statistics on it:

DBCC SHOWCONTIG (Person, IX_Person_LastName)

DBCC SHOWCONTIG scanning 'Person' table...
Table: 'Person' (2137058649); index ID: 13, database ID: 6
LEAF level scan performed.
- Pages Scanned................................: 1980
- Extents Scanned..............................: 250
- Extent Switches..............................: 251
- Avg. Pages per Extent........................: 7.9
- Scan Density [Best Count:Actual Count].......: 98.41% [248:252]
- Logical Scan Fragmentation ..................: 0.20%
- Extent Scan Fragmentation ...................: 1.60%
- Avg. Bytes Free per Page.....................: 23.4
- Avg. Page Density (full).....................: 99.71%
DBCC execution completed. If DBCC printed error messages, contact your system administrator.

Now let's estimate some things about the index:

The index has 1980 pages in it
The pages are almost entirely full, 99.71% page density
677,025 records spread across 1980 pages gives about 341 records per page
Since each page is 8092 bytes, each record is about 8092/341 = 23 bytes per record

Now let's do the same select statement, and see what happens:

SELECT lastName
FROM   Person

Which gives this message about the IO:

(677041 row(s) affected)
Table 'Person'. Scan count 1, logical reads 1996, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

And this query plan:

So, here is what the index accomplished for us:

1996 reads vs 6898 reads without the index (about 29% of the cost)
A scan of the smaller non-clustered index, and no use of the base table at all

OK, so maybe you saw that coming.

If we can get everything we need from a smaller data structure, then it should cost less. This is known as a Covering Index. In this case, our SELECT statement that only references the lastName column is covered by the index that has lastName in it.

Here's one that you might not see coming. Consider this statement:

SELECT COUNT(*)
FROM   Person

Guess what? Without the index, it costs 6898 reads (table scan), and with the index it costs 1979 (clustered index scan).

Let's consider a different index. This time, I'm going to index on lastName, and weight.

CREATE NONCLUSTERED INDEX [IX_Person_LastName] ON [dbo].[Person] 
(
 [lastName] ASC,
 [weight] ASC
)WITH (PAD_INDEX  = OFF, STATISTICS_NORECOMPUTE  = OFF, SORT_IN_TEMPDB = OFF, IGNORE_DUP_KEY = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS  = ON, ALLOW_PAGE_LOCKS  = ON) ON [PRIMARY]
GO

And here is a SELECT statement to consider:

SELECT  AVG(weight)
FROM    Person

This results in 2919 reads, and this query plan:

Hey, that's an index Scan on IX_Person_LastName! And look no WHERE clause!

Again, this shows the power of a pure covering index. The database engine is able to get every weight in the Person table, by scanning a smaller data structure, i.e., the index.

Make special note that the index had a benefit, even though there was no WHERE clause, and no Index Seek. Indexes can help, even without the WHERE clause, and even without accomplishing a seek.

But if you really want to see the Seek, and a really big performance improvement, here you go:

SELECT  AVG(weight)
FROM    Person
WHERE   LastName = 'Adams'

9 reads, and the Index Seek that is so IO-efficient:

Database by Doug

Tuesday, September 24, 2013

How indexes can help even without a where clause

No comments:

Post a Comment

Blog Archive

Followers

About Me