Chapter 1: Using Proper Normalization Techniques

In This Chapter

    Normalizing your database

    Removing repeating groups

    Insuring each column is dependent on the whole primary key

    Removing transitive dependencies

    Improving performance

Databases took a great leap forward when Edgar F. Codd defined the first three normal forms in his 1970 paper “A Relational Model of Data for Large Shared Data Banks.” Over the years, many more normal forms have been defined, but even today, the first three normal forms are the most commonly used. Today’s popular Relational Database Management Systems
(including SQL Server 2008) are based on the relational model first identified by Codd.

 

Normalizing Your Database
Normalizing a database is dividing tables into their simplest forms and creating relationships between the tables. Instead of a single table of many columns to hold all your data, you can create multiple tables and spread the data between them in a logical manner.

 

Tables are connected to each other by creating relationships. Relationships between tables are primarily created between the primary key in one table and a foreign key in another table.

 

A primary key is a column within a table that is used to uniquely identify each row within that table. Primary keys are usually numbers, but you might see them as a combination of characters and numbers in some databases. Allowing the system to generate primary keys automatically is easiest, but developers might sometimes choose to do this within their database application.
 

 

Foreign keys are used to relate one table to another table and can be a combination of numbers and letters. A foreign key is in the related column and matches the same format (numeric, character, or a combination) of the primary key in the related table.

 

Book II, Chapter 4 covers relationships between tables in-depth, including the different types of relationships, how to create relationships, and maintaining relationships.

 

 

By normalizing a database, many problems can be prevented. Consider a company that sells products to customers. Information on each customer is maintained in a database, including a name, address, phone number, and e-mail address. In a non-normalized database (one single table), you could have multiple problems when you try to query or modify this data. These problems include:
 

 

    ✦ Duplication of data: In a non-normalized database, a customer’s information might have to be entered every time he makes a purchase, storing the customer’s data in an Orders table. The result is duplication of work. Additionally, odds are very low that this information would be entered the same way each time. Consider how many different ways employees might enter the name, Brandie Johansen, or a city, such as San Francisco (SF, San Fran, SFO, and so on). A normalized database prevents the duplication of work and the associated typos.

 

 

    ✦ Update problems: If a customer’s address needs to be updated but the customer’s personal data has been entered in multiple places, the update has to occur in multiple places within the database. If the update isn’t done in multiple places, then the data will be inconsistent — correct in some places and incorrect in others.

 

 

✦ Query problems: Data in a database that isn’t normalized is difficult to retrieve. Additionally, the data that is retrieved can’t be relied on as being the most accurate. Consider the problem where a customer’s data is stored in the Orders table and  must be repeated for every order. If the customer moved, the only place where the customer’s address is accurate is in the last order. If the address is retrieved from anywhere else, it
would be incorrect.

 

Although as many as eight normal forms are sometimes listed, the most commonly used normal forms are the first three.

 

    ✦ First normal form: The table has no repeating groups, and each column is atomic. (Atomic doesn’t mean explosive; instead, it means the columns are reduced to the smallest possible value. More about that in the upcoming “Atomicity” section.)

 

    ✦ Second normal form: Each column is dependent on the entire primary key. The second normal form applies only when a table has a composite primary key.

 

    ✦ Third normal form: Each column is directly dependent on the primary key. That is, non-primary key columns aren’t directly dependent on other non-primary key columns.

 

The normal forms are progressively stricter for each higher normal form. The third normal form is stricter than the second normal form. The second normal form is stricter than the first normal form. Additionally, higher normal forms must comply with lower normal forms. If a table is in the second normal form, then by definition it’s also in the first normal form.

 

First Normal Form: No Repeating Groups

The first normal form states that no repeating elements or groups of elements can exist within the tables. Additionally, it states that each column is atomic. These rules can be summarized as:
 

 

    ✦ More than one value is not allowed in any cell (a repeating group within a column). A column for the product purchased, can include only one product purchased. Two products can’t be in the same column.

 

 

    ✦ More than one column is not used to define the same piece of data (a repeating group across columns). If more than one product is purchased, it’s not in the same table as Product1, Product2, and so on.

 

 

    ✦ Columns must be atomic. Data within a column is broken down into the smallest meaningful element. A Name column would be divided into three columns as First Name, Middle Name, and Last Name.

 

 

No repeating groups within a column


A repeating group within a column is a group where more than one value is included in a single column.

 

 

For example, consider a table used to record sales information. For every sale, the table owner desires to record the customer’s name, the date, and the item purchased.

 

 

 

Violating the first normal form, a table similar to Figure 1-1 could be created. Three columns (ItemDescription, ItemQty, and ItemPrice) have repeating groups. The customer purchased two books, and both book descriptions are in the ItemDescription column. Additionally, both quantities are in the ItemQty column, and both prices are in the ItemPrice column.

 

Creating the database this way defeats many of the strengths of a database. Creating queries to identify how many specific books have been sold, how many books have been sold in a specific price range, or the average price of books sold is quite difficult.

 

 

To comply with this portion of the first normal form, each column can have only one piece of data, such as only one item description in the ItemDescription column, only one item quantity in the ItemQty column, and only one price in the ItemPrice column.
 

 

However, limiting data entry to only one value per column won’t be enough to fix this problem.

 

 

No repeating groups across columns
 

 

A repeating group across columns is a group where multiple columns are added to the table for the same group of data. Say a beginning database developer wanted to solve the problem from the previous example.
 

 

Instead of allowing more than one value in any column, he might choose to add additional columns to the table.

 

Figure 1-2 shows how this might be done. Because the table has repeating groups across columns, it’s still violating the first normal form. A second column for ItemDescription, ItemQty, and ItemPrice has been added as ItemDescription1, ItemQty2, and ItemPrice2.

 

This approach doesn’t work from a practical sense because if a customer wanted to purchase 3 products (or 30 products), you couldn’t add enough repeating groups across the columns to meet each possible condition.

 

Can you imagine a grocery store that limits customers to only two product purchases at a time? “Sorry. Please come back tomorrow for your next two items.”

No. That’s no way to run a business and that’s no way to create a database in the first normal form.

A solution to this problem is to create two or more related tables. A possible solution is shown in Figure 1-3 where the repeating groups are taken from the first table and added to a related column named OrderDetails. The two tables are related on the OrderDetailsID column.

The OrderDetails table has columns to describe the product, the quantity, and the price. Because the OrderDetails table is related to the Orders table, you can have as many rows in the OrderDetails table as needed. Each of the rows in the OrderDetails table would have the same OrderID value relating them to a single order in the Orders table.

The icons on the connector between the two tables further define the relationship. The small key icon indicates one. The infinity icon (° ) indicates many. The Orders and OrderDetails are related in a one-to-many relationship. In other words, any order can have many order detail rows, each referring to a purchased item, such as a book, a magazine, a pen, and so on.
 

Atomicity
Atomic values are values that can’t be reduced to a smaller meaningful value. For example, a name value could include someone’s entire name (first name, middle initial, and last name), such as Harry J. Potter. However, this value isn’t atomic because the name isn’t reduced to the smallest meaningful values.
 

For a name, the smallest meaningful values would be the first name, the middle name (or middle initial), and the last name. Table 1-1 shows how you separate a name into smaller atomic values.
 

Why is this atomic? The term atomic comes from a time when it was believed that atoms were the smallest particle and that they couldn’t be reduced to anything smaller. Of course, today, we know the atom can be split.

Applying this concept to the first normal form, values within a column can’t be split into a smaller meaningful value.

You don’t break down the names into the smallest possible values — a column for every letter in the name. How ridiculous is that? Instead, you break down the column into the smallest meaningful values — the different parts of a name.

Atomic value columns make it much easier to create queries based on the individual attributes of a name. For example, you can easily search based on the first name or last name, and just as easily create queries that alphabetically order the output based on the first name or last name.

Second Normal Form: Dependent on the Whole Key

The second normal form states that the table must be in the first normal form and each column must be dependent on the entire primary key. The second normal form comes into play only on tables that have more than one column designated as the primary key. Having more than one column as the primary key is also known as a composite key.

If a table doesn’t have a composite key (only one column is designated as the primary key), and the table is in the first normal form, then it’s automatically considered to be in the second normal form.

Comparatively, if a table has a composite primary key, is in the first normal form, and has columns that aren’t dependent on each of the columns in the primary key, then it’s not in the second normal form.

For example, consider Figure 1-4. Only the OrderDetails table has a composite key (composed of the OrderID and ProductID columns). Because the Orders and Products table both have only one column in the primary key, they’re automatically considered compliant with the second normal form as long as they’re in the first normal form.
 

For the OrderDetails table to be compliant with the second normal form,each of the non-primary key columns (ItemQty and Manufacturer) must be fully dependent on the entire primary key (both OrderID and ProductID columns). The ItemQty is fully dependent. It’s identifying how many of the products (ProductID) are being ordered on this order (OrderID).

However, the Manufacturer column isn’t dependent on the entire primary key. The manufacturer is dependent on the product (ProductID) but not on this current order (OrderID).

Moving the Manufacturer column from the OrderDetails table to the Products table makes it compliant with the second normal form.


Third Normal Form

A table in third normal form must be in the second normal form and every non-primary key column must be non-transitively (or directly) dependent on only the primary key. In other words, if any non-primary key columns are directly dependent on other non-primary key columns, then the table isn’t in the third normal form.

The difference between the second normal form and the third normal form is sometimes lost on people learning about normal forms for the first time. The biggest difference is that the second normal form focuses on the primary key as a composite key and requires non-primary key columns to be dependent on the whole primary key, not just a part of it. The third normal form states that non-primary key columns must be dependent on the primary key and not other non-primary key columns. That is, columns must be dependent on the whole key (second normal form) and nothing but the key (third normal form).

Consider Figure 1-5, which isn’t in the third normal form because of the CustomerPhone column. The CustomerPhone column is directly dependent on the CustomerName column for meaning. Because this customer placed
 

this order, then the phone number is indirectly associated with the primary key of OrderID, showing a transitive relationship between the CustomerPhone and the OrderID.

Think about the salesperson filling in the data for this order. For a customer that placed 50 orders, the phone number needs to be entered 50 different times. In a perfect world, the phone number is entered the same way every 50 times, but more than likely, some typos might occur.

To make the Orders table compliant with the third normal form, you need to create a Customers table and put the associated customer data into the table.

As shown in Figure 1-6, the CustomerName and CustomerPhone columns have been replaced with a CustomerID column. Additionally, instead of just the customer name and phone number, the Customers table holds all associated customer data.

Denormalizing Your Database

Denormalizing a database is deliberately adding redundant data to a database to improve performance. Although Online Analytical Processing (OLAP) databases are highly denormalized, it’s not uncommon for an Transaction Processing (OLTP) database to have some level of denormalization designed to improve the performance of specific queries.

OLTP databases have a high level of transactions (UPDATE, INSERT, and DELETE statements). OLAP databases are highly queried (SELECT statements), but have very little, if any, transactions. Adding redundant data in an OLTP database has a performance cost because any updates to the redundant data must be made in multiple places. Because an OLAP database is rarely updated, the redundant data doesn’t add any additional maintenance costs.

For a simple example of denormalization in a database, imagine that you have a Products table that includes the columns shown in Figure 1-7. This table is normalized.

Imagine that management has recently become very interested in the cost of on-hand inventory. On a weekly basis, and often on a daily basis, queries are run that identify the total value of on-hand inventory.

Every row in the Products table must be examined. Multiplying the OnHand inventory column by the ProductCost column calculates the on-hand cost of each individual product. To calculate a total value of on-hand inventory, each multiplied value is then added together.

If you have 200 products, this isn’t a big deal. However, if you have more than 2 million products, running this query takes a lot of time and processing power, which might interfere with other processes on your server.

You can improve the performance of this query by denormalizing this table. The table can be denormalized by adding an OnHandCost column. The value of this column would be calculated by multiplying the OnHand value with the ProductCost column. An UPDATE trigger can be used to recalculate the value of the column any time the OnHand or ProductCost columns are changed.

You can also create the OnHandCost column as a computed column in the table definition. By default, a computed column is a virtual column. Whenever I hear “virtual,” I think, “Ah, it doesn’t really exist.” That’s exactly the case with a computed column. Whenever the data is retrieved, it calculates the value on the fly.

However, it is possible to mark a computed column as “persisted.” A persisted computed column does exist in the table. When either the OnHand or the ProductCost values change, the persisted column would be recalculated.

UPDAsTE statements to the OnHand or ProductCost columns take a little longer because the extra column must be calculated; however, you’ll never be updating these columns on 2 million rows at the same time.

Because the value of the OnHandCost is in the table for each row, the query that returns this information on two million rows runs much quicker. The calculation doesn’t need to be done on each of the 2 million rows each time the query is run because the calculation is stored within the table.

Chapter 2: The SQL Server Optimizer


In This Chapter

    Knowing what the Optimizer can do for you
    Understanding the Optimizer
    Using execution plans
    Creating and maintaining statistics
 

For speed and performance of any database management system, an optimizer identifies the best method to run queries. The best method includes identifying the best indexes to use and the most efficient methods of retrieving, joining, and sorting the data.

SQL Server uses its optimizer to evaluate and choose the best indexes and methods for individual queries. All queries are submitted to the optimizer, which in turn checks a variety of variables to create the best possible query plan. The query plan can be observed as text or in graphical mode by looking at the execution plan.
 

The most important thing you can do to support the optimizer is to ensure that statistics are automatically created and automatically maintained on all your indexes.
 

Why You Need the Optimizer

The SQL Server Optimizer (also called the Query Optimizer) has undergone several improvements through the years. In the days of SQL Server 6.5, a database administrator needed to tweak queries constantly by providing SQL Server an endless assortment of hints, such as which index to use or which type of join to use.

Although using hints helped the database management system (DBMS) perform better, they were a lot of work. The database administrator needed to fully understand the data and the database activity. In a dynamic database, 188 How the Optimizer Works the queries often had to be monitored while the data changed to determine whether the hints needed to be changed.

Currently, the Optimizer can analyze queries and quickly create a query plan without hints. Although hints can still be used, their use is generally discouraged except in the most advanced applications. The Optimizer automatically analyzes many aspects of the queries to create the query plan. Additionally, because the Optimizer optimizes queries when they’re executed, the query plan can change when the data changes.

For example, you might want to optimize a Sales database query that pulls data from the Orders, OrderDetails, and Products tables. By analyzing the execution plan, you could determine the best indexes to use and provide hints in the query. Over time, the data changes, the indexes change, and the statistics for the indexes change. Therefore, instead of an optimal query, you have a very slow-running query because the indexes you picked are no longer the best indexes for this query.

However, even if hints aren’t provided, the Query Optimizer can analyze the query and the elements of the database to determine the best possible query plan. Additionally, the same query can run later with a different query plan, based on the best possible query plan for that moment.
 

How the Optimizer Works

Query Optimizer analyzes an executed query and creates several possible query plans for the query. For each query plan, the Optimizer determines the estimated cost to run each query. The query plan with the smallest cost is then used to run the query. The query plans aren’t actually run to determine the cost but instead are just estimated. Estimated costs derive from input and output requirements of the query, CPU requirements, memory requirements, and more.

Figure 2-1 shows how the Optimizer works. A query is submitted to the database management system. After the query is parsed (verified that the syntax is correct), it’s sent to the Optimizer. The Optimizer analyzes the query and the database to determine different indexes to use and different join methods.

The Optimizer then creates several query plans. For each of these plans, it determines an estimated cost. The query is then compiled using the query plan with the lowest estimated cost — a cost-based optimization.

The cost of a query
 

Within SQL Server 2008, query plans are evaluated based on the costs. Costs aren’t dollars, but instead refer to time used to access resources. The cost is given in seconds but because cost is an estimated value, it doesn’t equate to actual seconds. Instead, think of the cost as a measurement for comparison.
When a query is issued, query plans are created. One query plan could be estimated to complete in one second, and a second query plan could be estimated to complete in three seconds. The Optimizer compares these measurements and chooses the quicker plan (in one second).

For every query run on SQL Server, the Optimizer strives to use the least amount of resources possible. This goal equates to having the lowest cost, or taking the shortest amount of time.

The following resources have an associated cost and are evaluated in the cost of a query:
    ✦ CPU or processor
    ✦ Memory usage
    ✦ Input/output operations
    ✦ Disk buffer space
    ✦ Disk storage time
 

Each of these resources is assigned an estimated cost. All the costs are added to identify an estimated total cost of the query. For simple queries with a very small cost, it’s possible for only a single query plan to be created and used. For example, if the first query plan created
for a simple query is estimated to complete in 8 milliseconds, the Optimizer could use this query plan instead of creating multiple plans and picking the best one. Spending another 30 milliseconds to find a query plan that could run 1 or 2 milliseconds quicker isn’t cost effective.

The Optimizer utilizes a sophisticated costing code to determine the actual cost based on a number of variables. The costing code considers
 

    ✦ How large the tables are
    ✦ What indexes are available
    ✦ How useful the indexes are for this query
    ✦ The best join methods
    ✦ The usefulness of statistics

Examining a query plan
A query plan can be viewed as text by using the SET SHOWPLAN_TEXT option. Viewing the query plan provides you a better understanding of how the queries are being run, and what indexes are being used.

The following steps show how to view a query plan:
 

1. Launch SQL Server Management Studio (SSMS) by choosing Start➪All Programs➪Microsoft SQL Server 2008➪SQL Server Management
Studio.

2. On the Connect to Server screen, click Connect.

3. Click the New Query button to open a new query window.

4. Enter the following query into the query window and press the F5 key to execute it:

SET SHOWPLAN_TEXT ON

When turned on, the SHOWPLAN_TEXT option remains on for the rest of the session until it’s turned off. While it’s on, you can’t get results from queries other than the query plan. To turn off the SHOWPLAN_TEXT option, enter the following query and execute it by pressing the F5 key:

SET SHOWPLAN_TEXT OFF.
 

5. Change the output from a grid format to a text format by right-clicking within the query pane and choosing Results To➪Results To Text.

The resulting query plan will be easier to view in a text format than in a grid format.

6. Enter a query against a database on your system.

If you have AdventureWorks2008 installed on your SQL Server, use the following query:
 

----------------------------------Q

This simple query plan merely states to use a clustered index scan when retrieving the data for this query. The clustered index is named
PK_Employee_BusinessEntityID and is in the AdventureWorks2008 database, in the Human Resources schema, in the Employee table.

Queries that are more complex have more complex query plans. The following query accesses the data from a view named vEmployee based on a more complex query:
 

-----------------------------Q

Although this query of the vEmployee view looks very similar to the previous query of the Employee table, the resulting query plan shows the complexity of the view. The query plan for the view includes a variety of indexes (clustered index scan, clustered index seek, and index scan), and different join methods (nested loops and hash matches).

Listing 2-1 shows the resulting query plan. I’ve bolded each of the methods chosen by the Query Optimizer (such as nested loops, hash match, and so on).

-------Q

This query plan allows you to peer into the inner workings of the Query Optimizer. Of course, you may be wondering, “Why would I ever want to peer in there?” Sometimes you might just want to know whether the Query Optimizer is using an index you’ve created.

For example, you might have created a composite index that included the address lines, city, state, and postal code columns in the Person.Address table specifically to optimize this query. Now you want to ensure that the index is being used. Based on the query plan, you can verify whether the index is or isn’t being used. The Person.Address table includes a composite index named:

Because the index isn’t named in the query plan, you know it isn’t being used by this query. You could either modify the index to make it more useful or delete it to eliminate the overhead required to maintain the index.

You can easily become overwhelmed with the quantity of data in a query plan. However, most of the data can be ignored. You typically use this to verify only that a specific index you’ve created is being used in this query plan.

For example, you could use the Database Engine Tuning Advisor to identify and create the best indexes to create for specific queries. Later, you might want to verify that the indexes you created are still being used. By viewing the query plan, you can easily determine whether the indexes are being used. If not, it’s probably time to run the Database Engine Tuning Advisor again.

Using Execution Plans to Figure Out What’s Happening

Execution plans are graphical representations of a query plan. You view an execution plan to determine how the query optimizer is building the query. Execution plans are often easier to view and interpret than the all-text query plans. They also provide a lot more information on the cost of a query.

To display an execution plan, perform the following steps:

1. Launch SQL Server Management Studio (SSMS) by choosing Start➪All Programs➪Microsoft SQL Server 2008➪SQL Server Management
Studio.

2. On the Connect to Server screen, click Connect. 194 Using Execution Plans to Figure Out What’s Happening

3. Click the New Query button to open a new query window.

4. Right-click in the query window and choose Include Actual Execution Plan.

You can also choose Display Estimated Execution Plan from the samemenu as Include Actual Execution Plan. The estimated plan displays
immediately based on the Optimizer’s best guesses. The actual plan appears after the query runs and shows the actual times for the query
and each sub element.

5. Enter a query into the query window.

If you have AdventureWorks2008 installed, use the following query:
-------Q

6. Press F5 to execute the query.

7. Click the Execution Plan tab to view the execution plan.

As shown in Figure 2-2, the execution plan has four nodes with the majority of the work done in the Clustered Index Scan node.
Execution plans are read from top to bottom, right to left. In this simple query, the first part of the plan is the clustered index scan, and the last part of the query is the SELECT statement.

Additionally, execution plans can easily become quite complex. Figure 2-3 shows the resulting execution plan from a query on the HumanResources. vEmployee view in the AdventureWorks2008 database. The details can’t be read on this graphic, but you can see that the execution plan is composed of 18 different nodes. When analyzing this query, you drill down into the icon at the top right, and read from top to bottom and right to left.

Each of the steps has associated costs, which represent the steps’ portion of the total cost. When you hover over the icon of an individual step, a ToolTip graphic appears showing the costs associated with that step.
 

Figure 2-4 shows a ToolTip for one of the Clustered Index Scan nodes within the execution plan. The information that appears in the ToolTip is dependent upon the operation performed by the node.

The following list describes the information that appears in the ToolTip:

    ✦ Physical Operation: Indicates the operation that will be used to implement the specified logical operation. Physical and logical operations (see the following bullet) are sometimes the same. Common physical operators are sort, nested loop, hash match, clustered index seek, clustered index scan, and index scans.

    ✦ Logical Operation: Conceptually describes what operation needs to be performed. Common logical operators are sort, inner join, clustered index seek, clustered index scan, outer join, and index scans.

    ✦ Actual Number of Rows: No surprise here. This identifies the number of rows returned from the query.
 

    ✦ Estimated I/O Cost: The estimated total input/output cost for this node (in seconds). Input/output cost is typically referring to disk activity but can also include memory activity.

    ✦ Estimated CPU Cost: Estimated cost of all processor activity for this node (in seconds).

    ✦ Estimated Operator Cost: The total cost (in seconds) of this node. This is the sum of the I/O cost and the CPU cost.

    ✦ Estimated Sub tree Cost: The cost of this node and all nodes preceding it (to the right of this node).

    ✦ Estimated Number of Rows: The number or rows produced by this node.

    ✦ Estimated Row Size: Estimated size of the row (in bytes) produced by this node.

    ✦ Actual Rebinds: Physical operators that initialize a connection with the data, collect the data, and then close. A rebind is the number of times the physical operator resets and repeats the initialize phase.

    ✦ Actual Rewinds: Rewind indicates that the inner result set for a join query can be reused (and a rebind isn’t necessary).

    ✦ Ordered: Either true or false. If true, it indicates the data is ordered, such as alphabetically or numerically in ascending or descending order.
For example, a clustered index would be ordered when first accessed, but might not be ordered after being merged with results of other nodes.
 

    ✦ Node ID: A number identifying a node. Nodes are numbered from left to right and from top to bottom (not right to left, as you would read the nodes) and start with the number 0.

Client Statistics: Helping the Optimizer Do Its Job

 The Optimizer uses statistics to help it create estimated query plans. Statistics within SQL Server work the same way as they do in other applications.

For example, consider an orange buyer purchasing a truckload of oranges. The price of oranges varies depending on how sweet or how acidic the oranges are, so the first thing she needs to do is determine how sweet the oranges are. She can do this in one of two ways:

    ✦ Take a sample from every orange on the truck and measure the sweetness and acidity.
 

    ✦ Remove a statistical sampling of oranges and measure the sweetness and acidity of the sample.

Bet that the latter choice is picked. By calculating the average and deviation of the sample of oranges, accurate predictions of the entire truckload of oranges can be made.

The Optimizer uses statistics when trying to determine the usefulness of available indexes. When determining the usefulness of an index, the Optimizer must determine two things:

    ✦ What’s the density of this index for this query? A low density is desired.
   

    ✦ What’s the selectivity of this index for this query? A high selectivity is desired.

Understanding the density of an index

Density refers to the number of duplicate rows in a column. Consider a business that does business in Virginia Beach, Virginia. More than
90 percent of the customers live in Virginia Beach. The other 10 percent of the customers live in nearby cities or are tourists from around the country.

The business has a database with a table named Customers. On the Customers table, an index has been created on the City column. If the following query were executed, the Optimizer would not use the index on the City column because the index is too dense.
 

------------------Q

On a Customers table with 10,000 entries, about 9,000 entries are identical. Using the index would not increase the performance of the query.

However, this same index might be useful for another query. The following query retrieves a list of customers who do not live in Virginia Beach.

PAGE 198

By using the index, the entire table of 10,000 rows can quickly be whittled to only 1,000 rows. With the index, the Optimizer is able to reduce the amount of data that needs to be searched to only 10 percent of the total. In this case, the index isn’t considered dense for the query.

Understanding the selectivity of an index

Selectivity is the number of rows returned by a query. The goal is to be highly selective (return the least number of rows).
 

Primary keys are created to enforce uniqueness on a table. With a primary key, every row is guaranteed to be unique (one of kind without any duplicates).

A query that retrieves a single row based on the primary key is highly selective. For example, the following query would return a single row based on the primary key of CustomerID:

-------------------Q

Using statistics
How does the Optimizer know whether an index has either high selectivity or low density and should be included in a query plan? Consider the earlier example of the Customers table with 10,000 rows.

The same as the earlier example of the orange buyer who had two choices to determine the sweetness of the oranges, the Optimizer also has two possible choices to determine the usefulness of the indexes:

    ✦ Look at each of the 10,000 rows and make a determination on the selectivity and density.

    ✦ Look at a statistical sample of the 10,000 rows and make a determination on the selectivity and density.

For speed and performance, the Optimizer chooses the latter. Statistics are automatically created on indexes by default, and the Optimizer has these available for use.
 

PAGE 199

With a 10,000-row Customers table, the index would also be 10,000 rows, but the statistics would be considerably less. The distance between rows in the statistics in the index is referred to as steps.

For example, on an index with 10,000 rows, the database management system might choose to create 300 statistics samples. When you do the math, 10,000 / 300 = 33.33, which rounds up to 34. With this example, the statistics would be created with the following data from the index:

    ✦ The first row of the statistics is always the first row of the index.

    ✦ Middle rows for the statistics are every 34th row after the first.

    ✦ The last row of the statistics is always the last row of the index.
 

Therefore, if the Optimizer needs to determine whether the index is useful for a given query, instead of having to search through 10,000 rows to determine the selectivity and density, it has to search through only 300 rows.

Automatically creating and maintaining statistics

One of the most important concepts to remember with statistics is to let the system do the work. As long as statistics are created and maintained by the system, you won’t have to do anything else on the majority of databases.

Statistics are set to be automatically created and updated by default with the following two settings:

    ✦ Auto Create Statistics: This setting specifies that the Optimizer automatically create any missing statistics during optimization of a query. Statistics could be missing if they weren’t created when the index was created, or they were deleted afterward.

    ✦ Auto Update Statistics: This setting specifies that SQL Server will automatically update statistics. If the Optimizer evaluates statistics that it determines are out of date, the Optimizer automatically updates them. Statistics can become out of date when the underlying data changes significantly.

If you suspect that statistics aren’t being maintained on a database, check the properties of the database with these steps:

1. Launch SQL Server Management Studio (SSMS) by choosing Start➪All Programs➪Microsoft SQL Server 2008➪SQL Server Management
Studio.

2. On the Connect to Server screen, click Connect.

PAGE 200

3. Browse to your database in the Databases container.

4. Right-click your database and choose Properties.

5. In the Database Properties dialog box, select the Options page.

As shown in Figure 2-5, Auto Create Statistics is set to True, and Auto Update Statistics is set to True. These are the defaults for both settings.