in General

Understanding Database Indexes

Imagine you had a piece of paper with a list of numbers and full names on it – first name, then surname. The names aren’t in any particular order – it’s just a random bunch of full names written out on a page with a number that uniquely identifies that name. The list might start like this:

1   Matt Durrant
2   Oscar Stapleton
3   Gemma Thomas
4   Amelia Durrant
5   Matt Swarbrick

You’re told to search the list and draw a circle around every example of a certain first name. How might you approach searching the list? You’d have to go through every single name in turn. You couldn’t stop when you found the first name you cared about – you’d still have to carry on all the way through to the very last name on the list to make sure you’d found every instance of the name on there.

If the list was short (say 10 or so names) then this process might seem trivial. Scanning through the list of names takes no time or effort at all. But the longer the list of names becomes, the longer it’s going to take you to search all the way through. Once you’re into hundreds or even thousands of names, searching the list starts to become a major headache.

How about if you put the names in alphabetical order (by first name) and tried the search again? This would make the search a lot easier. You could skip straight through the names at the start to the beginning of the block of names you cared about. Then you could read through all the examples of the first name you were searching for, and you could ignore any of the remaining names in the list as you knew you wouldn’t have to care about any them. Much faster.

However sometimes you also want to search the list by surname, so scrapping the original list and re-ordering it alphabetically doesn’t always help you. Instead you’re given a separate list of first names, in alphabetical order along with the numbers that allows you to quickly find the names in the original list. This is an index. By searching the index for the first name you can easily grab the list of names from the original list.

Amelia     4
Gemma   3
Matt        1, 5
Oscar      2

The index doesn’t contain the surnames, just the first name as that’s the thing you’ll be using to search on. You’d still have to read the surname back off the original list. If you also wanted to search by surname, you could create another index, this time with the surnames in alphabetical order.

What would happen if you then wanted to change the original list? Maybe you wanted to add a few more names at the end, or update a few of the names in the middle somewhere? Well, you’d also have to make a corresponding update to any of the indexes that were affected.

You may be starting to see some of the upsides and downsides of using an index. An index helps you search a list quickly, but it doesn’t come for free. You need some extra paper to write your index on, and you have to do a bit of work to maintain your index every time you make a change to the original list. If your list is quite short this might be more hastle than it’s worth. And if you have a list that you update all the time, you’re also going to have to constantly updating the indexes on your list.

Let’s apply this to SQL and see how it works. Firstly lets create a table called “Customers” that mirrors our example above:

CREATE TABLE Customers
(Id INT NOT NULL AUTO_INCREMENT PRIMARY KEY,
FirstName NVARCHAR(100) NULL,
LastName NVARCHAR(100) NULL)

This creates a table with three columns. There is “Id” column that starts at 1 and increases by 1 for every row we add to the table. This is the primary key on the table. There is also a “FirstName” column, and a “LastName” column. Both of these can hold some text.

Just by creating a table with a primary key we’ve already got our first index on the table. When you create a primary key on a table (in this case for our “Id” column”) it will automatically be used to create an Implicit Index. It makes sense that we’d want to use our primary key to look up rows quickly.

We can see that index has been created by using the query:

SHOW INDEXES IN Customers
Table Non_unique Key_name Seq_in_index Column_name Collation Cardinality Sub_part Packed Null Index_type Comment Index_comment
customers 0 PRIMARY 1 Id A 0 (NULL) (NULL) BTREE

This tells us a bit about the index that was created. The name of the index is “PRIMARY”, and as you might expect it is a Unique index. This means the index will enforce uniqueness across each of it’s entries (which of course makes complete sense for a Primary Key). The Cardinality is the number of unique rows in the index. In this case because we have 0 zero rows in the table, it returns 0. The BTree value refers to the way the actual index is stored. This may be interesting to you, but the way the data in the index is stored is invisible to you when you’re using the table so you can definitely get away without knowing too much about it.

Next let’s run the sql block below to insert 100 dummy names:

INSERT INTO `Customers` (`FirstName`,`LastName`) VALUES ("Evangeline","Hurley"),("Ruth","Daugherty"),("Mira","Caldwell"),("Bell","Wolf"),("Leandra","Parks"),("Ian","Lane"),("Hannah","Zamora"),("Todd","Clemons"),("Brandon","Hoover"),("Rhonda","Bolton");
INSERT INTO `Customers` (`FirstName`,`LastName`) VALUES ("Sawyer","Wells"),("Dean","Hewitt"),("Imelda","Gonzalez"),("Tobias","Mccarty"),("Debra","Aguilar"),("Hayfa","Cabrera"),("Ori","Raymond"),("Karina","Hebert"),("Lael","Hall"),("Maggie","Mccarty");
INSERT INTO `Customers` (`FirstName`,`LastName`) VALUES ("Alfreda","Matthews"),("Ezekiel","Graham"),("Aaron","Downs"),("Odessa","Pacheco"),("Quincy","Humphrey"),("Marny","Wong"),("Sylvia","Pitts"),("Stuart","Davenport"),("Norman","Burton"),("Harriet","Cobb");
INSERT INTO `Customers` (`FirstName`,`LastName`) VALUES ("Jacqueline","Knowles"),("Isadora","Delgado"),("Britanney","Wiley"),("Raya","Rodriquez"),("Chava","Wilkerson"),("Barclay","Clay"),("Emmanuel","Stark"),("Blaze","Townsend"),("Georgia","Powell"),("Avram","Hunter");
INSERT INTO `Customers` (`FirstName`,`LastName`) VALUES ("Chava","Mcfadden"),("Jessamine","Cox"),("Bert","Sims"),("Idola","Chambers"),("Colton","Golden"),("Kyle","Boyle"),("Aaron","Carver"),("Deirdre","Parrish"),("Blaine","Blackburn"),("Carla","Olson");
INSERT INTO `Customers` (`FirstName`,`LastName`) VALUES ("Axel","Pace"),("Ramona","Reynolds"),("Cullen","Whitaker"),("Kendall","Hatfield"),("Rhoda","Davis"),("Gretchen","Mason"),("Omar","Reese"),("Wang","Rasmussen"),("Mohammad","Foster"),("Delilah","May");
INSERT INTO `Customers` (`FirstName`,`LastName`) VALUES ("Ramona","Harrell"),("Sebastian","Dennis"),("Sade","Livingston"),("Destiny","Fernandez"),("Genevieve","Lewis"),("Brennan","Kelly"),("Jesse","Frederick"),("Dane","Beard"),("Rowan","Quinn"),("Shellie","Lambert");
INSERT INTO `Customers` (`FirstName`,`LastName`) VALUES ("Julie","Norris"),("Emmanuel","Franco"),("Dalton","Tanner"),("Herman","England"),("Justina","Mcdowell"),("Gretchen","Vinson"),("Quentin","Payne"),("Demetria","Hopkins"),("Ria","Wynn"),("Bethany","Carey");
INSERT INTO `Customers` (`FirstName`,`LastName`) VALUES ("Valentine","Weeks"),("Leigh","Hutchinson"),("Dylan","Burton"),("Adrian","Garrett"),("Alec","Haley"),("Raymond","Gay"),("Theodore","Myers"),("Colton","Wiggins"),("Ivy","Bauer"),("Mariko","Paul");
INSERT INTO `Customers` (`FirstName`,`LastName`) VALUES ("Janna","Shannon"),("Ella","Yates"),("Illiana","Hays"),("Cade","Weiss"),("Elizabeth","Foster"),("Luke","Clemons"),("Serina","Cote"),("Theodore","Lawrence"),("Gray","Boyer"),("Britanni","Cook");

(By the way, I’ve created this list using a site called GenerateData which is a useful tool for easily creating dummy data like this).

Note that if you check the indexes again on the table, the cardinality for the PRIMARY index is now 100 – as there are 100 unique ID values in the table.

Now lets recreate our first example of having to search the piece of paper (now our Customer table) for a certain first name. We can write a simple Select statement to do this, but that won’t tell us how the search was performed. To do that we need to add the EXPLAIN keyword to the start of the search. This will tell us exactly how much work needed to be done in the search:

EXPLAIN SELECT *
FROM Customers
WHERE FirstName = 'Ramona'
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE Customers ALL (NULL) (NULL) (NULL) (NULL) 100 Using where

As we had expected this reveals that our Select query performed a search of all 100 rows in the table. We can see from the Type value of All that a full table scan was done. Every row in the Customer table had to be searched to make sure all the Ramona’s had been found.

Contrast this with the same query against our primary key:

EXPLAIN SELECT *
FROM Customers
WHERE Id = '30'
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE Customers const PRIMARY PRIMARY 4 const 1

Only a single row was read to perform this search. The possible_keys column shows the list of indexes the search could have made use of, and the key column shows the actual index it chose (in this case there is only a single index to choose from – PRIMARY).

(If you’re interested in reading the other outputs of EXPLAIN, this article may be useful).

Now let’s add a new index for our First Name column. I’ve called it FirstName_IDX. The syntax for this is simple enough:

CREATE INDEX FirstName_IDX 
ON Customers (Firstname)

This will create a Non-Unique Index – there is no restriction on only having one of each first name, which is exactly what we want in this case. You could try and create a unique index (using the syntax CREATE UNIQUE INDEX) but this would fail due to their being duplicate first names already in the database.

Let’s retry our search for Ramona and see what’s changed:

EXPLAIN SELECT *
FROM Customers
WHERE FirstName = 'Ramona'
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE Customers ref FirstName_IDX FirstName_IDX 303 const 2 Using index condition

Now we only search the two rows that have Ramona as a first name. Much better. We can see that our new index FirstName_IDX is being used as expected.

What else could we use the index for? Could we use it to speed up searches for a first name that starts with a certain letter? What about first names that end with a certain letter? From what you now know about indexes you can probably guess that an index will help us with the first search, but not the second. And we know that we can use EXPLAIN to prove this:

EXPLAIN SELECT *
FROM Customers  
WHERE FirstName LIKE 'R%'
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE Customers range FirstName_IDX FirstName_IDX 303 (NULL) 9 Using index condition
EXPLAIN SELECT *
FROM Customers
WHERE FirstName LIKE '%R'
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE Customers ALL (NULL) (NULL) (NULL) (NULL) 100 Using where

You can also create indexes that use multiple columns.

Hopefully this article has given you a decent overview of what problem an index helps solve, alongside some of the pros and cons to using them. The database is commonly a bottleneck in web development. Use EXPLAIN to look for places in slow-running queries where full table scans are being undertaken, and consider whether an Index would drastically cut down the number of rows that need to be searched. On larger tables this can lead to massive gains in the speed of your queries. Adding indexes everywhere is not a great idea however as they do come with an overhead. You need to understand what your queries are doing and only add the appropriate Indexes.

Write a Comment

Comment