MySQL Forums
Forum List  »  General

Performance and indexing, when to use multi-column indexes and more
Posted by: Jim Change
Date: March 16, 2010 08:29AM

I am trying to find the best way to structure a query for performance when dealing with hundreds of thousands to potentially millions of records. Speed of query execution is of the utmost concern in this case. The SQL to create the table is included at the end of this post, as well as my current query to retrieve data from the table. This question is optionally up for some paid consulting if you are an expert ninja and want to optimize this thing to the max. I can pay a little bit for some much needed improvement.

I have set up indexes on the columns which are used in the select statement, although I have some questions regarding when I should be using multi-column indexes and some other questions too.

I will give a detailed description of what I am trying to accomplish, with a few explanations / examples and then finally my questions. I will also post my current query and the create statement for my database, including sample data for the query.

I have some basic issues that I am trying to address, but I am not sure of the correct solutions, so here's the gist of what I'm trying to do:

I am developing a web-based to-do list system for use in a large internal network. The system itself is very simple, but it requires the data to be returned very quickly (I can't have any 3 or 5 second wait times), so I need the retrieval query to be lightning fast. This system will be heavily used daily.

The basic idea is that a user logs into the system and is landed on a page which displays a list of all to-do items from the to-do lists that are assigned to them.

So, I have a database with 4 tables:
users (userID,email,displayName)
lists (listID,title,senderID,receiverID,notes,createDate,type)
items (itemID,listID,senderID,recieverID,label,postDate,parentItem,groupID)
listuserlinks (listuserlinkID,userID,listID) (ignore this table for now, I'll get to it in a bit)

Each item belongs to a to-do list (linked by its listID) and each to-do list has a creator (linked by senderID) and a recipient (linked by receiverID). You'll notice that I have put the senderID and recieverID in both the items table and the list table (an item's senderID and receiverID are the same as the senderID and receiverID of the list it belongs to, I'll explain the reason for duplicating the fields below).

So, an example...

example #1
1.) "User 1" creates a new to-do list (listID of 1, senderID of 1, receiverID of 3) for "User 3" and adds three items to the list. Each item is marked with 1 in the listID column
2.) "User 2" creates a new to-do list (listID of 2, senderID of 2, receiverID of 3) for "User 3" and adds two items to the list. Botm items are marked with 2 in the listID column
3.) When "User 3" logs in, they will see 5 items displayed, 3 items from listID=1 and 2 items from listID = 2

So, as an example of myquestion, I'll use the query for the example I've just given for when user 3 logs in:

SELECT DISTINCT items.itemID, items.listID, items.senderID, items.receiverID, items.label,
lists.listID, lists.senderID, lists.receiverID
FROM items
LEFT OUTER JOIN lists ON lists.listID = items.listID
WHERE items.senderID = 3 OR items.receiverID = 3

So, earlier I mentioned that I have the senderID and recieverID in both the items table and the lists table. I didn't know if it was more effecient to select this way:

WHERE items.senderID = 3 OR items.receiverID = 3

or this way, since the list the item is from is joined in the query:

WHERE lists.senderID = 3 OR lists.receiverID = 3

So, that's my first question... is it better to store the sender and receiver IDs in the items table and use those to select, or is it better to use the join on the list the items are linked to and use the sender and receiver IDs from the lists table to do the WHERE clause? I want to get all of the items from the database that are on lists where userID 3 is either the sender or the receiver of the list. So, is storing the sender/receiver ID in the item better or is joining to the list and using the sender/receiver ID of the list better?

So, on to question #2...

For this query:

SELECT DISTINCT items.itemID, items.listID, items.senderID, items.receiverID, items.label,
lists.listID, lists.senderID, lists.receiverID
FROM items
LEFT OUTER JOIN lists ON lists.listID = items.listID
WHERE items.senderID = 3 OR items.receiverID = 3

I currently have an index on the items table called "senderID" with senderID as the only index column, and I also have an index called "receiverID" that uses receiverID as its only index column. I also have an index called "listID" that uses listID as its only index column.

So, that's question number two... when I use "WHERE items.senderID = 3 OR items.receiverID = 3", would it be better to use an index called "users" that has both senderID and receiverID as index columns or to have what I have now with two separate indexes? (This also depends greatly on my previous question, since these columns would be lists.senderID and lists.receiverID instead, if using the joined table were the proper approach. If using the lists instead of the items for the WHERE clause were more efficient, what would be the proper way to index the columns?)

I know that I am required to use the leftmost column in a multi-column index, I have more complexity to add to my query, so this is kind of confusing for me.

When I run EXPLAIN on the above query I see "ALL" for table "items" under "type" and key is null.

I'll go ahead and explain the last bit of complexity now.

With the examples above I have been looking at selecting all items that are on lists where userID=3 is either the sender or receiver, but there is one more set of items I need to retrieve. Those are items that are on lists that "User 3" has been assigned to as a list user.

For example, let's imagine that everything from the above examples where "User 1" and "User 2" have created to-do lists for "User 3" has happened and now "User 5" creates a to-do list (listID of 3, senderID of 5, receiverID of 6) for "User 6". This won't affect the items that "User 3" sees when they log in. However, let's say "User 5" adds a record to the listuserlinks table for (userID=3,listID=3). This tells the system that "User 3" is supposed to see items from list 3, even though they aren't the sender or reciever. So, the query becomes:

SELECT DISTINCT items.itemID, items.listID, items.senderID, items.receiverID, items.label,
lists.listID, lists.senderID, lists.receiverID
FROM items
LEFT OUTER JOIN lists ON lists.listID = items.listID
WHERE (items.senderID = 3 OR items.receiverID = 3) OR lists.listID IN (SELECT listID FROM listuserlinks WHERE userID = 3)

So, this query selects all items where the sender or receiver is "User 3", and also those where the list that the item is on has a record in the listuserlink table that is associated with "User 3"

Since "User 5" added a record to the tabel for (userID=3,listID=3), the query will select all items where senderID=3 OR receiverID=3 OR those from listID=3, even though it belongs to "User 5" and the recipient is "User 6"

The final piece of complexity is that each item selected by the query that is selected because of this part of the query:

"OR lists.listID IN (SELECT listID FROM listuserlinks WHERE userID = 3)"

Can optionally have a second item attached to it whose item label is shown to the user instead of the original item label.

Example:

Imagine on the list above (listID=3) that there are 2 items on that list, so when "User 3" logs in, they see the 7 items displayed:

3 items from listID=1 and 2 items from listID = 2, plus 2 items from listID = 3 since even though list 3 doesn't belong to "User 3", there is a record in listuserlinks that tells the system to show items from listID = 3 to "User 3".

For any of the links from list 3, since User 3 is not the actual recipient, "User 3" can insert an item into the database with its parentID set to the itemID of the item it is attached to. By doing so, "User 3" can set a different label for the item without changing the label that the actual sender and receiver of the item see. This record will be selected along with the query. So:

SELECT DISTINCT items.itemID, items.listID, items.senderID, items.receiverID, items.label,
lists.listID, lists.senderID, lists.receiverID,
edited.label as editedLabel
FROM items
LEFT OUTER JOIN lists ON lists.listID = items.listID
LEFT OUTER JOIN items edited ON edited.parentItem = items.itemID
WHERE items.parentItem = 0 AND ((items.senderID = 3 OR items.receiverID = 3) OR lists.listID IN (SELECT listID FROM listuserlinks WHERE userID = 3))

So, in the query above I've simply told it to join an additional table (the items table) and call it "edited" if there were any items that have their parentItem column marked with the itemID of the item we are selecting. I've also selected edited.label as editedLabel, so I'll have a column available in my query that contains the label that "User 3" chose for the item they inserted. Lastly I've added "WHERE items.parentItem = 0". This tells the query only to get items who don't have a parent item, so it won't select the item that "User 3" inserted since it has a parentID, but it will select the label from that item when it selects its parent.

So, if List 3 contains 2 items, let's say itemID = 10 and itemID = 11 with labels "Item 10" and "Item 11", if "User 3" inserts a record with the label "My custom label for Item 11" with parentItem = 11, then when the query selects item 11 for List 3 along with the rest of the items, it will fill in the "editedLabel" column with the label field from the record with parentItem=11, but it won't select that item with parentItem=11 in the main query because it doesn't have parentItem=0 like the query requires.

Sooooo... in plain English...

I want to select to-do items from a database where the senderID or receiverID of the item (or of the list, whichever is more efficient) matches a passed in userID (3 in the examples we've been using). I also want to get items from to-do lists where there is a matching record for that user to that list. (So if there is a record for user 3, list 55, show items from list 55 too). For any items on those matched lists, to also select any additional items as "edited" that have the parentItem field equal to that item.

There are also cases where I will only need to use the senderID or the receiverID, but not both, so I'm not sure how that changes how the indexes should be set up in that case. If I just do "WHERE items.parentItem = 0 AND ((items.senderID = 3)" or "WHERE items.parentItem = 0 AND ((items.receiverID = 3)", then I'm not sure if using single-column indexes is better than multi-column indexes, etc

So, that's the best explanation I can give.

I'm not a database guy by nature, so it was a stretch for me to get this far.

Eventually I'll want to do things like filter by date range or filter by specific listID, like this:

SELECT DISTINCT items.itemID, items.listID, items.senderID, items.receiverID, items.label,
lists.listID, lists.senderID, lists.receiverID,
edited.label as editedLabel
FROM items
LEFT OUTER JOIN lists ON lists.listID = items.listID
LEFT OUTER JOIN items edited ON edited.parentItem = items.itemID
WHERE lists.listID = 125 AND items.parentItem = 0 AND ((items.senderID = 3 OR items.receiverID = 3) OR lists.listID IN (SELECT listID FROM listuserlinks WHERE userID = 3))

Or add in additional fields, like "status = 0 or hidden = false", so I don't know how much maneuvering that really requires, so I wanted to focus on the primary issue which was this query:

SELECT DISTINCT items.itemID, items.listID, items.senderID, items.receiverID, items.label,
lists.listID, lists.senderID, lists.receiverID,
edited.label as editedLabel
FROM items
LEFT OUTER JOIN lists ON lists.listID = items.listID
LEFT OUTER JOIN items edited ON edited.parentItem = items.itemID
WHERE items.parentItem = 0 AND ((items.senderID = 3 OR items.receiverID = 3) OR lists.listID IN (SELECT listID FROM listuserlinks WHERE userID = 3))

With the option of adding the listID part:

"...WHERE lists.listID = 125 AND items.parentItem = 0 AND ..."

So, there you have it.

How can I perform my query most efficiently?

With 5,000 records there is no problem, but when I insert 100,000 test records, I see query times of 2.7 seconds. 500,000 records results in 3.9 to 4.2 seconds usually, even for users that only have 5 to-do items total. If I try to retrieve even a LIMIT of 0,25 on a user that has 10,000 records themselves, it takes 5 - 6 seconds.

When I simplify the query down and remove the join of "edited" it speeds up quite a bit to .797 seconds, using the current index configuration that I have (which I know isn't optimal at all).

Also, if I add in things like:

inner join users sender ON sender.userID = items.senderID
inner join users receiver ON receiver.userID = items.receiverID

The query slows down even more, but I need these tables to show information about the sender and receiver.

I'll eventually want to be applying grouping, LIMIT, etc to this query, so that's another thing I'm not sure how to figure in to performance.

Here is the SQL to create the table with indexes and sample data. The test query to run is also included at the very end of this post. The example user is user 2, for whom there are some items already on a list, and 1 item on one list that doesn't belong to user 2 but for which user 2 is set see items from because of a record in the listuserlinks table. If you run the query you will see the result set, you will see. Insert 50,000 to 100,000 records and you'll see the slow down.

So, does anyone have any bright ideas for optimizing the query below for the table below?


-- MySQL Administrator dump 1.4
--
-- ------------------------------------------------------
-- Server version 5.1.32-community-log


/*!40101 SET @OLD_CHARACTER_SET_CLIENT=@@CHARACTER_SET_CLIENT */;
/*!40101 SET @OLD_CHARACTER_SET_RESULTS=@@CHARACTER_SET_RESULTS */;
/*!40101 SET @OLD_COLLATION_CONNECTION=@@COLLATION_CONNECTION */;
/*!40101 SET NAMES utf8 */;

/*!40014 SET @OLD_UNIQUE_CHECKS=@@UNIQUE_CHECKS, UNIQUE_CHECKS=0 */;
/*!40014 SET @OLD_FOREIGN_KEY_CHECKS=@@FOREIGN_KEY_CHECKS, FOREIGN_KEY_CHECKS=0 */;
/*!40101 SET @OLD_SQL_MODE=@@SQL_MODE, SQL_MODE='NO_AUTO_VALUE_ON_ZERO' */;


--
-- Create schema todo
--

CREATE DATABASE IF NOT EXISTS todo;
USE todo;

--
-- Definition of table `items`
--

DROP TABLE IF EXISTS `items`;
CREATE TABLE `items` (
`itemID` int(10) unsigned NOT NULL AUTO_INCREMENT,
`listID` int(10) unsigned DEFAULT NULL,
`senderID` int(10) unsigned DEFAULT NULL,
`receiverID` int(10) unsigned DEFAULT NULL,
`label` varchar(45) DEFAULT NULL,
`postDate` datetime DEFAULT NULL,
`parentItem` int(10) unsigned DEFAULT NULL,
`groupID` int(10) unsigned DEFAULT NULL,
PRIMARY KEY (`itemID`),
KEY `senderID` (`senderID`),
KEY `receiverID` (`receiverID`),
KEY `listID` (`listID`),
KEY `group` (`groupID`) USING BTREE
) ENGINE=InnoDB AUTO_INCREMENT=10 DEFAULT CHARSET=latin1;

--
-- Dumping data for table `items`
--
INSERT INTO `items` (`itemID`,`listID`,`senderID`,`receiverID`,`label`,`postDate`,`parentItem`,`groupID`) VALUES
(1,1,1,2,'Item 1','2010-03-16 02:55:55',0,1),
(2,1,1,2,'Item 2','2010-03-16 02:55:55',0,2),
(3,1,1,2,'Item 3','2010-03-16 02:55:55',0,3),
(4,3,2,1,'Item 4','2010-03-16 02:55:55',0,4),
(5,3,2,1,'Item 5','2010-03-16 02:55:55',0,5),
(6,3,2,1,'Item 6','2010-03-16 02:55:55',0,6),
(7,5,3,4,'Item 7','2010-03-16 02:55:55',0,7),
(8,5,3,4,'Item 8','2010-03-16 02:55:55',0,8),
(9,5,2,2,'Item 8 Edited Label by User 2','2010-03-16 02:55:55',8,9);

--
-- Definition of table `lists`
--

DROP TABLE IF EXISTS `lists`;
CREATE TABLE `lists` (
`listID` int(10) unsigned NOT NULL AUTO_INCREMENT,
`title` varchar(45) DEFAULT NULL,
`senderID` int(10) unsigned DEFAULT NULL,
`receiverID` int(10) unsigned DEFAULT NULL,
`notes` varchar(45) DEFAULT NULL,
`createDate` datetime DEFAULT NULL,
`type` int(10) unsigned DEFAULT NULL,
PRIMARY KEY (`listID`),
KEY `senderID` (`senderID`),
KEY `receiverID` (`receiverID`)
) ENGINE=InnoDB AUTO_INCREMENT=7 DEFAULT CHARSET=latin1;

--
-- Dumping data for table `lists`
--
INSERT INTO `lists` (`listID`,`title`,`senderID`,`receiverID`,`notes`,`createDate`,`type`) VALUES
(1,'List 1',1,2,'some notes','2010-03-16 02:55:55',0),
(2,'List 2',1,3,'some notes','2010-03-16 02:55:55',0),
(3,'List 3',2,1,'some notes','2010-03-16 02:55:55',0),
(4,'List 4',2,3,'some notes','2010-03-16 02:55:55',0),
(5,'List 5',3,4,'some notes','2010-03-16 02:55:55',1),
(6,'List 6',3,4,'some notes','2010-03-16 02:55:55',1);

--
-- Definition of table `listuserlinks`
--

DROP TABLE IF EXISTS `listuserlinks`;
CREATE TABLE `listuserlinks` (
`listuserlinkID` int(10) unsigned NOT NULL AUTO_INCREMENT,
`userID` int(10) unsigned DEFAULT NULL,
`listID` int(10) unsigned DEFAULT NULL,
PRIMARY KEY (`listuserlinkID`),
KEY `userID` (`userID`),
KEY `listID` (`listID`)
) ENGINE=InnoDB AUTO_INCREMENT=3 DEFAULT CHARSET=latin1;

--
-- Dumping data for table `listuserlinks`
--
INSERT INTO `listuserlinks` (`listuserlinkID`,`userID`,`listID`) VALUES
(1,2,5),
(2,2,6);

--
-- Definition of table `users`
--

DROP TABLE IF EXISTS `users`;
CREATE TABLE `users` (
`userID` int(10) unsigned NOT NULL AUTO_INCREMENT,
`email` varchar(255) DEFAULT NULL,
`displayName` varchar(45) DEFAULT NULL,
PRIMARY KEY (`userID`)
) ENGINE=InnoDB AUTO_INCREMENT=5 DEFAULT CHARSET=latin1;

--
-- Dumping data for table `users`
--
INSERT INTO `users` (`userID`,`email`,`displayName`) VALUES
(1,'user1@foo.bar','Some One'),
(2,'user2@foo.bar','Some Two'),
(3,'user3@foo.bar','Some Three'),
(4,'user4@foo.bar','Some Four');



/*!40101 SET SQL_MODE=@OLD_SQL_MODE */;
/*!40014 SET FOREIGN_KEY_CHECKS=@OLD_FOREIGN_KEY_CHECKS */;
/*!40014 SET UNIQUE_CHECKS=@OLD_UNIQUE_CHECKS */;
/*!40101 SET CHARACTER_SET_CLIENT=@OLD_CHARACTER_SET_CLIENT */;
/*!40101 SET CHARACTER_SET_RESULTS=@OLD_CHARACTER_SET_RESULTS */;
/*!40101 SET COLLATION_CONNECTION=@OLD_COLLATION_CONNECTION */;
/*!40101 SET CHARACTER_SET_CLIENT=@OLD_CHARACTER_SET_CLIENT */;





Query to be run on this databae:

SELECT DISTINCT items.itemID, items.listID, items.senderID, items.receiverID, items.label, items.parentItem,
lists.listID, lists.senderID, lists.receiverID, lists.type,
sender.userID as senderID, sender.email as senderEmail, sender.displayName as senderName,
receiver.userID as receiverID, receiver.email as receiverEmail, receiver.displayName as receiverName,
edited.label as editedLabel

FROM items

LEFT OUTER JOIN lists ON lists.listID = items.listID
LEFT OUTER JOIN items edited ON edited.parentItem = items.itemID
inner join users sender ON sender.userID = items.senderID
inner join users receiver ON receiver.userID = items.receiverID

WHERE items.parentItem = 0

AND (
(items.senderID = 2 OR items.receiverID = 2) OR lists.listID IN (SELECT listID FROM listuserlinks WHERE userID = 2)
);

Options: ReplyQuote


Subject
Written By
Posted
Performance and indexing, when to use multi-column indexes and more
March 16, 2010 08:29AM


Sorry, you can't reply to this topic. It has been closed.

Content reproduced on this site is the property of the respective copyright holders. It is not reviewed in advance by Oracle and does not necessarily represent the opinion of Oracle or any other party.