Structuring tables can improve database performance-- here's how to do it.
by Reuven M. Lerner
Relational databases are becoming increasingly popular for web applications. This is generally a good thing, allowing us to focus on the way in which our data is structured, rather than the way it is stored on disk. Offloading data storage and retrieval tasks to a relational database server means our programs can be smaller and easier to maintain.
However, incorporating a database server on a web site is not a cure-all. The database might take care of many necessary tasks, but it cannot design your tables for you, nor determine the best way in which to work with them.
This month, we will look at the art of database design and how we can structure tables to improve performance. Getting the most out of a database is something of a black art, which is why good database administrators (DBAs) are always in high demand. But with a few simple techniques, we can overcome some of the most basic performance problems experienced by web programmers. We will design a database that can handle train schedules. In so doing, we will touch on a number of issues database programmers face when deciding how to design tables in a database.
I love trains, and often take the train between Tel Aviv and Haifa when I must travel between those two cities. One day, after leafing through the small paper schedule that Israel's Rail Authority distributes, I realized the implementation of a computerized train schedule is not as obvious as it would appear at first.
Rail schedules often come in the form of a printed table, with the columns representing individual trains and each station in its own row. Each table lists trains on one rail line, in a single direction.
Since relational databases store all of their data in tables, you might think this is a perfect way in which to store the information. In order to allow us to add and delete trains more easily, we will swap the axes from the printed schedule, putting the individual trains in the rows and the stations in the columns.
To define such a table in SQL, we could use a query like this:
CREATE TABLE HaifaToTelAviv ( haifa_central TIME NOT NULL, haifa_bat_galim TIME NOT NULL, binyamina TIME NOT NULL, hof_hacarmel TIME NOT NULL, ta_central TIME NOT NULL, ta_hashalom TIME NOT NULL );Given such a table, we could enter our trains as follows:
INSERT INTO HaifaToTelAviv (haifa_central, haifa_bat_galim, binyamina, hof_hacarmel, ta_central, ta_hashalom) VALUES ("12:05", "12:10", "12:17", "12:37", "13:16", "13:21");If you have any experience with databases, you can quickly see the terrible problems in store for us here. For starters, what happens if a new station is built between Haifa and Tel Aviv? That would require us to redefine our table, adding a new column, and that's only the beginning. It is a bit absurd that each train line requires two tables, one for each direction. And there isn't any way for me to determine whether a particular rail line serves any two cities--if the cities are represented by columns. What can we do about Tel Aviv? If two cities are close to each other and I can take a train to either one, I will have to query two tables in order to find the answer.
In addition, trying to query information from the above HaifaToTelAviv table would be difficult, requiring us to know the name of the column corresponding to each station. The problems just continue from there--for instance, what do we enter if the express train passes Binyamina? We could define the ``binyamina'' column to be NULL and enter a NULL value in that column. However, NULL normally indicates that a value is unknown or missing, whereas the reason in this case is much simpler.
Finally, what happens if a new schedule comes out, making each train later by a different amount of time? Editing the schedule in this format would be quite difficult.
How should we model the train schedule, then, if we cannot do so from the printed schedule? The solution is to break the information into smaller tables, bringing them together to answer questions. Relational databases specialize in this sort of operation, allowing us to ``join'' two or more tables together.
Breaking the single large table into many smaller tables makes the database more flexible, allowing us to ask many more questions than would otherwise be possible. For example, we should be able to ask questions like:
These examples all use MySQL, a ``mostly free'' database popular with many web sites. MySQL lacks some of the advanced features of other databases, such as transactions and referential integrity. However, it is easy to install and administer and is extremely fast. You can learn more about MySQL at http://www.mysql.com/.
For example, here is a definition of the RailStations table:
CREATE TABLE RailStations ( id TINYINT UNSIGNED AUTO_INCREMENT PRIMARY KEY, name VARCHAR(50) NOT NULL, UNIQUE(name) );The only reason for RailStations to exist is to associate a numeric ID with each station. It might seem silly to create such a table, when we could enter station names directly wherever we need them.
However, giving each station an ID number gives us two advantages. First of all, we can be sure the station names will be spelled consistently, without variations in spelling, capitalization and abbreviations. Second, an integer consumes less space than the name to which it points. Each tinyint consumes a single byte, whereas a 20-character station name will consume 20 bytes. Referring to the full name would thus consume 20 times as much RAM and disk space.
Notice that we define id to be a column of type TINYINT UNSIGNED. This allows us to assign values between 0 and 255. Large rail systems, with more than 255 stations, would need to use a SMALLINT UNSIGNED, which ranges between 0 and 65535.
We ensure each station name in RailStations is unique by giving it the UNIQUE qualifier. The ID numbers are already guaranteed to be unique because they have been declared the primary key. Better yet, because we specified AUTO_INCREMENT, MySQL will automatically assign an ID number if an INSERT query ignores it. For example:
INSERT INTO RailStations (name) VALUES ("Nahariya");If we now query the database:
SELECT id FROM RailStations WHERE name = "Nahariya";we learn that Nahariya has been automatically assigned an ID of 1.
We can insert one or more new rows into the table with a single INSERT statement. For example, the following adds several more rows to RailStations:
INSERT INTO RailStations (name) VALUES ("Akko"), ("Hof Hacarmel"), ("Tel Aviv Central"), ("Tel Aviv Hashalom"), ("Lod"), ("Rehovot"), ("Herzliya") ;
Unlike cars, buses and airplanes, trains run along fixed lines. Each line must have at least two stations, and each station is on one or more lines.
We could have included an additional ``lines'' column in RailStations, identifying the line with which each train is associated. But given such a table, how would we handle stations that sit on more than one line? It would not make sense to treat the station as two separate stations, particularly if people will need to switch trains.
A better solution uses a separate RailLines table, defined similarly to RailStations:
CREATE TABLE RailLines ( id TINYINT UNSIGNED AUTO_INCREMENT PRIMARY KEY, name VARCHAR(50) NOT NULL, UNIQUE(name) );Now that we have a list of lines and stations, we will create a third table that describes the intersection between the two:
CREATE TABLE StationLines ( station_id TINYINT UNSIGNED NOT NULL, line_id TINYINT UNSIGNED NOT NULL, north_to_south TINYINT UNSIGNED NOT NULL, UNIQUE(station_id, line_id), INDEX(station_id), INDEX(line_id), INDEX(north_to_south) );StationLines is a table that brings together the train stations and the lines on which they sit. station_id and line_id contain values from the ``id'' columns in RailStations and RailLines, respectively. And north_to_south is an integer value that counts the number of stops between the beginning of the line and the named station. Thus, the northernmost station on a rail line would be assigned 1, the next station would be assigned 2, the station after that would be assigned 3, and so forth.
Because each station can sit on more than one rail line, and each rail line contains more than one station, we should not use the UNIQUE modifier with those columns. However, we do not want the combination of any station and line to appear in the table more than once. We enforce this by naming both station_id and line_id as arguments to UNIQUE. Either of those columns may appear multiple times, but the combination of any two values may appear only once.
For example, the following row places station_id 1 as the northernmost station on line_id 1:
INSERT INTO StationLines (station_id, line_id, north_to_south) VALUES (1, 1, 1);The following indicates that station_id 7 is 11 stops from the beginning of line_id 4:
INSERT INTO StationLines (station_id, line_id, north_to_south) VALUES (7, 4, 11);
With StationLines defined, we can begin to ask the database basic questions. For instance, we can list the stations on line two:
SELECT station_id FROM StationLines WHERE line_id = 2 ORDER BY north_to_south;This query produces the following result:
+------------+ | station_id | +------------+ | 6 | | 4 | | 5 | +------------+While this answer is indeed correct, it is not very useful. After all, why should I have to remember various stations' ID numbers in order to use the system?
Fortunately, relational databases permit us to join two tables, allowing us to connect the station's ID number with its name. To avoid confusing columns from the two tables, we name each column using the table.column syntax, separating the two with a period. And to reduce the amount of typing we must do, we give each table a nickname.
For example, we can structure our query such that it selects information from both RailStations and StationLines:
SELECT S.name FROM RailStations S, StationLines L WHERE L.line_id = 2 AND S.id = L.station_id ORDER BY north_to_south;The query now produces the following results:
+-------------------+ | name | +-------------------+ | Lod | | Tel Aviv Central | | Tel Aviv Hashalom | +-------------------+Beginning database programmers often make the mistake of not qualifying their joins enough; that is, not putting enough statements in the WHERE clause. This is because a database server produces a join by combining every row in RailStations with every row in StationLines. The WHERE clause tells the server which rows to remove from the resulting table.
In the example above, the database server first creates a table of 112 rows (8 rows in RailStations x 14 rows in StationLines). It then removes all rows in which L.line_id is not 2, producing 24 rows. It then applies the final criterion, throwing out those rows in which S.id and L.station_id are unequal. The result is three rows.
Because we have broken down the data into three tables and we can join any combination of tables with any set of criteria, our database can already help us answer some basic questions. For example, which rail lines connect to the Tel Aviv Central station? Knowing the ID of that station is 4, I can compose the following query:
SELECT L.name FROM RailLines L, StationLines SL WHERE SL.station_id = 4 AND L.id = SL.line_id;That query produces the following results:
+-------------------------------+ | name | +-------------------------------+ | Nahariya - Tel Aviv | | Tel Aviv - Be'er Sheva | | Binyamina - Tel Aviv suburban | +-------------------------------+If I join a third table in my query, I can use the station's name, rather than its ID number:
SELECT L.name FROM RailLines L, StationLines SL, RailStations S WHERE L.id = SL.line_id AND SL.station_id = S.id AND S.name = "Tel Aviv Central";You might think the latter query, in which we name the station explicitly, would be the more common and useful when designing database applications for the Web. In fact, it's not. <select> lists and other HTML form elements distinguish between the value passed to the server and the value displayed to the user. For example:
<select name="station"> <option value="4">Tel Aviv Central </select>The above one-element <select> list gives us the best of both worlds--it displays the station name to the user, but actually passes the ID associated with that station. This means our query can join two tables rather than three, which reduces the amount of memory it uses, as well as the speed with which results are returned to the client.
In addition to the column definitions and the UNIQUE qualifier, our definition for the StationLines table included three INDEX lines--one for each of the station_id, line_id and north_to_south columns.
While it often helps to think of a relational database table as a glorified spreadsheet with rows and columns, there are some important differences. One is that a database table does not store its rows in any particular order. If we are interested in retrieving rows from the table in a certain order, we must specify it with the ORDER BY clause in our query.
Because rows are not ordered in any particular way, a SELECT query can often take quite a while to fulfill. For example, take the following query:
SELECT id FROM RailStations WHERE name = "Tel Aviv Central";This might not seem like a time-consuming query, given that it involves a single table and a simple WHERE clause. But since the rows of RailStations are not stored in any particular order, finding the rows where the name is ``Tel Aviv Central'' can take quite a while. This might be a negligible amount of time in the case of a 100-row table, but when a table contains 1,000 or 10,000 rows, the time can become noticeable. In this particular example, the database server is probably smart enough to realize that RailStations.name has been declared UNIQUE, meaning our query will return one row, if it returns anything. This means the server will, on average, have to search through only half of the rows--but that can still take quite a while.
An index changes this picture by adding a pointer to each column value. If RailStations.name is indexed, the MySQL server can almost immediately find those rows containing a particular value. It can also determine whether a value exists at all.
If indexes can increase query speeds so dramatically, why are rows unindexed by default? The main answer is that indexes are written and updated each time an INSERT or UPDATE operation is performed on a table. Since the majority of database queries are SELECTs, in which the index can substantially improve performance, this is normally an acceptable trade-off. However, certain applications must INSERT and UPDATE at maximum speed, in which case creating an index can cause problems.
Since indexes are used in locating columns of a certain value, they are necessary for only those columns that will be named in WHERE clauses. There is no need to index a column that is displayed, but rarely used as a search criterion.
In some cases, it is enough to index the first part of each column rather than the entire column. For example, if we are indexing a column of type VARCHAR(50), then we might be able to index only 10 of those characters. This will retain most of the advantages of a full index (since the first ten characters are rarely identical in such a text field), while reducing the amount of information the index must store.
Now that we have thoroughly examined the tables describing the train system, it is time to put some trains on those tracks. The question of how to model this data is a tough one, since there are a number of ways in which to accomplish it. I decided to split this information into two tables, Trains and DepartureTimes.
Each row of Trains describes a particular train, indicating the line on which it runs, the ID numbers of its origin and destination stations, and the time it departs from its origin:
CREATE TABLE Trains ( id SMALLINT UNSIGNED AUTO_INCREMENT PRIMARY KEY, line_id TINYINT UNSIGNED NOT NULL, origin_id TINYINT UNSIGNED NOT NULL, destination_id TINYINT UNSIGNED NOT NULL, depart_origin_time TIME NOT NULL, UNIQUE(line_id, origin_id, destination_id, depart_origin_time), INDEX(line_id), INDEX(origin_id), INDEX(destination_id), INDEX(depart_origin_time) );The first column is a primary key, allowing us to describe each train with a single number. The combination of a rail line, origin, destination and hour should be unique, so we ask the database server to enforce this condition with the UNIQUE keyword.
Finally, we define the DepartureTimes table, which stores information on when a train will leave from a particular station:
CREATE TABLE DepartureTimes ( train_id SMALLINT UNSIGNED NOT NULL, station_id TINYINT UNSIGNED NOT NULL, departure_time TIME NOT NULL, INDEX(train_id), INDEX(station_id), INDEX(departure_time) );Once we enter information into these tables, we can start to perform sophisticated queries. For example, which trains arrive at ``Tel Aviv Central'' before 8 a.m.?
SELECT train_id FROM DepartureTimes WHERE departure_time < "08:00" AND station_id = 4;Sure enough, this query returns a table containing two rows:
+----------+ | train_id | +----------+ | 1 | | 2 | +----------+Now we know two trains will arrive in Tel Aviv early enough for us to catch a morning meeting. But which trains are those? It would be nice to get more information than that. One possibility is to print the name of the origin station and the hour at which the train leaves:
SELECT S.name, T.depart_origin_time FROM DepartureTimes DT, Trains T, RailStations S WHERE DT.departure_time < "08:00" AND DT.station_id = 4 AND DT.train_id = T.id AND S.id = T.origin_id;Notice how SQL allows us to use < and > when handling dates and times, for columns declared as DATE, TIME or DATETIME. Given the contortions one must use in order to compare dates and times in nearly any programming language, this built-in date comparison is still one of my favorites.
Assuming we want to take the first train of the day (ID 1), we can print the list of when it will arrive at each station:
SELECT T.id, S.name, DT.departure_time FROM RailStations S, DepartureTimes DT, Trains T, StationLines SL WHERE T.id = DT.train_id AND T.id = 1 AND T.line_id = SL.line_id AND SL.station_id = DT.station_id AND DT.station_id = S.id ORDER BY T.id, SL.north_to_south ;We can even print a full schedule for trains to Tel Aviv (ID 5):
SELECT T.id, S.name, DT.departure_time FROM RailStations S, DepartureTimes DT, Trains T, StationLines SL WHERE T.id = DT.train_id AND T.line_id = SL.line_id AND SL.station_id = DT.station_id AND DT.station_id = S.id AND T.destination_id = 5 ORDER BY T.id, SL.north_to_south ;Finally, we can retrieve a full schedule for trains to Tel Aviv (ID 5) that leave after 9 AM:
SELECT T.id, S.name, DT.departure_time FROM RailStations S, DepartureTimes DT, Trains T, StationLines SL WHERE T.id = DT.train_id AND T.line_id = SL.line_id AND SL.station_id = DT.station_id AND DT.station_id = S.id AND T.destination_id = 5 AND T.depart_origin_time > "09:00" ORDER BY T.id, SL.north_to_south ;
While it is often a good idea to use a database for storing and retrieving information in a web application, it is not always obvious how to go about structuring the tables in that database. Splitting information into separate tables, as we have seen, makes it possible to mix and match data in a wide variety of ways. By using numeric primary keys and indexing the columns we will need most, we can make our queries efficient as well as flexible.
Now that we have seen how to define our database tables in an intelligent way, it is time to create some applications to use them. Next month, we will look at a variety of applications that can use these tables, giving them interfaces appropriate for web users.
Reuven M. Lerner, an Internet and Web consultant, recently moved to Modi'in, Israel following his marriage to Shira Friedman-Lerner. His book Core Perl will be published by Prentice-Hall in the spring. Reuven can be reached at reuven@lerner.co.il. The ATF home page, including archives and discussion forums, is at http://www.lerner.co.il/atf/.