Mr. Lerner shows us how to add a full-text search to our BBS.
by Reuven M. Lerner
Over the last two months, we have looked at a simple bulletin-board system that can be incorporated into a web site. This BBS groups messages by subject or thread, and stores the messages in a relational database.
As we have seen in the last two installments of ATF, using a database for information storage and retrieval significantly reduces the amount of development work necessary to implement such a system. Given the ease with which I was able to implement the BBS, I decided to add several more features which can help users navigate their way through and use the BBS.
This month, we will look at how to implement several of these features. The most important is full-text search, which allows users to find interesting postings based on keywords. This saves them from having to search through a thread, which might not have an appropriate or easy-to-understand title. Then we will look at a tool that allows administrators to remove inappropriate postings without having to go into the guts of the database.
If you did not catch the last two installments of ``At the Forge'', let's take a quick look at how the BBS is implemented. I used MySQL (see Resources), a relational database that has gained quite a bit of popularity among web programmers. Information in a relational database is stored in tables, in which the rows represent records and the columns represent fields.
We define the columns with SQL, the Structured Query Language that is a well-known standard for working with relational databases. Our BBS will contain two separate tables, ATFThreads (for keeping track of the individual threads, including the initial posting) and ATFMessages (for keeping track of individual messages). SQL is sent to a database server from a database client; this can be a programmatic client (e.g., a CGI program) or an interactive client (e.g., the mysql program that comes with MySQL). I normally use interactive clients for table creation, maintenance and debugging, but we can create our two tables with the following SQL:
CREATE TABLE ATFThreads ( id SMALLINT UNSIGNED AUTO_INCREMENT PRIMARY KEY, subject VARCHAR(255) NOT NULL, author VARCHAR(60) NOT NULL, email VARCHAR(60) NOT NULL, text TEXT NOT NULL, date DATETIME NOT NULL, UNIQUE(subject) ) CREATE TABLE ATFMessages ( id MEDIUMINT UNSIGNED AUTO_INCREMENT PRIMARY KEY, thread SMALLINT UNSIGNED NOT NULL, subject VARCHAR(60) NOT NULL DEFAULT "No subject", date DATETIME NOT NULL, author VARCHAR(60) NOT NULL DEFAULT "Mr. Nobody", email VARCHAR(60) NOT NULL DEFAULT "atf@lerner.co.il", text TEXT NOT NULL )If you enter the above into the interactive mysql program, you will want to put a semicolon (;) after each query to indicate that you want mysql to execute the query right away, rather than wait for additional input.
Once we have created the tables, we will have to write some programs--in our case, CGI programs--that manipulate the data. We will not write programs with SQL directly, but will create it within our CGI programs. Using Perl for such CGI programs is particularly easy once we use CGI.pm (the standard Perl module for CGI programs) and DBI (Perl's generic database interface).
The entire bulletin board system consists of about seven programs, each of which handles a different aspect of the system. You can download the programs from the FTP sites ftp://ftp.ssc.com/pub/lj/listings/issue57/3193.tgz and ftp://ftp.ssc.com/pub/lj/listings/issue58/3252.tgz Listings for this article will be in ftp://ftp.ssc.com/pub/lj/listings/issue59/3296.tgz
Now that we have given the basic functionality a quick discussion, let's start to add some advanced functionality to our BBS.
The first new function we will add to the BBS is full-text searching. I have long been a fan of full-text search, both on the Web and elsewhere. According to Jakob Nielsen, probably the best-known web-usability researcher, many users are ``search-dominant''. This means they prefer to search through a site for content, rather than traverse through trees of hyperlinks. (See Resources for more information on Nielsen, including the URL of his article.)
While the threaded structure of our BBS makes it relatively easy to find postings on a particular topic, undoubtedly there will be times when subject lines fail to reflect the actual content or when discussions veer into unexpected or unusual directions. Allowing users to search for words or phrases makes it easier for them to find what they want. Best of all, because the search functionality is in a separate program, it slows down the system only when someone uses it. If no one ever searches through the BBS, the system will not be slowed down.
MySQL allows for two kinds of searches through a table, with either SQL regular expressions or UNIX-style regular expressions. SQL regular expressions might seem silly to someone used to working with UNIX, but they are guaranteed to work on any database system that adheres to SQL standards. SQL regular expressions have two special characters: % (which matches zero or more characters) and _ (which matches exactly one character). To escape these special characters, you insert a leading backslash (\). To get a literal backslash, you insert two backslashes (\\).
To search for a match with SQL regular expressions, you use the LIKE operator within a SELECT statement. This returns all of the rows for which the regular expression finds a match. For instance:
SELECT text FROM ATFMessages WHERE text LIKE "a%"will retrieve all message texts that begin with the letter a, followed by zero or more characters.
If you prefer UNIX-style regular expressions, MySQL allows you to use the REGEXP (or RLIKE) operator, as in
SELECT text FROM ATFMessages WHERE text RLIKE "a.*"This will perform the same function as above.
Rather than force one system on all users, I decided to allow for both literal text and UNIX regular expressions. UNIX regular expressions are hard for most people to learn and understand, so the default is to allow for literal text searches. We perform a literal text search by using the LIKE operator, escaping the two SQL regular expression meta-characters with backslashes.
The search form itself (search-form.shtml, see Listing 1 in the archive file for an expanded version) is thus very short:
<P>Search for: <input type="text" name="term"></P> <input type="radio" name="regexp" checked value="no">Literal search <input type="radio" name="regexp" value="yes">Use regular expressions <input type="submit" value="Search!">This form is submitted to the CGI program search.pl (Listing 2 in the archive file), which performs the actual searching. Search.pl is also fairly simple, although the SQL query is the most complex we have seen in this project. That's because we have to search through ATFMessages to find matches. We also need the message's thread ID number in order to create a hyperlink to view-thread.pl, which allows users to look at that thread.
We perform what is known as a ``join'' between the two tables, selecting several columns from ATFMessages and one column from ATFThreads. Joins allow us to take only the most interesting columns from two or more tables, grabbing only those matching our criteria. Always remember to set up a relationship among the tables, otherwise you will get the ``Cartesian product'' of the results, with each of the rows in table A matched up with each of the rows in table B. We thus avoid selection like this:
SELECT M.id, M.thread, M.subject, M.author, T.subject FROM ATFMessages M, ATFThreads Twhich will produce the Cartesian product. Instead, we use
SELECT M.id, M.thread, M.subject, M.author, T.subject FROM ATFMessages M, ATFThreads T AND M.thread = T.id ORDER BY M.date descwhich qualifies the relationship between ATFMessages (given the nickname ``M'' in this query) and ATFThreads (with the nickname T), then lists the resulting rows in descending date order.
We also test the value of $regexp, which is set to the value of the ``regexp'' radio button. If $regexp is ``yes'', we use the REGEXP operator in our SQL query and perform a regular expression search. Otherwise, we escape the SQL characters % and _ and use the LIKE operator. The Perl code to make this query possible looks like this:
my $sql = "SELECT M.id, M.thread, M.subject, M.author, T.subject "; $sql .= "FROM ATFMessages M, ATFThreads T "; if ($regexp eq "yes") { $sql .= "WHERE M.text REGEXP \"$term\" "; } else { $term =~ s|%|\\\%|g; $term =~ s|_|\\\_|g; $sql .= "WHERE M.text LIKE \"%$term%\" "; } $sql .= "AND M.thread = T.id "; $sql .= "ORDER BY M.date desc";Because we build the SQL query by combining text strings, we can conditionally modify parts of the query, as we saw above.
Results from a SQL SELECT query are always returned in a table, in which the columns are the rows requested in the query and the rows are those matching the criteria from the query. With DBI, reading the results from a query usually means iterating through the rows from within a Perl while loop.
DBI provides a number of methods for retrieving the values returned by a SELECT, but perhaps the easiest one to understand is the simple fetchrow_array method. This method is defined for $sth, the ``statement handle'' through which we submit our query and retrieve its results.
Several methods can be used to retrieve the results from our SELECT, but the easiest one to understand is $sth->fetchrow_array, which returns one row from the response. Each time we invoke $sth->fetchrow_array, the next row from the response table is returned. After $sth->fetchrow_array returns the last row of the response table, it returns ``false''. By putting $sth->fetchrow_array inside of a while loop, we can iterate through each of the rows in the response table.
Here, then, is the code from search.pl that iterates through the results table:
while (my @row = $sth { ($message_id, $thread_id, $subject, $author, $thread_name) = @row; print "<li><a href=\"/cgi-bin/view-thread.pl?"; print "$thread_id#$message_id\">$subject</a>, "; print "by $author in "; print "<a href=\"/cgi-bin/view-thread.pl?$thread_id\">"; print "$thread_name</a>\n"; }As you can see, we assign a number of scalars to the individual elements in @row. DBI returns NULL elements (that is, elements that lack a value, rather than the C/Perl notion of ``true'' being non-zero) as undefined, so you can test for a value with Perl's built-in defined function. Once we have extracted the elements of @row into a number of easy-to-identify scalars, we can then use them to print results to the user's browser.
Notice how each of the hyperlinks we create does not simply point to a thread, but also to a message. We can do this by taking advantage of named anchors within a link, which allow us to force the user's browser to scroll to a particular point. If you are unfamiliar with named anchors, here is a quick lesson: in the link http://www.ssc.com/test.html#testing, ``testing'' is the named anchor and points to a location in test.html marked with the tag <a name="testing">. If no such tag exists, adding the named anchor to the URL has no effect.
Because our program view-thread.pl (discussed last month) places such a named anchor at the beginning of each message header within a thread, we can thus point users directly to the message that matched their search string, rather than to the thread.
By the way, if you are interested in getting the greatest possible speed out of your application, you might want to consider using $sth->fetchrow_arrayref rather than $sth->fetchrow_array. The difference, as you might guess from their names, is that the former method returns an array reference, while the latter returns an array.
Passing a reference will always be faster than passing an array, since it involves manipulating fewer bytes. I chose to work with $sth->fetchrow_array partly because it simplified the rest of the code, and partly because I felt that we would be handling small amounts of data anyway and that the speed difference would not be too significant.
With these two files--search-form.html and search.pl--installed on our server, we now have the ability to search through the text of any message. With a few new links to the search form from our main page, this functionality is integrated into our system.
Whenever I work on a web site that involves databases, I almost always include one or two web-based administrative tools. Such tools are useful for a number of reasons, the most obvious of which is their utility for people who want to manipulate the database without learning SQL. (They also allow you to shield the database from people who think they understand SQL, only to discover there is not any way to undo a DROP TABLE command.)
Which tools are necessary and which functions they must perform will depend on the web application you have written, as well as the needs of your individual users. We will look at a simple application called zap-thread.pl, which allows administrators to delete one or more discussion threads.
In order to ensure that only authorized users can zap threads, we will include a password field, adding the variable $zap_password to ATFConstants.pm, the module containing all of our global variables. (See Listing 4 in the archive file.)
While there are a number of ways in which we could implement zap-thread.pl, I found it easiest to write a single program that has two different personalities. When invoked with the GET method, zap-thread.pl produces an HTML form that can be used to delete threads, including a text field into which the user must enter a password.
When invoked with POST, zap-thread.pl assumes it was invoked by the form produced by its GET personality. It expects to receive two different sorts of HTML form elements: a password in the password form element, which must be compared with the $zap_password variable, and one or more check boxes named ``thread-x'', where x is the ID number of the thread in the database.
Before doing anything else, our program compares the password it received with the $zap_password variable. If they match, we continue without comment. If they fail to match, we produce an error message telling the user that the password did not match.
We can iterate through the thread-x elements in a number of ways, but I found the easiest way was to iterate through each element, ignoring any elements that failed to match the ``thread-x'' pattern. By capturing the number within parentheses:
next unless ($element =~ m/^thread-(\d+)$/);we can then grab it using the $1 variable, which retrieves whatever was in the first set of parentheses in the last match:
my $thread_id = $1;Once we have retrieved the thread ID number, we need to delete matching rows from ATFThreads, which contains the master list of threads, and from ATFMessages, which contains the messages themselves. If we were to delete rows from only ATFThreads, we would run a substantial risk of MySQL reusing the thread ID number with a new thread--which would effectively put all of our supposedly deleted messages in the new thread.
We delete them by sending two separate SQL queries, printing a brief status message to the user's browser:
my $sql = "DELETE FROM ATFThreads WHERE id = $thread_id "; warn "SQL: \"$sql\"\n"; my $sth = $dbh->prepare ($sql); my $result = $sth->execute; die("Error deleting from ATFThreads: " . $sth->errstr) unless $result; print "<P>Deleted the thread.</P>\n"; # Delete messages for this thread from ATFMessages $sql = "DELETE FROM ATFMessages WHERE thread = $thread_id "; warn "SQL: \"$sql\"\n"; $sth = $dbh->prepare ($sql); $result = $sth->execute; die("Error deleting from ATFMessages: " . $sth->errstr) unless $result; print "<P>Deleted messages in the thread.</P>\n"; }By putting the DELETE commands inside of a foreach loop, we make it possible to delete more than one thread. If the user indicates that three threads should be deleted, we will enter the loop three times, deleting each thread in sequence.
There is at least one problem with this implementation of zap-thread.pl, and that is the lack of an undo function. What happens if you delete the wrong thread? Perhaps the easiest way to implement such an undelete function would be to change the underlying tables, adding a new ``active'' column to ATFThreads and ATFMessages. This column would contain a single true/false value, perhaps implemented with a TINYINT or an ENUM type.
With such table definitions in place, deleting a thread would involve an UPDATE query (rather than the current DELETE query) to modify the value in that column. Such a change would also require some adjustments to list-threads.pl and view-thread.pl, so that they only SELECT those rows WHERE active="true", or a similar condition.
Another potential issue with zap-thread.pl is that it operates on a per-thread basis. There are undoubtedly times when we will want to delete an individual message, rather than a whole thread. Creating such a program would be a bit more difficult than zap-thread.pl, but the true challenge is in the user interface. Using zap-thread.pl will be difficult enough with 100 threads; trying to find an interface appropriate for zapping individual messages would be even more difficult. The best approach might involve breaking the program into two smaller programs, one for selecting a thread and another for selecting a message.
As with most software projects, this BBS has almost unlimited potential for improvement and expansion. A number of ways remain in which this software could be improved--allowing for sub-threads, providing for moderated discussions, sending e-mail to some or all participants in a thread when a new posting is added, allowing users to edit their own postings and even an integrated spell-checker.
Finally, notice how our web BBS application used separate CGI programs, rather than one single large program. Using a suite of related programs is not the only way to design such applications, but I find it to be the easiest way to create such functionality. Not only does it allow breaking the problem into separate slices, but it also allows for incremental implementation--which comes in handy when a client is breathing down your neck, wanting to see results right away.
Reuven M. Lerner is an Internet and Web consultant living in Haifa, Israel, who has been using the Web since early 1993. His book Core Perl will be published by Prentice-Hall in the spring. Reuven can be reached at reuven@lerner.co.il. The ATF home page, including archives and discussion forums, is at http://www.lerner.co.il/atf/.