Learning to be more efficient in searching by pre-indexing files.
by Reuven M. Lerner
Last month, we looked at a simple search engine for web sites. The program was little more than a CGI program strapped to the File::Find Perl module: each time a user would enter a search term in the HTML form, the search program would dutifully open and examine each of the files under the web hierarchy.
While this sort of search engine works, it is exceedingly inefficient. A site containing several dozen files will not feel too much of a hit when its documents are searched repeatedly by a CGI program, but a site with hundreds or thousands of files, attracting thousands of hits per day, will watch its server's load average skyrocket without very much difficulty.
This month, we will explore ways of making a search engine more efficient. In the end, we will have a search engine which might not work as efficiently as other software, but is simple to install and use. Most importantly, we will get a chance to explore an interesting type of software with inner workings usually invisible to us.
Searching through files sequentially, trying to find matches for a user's input, is an inherently inefficient business. Each file must be opened, read, scanned and closed, which takes time. Perl programs tend to consume a fair amount of memory, so the slow execution speed means more copies of the CGI program will be running at once. This in turn increases the risk that the web server will have to use virtual memory, rather than physical RAM. Slow web servers make for unhappy users, and often convince users not to return at all.
To solve this problem, we must reduce or remove the need for the search program to read through files. If the CGI search program did not have to open each individual file, things would speed up quite a bit.
A tried-and-true solution is to divide it up between two programs. Once or twice each day, an indexing program traverses the web-document tree, reading through each document and analyzing its word use. This program runs behind the scenes without user intervention or knowledge. Rather than sending its results to a user, the indexer dumps all information it has about word frequency and usage and places it in a file on disk.
This means the search program the user invokes via CGI does not actually have to search. Instead, the search program merely opens the index file, finds those files where the user's search term appears the greatest number of times, and displays that list in the user's browser.
Indexing a page is not difficult in Perl, because of its rich library of regular expressions. The m// operator normally matches the regular expression between its delimiters. When invoked with the /g modifier and when operating in list context, it returns all matches it can find. Thus, in the expression
my $found = join ' ', ("encyclopedia" =~ m/[aeiou]/g); print "$found\n";the first statement finds all vowels in ``encyclopedia'' and returns them as a list to the caller. In this case, the caller is Perl's join operator, which pushes the elements together, separated by spaces. Executing the two lines of code above displays the following on the user's screen:
e o a e i aUsing the built-in character class for non-whitespace characters, \S, we can apply the same algorithm in order to extract words from a text string. For example:
my $sentence = "The rain in Spain falls mainly\n\n on the plain."; my $found = join '|', ($sentence =~ m/\b\S+\b/g); print "$found\n";The code above prints the following results:
The|rain|in|Spain|falls|mainly|on|the|plainNotice how using \b (which matches a word boundary) means our program need not worry about newline characters, extra spaces or punctuation.
Indexers have to consider whether to keep case relevant. My personal preference is to ignore case, since users do not necessarily remember, and it also removes an obstacle to finding desired text. We can thus turn all of the words into lowercase letters:
my $sentence = "The rain in Spain falls mainly\n\n on the plain."; my $found = join '|', map {lc $_} ($sentence =~ m/\b\S+\b/g); print "$found\n";
To store index information, we will use a hash, %IGNORE. The keys will be words we wish to ignore when indexing. Any non-zero value will indicate this word should be ignored when indexing:
%IGNORE = ("the" => 1, "in" => 1, "on" => 1); my $sentence = "The rain in Spain falls mainly\n\n on the plain."; my $found = join '|', grep {!$IGNORE{$_}} map {lc $_} ($sentence =~ m/\b\S+\b/g); print "$found\n";Notice how we can stack different items together: m// returns a list, which is passed to map, which returns a list, which is fed to grep, which is in turn fed to join, and which is in turn assigned to $found.
Finally, we will index the words by creating a hash (%index) in which the collected words are the keys. The value will be a hash reference, where the key is the name of the current file, and the value is the number of times this word appears in the file. In other words,
$index{spain}->{foo.html} = 5;means the word ``spain'' appears in foo.html five times. Here is some code that performs the indexing in this way:
%IGNORE = ("the" => 1, "in" => 1, "on" => 1); my $sentence = "The rain in Spain falls mainly\n\n on the plain."; my @found = grep {!$IGNORE{$_}} map {lc $_} ($sentence =~ m/\b\S+\b/g); foreach my $word (@found) { $index{$word}->{$filename}++; }
Now that we know how to index the words from a string, we will use File::Find to perform this task on the entire web hierarchy. File::Find comes with the standard Perl distribution, and exports a subroutine named find. This subroutine takes a subroutine reference and a list of directories as arguments. It invokes the named subroutine (known as a ``callback'') on every file it finds in the named directories. It is the subroutine's responsibility to open the file, retrieve its contents, and index it as necessary.
Listing 3 (see Resources) contains a simple program (index-files.pl) that traverses the web hierarchy, searches through the contents, and stores word frequency in the %index hash. The callback subroutine ignores files in any directory named .noindex, as well as any subdirectories of such directories. This is accomplished with another hash named %ignore_directory. It also ignores non-HTML files and removes HTML tags from the contents of each file before indexing the words it contains. Rather than removing the HTML tags altogether, index-files.pl turns them into whitespace. This means that two words separated by an HTML tag (but no whitespace) will not be jammed together, but will remain separate words.
To avoid problems with HTML entities such as >, & and other items needed for symbols and accented Latin characters, we use the HTML::Entities module. This module exports a decode_entities subroutine that turns the entities back into their regular characters. By running this and then removing the HTML tags, we are able to index a clean document containing only text.
While hashes are convenient and fast, they tend to consume a lot of memory. A hash that points to another hash consumes even more memory, and storing each individual word as a key in the primary hash means the memory usage will skyrocket even more. Running index-file.pl is thus not for the faint of heart or for computers without a significant amount of RAM. This particular version of index-file.pl caused my Linux system (with 72MB of RAM and another 64MB of swap) to run out of memory, particularly when piping the output through less.
While index-files.pl performs an admirable job of indexing the files in a web hierarchy, it does not write the information to disk. Once index-files.pl exits, all indexing information is lost forever.
The solution is clearly to write the index to disk. But how? The most obvious answer is to use a DBM file. DBM files are most easily described as an on-disk version of hashes. Each entry in a DBM file has a key and a value, both of which may be any character string. As with hashes, DBM files are optimized for speed, making them much faster to use than straight text files. There are several different kinds of DBM files out there, ranging from plain old DBM to the GNU Project's GDBM to Berkeley DB, which has gained quite a bit of popularity in the last few years partly as a result of its integration into Sendmail.
Perl supports many types of DBM files through its ``tie'' interface, each of which has its own object module. Tying is a means of making a variable magical, attaching additional behavior every time a value is stored or retrieved. Thus, tying a hash to a DBM class means that every time a value is stored to the hash, the value is written to a corresponding DBM file on disk. By the same token, every value retrieval reads the information from disk.
There is one problem here, though: DBM files can store text strings, but nothing more complicated than that. Because each value of %index is a reference to another hash, the normal DBM classes will not work correctly. Attempting to store references in a normal DBM file will actually store their printed representation, which is not at all useful.
The solution is to use the MLDBM class, which is designed precisely for this purpose. MLDBM works in conjunction with another DBM class to store references in a format that can be retrieved later.
To use MLDBM, import it at the top of a program with use, specifying the type of DBM file to use with it:
use MLDBM qw(DB_File);In this particular case, we will use DB_File, which uses the Berkeley DB module that comes with Perl. (Most Linux systems come with Berkeley DB, as well.) It is also possible to specify the underlying method used to store the references; we will use the default, which takes advantage of the Data::Dumper module available on CPAN. Other options include FreezeThaw and Storable, which perform similar tasks in different ways.
Our hash can then be tied to a DBM file with the tie command:
# In what file should the index be stored? my $index_file = "/tmp/lj.db"; # Create the word index hash my %data; tie %data, "MLDBM", $index_file, O_RDWR|O_CREAT or die "Error tying %data: $! ";From this point onward, any value stored to %index will be stored in the file /tmp/lj.db. Any value retrieved from %index will actually be retrieved from /tmp/lj.db. Storing index data in /tmp is a mistake on a production web server, since the file can be deleted by the system.
While MLDBM attempts to function as transparently as possible, it will sometimes cause confusing errors. For example, there is normally no problem with the following Perl code:
$data{$word}->{$url}++;As we saw earlier, this will increment the counter for a particular word in a particular file. When %data is tied to MLDBM, the above code will no longer work. Instead, the assignment must be performed in two stages, retrieving the hash reference, assigning to it, and placing it back inside of %data:
my $hashref = $data{$word}; $hashref->{$url}++; $data{$word} = $hashref;The index should be regenerated on a regular basis to ensure it is as fresh as possible. Consider using cron, which comes with all Linux and UNIX systems, to run index-files.pl every day at 4 AM or at another convenient time when people are unlikely to be using the computer.
While our version of index-files.pl does not support such functionality, it would not be difficult to add a hash key indicating when the file system was last indexed. index-files.pl could then be changed to index only those files that were created or modified after a particular date.
Now that we have created an index, we will need a program to retrieve results from it. Indeed, that's just the beginning--if we want, we can write multiple programs that read through the DBM file, searching in a variety of ways.
Our first and simplest search will accept input from an HTML form. The form will allow users to specify a single word to search for. Unfortunately, the way we have indexed the files makes it easy to retrieve single words, but impossible to retrieve phrases. A different data structure would make it easier to perform such searches, but would undoubtedly increase the size of the index.
Listing 1 is a simple HTML form we can use for searching. This form expects to submit its results to a CGI program called dbsearch.pl, which is in Listing 4 (see Resources). dbsearch.pl opens the DBM file--again, using MLDBM and DB_File--and retrieves the keys from the hash associated with the search term. By sorting the keys (which contain URLs) by their values (which are integers indicating how many times a word appeared in a file), it's easy to create a simple frequency chart. In this particular case, dbsearch.pl displays all of the results, but it would not be difficult to display only the top 20.
Searching for a single term might be easy to implement, but the best search engines allow for AND and OR searches. Listing 2 is an HTML form that differs from the previous version by adding a radio button. If the radio button is set to the default OR position, any one of the search terms may appear in the document. An AND search requires that both or all of the search terms appear in the document in order for it to match.
A version of better-dbsearch.pl, which allows for AND and OR searches, is in Listing 5 (see Resources).
If the user chooses OR, dbsearch.pl will have to find the highest total for a combination of hashes, rather than just one. A simple example would be the hashes %odds and %evens, as follows:
%odds = (one => 1, three => 3, five => 5); %evens = (two => 2, four => 4, six => 6);We can turn this into a simple hash by mixing them together:
%numbers = (%odds, %evens);This technique will not work in our case, because it is quite possible that both words will appear in the same file. If that happens, there will be duplicate keys--unlike %odds and %evens, where no key appears in both hashes. For our OR search to work, we will need to combine the values in a master hash:
foreach my $word (@terms) { my $frequency = $data{$word}; foreach my $filename (keys %$frequency) { $full_results{$filename} += $frequency->{$filename}; } }The above code is not very different from its predecessor in dbsearch.pl. The main change is that it introduces a new hash, %full_results, where the keys are file names and the values reflect the total scores for each of the search terms. This algorithm means a file containing one copy of both search terms will be ranked the same as one in which the same search term appears twice.
To handle AND searches, we must keep track of how many search terms appear in each file. If a file contains the same number of search terms as are in @terms, the file is acceptable. If it contains fewer matches than @terms, the file should be dropped from consideration.
We can implement this by using the %match_counter hash, where the keys are file names and the values are integers representing the number of search terms contained in a file. By adding a single line to the foreach loop, we can keep track of how many search terms appear in each file:
$match_counter{$filename}++ if ($boolean eq "and");Once we have exited from the foreach loop, but before we announce the results to the user, we prune file names from %full_results. Because the keys for %full_results are the same as for %match_counter, the code is relatively simple:
foreach my $filename (keys %full_results) { delete $full_results{$filename} unless ($match_counter{$filename} == @terms); }Notice how we use delete to remove the hash element. Using undef on $full_results{$filename} would make the value undefined, but would leave the hash key unaffected. delete removes the key and its corresponding value from the hash completely.
Also notice how we can get the size of @terms simply by naming the array itself. The == forces scalar context on both of its operands, so there is no need to use the scalar keyword before @terms.
Our pre-indexing solution is, indeed, faster than the search program we examined last month. I did not perform any serious benchmarks on this set of programs, but the difference was readily apparent to the naked eye. Not only that, but our pre-indexed implementation made it easy to rank files according to the number of matches.
However, pre-indexing is not a panacea. For starters, it requires a good deal of disk space, weighing in at 2.6MB for an index of fewer than 400 files. However, given the rapidly dropping price of disk space, even a 10MB index should not prove to be much of a deterrent for most systems.
A bigger problem is the lack of flexibility that pre-indexing forces on a search system. For example, our index programs all used Perl's lc operator to force all of the letters to lowercase. Now that we have removed any trace of case from our index, how can we offer a case-sensitive search? The only real answer is to build the DBM file in a slightly different way, perhaps by storing an additional literal element in the hash for each file. Then we would know that abc appeared five times in a particular file, but that two of these were ABC and the remaining three were abc.
Pre-indexing also means we can no longer offer a phrase search, in addition to AND and OR searches. There are ways to solve this problem; the easiest might be to store ``next word'' information in the database hash. Then we could search for the first word of a phrase and use the ``next word'' information to see if the other words were found, one by one.
Finally, pre-indexing always means the index will lag behind other content on a web site. If the index is generated once every 24 hours, it might take up to one day for a new document to be read into the index. One solution is to run the indexer more often, such as every three or six hours.
Searching is an essential part of every good web site. It means users can find what interests them quickly, and can perform analyses that the site's administrators never expected. But as we saw last month, a straight search through a site's files can take a long time to execute. Pre-indexing is the standard solution to this problem, trading additional disk space and a slightly out-of-date index in exchange for faster execution speed. Understanding the trade-offs involved in writing a search engine makes it easier to evaluate free and commercial offerings, and thus make your site a more enjoyable place for users to visit.
Reuven M. Lerner is an Internet and Web consultant living in Haifa, Israel, who has been using the Web since early 1993. His book Core Perl will be published by Prentice-Hall in the spring. Reuven can be reached at reuven@lerner.co.il. The ATF home page, including archives and discussion forums, is at http://www.lerner.co.il/atf/.