CGI Programming on the World Wide Web

Previous Chapter 10
Gateways to Internet Information Servers
Next
 

10.7 Network News on the Web

NNTP (Network News Transfer Protocol) is the most popular software used to transmit Usenet news over the Internet. It lets the receiving (client) system tell the sending (server) system which newsgroups to send, and which articles from each group. NNTP accepts commands in a fairly simple format. It sends back a stream of text consisting of the articles posted and occasional status information.

This CGI gateway communicates with an NTTP server directly by using socket I/O. The program displays lists of newsgroups and articles for the user to choose from. You will be able to read news from the specified newsgroups in a threaded fashion (all the replies to each article are grouped together).

#!/usr/local/bin/perl
require "sockets.pl";
$webmaster = "Shishir Gundavaram (shishir\@bu\.edu)";
$error = "CGI NNTP Gateway Error";
%groups = ( 'cgi',     'comp.infosystems.www.authoring.cgi',
            'html',    'comp.infosystems.www.authoring.html',
            'images',  'comp.infosystems.www.authoring.images',
            'misc',    'comp.infosystems.www.authoring.misc',
            'perl',    'comp.lang.perl.misc' );

The groups associative array contains a list of the newsgroups that will be displayed when the form is dynamically created.

$all_groups = '(cgi|html|images|misc|perl)';

The all_groups variable contains a regular expression listing all of the keys of the groups associative array. This will be used to ensure that a valid newsgroup is specified by the user.

$nntp_server = "nntp.bu.edu";

The NNTP server is set to "nntp.bu.edu". If you do not want users from domains other than "bu.edu" to access this form, you can set up a simple authentication scheme like this:

$allowed_domain = "bu.edu";
$remote_host = $ENV{'REMOTE_HOST'};
($remote_domain) = ($remote_host =~ /([^.]+\.[^.]+)$/);
if ($remote_domain ne $allowed_domain) {
    &return_error (500, $error, "Sorry! You are not allowed to read news!");
}

The regular expression used above extracts the domain name from an IP name or address.

[Graphic: Figure from the text]

Or, you can allow multiple domains like this:

$allowed_domains = "(bu.edu|mit.edu|perl.com)";
$remote_host = $ENV{'REMOTE_HOST'};
if ($remote_host !~ /$allowed_domains$/o) {
    &return_error (500, $error, "Sorry! You are not allowed to read news!");
}

To continue with the program:

&parse_form_data (*NEWS);
$group_name = $NEWS{'group'};
$article_number = $NEWS{'article'};

There is no form front end to this CGI gateway. Instead, all parameters are passed as query information (GET method). If you access this application without a query, a document listing all the newsgroups is listed. Once you select a newsgroup from this list, the program is invoked again, this time with a query that specifies the newsgroup you want. For instance, if you want the newsgroup whose key is "images", this query is passed to the program:

http://some.machine/cgi-bin/nntp.pl?group=images

The groups associative array associates the string "images" with the actual newsgroup name. This is a more secure way of handling things--much like the way the Archie server names were passed instead of the actual IP names in the previous example. If the program receives a query like the one above, it displays a list of the articles in the newsgroup. When the user chooses an article, the query information will look like this:

http://some.machine/cgi-bin/nntp.pl?group=images&article=18721

This program will then display the article.

if ($group_name =~ /\b$all_groups\b/o) {
   $selected_group = $groups{$group_name};

This block of code will be executed only if the group field consists of a valid newsgroup name, as stored in all_groups. The actual newsgroup name is stored in the selected_group variable.

    &open_connection (NNTP, $nntp_server, "nntp") ||
          &return_error (500, $error, "Could not connect to NNTP server.");
    &check_nntp ();

A socket is opened to the NNTP server. The server usually runs on port 119. The check_nntp subroutine checks the header information that is output by the server upon connection. If the server issues any error messages, the script terminates.

    ($first, $last) = &set_newsgroup ($selected_group);

The NNTP server keeps track of all the articles in a newsgroup by numbering them in ascending order, starting at some arbitrary number. The set_newsgroup subroutine returns the identification number for the first and last articles.

    if ($article_number) {
        if (($article_number < $first) || ($article_number > $last)) {
            &return_error (500, $error,
                 "The article number you specified is not valid.");
        } else {
            &show_article ($selected_group, $article_number);
        }

If the user selected an article from the list that was dynamically generated when a newsgroup is selected, this branch of code is executed. The article number is checked to make sure that it lies within the valid range. You might wonder why we need to check this, since the list that is presented to the user is based on the range generated by the set_newsgroup subroutine. The reason for this is that the NNTP server lets articles expire periodically, and articles are sometimes deleted by their author. If sufficient time passes between the time the list is displayed and the time the user makes a selection, the specified article number could be invalid. In addition, I like to handle the possibility that a user hardcoded a query.

    } else {
        &show_all_articles ($group_name, $selected_group, $first, $last);
    }

If no article is specified, which happens when the user selects a newsgroup from the main HTML document, the show_all_articles subroutine is called to display a list of all the articles for the selected newsgroup.

    print NNTP "quit", "\n";
    &close_connection (NNTP);

Finally, the quit command is sent to the NNTP server, and the socket is closed.

} else {
    &display_newsgroups ();
}
exit (0);

If this program is accessed without any query information, or if the specified newsgroup is not among the list stored in the groups associative array, the display_newsgroups subroutine is called to output the valid newsgroups.

The following print_header subroutine displays a MIME header, and some HTML to display the title and the header.

sub print_header
{
    local ($title) = @_;
    print "Content-type: text/html", "\n\n";
        print "<HTML>", "\n";
    print "<HEAD><TITLE>", $title, "</TITLE></HEAD>", "\n";
        print "<BODY>", "\n";
    print "<H1>", $title, "</H1>", "\n";
    print "<HR>", "<BR>", "\n";
}

The print_footer subroutine outputs the webmaster's address.

sub print_footer
{
    print "<HR>", "\n";
    print "<ADDRESS>", $webmaster, "</ADDRESS>", "\n";
        print "</BODY></HTML>", "\n";
}

The escape subroutine "escapes" all characters except for alphanumeric characters and whitespace. The main reason for this is so that "special" characters are displayed properly.

sub escape
{
    local ($string) = @_;
    $string =~ s/([^\w\s])/sprintf ("&#%d;", ord ($1))/ge;
    return ($string);
}

For example, if an article in a newsgroup contains:

From: joe@test.net (Joe Test)
Subject: I can't get the <H1> headers to display correctly

The browser will actually interpret the "<H1>", and the rest of the document will be messed up. This subroutine escapes the text so that it looks like this:

From&#58; joe&#64;test&#46;net &#40;Joe Test&#41;
Subject&#58; I can&#39;t get the &#60;H1&#62; headers to display correctly

A web client can interpret any string in the form &#n, where n is the ASCII code of the character. This might slow down the display slightly, but it is much safer than escaping specific characters only.

The check_nntp subroutine continuously reads the output from the NNTP server until the return status is either a success (200 or 201) or a failure (4xx or 5xx). You might have noticed that these status codes are very similar to the HTTP status code. In fact, most Internet servers that follow a standard use these codes.

sub check_nntp
{
    while (<NNTP>) {
        if (/^(200|201)/) {
            last;
        } elsif (/^4|5\d+/) {
            &return_error (500, $error, "The NNTP server returned an error.");
        }
    }
}

The set_newsgroup subroutine returns the first and last article numbers for the newsgroup.

sub set_newsgroup
{
    local ($group) = @_;
    local ($group_info, $status, $first_post, $last_post);
    print NNTP "group ", $group, "\n";

The group command is sent to the NNTP server. In response to this, the server sets its current newsgroup to the one specified, and outputs information in the following format:

group comp.infosystems.www.authoring.cgi
211 1289 4776 14059 comp.infosystems.www.authoring.cgi

The first column indicates the status of the operation ( 211 being a success). The total number of articles, the first and last articles, and the newsgroup name constitute the rest of the line, respectively. As you can see, the number of articles is not equal to the numerical difference of the first and last articles. This is due to article expiration and deletion (as mentioned above).

    $group_info = <NNTP>;
    ($status, $first_post, $last_post) = (split (/\s+/, $group_info))[0, 2, 3];

The server output is split on whitespace, and the first, third, and fourth elements are stored in status, first_post, and last_post, respectively. Remember, arrays are zero based; the first element is zero, not one.

    if ($status != 211) {
        &return_error (500, $error,
                            "Could not get group information for $group.");
    } else {
        return ($first_post, $last_post);
    }
}

If the status is not 211, an error message is displayed. Otherwise, the first and last article numbers are returned.

In the show_article subroutine, the actual news article is retrieved and printed.

sub show_article
{
    local ($group, $number) = @_;
    local ($useful_headers, $header_line);
    
    $useful_headers = '(From:|Subject:|Date:|Organization:)';
    print NNTP "head $number", "\n";
    $header_line = <NNTP>;

The head command displays the headers for the specified article. Here is the format of the NNTP output:

221 14059 <47hh6767ghe1$d09@nntp.test.net> head
Path: news.bu.edu!decwrl!nntp.test.net!usenet
From: joe@test.net (Joe Test)
Newsgroups: comp.infosystems.www.authoring.cgi
Subject: I can't get the <H1> headers to display correctly
Date: Thu, 05 Oct 1995 05:19:03 GMT
Organization: Joe's Test Net
Lines: 17
Message-ID: <47hh6767ghe1$d09@nntp.test.net>
Reply-To: joe@test.net
NNTP-Posting-Host: my.news.test.net
X-Newsreader: Joe Windows Reader v1.28
.

The first line contains the status, the article number, the article identification, and the NNTP command, respectively. The status of 221 indicates success. All of the other lines constitute the various article headers, and are based on how and where the article was posted. The header body ends with the "." character.

    if ($header_line =~ /^221/) {
        &print_header ($group);
        print "<PRE>", "\n";

If the server returns a success status of 221, the print_header subroutine is called to display the MIME header, followed by the usual HTML.

        while (<NNTP>) {
            if (/^$useful_headers/) {
                $_ = &escape ($_);
                print "<B>", $_, "</B>";
            } elsif (/^\.\s*$/) {    
                last;
            }
        }

This loop iterates through the header body, and escapes and displays the From, Subject, Date, and Organization headers.

        print "\n";
        print NNTP "body $number", "\n";
        <NNTP>;

If everything is successful up to this point, the body command is sent to the server. In response, the server outputs the body of the article in the following format:

body 14059
222 14059 <47hh6767ghe1$d09@nntp.test.net> body
I am trying to display headers using the <H1> tag, but it does not
seem to be working. What should I do? Please help.
Thanks in advance,
-Joe
.

There is no need to check the status of this command, if the head command executed successfully. The server returns a status of 222 to indicate success.

        while (<NNTP>) {    
            last if (/^\.\s*$/);
            $_ = &escape ($_);
            print;
        }

The while loop iterates through the body, escapes all the lines, and displays them. If the line starts with a period and contains nothing else but whitespace, the loop terminates.

        print "</PRE>", "\n";
        &print_footer ();
    } else {
        &return_error (500, $error,
            "Article number $number could not be retrieved.");
    }
}

If the specified article is not found, an error message is displayed.

The following subroutine reads all of the articles for a particular group into memory, threads them--all replies to a specific article are grouped together for reading convenience--and displays the article numbers and subject lines.

sub show_all_articles
{
    local ($id, $group, $first_article, $last_article) = @_;
    local ($this_script, %all, $count, @numbers, $article,
           $subject, @threads, $query);
    $this_script = $ENV{'SCRIPT_NAME'};
    $count = 0;

This is the most complicated (but the most interesting) part of the program. Before your eyes, you will see a nice web interface grow from some fairly primitive output from the NNTP server.

    print NNTP "xhdr subject $first_article-$last_article", "\n";
    <NNTP>;

The xhdr subject lists all the articles in the specified range in the following format:

xhdr subject 4776-14059
221 subject fields follow
4776 Re: CGI Scripts (guestbook ie)
4831 Re: Access counter for CERN server
12769 Re: Problems using sendmail from Perl script
12770 File upload, Frames and BSCW
-
- (More Articles)
-
.

The first line contains the status. Again, there is no need to check this, as we know the newsgroup exists. Each article is listed with its number and subject.

    &print_header ("Newsgroup: $group");
    print "<UL>", "\n";
    while (<NNTP>) {
        last if (/^\.\s*$/);
        $_ = &escape ($_);
        ($article, $subject) = split (/\s+/, $_, 2);
        $subject =~ s/^\s*(.*)\b\s*/$1/;
        $subject =~ s/^[Rr][Ee]:\s*//;

The loop iterates through all of the subjects. The split command separates each entry into the article number and subject. Leading and trailing spaces, as well as "Re:" at the beginning of the line are removed from the subject. This is for sorting purposes.

        if (defined ($all{$subject})) {
            $all{$subject} = join ("-", $all{$subject}, $article);
        } else {
            $count++;
            $all{$subject} = join ("\0", $count, $article); 
        }
    }

This is responsible for threading the articles. Each new subject is stored in an associative array, $all, keyed by the subject itself. The $count variable gives a unique number to start each value in the array. If the article already exists, the article number is simply appended to the end to the element with the same subject. For example, if the subjects look like this:

2020 What is CGI?
2026 How do you create counters?
2027 Please help with file locking!!!
2029 Re: What is CGI?
2030 Re: What is CGI?
2047 Re: How do you create counters?
.
.
.

Then this is how the associative array will look:

$all{'What is CGI?'} = "1\02020-2029-2030";
$all{'How do you create counters?'} = "2\02026-2047";
$all{'Please help with file locking!!!'} = "3\02027";

Note that we assigned a $count of 1 to the first thread we see ("What's CGI?"), 2 to the second thread, and so on. Later we sort by these numbers, so the user will see threads in the order that they came in to the newsgroup.

    @numbers = sort by_article_number keys (%all);

What you see here is a common Perl technique for sorting. The sort command invokes a subroutine repeatedly (in this case, one that I wrote called by_article_number). Using a fast algorithm, it passes pairs of elements from the $all array to the subroutine.

    foreach $subject (@numbers) {
        $article = (split("\0", $all{$subject}))[1];

The loop iterates through all of the subjects. The list of article numbers for each subject is stored in article. Thus, the $article variable for "What is CGI?" would be:

2020-2029-2030

Now, we work on the string of articles.

        @threads = split (/-/, $article);

The string containing all of the articles for a particular subject are split on the "-" delimiter and stored in the threads array.

        foreach (@threads) {        
            $query = join ("", $this_script, "?", "group=", $id, 
                       "&", "article=", $_);
            print qq|<LI><A HREF="$query">$subject</A>|, "\n";
        }
    }
    print "</UL>", "\n";
    &print_footer ();
}

The loop iterates through each article number (or thread), and builds a hypertext link containing the newsgroup name and the article number (see Figure 10.3).

Figure 10.3: News articles

[Graphic: Figure 10-3]

The following is a simple subroutine that compares two values of an associative array.

sub by_article_number
{
    $all{$a} <=> $all{$b};
}

This statement is identical to the following:

if ($all{$a} < $all{$b}) {
    return (-1);
} elsif ($all{$a} == $all{$b}) {
    return (0);
} elsif ($all{$a} > $all{$b}) {
    return (1);
}

The $a and $b constitute two values in the associative array. In this case, Perl uses this logic to compare all of the values in the associative array.

The display_newsgroups subroutine creates a dynamic HTML document that lists all the newsgroups contained in the groups associative array.

sub display_newsgroups
{
    local ($script_name, $keyword, $newsgroup, $query);
    &print_header ("CGI NNTP Gateway");
    $script_name = $ENV{'SCRIPT_NAME'};
    print "<UL>", "\n";
    foreach $keyword (keys %groups) {
        $newsgroup = $groups{$keyword};
        $query = join ("", $script_name, "?", "group=", $keyword);
        print qq|<LI><A HREF="$query">$newsgroup</A>|, "\n";
    }
    print "</UL>";
    &print_footer ();
}

Each newsgroup is listed as an unordered list, with the query consisting of the specific key from the associative array. Remember, the qq|...| notation is exactly like the "..." notation, except for the fact that "|" is the delimiter, instead of the double quotation marks.


Previous Home Next
Archie Book Index Magic Cookies

HTML: The Definitive Guide CGI Programming JavaScript: The Definitive Guide Programming Perl WebMaster in a Nutshell