The Hypertext Transport Protocol (CGI Programming with Perl)

If you are interested in understanding more about HTTP than we provide here, visit the World Wide Web Consortium's web site at http://www.w3.org/Protocols/. On the other hand, if you are eager to get started writing CGI scripts, you may be tempted to skip this chapter. We encourage you not to. Although you can certainly learn to write CGI scripts without learning HTTP, without the bigger picture you may end up memorizing what to do instead of understanding why. This is certainly the most challenging chapter, however, because we cover a lot of material without many examples. So if you find it a little dry and want to peek ahead to the fun stuff, we'll forgive you. Just be sure to return here later.

2.1. URLs

During our discussion of HTTP and CGI, we will be often be referring to URLs , or Uniform Resource Locators. If you have used the Web at all, then you are probably familiar with URLs. In web terms, a resource represents anything available on the web, whether it be an HTML page, an image, a CGI script, etc. URLs provide a standard way to locate these resources on the Web.

Note that URLs are not actually specific to HTTP; they can refer to resources in many protocols. Our discussion here will focus strictly on HTTP URLs.

What About URIs?

You may have also encountered the term URI and wondered about the difference between a URI and a URL. Actually, the terms are often interchangeable because all URLs are URIs. Uniform Resource Identifiers (URIs) are a more generalized class which includes URLs as well as Uniform Resource Names (URNs). A URN provides a name that sticks to an object even though the location of the object may move around. You can think of it this way: your name is similar to a URN, while your address is similar to a URL. Both serve to identify you in some way, and in this manner both are URIs.

Because URNs are just a concept and are not used on the Web today, you can safely think of URIs and URLs as interchangeable terms and not let the terminology throw you. Since we are not interested in other forms of URIs, we will try to avoid confusion altogether by just using the term URL in the text.

2.1.1. Elements of a URL

HTTP URLs consist of a scheme, a host name, a port number, a path, a query string, and a fragment identifier, any of which may be omitted under certain circumstances (see Figure 2-1).

Figure 2-1. Components of a URL

HTTP URLs contain the following elements:

Scheme

2.1.2. Absolute and Relative URLs

Many of the elements within a URL are optional. You may omit the scheme, host, and port number in a URL if the URL is used in a context where these elements can be assumed. For example, if you include a URL in a link on an HTML page and leave out these elements, the browser will assume the link applies to a resource on the same machine as the link. There are two classes of URLs:

Absolute URL

2.1.3. URL Encoding

Many characters must be encoded within a URL for a variety of reasons. For example, certain characters such as ?, #, and / have special meaning within URLs and will be misinterpreted unless encoded. It is possible to name a file doc#2.html on some systems, but the URL http://localhost/doc#2.html would not point to this document. It points to the fragment 2.html in a (possibly nonexistent) file named doc. We must encode the # character so the web browser and server recognize that it is part of the resource name instead.

Characters are encoded by representing them with a percent sign (%) followed by the two-digit hexadecimal value for that character based upon the ISO Latin 1 character set or ASCII character set (these character sets are the same for the first eight bits). For example, the # symbol has a hexadecimal value of 0x23, so it is encoded as %23.

The following characters must be encoded:

Control characters: ASCII 0x00 through 0x1F plus 0x7F
Eight-bit characters: ASCII 0x80 through 0xFF
Characters given special importance within URLs: ; / ? : @ & = + $ ,
Characters often used to delimit (quote) URLs: < > # % "
Characters considered unsafe because they may have special meaning for other protocols used to transmit URLs (e.g., SMTP): { } | \ ^ [ ] `

Additionally, spaces should be encoded as + although %20 is also allowed. As you can see, most characters must be encoded; the list of allowed characters is actually much shorter:

Letters: a-z and A-Z
Digits: 0-9
The following characters: - _ . ! ~ * ' ( )

It is actually permissible and not uncommon for any of the allowed characters to also be encoded by some software. Thus, any application that decodes a URL must decode every occurrence of a percentage sign followed by any two hexadecimal digits.

The following code encodes text for URLs:

sub url_encode {
    my $text = shift;
    $text =~ s/([^a-z0-9_.!~*'(  ) -])/sprintf "%%%02X", ord($1)/ei;
    $text =~ tr/ /+/;
    return $text;
}

Any character not in the allowed set is replaced by a percentage sign and its two-digit hexadecimal equivalent. The three percentage signs are necessary because percentage signs indicate format codes for sprintf, and literal percentage signs must be indicated by two percentage signs. Our format code thus includes a percentage sign, %%, plus the format code for two hexadecimal digits, %02X.

Code to decode URL encoded text looks like this:

sub url_decode {
    my $text = shift;
    $text =~ tr/\+/ /;
    $text =~ s/%([a-f0-9][a-f0-9])/chr( hex( $1 ) )/ei;
    return $text;
}

Here we first translate any plus signs to spaces. Then we scan for a percentage sign followed by two hexadecimal digits and use Perl's chrfunction to convert the hexadecimal value into a character.

Neither the encoding nor the decoding operations can be safely repeated on the same text. Text encoded twice differs from text encoded once because the percentage signs introduced in the first step would themselves be encoded in the second. Likewise, you cannot encode or decode entire URLs. If you were to decode a URL, you could no longer reliably parse it, for you may have introduced characters that would be misinterpreted such as / or ?. You should always parse a URL to get the components you want before decoding them; likewise, encode components before building them into a full URL.

Note that it's good to understand how a wheel works but reinventing it would be pointless. Even though you have just seen how to encode and decode text for URLs, you shouldn't do so yourself. The URI::URL module (actually it is a collection of modules), available on CPAN (see Appendix B, "Perl Modules"), provides many URL-related modules and functions. One of the included modules, URI::Escape, provides the url_escape and url_unescape functions. Use them. The subroutines in these modules have been vigorously tested, and future versions will reflect any changes to HTTP as it evolves.[2] Using standard subroutines will also make your code much clearer to those who may have to maintain your code later (this includes you).

[2]Don't think this could happen? What if we told you the tilde character (~) was not always allowed in URLs? This restriction was removed after it became common practice for some web servers to accept a tilde plus username in the path to indicate a user's personal web directory.

If, despite these warnings, you still insist on writing your own decoding code yourself, at least place it in appropriately named subroutines. Granted, some of these actions take only a line or two of code, but the code is quite cryptic, and these operations should be clearly labeled.