Book HomeCGI Programming with PerlSearch this book

Chapter 2. The Hypertext Transport Protocol

Contents:

URLs
HTTP
Browser Requests
Server Responses
Proxies
Content Negotiation
Summary

The Hypertext Transport Protocol (HTTP) is the common language that web browsers and web servers use to communicate with each other on the Internet. CGI is built on top of HTTP, so to understand CGI fully, it certainly helps to understand HTTP. One of the reasons CGI is so powerful is because it allows you to manipulate the metadata exchanged between the web browser and server and thus perform many useful tricks, including:

We won't cover all of the details of HTTP, just what is important for our understanding of CGI. Specifically, we'll focus on the request and response process: how browsers ask for and receive web pages.

If you are interested in understanding more about HTTP than we provide here, visit the World Wide Web Consortium's web site at http://www.w3.org/Protocols/. On the other hand, if you are eager to get started writing CGI scripts, you may be tempted to skip this chapter. We encourage you not to. Although you can certainly learn to write CGI scripts without learning HTTP, without the bigger picture you may end up memorizing what to do instead of understanding why. This is certainly the most challenging chapter, however, because we cover a lot of material without many examples. So if you find it a little dry and want to peek ahead to the fun stuff, we'll forgive you. Just be sure to return here later.

2.1. URLs

During our discussion of HTTP and CGI, we will be often be referring to URLs , or Uniform Resource Locators. If you have used the Web at all, then you are probably familiar with URLs. In web terms, a resource represents anything available on the web, whether it be an HTML page, an image, a CGI script, etc. URLs provide a standard way to locate these resources on the Web.

Note that URLs are not actually specific to HTTP; they can refer to resources in many protocols. Our discussion here will focus strictly on HTTP URLs.

2.1.1. Elements of a URL

HTTP URLs consist of a scheme, a host name, a port number, a path, a query string, and a fragment identifier, any of which may be omitted under certain circumstances (see Figure 2-1).

Figure 2-1

Figure 2-1. Components of a URL

HTTP URLs contain the following elements:

Scheme

The scheme represents the protocol, and for our purposes will either be http or https. https represents a connection to a secure web server. Refer to the sidebar "The Secure Sockets Layer" later in this chapter.

Host

The host identifies the machine running a web server. It can be a domain name or an IP address, although it is a bad idea to use IP addresses in URLs and is strongly discouraged. The problem is that IP addresses often change for any number of reasons: a web site may move from one machine to another, or it may relocate to another network. Domain names can remain constant in these cases, allowing these changes to remain hidden from the user.

Port number

The port number is optional and may appear in URLs only if the host is also included. The host and port are separated by a colon. If the port is not specified, port 80 is used for http URLs and port 443 is used for https URLs.

It is possible to configure a web server to answer other ports. This is often done if two different web servers need to operate on the same machine, or if a web server is operated by someone who does not have sufficient rights on the machine to start a server on these ports (e.g., only root may bind to ports below 1024 on Unix machines). However, servers using ports other than the standard 80 and 443 may be inaccessible to users behind firewalls. Some firewalls are configured to restrict access to all but a narrow set of ports representing the defaults for certain allowed protocols.

Path information

Path information represents the location of the resource being requested, such as an HTML file or a CGI script. Depending on how your web server is configured, it may or may not map to some actual file path on your system. As we mentioned last chapter, the URL path for CGI scripts generally begin with /cgi/ or /cgi-bin/ and these paths are mapped to a similarly-named directory in the web server, such as /usr/local/apache/cgi-bin.

Note that the URL for a script may include path information beyond the location of the script itself. For example, say you have a CGI at:

http://localhost/cgi/browse_docs.cgi

You can pass extra path information to the script by appending it to the end, for example:

http://localhost/cgi/browse_docs.cgi/docs/product/description.text

Here the path /docs/product/description.text is passed to the script. We explain how to access and use this additional path information in more detail in the next chapter.

Query string

A query string passes additional parameters to scripts. It is sometimes referred to as a search string or an index. It may contain name and value pairs, in which each pair is separated from the next pair by an ampersand (&), and the name and value are separated from each other by an equals sign (=). We discuss how to parse and use this information in your scripts in the next chapter.

Query strings can also include data that is not formatted as name-value pairs. If a query string does not contain an equals sign, it is often referred to as an index. Each argument should be separated from the next by an encoded space (encoded either as + or %20; see Section 2.1.3, "URL Encoding" below). CGI scripts handle indexes a little differently, as we will see in the next chapter.

Fragment identifier

Fragment identifiers refer to a specific section in a resource. Fragment identifiers are not sent to web servers, so you cannot access this component of the URLs in your CGI scripts. Instead, the browser fetches a resource and then applies the fragment identifier to locate the appropriate section in the resource. For HTML documents, fragment identifiers refer to anchor tags within the document:

<a name="anchor" >Here is the content you're after...</a>

The following URL would request the full document and then scroll to the section marked by the anchor tag:

http://localhost/document.html#anchor

Web browsers generally jump to the bottom of the document if no anchor for the fragment identifier is found.

2.1.2. Absolute and Relative URLs

Many of the elements within a URL are optional. You may omit the scheme, host, and port number in a URL if the URL is used in a context where these elements can be assumed. For example, if you include a URL in a link on an HTML page and leave out these elements, the browser will assume the link applies to a resource on the same machine as the link. There are two classes of URLs:

Absolute URL

URLs that include the hostname are called absolute URLs. An example of an absolute URL is http://localhost/cgi/script.cgi.

Relative URL

URLs without a scheme, host, or port are called relative URLs. These can be further broken down into full and relative paths:

Full paths

Relative URLs with an absolute path are sometimes referred to as full paths (even though they can also include a query string and fragment identifier). Full paths can be distinguished from URLs with relative paths because they always start with a forward slash. Note that in all these cases, the paths are virtual paths, and do not necessarily correspond to a path on the web server's filesystem. An example of an absolute path is /index.html.

Relative paths

Relative URLs that begin with a character other than a forward slash are relative paths. Examples of relative paths include script.cgi and ../images/photo.jpg.

2.1.3. URL Encoding

Many characters must be encoded within a URL for a variety of reasons. For example, certain characters such as ?, #, and / have special meaning within URLs and will be misinterpreted unless encoded. It is possible to name a file doc#2.html on some systems, but the URL http://localhost/doc#2.html would not point to this document. It points to the fragment 2.html in a (possibly nonexistent) file named doc. We must encode the # character so the web browser and server recognize that it is part of the resource name instead.

Characters are encoded by representing them with a percent sign (%) followed by the two-digit hexadecimal value for that character based upon the ISO Latin 1 character set or ASCII character set (these character sets are the same for the first eight bits). For example, the # symbol has a hexadecimal value of 0x23, so it is encoded as %23.

The following characters must be encoded:

  • Control characters: ASCII 0x00 through 0x1F plus 0x7F

  • Eight-bit characters: ASCII 0x80 through 0xFF

  • Characters given special importance within URLs: ; / ? : @ & = + $ ,

  • Characters often used to delimit (quote) URLs: < > # % "

  • Characters considered unsafe because they may have special meaning for other protocols used to transmit URLs (e.g., SMTP): { } | \ ^ [ ] `

Additionally, spaces should be encoded as + although %20 is also allowed. As you can see, most characters must be encoded; the list of allowed characters is actually much shorter:

  • Letters: a-z and A-Z

  • Digits: 0-9

  • The following characters: - _ . ! ~ * ' ( )

It is actually permissible and not uncommon for any of the allowed characters to also be encoded by some software. Thus, any application that decodes a URL must decode every occurrence of a percentage sign followed by any two hexadecimal digits.

The following code encodes text for URLs:

sub url_encode {
    my $text = shift;
    $text =~ s/([^a-z0-9_.!~*'(  ) -])/sprintf "%%%02X", ord($1)/ei;
    $text =~ tr/ /+/;
    return $text;
}

Any character not in the allowed set is replaced by a percentage sign and its two-digit hexadecimal equivalent. The three percentage signs are necessary because percentage signs indicate format codes for sprintf, and literal percentage signs must be indicated by two percentage signs. Our format code thus includes a percentage sign, %%, plus the format code for two hexadecimal digits, %02X.

Code to decode URL encoded text looks like this:

sub url_decode {
    my $text = shift;
    $text =~ tr/\+/ /;
    $text =~ s/%([a-f0-9][a-f0-9])/chr( hex( $1 ) )/ei;
    return $text;
}

Here we first translate any plus signs to spaces. Then we scan for a percentage sign followed by two hexadecimal digits and use Perl's chr function to convert the hexadecimal value into a character.

Neither the encoding nor the decoding operations can be safely repeated on the same text. Text encoded twice differs from text encoded once because the percentage signs introduced in the first step would themselves be encoded in the second. Likewise, you cannot encode or decode entire URLs. If you were to decode a URL, you could no longer reliably parse it, for you may have introduced characters that would be misinterpreted such as / or ?. You should always parse a URL to get the components you want before decoding them; likewise, encode components before building them into a full URL.

Note that it's good to understand how a wheel works but reinventing it would be pointless. Even though you have just seen how to encode and decode text for URLs, you shouldn't do so yourself. The URI::URL module (actually it is a collection of modules), available on CPAN (see Appendix B, "Perl Modules"), provides many URL-related modules and functions. One of the included modules, URI::Escape, provides the url_escape and url_unescape functions. Use them. The subroutines in these modules have been vigorously tested, and future versions will reflect any changes to HTTP as it evolves.[2] Using standard subroutines will also make your code much clearer to those who may have to maintain your code later (this includes you).

[2]Don't think this could happen? What if we told you the tilde character (~) was not always allowed in URLs? This restriction was removed after it became common practice for some web servers to accept a tilde plus username in the path to indicate a user's personal web directory.

If, despite these warnings, you still insist on writing your own decoding code yourself, at least place it in appropriately named subroutines. Granted, some of these actions take only a line or two of code, but the code is quite cryptic, and these operations should be clearly labeled.



Library Navigation Links

Copyright © 2001 O'Reilly & Associates. All rights reserved.