If you are familiar with HTML, a simple XML document shouldn't be too difficult to understand.
<?xml version "1.0" standalone="no"?> <!DOCTYPE accounts SYSTEM "simple.dtd"> <accounts> <customer> <name>Bobby Five</name> <accountNumber>4456</accountNumber> <balance>111.32</balance> </customer> </accounts>
In the first line, the code between the <?xml and the ?> is called an XML declaration. This declaration contains special information for the XML processor (the program reading the XML) indicating that this document conforms to Version 1.0 of the XML standard. In addition, the standalone="no" attribute informs the program that an outside DTD is needed to correctly interpret the document.
The second line is the DOCTYPE declaration. It identifies the root element (accounts in our example) and the DTD for the document. The root element is the element in the document that contains all other elements. It must be unique, which means it may be used only once in the document. All XML documents must have a root element. The root element in HTML and XHTML documents is html, since the whole document is contained within <html> tags.
The last part of the declaration is a pointer to the DTD itself. The SYSTEM identifier points to the DTD resource by location (its URL). In our example, the DTD of the document resides in a separate local file named simple.dtd. As an alternative, some declarations use the PUBLIC identifier to point to the DTD (or other resource) by a unique name. The advantage to using PUBLIC is that it is still valid if the location of the resource changes. Unfortunately, current browsers do not handle PUBLIC identifiers well, so it is always good at least to provide a URL as a backup.
Together, the XML and DOCTYPE declarations are often referred to as the document prolog, which is optional in an XML document. The remainder of the example document contains content tagged according to the elements and rules of the specified DTD.
Browsers often recover from sloppily written or illegal HTML. This is not the case with XML documents. Because XML languages vary, the rules for coding the document need to be followed to the letter in order to ensure proper interpretation by the XML client. When a document follows the XML markup rules, it is said to be well-formed.
The primary rules for a well-formed XML document are:
There may be no white space (character spaces or line returns) before the XML declaration.
All element attribute values must be in quotation marks (either single or double quotes).
Tags and attributes are case-sensitive; for example, <par>, <PAR>, and <Par> are considered to be three different tags.
An element must have both an opening and closing tag, unless it is an empty element.
If a tag is a standalone empty element, it must contain a closing slash before the end of the tag (for example, <img/>)
All opening and closing tags must nest correctly and not overlap.
The document must have a single root element, a unique element that encloses the entire document. The root element may be used only once in the document.
Isolated markup characters (e.g., <, &, and >) are not allowed in text; use a the equivalent standard character entities instead. Table 30-1 lists the predefined character entities in XML.
Entity |
Char |
Notes |
---|---|---|
& |
Must not be used inside processing instructions |
|
< |
Use inside attribute values quoted with " |
|
> |
Use after ]] in normal text and inside processing instructions |
|
" |
||
' |
Use inside attribute values quoted with ' |
You can check whether the syntax of your XML document is correct using a well-formedness checker (also called a nonvalidating parser). Parsers are built into Netscape 6 and Internet Explorer 5.5. You may also want to check out the list of nonvalidating parsers provided by the Web Developer's Virtual Library at http://wdvl.com/Software/XML/parsers.html.
With XML, your document may use tags that come from different "types" of XML documents. For example, you might have an XHTML document that contains some math expressions written using the MathML XML dialect. But in this case, how can you differentiate between an <a> tag coming from XHTML (an anchor) and an <a> tag that might come from MathML (an absolute value)?
The W3C anticipated such "collisions" and responded by creating the namespace convention. A namespace is a group of element and attribute names that is unique for each XML dialect. Namespaces take names that look just like URLs (they are not links to actual documents, however) to ensure uniqueness and provide information about the organization that maintains the namespace. When you reference elements and attributes in your document, the browser looks them up in the namespace to find out how they should be used.
Namespaces are declared in an XML document using the xmlns attribute. You can establish the namespace for a whole document or an individual element. Typically, the value of the xmlns attribute is a reference to the URL-like namespace. This example establishes the default namespace for the document to be transitional XHTML:
<html xmlns="http://www.w3.org/1999/xhtml">
If you need to include math markup, you can apply the xmlns attribute within the specific tag, so the browser knows to look up the element in the MathML DTD (not XHTML):
<div xmlns="http://www.w3.org/1998/Math/MathML">46/100</div>
If you plan to refer to a namespace repeatedly within a document, you can declare the namespace and give it a label just once at the beginning of the document. Then refer to it in each tag by placing the label before the tag name, separated by a colon (:). For example:
<html xmlns="http://www.w3.org/1999/xhtml" xmlns:math="http://www.w3.org/1998/Math/MathML">
The full namespace can now be shortened to math later in the document. The result is much tidier code (and smaller file sizes!):
<math:div>46/100</math:div>
Copyright © 2002 O'Reilly & Associates. All rights reserved.