Chapter 1. What is XML?

Table of Contents
1.1. Introduction
1.2. Highlights of XML
1.3. A complete example: The readme DTD

1.1. Introduction

XML (short for Extensible Markup Language) generalizes the idea that text documents are typically structured in sections, sub-sections, paragraphs, and so on. The format of the document is not fixed (as, for example, in HTML), but can be declared by a so-called DTD (document type definition). The DTD describes only the rules how the document can be structured, but not how the document can be processed. For example, if you want to publish a book that uses XML markup, you will need a processor that converts the XML file into a printable format such as Postscript. On the one hand, the structure of XML documents is configurable; on the other hand, there is no longer a canonical interpretation of the elements of the document; for example one XML DTD might want that paragraphes are delimited by para tags, and another DTD expects p tags for the same purpose. As a result, for every DTD a new processor is required.

Although XML can be used to express structured text documents it is not limited to this kind of application. For example, XML can also be used to exchange structured data over a network, or to simply store structured data in files. Note that XML documents cannot contain arbitrary binary data because some characters are forbidden; for some applications you need to encode binary data as text (e.g. the base 64 encoding).

1.1.1. The "hello world" example

The following example shows a very simple DTD, and a corresponding document instance. The document is structured such that it consists of sections, and that sections consist of paragraphs, and that paragraphs contain plain text:

<!ELEMENT document (section)+>
<!ELEMENT section (paragraph)+>
<!ELEMENT paragraph (#PCDATA)>

The following document is an instance of this DTD:

<?xml version="1.0" encoding="ISO-8859-1"?>
<!DOCTYPE document SYSTEM "simple.dtd">
<document>
  <section>
    <paragraph>This is a paragraph of the first section.</paragraph>
    <paragraph>This is another paragraph of the first section.</paragraph>
  </section>
  <section>
    <paragraph>This is the only paragraph of the second section.</paragraph>
  </section>
</document>

As in HTML (and, of course, in grand-father SGML), the "pieces" of the document are delimited by element braces, i.e. such a piece begins with <name-of-the-type-of-the-piece> and ends with </name-of-the-type-of-the-piece>, and the pieces are called elements. Unlike HTML and SGML, both start tags and end tags (i.e. the delimiters written in angle brackets) can never be left out. For example, HTML calls the paragraphs simply p, and because paragraphs never contain paragraphs, a sequence of several paragraphs can be written as:

<p>First paragraph 
<p>Second paragraph
This is not possible in XML; continuing our example above we must always write
<paragraph>First paragraph</paragraph>
<paragraph>Second paragraph</paragraph>
The rationale behind that is to (1) simplify the development of XML parsers (you need not convert the DTD into a deterministic finite automaton which is required to detect omitted tags), and to (2) make it possible to parse the document independent of whether the DTD is known or not.

The first line of our sample document,

<?xml version="1.0" encoding="ISO-8859-1"?>
is the so-called XML declaration. It expresses that the document follows the conventions of XML version 1.0, and that the document is encoded using characters from the ISO-8859-1 character set (often known as "Latin 1", mostly used in Western Europe). Although the XML declaration is not mandatory, it is good style to include it; everybody sees at the first glance that the document uses XML markup and not the similar-looking HTML and SGML markup languages. If you omit the XML declaration, the parser will assume that the document is encoded as UTF-8 or UTF-16 (there is a rule that makes it possible to distinguish between UTF-8 and UTF-16 automatically); these are encodings of Unicode's universal character set. (Note that PXP, unlike its predecessor "Markup", fully supports Unicode.)

The second line,

<!DOCTYPE document SYSTEM "simple.dtd">
names the DTD that is going to be used for the rest of the document. In general, it is possible that the DTD consists of two parts, the so-called external and the internal subset. "External" means that the DTD exists as a second file; "internal" means that the DTD is included in the same file. In this example, there is only an external subset, and the system identifier "simple.dtd" specifies where the DTD file can be found. System identifiers are interpreted as URLs; for instance this would be legal:
<!DOCTYPE document SYSTEM "http://host/location/simple.dtd">
Please note that PXP cannot interpret HTTP identifiers by default, but it is possible to change the interpretation of system identifiers.

The word immediately following DOCTYPE determines which of the declared element types (here "document", "section", and "paragraph") is used for the outermost element, the root element. In this example it is document because the outermost element is delimited by <document> and </document>.

The DTD consists of three declarations for element types: document, section, and paragraph. Such a declaration has two parts:

<!ELEMENT name content-model>
The content model is a regular expression which describes the possible inner structure of the element. Here, document contains one or more sections, and a section contains one or more paragraphs. Note that these two element types are not allowed to contain arbitrary text. Only the paragraph element type is declared such that parsed character data (indicated by the symbol #PCDATA) is permitted.

See below for a detailed discussion of content models.

1.1.3. Discussion

As we have seen, there are two levels of description: On the one hand, XML can define rules about the format of a document (the DTD), on the other hand, XML expresses structured documents. There are a number of possible applications: