Programs vs. markup or why HTML authoring is not programming

The words “program” and “programming” are often used confusingly. This document tries to characterize what computer programs and programming languages are and how they differ from markup, both presentational and logical markup. This hopefully helps in understanding, for example, the different roles of HTML and programming languages like JavaScript and Perl in HTML authoring. The difference has some legal impact, too.

The simple (?) question: is HTML a programming language?

It is not uncommon to see people call HTML a programming language, or call HTML authoring programming. In various classifications, HTML might be classified into a section titled “Programming”. Even documents purported to be HTML tutorials or references may say so. However, few people are consistent in such usage and call HTML documents programs; this might indicate that they don’t really mean that HTML authoring is programming.

It’s hard to tell what is behind this, but calling HTML a programming language might reflect the use of the word “programming” to mean just writing something that will be processed by computers rather than people; after all, we might (reasonably) say that writing HTML is coding. Alternatively, perhaps the varying meanings of the word “program” in everyday language (“TV program”, “study program” etc.) confuse things. Or it might mean a misunderstanding caused by the fact that HTML is used in conjunction with programming. In particular, an HTML document might contain a program embedded into it, typically a JavaScript program, or an HTML document might be generated by a program, typically a Perl program (script). People may miss the essential distinction between HTML and constructs embedded into it. For example, the HTML markup

<input name="x" size="30" style="width:100%" onclick="check()">

contains, as the value of the onclick attribute, a function invocation in a programming language. The attribute itself is part of HTML syntax, but its value is something external to HTML, just as image formats are. (Moreover, the markup contains an expression, width:100%, in yet another language, a style sheet language. This is just to illustrate that HTML can be confused not only with programming languages but with other things external to it.)

What HTML specifications say about the language

No HTML specification has ever called HTML a programming language, or anything like that.

There are somewhat different views on what HTML is, or should be. The first (!) HTML specification, HTML 2.0 characterized the language as follows:

The HyperText Markup Language (HTML) is a simple data format used to create hypertext documents that are portable from one platform to another. HTML documents are SGML documents with generic semantics that are appropriate for representing information from a wide range of domains.

That specification is a great improvement in conceptual clarity over its successors, HTML 3.2HTML 4 and XHTML 1.0. But none of those specifications calls HTML a programming language. Instead, they say it is a markup language.

Let’s see how the American National Standard Dictionary of Information Technology (ANSDIT) defines what a markup language is:markupText added to the data of a document to convey information about the document; for example: tagsprocessing instructions, and hyperlinks.markup language(1) A text-formatting language designed to transform raw text into structureddocuments, by inserting procedural and descriptive markup into the raw text. (2) A language designed to describe or transform in space or time data, text, or objects into structured data, text, or objects, for example: SGMLHTMLVRML.

So markup is, to put it briefly, information, not instructions. While it could contain “processing instructions”, in the SGML sense for example, these wouldn’t really be comparable to programming. As the HTML 4 specification mentions, effectively among SGML features not supported by HTML 4 user agents, “Processing instructions are a mechanism to capture platform-specific idioms”. The examples there suggest things like font face and page eject, i.e. quite comparable to presentational markup.

But doesn’t markup mean instructions to computers?

Markup can be divided into two major categories: descriptive (or logical, or structural) markup, which describes the structure of a document in some way, and procedural (or physical, or presentational) markup, which specifies how the document should be presented physically. Obviously, procedural markup is inevitably device-dependent in some sense, or at least dependent on some general properties of the presentation medium. Page eject does not make sense in speech. On the other hand, descriptive markup which e.g. divides a document into major sections could be mapped to different presentations (say, page eject or pause or a divider rule or image between sections).

For more on this, see Dmitri Kirsanov‘s Procedural and Descriptive Markup and the deeper explanation by Robin Cover in SGML: A Textual Representation for Information Structure; Part 2: The Axiological Foundations of SGML.

Neither descriptive nor procedural markup is programming, though procedural markup might be somewhat comparable to programming in some respect. And HTML is essentially descriptive; attempts to use it for procedural markup can have rather limited success only, no matter how popular such attempts might be.

To compare with natural language constructs, “This car has four doors” is descriptive; “open all the doors!” is procedural. Neither implies the other. And HTML constructs are generally descriptive, saying things like “this is a heading”, instead of saying “show this in such-and-such a way”. A browser can be programmed to process descriptive markup in a particular way. And for obvious reasons, a browser generally displays headings in some emphatic way; and similarities between browsers in this respect may lead people into thinking that markup like <h1>...</h1> means some particular font size etc. – but it really doesn’t. A browser might just as well be programmed or configured to display first level headings in normal font but distinctive color, and this might actually be better in a very small (handheld) device. To take another example, a browser could present a link as underlined blue text that can be clicked on so that a new page then appears. But that’s just one possibility. Another possibility is that an indexing robot has been programmed to follow all links in a document for indexing purposes (without anyone clicking or displaying anything).

It is true that there are some (currently deprecated) constructs in HTML that can be regarded as “commands” or “instructions” in a sense. One might say that <font color="red"> is an “instruction” to turn font color to red. I wouldn’t say so – it is more natural to interpret it as a suggestion, or hint, concerning presentation – but if you do, then you might say that a browser is an interpreter that executes such instructions. But that would be very remotely if at all analogous to, say, a Perl interpreter executing a Perl program (script), which is written in a full-blooded programming language.

Note that for example for the width attribute in HTML, the specifications explicitly say that it gives the “suggested” or “recommend” width (of a table cell, for example). And experience shows that browsers actually treat them that way, often overriding the values suggested by authors, sometimes for good reasons, sometimes not.

And the categorization of a language is to be judged according to its characteristic and typical constructs. As a whole, HTML is clearly a markup language which is declarative (“here’s a block quotation… here’s a heading, … “), not procedural/imperative (“indent so-and-so”, “increase font size”, …). Even if we regarded and used HTML as a procedural markup language (and for such a purpose, HTML is remarkably limited), this wouldn’t make it a programming language or turn HTML documents into programs. An MS Word document contains procedural markup – in a specific binary format  – for document appearance. (The use of binary format is not essential here; we might just as well consider the RTF format, which is a physical markup language based on textual tags.) If HTML documents were programs, MS Word documents would be that much stronger – and I’m not even referring to macros. So would PDF documents, TeX documents, nroff documents, etc. Even procedural markup is not programming; so surely structural markup isn’t either.

It might be argued that a presentational markup language is an interpreted programming language. And it is true that the concepts “programming language” and “program” are somewhat vague. A program in the strictest sense of the word, a binary program, is a sequence of machine instructions directly executed by computer hardware. In a broader sense, a program could be a “source program” written in a language like Fortran, C, or Cobol and compiled (translated) into a binary program. In an even broader sense, we might dispense with the compilation, if there is a program (in the strictest sense of the word!), an interpreter, that reads a “source” program and executes it interpretively, i.e. performing the actions prescribed in the source program. This makes things somewhat relative. The same source program could be “run” either via compilation or by an interpreter. And since virtually anything can be interpreted, after assigning some meaning to it, we could go to the extremes and say that, for example, any piece of text is a program. After all, we could interpret the letter “a” as an instruction to print the letter “a” or, to make it more exciting, the letter “b”, or some image.

This reductio ad absurdum hopefully indicates that we need to draw the line between interpreted programs and mere data structures somewhere. We might say that at the very minimum, a programming language has some control structures for sequentiality, conditionality, and repetition as well as some methods for storing and retrieving data during processing. Is there any doubt of where markup languages belong then? In HTML, you cannot compute 1+1, or do much branching, or repeat anything. HTML has tables, but only as static collections of data.

Is HTML coding, then?

It is reasonable to say that HTML markup is code (and writing HTML markup is coding), provided that people understand that it is comparable to using coded notations when talking or writing. Think about the use of product codes, or using special code books when sending telegraphs, so that short coded presentations stand for long statements, or using colors as codes so that red means “stop” or “warning” or “hot”. It’s a matter of using some notational system which has been specifically agreed upon. (Actually, natural languages are not completely different from codes; they too are based on agreements, just more vague and implicit.)

Since computer programs are often called “code” – we often say “source code” and “object code” (i.e. program in machine language) – so care must be taken to avoid the idea that being code means being program code. Even the phrase “source code” makes sense in conjunction with markup language: it can be used to clarify that we refer to an HTML document as containing markup, rather than the way it might be displayed (or spoken). Some people also say that HTML is compiled, but this is quite misleading. It cannot be compiled in the sense that programming languages are compiled. Without going into details here, let’s just say this: it’s a matter of putting an HTML document as data into a browser or some other program and perhaps saving the program in such a state for efficiency. It’s packaging, not compilation.

Programs and data

Considering the distinction between programs and data, where does HTML markup fall into the categorization? Since the markup applies to some textual data, isn’t it program rather than data?

The categorization, though often useful, can be misleading. Programs are just a special case of data – they can be processed in various ways, like copied onto diskettes, sent over the Internet, etc., just as other data can. But programs are data that can be executed as machine instructions, or executed in interpretive mode by an interpreter, or compiled into machine instructions by a compiler. (We can of course decide to use the word “data” in a limited meaning too, as ‘any data which is not a program’.)

In particular, “data” (in the general sense, or as opposite to programs) does not mean only the simple constituents like characters and numbers on which data processing operates at the low level. Some confusion may have been caused by the use of the term “data character” in conjunction with markup, denoting the plain text content of a document as opposite to those characters which are part of markup. In the HTML element <h2>XYZ</h2>, only XYZ are data characters in this sense while the rest constitutes markup. But this does not turn markup into programs. It’s comparable to writing a margin note “this is a 2nd level heading”. Similarly, markup like <ins>the</ins> is comparable to using brackets, i.e. [the], in some styles of writing to indicate inserted text. Surely you don’t do any programming if you put brackets around words that you add into a quotation for clarity.

Write down a hundred times:
HTML is a data format, not a programming language.
You are allowed to use any programming language to write a loop that writes that text 100 times. You are also allowed to try to do that in HTML.

Date of creation: 2000-08-25. Last updated: 2002-10-14 and 2005-08-08 and 2005-11-16 (minor fixes).

This page belongs to the free information site IT and communication by Jukka “Yucca” Korpela.