The Virtue of Forgiving HTML Parsers

Most of the time, you want code to fail fast if it receives junk data. HTML on the early web is a powerful counterexample.

By Ryan McGreal

Posted January 08, 2010 in Blog (Last Updated January 08, 2010)

Today Hacker News featured an essay by Alex Russell of the Dojo javscript toolkit in which he mused on the virtues of view-source as they apply to the internet.

Since browsers render HTML, javascript and CSS from plain text, it's possible not only to see the finished product - a rendered web page - but also its underlying source code.

Every browser I've ever seen includes an option to view source - to see the underlying HTML markup, javascript code and style sheet formatting that the browser uses to render a web page.

Russell makes the important point that it was the ability to view source - especially in the early days of the internet - that made web content easily accessible to anyone who took the time to read the code and to experiment with it to discover how it works.

View-source provides a powerful catalyst to creating a culture of shared learning and learning-by-doing, which in turn helps formulate a mental model of the relationship between input and output faster. Web developers get started by taking some code, pasting it into a file, loading it in a browser and switching between editor and browser between even the most minor changes.

This is a stark contrast with other types of development, notably those that impose a compilation step on development, in which the process of seeing what what done requires an intermediate action.

In other words, immediacy of output helps build an understanding of how the system will behave, and ctrl-r becomes a seductive and productive way for developers to accelerate their learning in the copy-paste-tweak loop.

The only required equipment is a text editor and a web browser, tools that are free and work together instantly. That is to say, there's no waiting between when you save the file to disk and when you can view the results. It's just a ctrl-r away. [paragraph breaks added]

On reading this, a parallel observation occurred to me: The power of view-source is multiplied when generous parsers forgive errors and render anyway.

The value of a big, broad internet is far greater than the value of a clean, pure internet.

Most programmers will agree that programs or data streams with malformed syntax should fail, and fail fast. The worst thing that can happen with a software application - particularly a complex one - is for errors to pass silently while malformed data flows into data storage, only to produce gibberish - or worse, subtly wrong results - on export.

Passing a string into a function that expects an integer should throw an exception; passing a set of four values into a method that expects three parameters should raise a red flag; and so on.

By contast, the HTML parsers in every browser are extremely forgiving. If an HTML parser encounters an opening tag for an element that is missing a corresponding closing tag, it just accepts the missing close tag as implied and renders as if it was there.

Likewise, if the parser encounters a malformed or non-standard element, it will either try to render it somehow by guessing at the code's intention (naturally, different parsers approach such matters differently) or will just ignore it completely and move onto the next line.

As a result, even badly malformed HTML can still produce a readable web page.

Purists bristle at this, insisting that it makes the web a worse place by forcing browser makers to code for errors, encouraging sloppy coding practices, causing the same content to be rendered differently on different browsers, and so on.

They're missing the point. The virtue of forgiving parsers is that they vastly increase the pool of people able and willing to create web content.

If you're already a programmer, HTML syntax is easy enough to understand and produce in valid form.

However, most people who create web content aren't programmers - particularly during the early days of the internet when it grew exponentially and established the virtuous cycle of positive network externalities that ultimately dragged more professional developers onto that platform.

Rather, they were amateur enthusiasts exploring a new technological domain. Thanks to HTML, view-source and forgiving parsers, the number of people who could create web pages was vastly higher than the number of people who could write computer programs.

It was the rapid democratization of HTML made possible by view-source and forgiving parsers that accounts for much of its success as a language - and of the success of the internet as a platform.

When an HTML parser finds code so bad that it can't render it, the parser just skips it and moves to the next line, in the manner of VB's on error resume next (programmers are welcome to cringe here). Contrast the stricture of XML parsers, which are obliged to fail on encountering malformed code and produce no output at all.

Since HTML rendering in response to an HTTP GET request is essentially idempotent, there's no real harm in continuing to parse code after encountering an error - but the positive network effects from this are huge.

If only programmers possessed the arcane ability to produce well-formed, valid markup, it would never have experienced the early growth that transformed it into the reigning standard of a huge and growing public network.

One more thing: for many people, HTML provided a gently sloping pathway into programming for people who might otherwise never have managed to overcome the steep barriers to entry of, say, C, with its verbose syntax and elaborate requirements. (Compiler? What's a compiler? And why does it hate me?)

Many programmers today (myself included) found their way into programming from a start in HTML, after bumping into the limitations of static content, exploring event handling in javascript, and then making the jump over to PHP or classic ASP and thence to SQL - and ultimately over to more modern languages with more robust, structured software design principles baked into them than the spaghetti code that powered a lot of early dynamic websites.

Again, there are some people who consider this to be a terrible thing - a watering down of an industry that ought to have high barriers to entry. Setting aside my own interest in the matter, I disagree. Getting more computing and networking capability directly into the hands of more people can only increase the rate of technical innovation by deploying more expressive power more widely.