The lost art of XML

Preamble

There exists a peculiar amnesia in software engineering regarding XML. Mention it in most circles and you will receive knowing smiles, dismissive waves, the sort of patronizing acknowledgment reserved for technologies deemed passé. "Oh, XML," they say, as if the very syllables carry the weight of obsolescence. "We use JSON now. Much cleaner."

This is nonsense.

XML was not abandoned because it was inadequate; it was abandoned because JavaScript won. The browser won. And in that victory, we collectively agreed to pretend that a format designed for human readability in a REPL was suitable for machine-to-machine communication, for configuration, for anything requiring rigor. We relinquished the logical formalism for convenience with our tools.

The Case for XML

Consider what XML actually offers, what we surrendered in our rush toward minimalism:

Schemas. XML Schema Definition (XSD) provides genuine type checking at the document level. You can specify that an element must contain an integer, that it must appear exactly once, that certain attributes are required. The schema is itself a document; it can be validated, versioned, referenced. When you receive an XML document, you can verify its structure before you ever parse its content. This is not a luxury. This is basic engineering hygiene.

JSON has no such mechanism built into the format. Yes, JSON Schema exists, but it is an afterthought, a third-party addition that never achieved universal adoption. Most JSON is validated (if at all) through ad-hoc code that checks for the presence of expected keys and hopes for the best. This is insanity masquerading as pragmatism.

Namespaces. XML allows you to compose documents from multiple schemas without collision. You can embed XHTML inside a custom vocabulary, reference external definitions, maintain clear boundaries between different semantic domains. This is not theoretical; this is how standards like SVG, MathML, and SOAP actually work in practice.

JSON has no answer to this. If two libraries use the same key name, you improvise. You prefix. You nest arbitrarily. You pray.

Comments. XML supports comments as a first-class feature. You can annotate your configuration, explain why a particular value exists, leave notes for future maintainers. JSON does not support comments. The official specification forbids them. The rationale, as I have repeatedly heard, is that comments would make parsing more complex.

Self-description. An XML document carries its schema with it, or references it explicitly. The structure is declarative. The types are manifest. You can hand someone an XML file and they can, with reasonable effort, understand what it represents without consulting external documentation.

JSON is a series of nested dictionaries and arrays with string keys. What does "status": 1 mean? What values are valid for "type"? Is "timestamp" an integer or a string? You will need to read the API documentation. You will need to hope that documentation exists and is current.

The S-Expression Connection

For those who have spent time with Lisp, XML's structure is immediately familiar. It is essentially s-expressions with angle brackets instead of parentheses. An element is a tagged list; attributes are metadata; nesting is composition. The mapping is direct:

(person (name "Alice") (age 30))
<person>
  <name>Alice</name>
  <age>30</age>
</person>

Or with attributes:

<person name="Alice" age="30" />

This is not accidental. XML inherits from SGML, which inherits from earlier markup traditions, but the fundamental insight (that data can be represented as nested, tagged structures) is the same insight that drives Lisp's power. Code as data. Structure as meaning.

JSON, by contrast, is an object literal from JavaScript. It is a notation for initializing dictionaries. It was never designed to be a data interchange format; it was promoted to that role because it was already in the browser and developers were already familiar with it. Convenience over correctness.

Why We Chose Poorly

The abandonment of XML in favor of JSON and other lobotomized formats like YAML is a case study in how developer experience can override technical merit. XML is verbose. It requires closing tags. It looks "heavy" compared to JSON's minimalism. These are aesthetic complaints dressed up as engineering concerns.

While verbosity is often a hazard, it is not a vice when it serves clarity. These are not mutually exclusive. Closing tags make structure explicit; they eliminate ambiguity in parsing. The angle brackets are not there to annoy you; they are there to separate markup from content, to make the document's structure immediately visible.

YAML, the other pretender, manages to be both ambiguous and fragile. Indentation-sensitive syntax in a data format? Implicit typing that guesses whether no means a boolean or a string? A specification so complex that implementations disagree on edge cases? This is context-dependent parsing that we supposedly left behind.

On Developer Convenience and Self-Deception

There is a distinction that the industry refuses to acknowledge: developer convenience and correctness are different concerns. They are not opposed, necessarily, but they are not the same thing. A format can be inconvenient to type and still be the right choice. A format can be pleasant to work with and still be fundamentally inadequate.

We have spent billions of dollars and countless engineering hours making terrible technologies fast. The JVM is perhaps the canonical example; a virtual machine originally designed for remote controls, turned into the foundation of enterprise software through sheer force of optimization, millions of developer hours and central bank funny money. Decades of work went into JIT compilation, garbage collection algorithms, escape analysis, all to make a fundamentally awkward platform perform acceptably. And it worked! The JVM is now genuinely fast.

But imagine if a fraction of that effort had gone into something better from the start. Imagine if we had chosen a platform designed for the problems we actually needed to solve, rather than retrofitting a toy into production use. We spent billions making wrong choices work, when we could have spent millions making the right choice pleasant.

This is the pattern with JSON. We chose it because it was convenient, because it was already in the browser, because developers already understood object literals. Then, when its limitations became apparent, we spent enormous effort working around them: creating validation libraries, inventing type systems (TypeScript), building code generators for API clients, developing entire frameworks to manage the chaos of untyped data structures.

We could have just used XML. The schema validation was already there. The type systems were already there. The tooling was already there. But XML looked ugly, and closing tags felt verbose, so we chose JSON and then spent years rebuilding what XML already provided.

The rationalization is remarkable. "JSON is simpler", they say, while maintaining thousands of lines of validation code. "JSON is more readable", they claim, while debugging subtle bugs caused by typos in key names that a schema would have caught immediately. "JSON is lightweight", they insist, while transmitting megabytes of redundant field names that binary XML would have compressed away.

This is not engineering. This is fashion masquerading as technical judgment.

The Physical and the Conceptual

Here is another confusion: we have allowed the physical representation to dictate the conceptual model. XML's angle brackets are a serialization choice; they are not the essence of what XML is. The essence is the Information Set, the abstract model of elements, attributes, and content. How that model is physically encoded (text with angle brackets, binary with Fast Infoset, compressed with EXI) is a separate concern.

But because the text serialization looks "heavy" we rejected the entire model. We threw away schemas, namespaces, validation, self-description, all because we didn't like angle brackets. This is like rejecting relational databases because you don't like SQL. SQL is not the relational model. Truth be told, most of the hate on XML comes from protocols and arcane tools people had to deal with over the years: a friend told me about his horrors on making an integration with the central bank of Nigeria once, and they apparently used SOAP; but these are implementation problems. While it's a nice idea to analyze how much we have committed to a model, it's also important to realise the formalism doesn't bear the blame on how the central bank of Nigeria applies it.

JSON conflates these layers. It is simultaneously a data model (nested objects and arrays) and a serialization format (braces and brackets with string keys). There is no abstract model separate from the text representation. This means you cannot have binary JSON that preserves the same semantics, because the semantics are the syntax. Every binary JSON-like format (MessagePack, BSON, etc.) is a different model with different tradeoffs.

XML separates these concerns properly. The Information Set is the model. Text, binary, compressed binary are all just serializations of that model. You can choose the serialization appropriate to your constraints without changing what the data means. This is correct layering. This is how systems should be designed.

But correct layering is not convenient when you're writing a quick API endpoint, so we chose the conflated mess and called it progress.

What We Lost

When we discarded XML, we lost:

  • First-class validation through schemas
  • Namespacing and composition
  • Comments and self-documentation
  • A separation between structure and content
  • Tooling that could verify correctness before runtime
  • A format that could evolve without breaking existing consumers

What we gained:

  • Fewer characters
  • Easier hand-writing for trivial cases
  • Native parsing in JavaScript

Fantastic trade.

The Binary Answer

One of the common complaints about XML is its verbosity, particularly for network transmission. "All those closing tags waste bandwidth". But even accepting this concern at face value (we shouldn't), XML addressed it years ago.

Fast Infoset, standardized as ITU-T Rec. X.891 and ISO/IEC 24824-1, is a binary encoding of the XML Information Set. It provides the same logical structure, the same schema validation, the same semantic richness, but with dramatically reduced size and parsing overhead. An XML document can be serialized to Fast Infoset for transmission, then deserialized back to standard XML at the receiving end. The schema remains the same. The tooling remains the same. Only the wire format changes.

EXI (Efficient XML Interchange), another W3C standard, goes further. It uses schema-informed compression to achieve sizes competitive with hand-tuned binary formats, while maintaining XML's semantic model. You get type safety, validation, and self-description, with none of the supposed bandwidth penalty.

These formats exist. They are standardized. They have implementations in multiple languages. And yet the industry collectively ignores them in favor of JSON, then invents Protocol Buffers and other binary formats to solve the performance problems that binary XML already solved.

The pattern is instructive: we abandoned XML for being verbose, then when dealing JSON proved too painful, we created new binary formats that lack XML's semantic richness, but with a confused grasp that "we need something, but what is it!?". We could have simply used the binary XML encodings that already existed. But that would require admitting that XML was right all along.

On Practicality

I am not arguing that XML should be used everywhere. There are cases where other formats are appropriate: small data transfers between cooperating services and scenarios where schema validation would be overkill. But these are the exceptions, not the rule.

For anything requiring durability, for anything that will be consumed by multiple systems, for anything where correctness matters more than convenience; XML remains the superior choice. The fact that we collectively pretend otherwise is a testament to our capacity for self-deception.

We chose the format that was easier to type over the format that was harder to misuse. We chose developer convenience over system reliability. And now we act surprised when our JSON APIs drift, when our configurations break silently, when our data lacks the structure we assumed it had.

A Final Point

Microsoft, for all their faults, understood this. MSBuild uses XML. WPF uses XML. .NET's configuration system was XML until they caved to JSON pressure in .NET Core. These were not arbitrary choices; they were recognition that complex systems require complex representations, and that formality in data representation prevents entire classes of errors.

The fact that we now consider this "old-fashioned" says more about our current priorities than about XML's utility. We value keystroke economy over semantic precision. We value familiarity over rigor. We value the appearance of simplicity over actual simplicity, which is the simplicity that comes from clear rules and consistent structure.

XML is not perfect. XPath is baroque; XSLT is its own circle of hell; the various XML-based "standards" spawned by enterprise committees are monuments to over-engineering. But the core format (elements, attributes, schemas, namespaces) remains sound. We threw out the mechanism along with its abuses.

I am tired of lobotomized formats like JSON being treated as the default, as the modern choice, as the obviously correct solution. They are none of these things. They are the result of path dependence and fashion, not considered engineering judgment.

Sometimes the old way was the right way. This is one of those times.