Serializing on the shoulders of giants

One of the under-appreciated shifts in the web over the past quarter of a century is the slow drift form “HTML as something people enter as part of the document authoring process” to “HTML as serialization format for browser instructions.”

The history of markup languages is fascinating to me because there are so many different convergent threads. The the GML cluster (which gave rise to SGML, HTML, XML…) was human-editable but concerned with capturing document-level meaning…

Setext and its siblings/descendants were about “paving the paths” of markup people used when communicating on usenet, in email, BBS conversations, etc. Markdown is the inheritor of that, with bold and italics and ## Heading …

And then in the background, relational databases evolved into the place where the vast majority of web content lived (as opposed to discrete HTML documents)… which is where things became particularly interesting and tangled.

…Because those relational databases were often being manipulated via software that used an object-oriented “entities of particular types that have properties and behaviors” model.

Result: Software with an object-oriented entity-relationship model reading content out of a tables-with-cross-references storage model that contains text in a format meant to capture document semantics… and all of it being slammed together into “documents” for a browser to read.

Next-gen rich-structured-content tools tend to favor JSON for content schemas and storage, which amounts to “turning the software object model into the content structure”. Content can be output as text, but it’s not markup meant for human editors, just a serialization format.

I’ve written before about the problems that spring up when we ask simple author-focused semantics like Markdown to solve these translation problems; the Setext family is best understood as an editing interface, not a markup format. (https://medium.com/@eaton/markdown-wont-solve-your-content-problems-db9854fbb9b9)

Now, these “Let’s just store it in JSON and provide rich editors” school of thought is taking the trend to its logical conclusion, decoupling the content semantics from the browser-communication semantics.

It’s a conceptual shift; bringing the wrong mental models to systems that work that way causes lots of problems down the line. That’s not bad — there’s a lot of good that comes from it, structurally — but it’s critical to keep in mind when adopting new decoupled content tools.

In practical terms, these distinctions are usually ignorable but they become show-stoppers in unexpected ways. “Can users add superscript text in the title of an article,” for example.

“NO!” says the database engineer who doesn’t want HTML polluting ‘pure text.’ “Of course!” says the content strategist, who realizes they need the trademark symbol in article titles. “Uhoh,” says the CMS dev who hard-coded strip_tags() for titles to avoid security issues.