Is The Written Word “Data”?

To further flesh out the mental model of data, let’s try an adversarial example to see how well this idea holds up.

Writing Is Not Data

Imagine a collection of words, like this blog post you’re reading. Is this data? I’d say no–at least, not in a meaningful way that represents the purpose of language directly.

For you and me, the importance of a piece of writing is the ideas it contains. And in that sense, the data simply isn’t “here”–it isn’t directly represented in any meaningful way to the machine. While these words evoke a rich network of meanings for you, the reader, they mean next to nothing to the software that is storing and displaying them; which is to say, the software isn’t interacting with the words at the same level of semantic depth that you are. (Remember the idea of semantic depth: machines can interact with data at various levels, from shallow to deep.)

OK, Sure, There’s Data In Writing

Let’s take a step back. A piece of writing (like this post) obviously is data at lower levels of semantic depth. Walking up the stack:

Clearly there’s a category of bits, which are discrete values that can be either on or off, and this post constitutes a whole bunch of instances of those, in a fixed sequence.
One level up, there’s the category of typographic characters (multi-byte strings of bits that represent a glyph like “X” or “☃”), and this post can be seen as a sequence of instances of those, too.
Up from there, this post also contains delimited strings of characters, or tokens, with spaces or line breaks between them. The software we’re using absolutely does interact with token as data at this level, as evidenced by the fact that it knows how to wrap left-to-right text sensibly; it doesn’t break up words arbitrarily in the middle when it gets to the end of a line, it pushes the entire preceding word to the next line. This kind of textual layout is actually quite sophisticated; notice how hyphenated words like “multi-byte” can be broken over lines, not just space-delimited words. The same goes for other rendering capabilities, like page breaks (most editors won’t strand a single line of a paragraph after a page break). The category here is “flowable tokens”, and this post is a sequence of those.
This post also contains hyper-textual tokens: things that aren’t rendered in the same way as the words in the post, but still affect the behavior of the document: links to other articles, headings, formatting, etc. These provoke even more sophisticated behaviors in your web browser, like navigating to new pages, so the category in question is not just “flowable tokens” but “renderable tokens”, some of which are flowable visible tokens, and others perform other functions.

That’s quite a bit of semantic depth, but it still falls far short of representing the ideas that are the real content of writing in a meaningful way.

Deeper Parsing: Words In A Language

Is that where it stops? Not remotely! There are a couple more levels of semantic depth that our software regularly gives us for writing: words and grammar.

As a writer, in the editor software I’m using, this stream of tokens actually exists as instances of a more sophisticated category: English words. If I write a misspelled word like “Enhglish” (with an extra letter interposed), I get a dashed red underline in my editor (which happens to be Ulysses):

How is this happening? My editor has a canonical list of English words (aka a dictionary), and if it doesn’t find a match, it helpfully lets me know that I may have made a spelling mistake. This isn’t perfect, of course, because language is open-ended, but most people find it helpful most of the time (including me). The level of confidence is so high, in fact, that in some cases (like writing on a mobile device, such as this iPad I’m typing on) the default isn’t just to underline a misspelling, it’s to automatically correct it as you type (which, again, is mostly helpful, but not always). (To split hairs, on mobile devices without keyboards, like a mobile phone, the autocorrect model isn’t just using data about which letters you did or didn’t type, it’s also incorporating the specific placement of your finger on the virtual keyboard; if you tapped closer to the letter “p” while typing the word “toy”, it can be very confident you meant “toy” and not “tpy”).

That level of semantic depth–the category of English words–isn’t always apparent to you, as someone reading this post, because web browsers don’t usually highlight spelling mistakes in published pages, because that would be rude and pointless. (Though, your web browser probably does have a spelling dictionary included in it, for times when you’re using the browser as an editor, to author something like an email.)

But it’s not entirely absent; you might have noticed that my depiction above of line-wrapping was actually underselling how sophisticated it is. Notice that in this post, some words are actually hyphenated mid-word in service of line wrapping! This capability uses the English dictionary to recognize that words are composed of sub-structures called syllables, and that it’s permissible to break up a word at syllable boundaries if it makes text wrapping work more nicely. Details like this mostly go unnoticed, but they make a big difference in terms of the experience of a reader.

And, of course, everything I’m saying in this section doesn’t apply to only English words; this same capability exists for people working in any language (though your mileage may vary as to the quality of the dictionary if you’re working in a less widely used language).

Deeper Still: Grammar And Style

The level of semantic depth on offer can go even one level deeper than this! Many modern textual editors, like Google docs, can actually do a parse of text on a deeper level to give you feedback about your grammar. In Google Docs, if I write “The loved girl her father”, I get a helpful blue underline because the software recognizes the parts of speech of “the”, “loved”, and “girl”.

This is a lot more subtle than simple spelling correction; consider the similar sentence “The loved girl is happier than the unloved girl”. In this case, “the loved girl” isn’t a mistake at all, so proper grammar correction has to take into account the entire context of the sentence.

The actual mechanics of doing this are extremely sophisticated, as you can probably guess. For Google Docs, we can assume that it’s based on statistical language structure models extracted through machine learning (rather than simply just a set of human-authored rules, which turns out to be really really hard to get right). It’s a lot less reliable than spell-checking, but still very impressive.

But wait, there’s more! If you’re a professional writer, you might use even more specialized software like Pro Writing Aid or Hemingway. These look for common “defects” in writing, like overly long sentences or obscure words. This is more than just correctness of grammar; it’s writing style and comprehensibility.

Grounding Out

This is an impressive stack! But notice that at least as of today, in most software that interacts with textual content, this is where it stops. The software we use to write and read text doesn’t generally treat the words in your writing as representatives of ideas, or as claims about reality, or anything of that sort.

This is counterintuitive, because your experience of reading is full of ideas. How can this happen if the software isn’t representing them? It happens because you fill in the gaps. You’re reading this post, and your human brain is connecting these English words at a deeper semantic level, as representatives of ideas that are connected by a grammar into assertions. You spent years learning to do this, building up your own internal representation of semantics, grammar, and shared consensual ideas like “counterintuitive” and “reading”. So while writing is a good conduit of these ideas between humans, it’s really just that: a conduit. Like a length of PVC pipe, it doesn’t know the difference between water and sewage.

Compare this with the level of claims you get in a structured data model, like a database. If I write the sentence “The name of employee 234546 is Jane Smith”, then the existential claim here that an employee exists and has a certain name. In textual software, I can tell you if this is proper English, but any notion of higher level categories is only latent. As a reader, we recognize that there are a couple categories here (“employee” and “name”), and instances of those categories (“employee 234546” and “Jane”) with relationships among them. But there’s no meaningful data in the system about that if all you have is text.

In a database, conversely, the whole point is to associate instances with categories, in the form of existential claims; by having a record in the database table “Employee” with the ID attribute of “234546” and the Name attribute of “Jane Smith”, I’m establishing that someone has claimed there’s a real instance of these categories that fulfills these relationships. (The fact that it’s been claimed doesn’t make it “true”, but that’s a distinction for another day.)

My point is: armed with a formal concept of what it means to be “data”, we have a new lens to evaluate examples of data and its semantic depth. And in that sense, much of the “data” in the world is very “dumb”, I.e. it exists at a lower level of semantic depth in the machine than its value to humans presumes.

This isn’t necessarily a bad thing, in that these lower level capabilities are insanely useful. But is that as far as we can go?

Is More Possible?

Just because this is where we stop today doesn’t mean it’s the limit of what’s possible. Indeed, some of the most exciting developments in computer science, at the forefront of AI and machine learning, are pushing the boundaries of how deeply we can represent ideas. And if we can develop tools that reach more deeply into the semantic meaning of existing texts, think about what powers that might give us!

In future articles, I’ll talk about some of these ideas: linked data, Wikidata, NLP, semantic folding, and other approaches that allow textual data to exist at a deeper semantic depth in machines.

← sntl.st