As a foundational post for this blog, I’d like to propose a mental model that I think is worthy of your attention. I suppose I’m being a bit trendy by using the term “mental model”, as that’s in vogue these days (with recent books on the topic by both Scott Page and Shane Parrish). But, it’s actually a very nice framing for some thoughts that have been kicking around in my head for years without a proper form. I’ve never seen anyone talk about it in this way before, so it seemed useful to write it down.

What is data?

In an intuitive sense, we all know what data is: it’s structured information with a meaning. You could have data about the movement of the planets, or your company payroll, or economic activity in Sweden in the 1980s. Anything that can be stored in a computer is data–a digital representation of facts and knowledge.

Generally, we use the word “data” as a mass noun (“there’s a lot of data…”), as opposed to a count noun (“there are many data…”). This suggests that in the popular conception of data, it’s “stuff”–a bunch of ones and zeros. We talk about people collecting our “personal data”, or making “data-driven decisions”, and data is a word that means “the universal material of information”.

From an information-theoretic point of view, this is absolutely correct. If you’re talking about transmitting data, or compressing data, or storing data, you don’t care about meaning, and data really is just “stuff”. This was the great leap that Claude Shannon made (a story that James Gleick tells so clearly in The Information).

In the rest of this post, however, I’d like to propose a higher-level mental model: rather than just some “stuff” we store and manipulate with our computers, calling something “meaningful data” entails a particular relationship of concepts to each other. So I’m going to attempt to be explicit about this relationship.

Categories And Instances

I’d like to propose that “meaningful data” is actually an amalgam of two different things: “categories” and “instances”.

By the word category (which I’m using the Hofstadter sense, not the more formal mathematical sense, for now), I basically just mean a concept–a pattern in the world that humans have recognized and named. Like: Chair, or President of the United States, or running. Categories arise when people notice similarities in their world, and they extract and abstract that similarity.

By the word instance, on the other hand, I mean one unique appearance of a concept: this chair I’m sitting on, the 44th president of the U.S., a run I went on this morning. There are lots more instances than there are concepts, obviously; “hydrogen atom” is a single concept, but there are vast numbers of actual hydrogen atoms.

So when we think about data, I’d like to suggest that these two words are best defined in terms of each other:

  • The category is whatever is shared by all the instances in that category.
  • The instance is whatever is unique, and not shared by all the instances in the category.

This might sound a little circular, so here’s an example: think of all the U.S. presidents. You could define the category, as Wikipedia does: “the head of state of the United States, indirectly elected to a four-year term by the people through the Electoral College, who leads the executive branch of the federal government ...” and so on. These facts are true of U.S. presidents collectively, and the same for all of them. If you were keeping a database of U.S. presidents, you wouldn’t need to store information about whether each one was the “head of the executive branch”; they all are. It’s part of the consensually shared definition of “president”, and storing it for each one would be redundant. This is the essence of the concept!

On the other hand, each individual president obviously is different from the others–each has a different name, body, history, etc. So if you were keeping a list of the U.S. Presidents–i.e. if you wanted to store data about them–then what you’d record about each one are the facts that aren’t entailed by the category. You wouldn’t store “does this president have a human body?”, because (so far) they all did. Instead, you’d store the parts that aren’t entailed simply by the category of being a U.S. president, like: name, birth date, birthplace, list of policies, etc.

Thus, by this definition, the totality of “meaningful data” requires both of these things. If you just have a shared definition of a concept, it isn’t meaningful data yet. Conversely, if you just have a list of records containing numbers, text strings, etc, but you don’t know what each of those things means in terms of describing instances that belong to a category, again, that’s not meaningful data, it’s just meaningless bit patterns:

Value 1    Value 2
---------- ------------
56         CCHG-ATHJW
109        XOFUASIW-WN

(NB: my point here isn’t to be prescriptivist about how people use words like “data”; in fact, I’m usually annoyed by writing that attempts to establish a universal hierarchy (“data” —> “information” —> “knowledge” —> “wisdom”; blegh). Feel free to substitute another word for “data” here if you like; my goal is just to communicate the shape of this idea, that the dual of categories and instances is a noteworthy mental model that maps to a lot of our activities in creating, storing, and using meaningful data.)

How Novel Is This?

Does this model seem obvious? I can’t really tell. It’s clearly not rocket science, but I’ve never read anyone putting it exactly this way.

For me, the “instances” side is certainly obvious; that’s what data “is”, it’s what you see when you look at spreadsheets or databases.

But being explicit about the categories is what I find interesting, because that’s rarely how people talk about it. If I create a database table of employees, for example, I might call it “Employees” and the column names might be things like “name”, “email address”, “employee number”, etc. Looking at that alone, we gloss over the fact that there’s a category lurking there; it isn’t spelled out anywhere, other than the name of the table (“Employees”), and perhaps implicitly in the attributes we choose to store. The richness of the category is present in the minds of everyone who interacts with this database, however–they bring their own definition of what it means to be an employee, and only in light of that shared consensual category does this become meaningful data.

That’s also where all the nuance happens: on the edge between the definition of a category and the unique attributes of the instances. Going back to the U.S. Presidents example: should we store a “gender” attribute for presidents? A records-keeper from the 1790s wouldn’t have thought to include “gender”, because it would have been obvious to them that presidents are men. Today, clearly we’d be more enlightened than that, but ... well, if you just look at the category of actual U.S. Presidents, all of them so far have indeed been men.

Even if you try to be exhaustive, it’s not really possible to be explicit about everything that’s true of a category. For example, until recently you might have said that being President of the U.S. implies a certain level of decorum and dignity; for example, you’d expect that mocking a disabled report during a press conference would not be something a U.S. President would do. Without making a moral judgement, it’s clear that the 44th president showed that this was not, in fact, a required feature of the category of US President.

So our definition of the category is much more subjective than we might like to admit, and it’s always evolving.

What does this mean for our data systems that rely on categories as part of the definition of data? Here are a few other posts that explore this further: