When you make a model of the world, such as a piece of software or a mathematical model, you are choosing which aspects of the world to include ... and which to exclude. And, let’s be honest: in practice, you’re mostly excluding stuff, because the world is much more complex than any single model is going to include.
Consider a simple employee database. You’d probably include information like “name”, “position”, “date of hire”, etc. But you certainly wouldn’t include lots of other information, like:
- The employee’s exact weight as measured on a daily or hourly basis
- The number of hairs on the employee’s head
- How many times this employee has said the word “carrot” while at work
- Whether this employee is a human being
This list of what you wouldn’t include is more or less infinite (especially given that categories are just arbitrary predicates in the first place). But can you answer, in any rigorous way, the reasons why you would or wouldn’t include something in the set of claims you actually store about a category?
As I’ve thought about it, I see four broad categories of reasons for not explicitly storing some information:
- It can be left unsaid because it’s really about the category, not the instance (see also: Category As Compression)
- It can be left unsaid because it’s derived from other information you are storing. So if you store a birth date, you don’t need to also store age, as the latter can be derived by the former.
- It will be left unsaid because it’s impossible (or just too difficult or costly) to actually obtain existential claims about (for example, the number of hairs on someone’s head).
- It can be left unsaid because it’s irrelevant to what I’m trying to do. I could store my employees’ favorite Disney movies, but who cares?
In the first case, you’re essentially using data compression (as I’ve explained here); you don’t have to explicitly track whether each employee in your database has an exoskeleton because we’ve defined “employee” to be a human, and 100% of humans have endoskeletons.
In the second case, we’re talking about an entailment that’s inherent in the meaning of the data. Since attributes are slippery and ultimately come down to relationships, this is like saying that “there’s more than one way to skin a cat”; the underlying claim of age is really a relationship between a person, a point in time (that person’s birth), and another point in time (now).
In the third case, there’s no conceptual conundrum; the number of hairs on a person’s head is clearly a coherent thing to store, if you choose to, but we can all agree that it’s impractical based on today’s technology. There’s nothing universal or inherent in this distinction, however; while it’s probably infeasible to track the exact number of birds on the earth today, it’s easy to imagine a near future where cheap, tiny drones can blanket the earth and record exactly this kind of information. (And given recent news, that’s not a terrible idea.)
And in the fourth case, which is by far the most common, there’s no reason for you to store a claim, so you don’t.
To make this more clear, let’s take a simple example: whether to store the number of eyes your employees have. If you don’t store it, it could certainly be for any of our four reasons:
- You don’t store it because ... the category of “employee” entails having two eyes, so you can “compress” the information into the category rather than the instances. All employees are humans and all humans have two eyes!
- You don’t store it because ... you store other information that gets you the same knowledge. For example, you could be storing a bit for “has left eye” and a bit for “has right eye” (which, when put together, tell you whether someone has zero, one, or two eyes).
- You don’t store it because ... you don’t have any process in employee on-boarding that explicitly asks whether someone has two eyes, so there’s just no practical way for you to accurately record the information.
- You don’t store it because ... you don’t care.
It’s this fourth category that should really give us pause. What does it mean to “care”, or more broadly, for something to be “relevant”?
Domains Of Applicability
Sean Carrol has a useful phrase, which he explains at length in his book The Big Picture: "domains of applicability". Without (hopefully) doing too much damage to Sean’s much more nuanced exposition, the basic idea is that our understanding of the world is partitioned in isolated realms–groups of theories that (a) match our observations and (b) are good explanations of those observations. But these groups of theories don’t (necessarily) all stitch together into one big unified theory.
His initial example is the difference between molecular physics and ideal gas law; the former describes particles zooming around bumping into each other, and the latter describes macro properties like temperature and pressure. In this example, we do have a basic connection between these domains (we can mathematically model how the molecules bumping into each other produce macro effects like pressure, even if we can’t actually measure the location and behavior of all the molecules).
But such a clear relationship between domains, while convenient when it happens, isn’t required, or even the rule. At the other end of the spectrum, we have the domain of neural biology, and the domain of the experience of consciousness, and we don’t have any real idea how the two relate. But the important point is that it’s OK to talk about the latter domain on its own terms, without needing the grounding all the way down to fundamental physics, as long as you’ve got good observations and good explanations.
So, to abuse this notion with a heavy dose of hand-waving: one way to look at “relevance” in a conceptual model is to think in terms of domains of applicability. If I’m modeling employees at a company, then my domain of applicability is a theory about what kind of interactions make up the machinery of employment. An employee’s salary is clearly a valid part of this model; it’s a contract between employee and employer, and if you stop paying an employee, they stop coming to work. Conversely, someone’s favorite Disney movie, or the number of eyes they have, doesn’t usually have any status in this domain. Usually.
A Million Little Domains
This, however, is where we diverge from domains of applicability in the sciences. For something to be a “respectable” science, it generally has to address everything; that is, all the things we can make observations and explanations about. Physics is not “the physics of Utah”, it’s “the physics of everything we can observe”. A single university department can happily talk about multiple domains of applicability (like particle physics versus ideal gas law), but arbitrarily dividing a field into lots of different local domains of applicability is probably a sign that you haven’t arrived at really good explanations of reality in that field.
In the practical world of society and business, however, we’ve got exactly the opposite–an explosion of “fiefdoms”, little domains of applicability created by evolution, consensus, or fiat. For example: every company is free, within some limits, to set its own “laws” about how entities should relate to each other, what they're called, and which ones are relevant. The same holds for governmental policies and laws at a nation or state level; the “reality” of what’s legal in China is very different from the same “reality” in Germany. The complexity of modern life (and information systems) stems largely from the fact that these domains evolve independently and gradually. We drive on one side of the road in the U.S., and the other side of the road in England.
So going back to our question of relevance, if storing the number of eyes that an employee has is “irrelevant”, that’s only according to a single domain of applicability, a local context. In some domains, it might indeed be relevant. What if you’re an eyewear company, and you’re hiring eyewear models?
This proliferation of domains is troublesome for information professionals in the first place, but it’s made even more tricky because it changes over time, for at least a couple reasons:
- The domain of a business or organization naturally changes over time. Perhaps you used to only sell shoes, but now market conditions have driven you to start selling eyewear, and “number of eyes” for the humans you employ as product models is suddenly relevant, where it never was before.
- The complexity of the world sometimes unfolds over time; perhaps number of eyes was always relevant, but you just never happened to encounter anyone with only one eye, so it never occurred to you to ask. We are all working from imperfect information, and we don’t always have a rich enough understanding of our domain to make the right call at first.
All of which is to say: this is legitimately hard, and much more locally convoluted than the universal sciences.
The Impossibility Of Perfect Description
So, there are many reasons why a model might choose to exclude some aspect of the world that it theoretically could include. However, all of these reasons manifest in the same way: as simple omissions. We can’t really know which class of omission it is. Am I leaving off “number of eyes” because I know for sure every employee will have two eyes, or am I leaving it off because it’s sufficiently safe to assume that they do? And is it safe because the alternative has a very low probability, or because if it turns out to be wrong, it doesn’t really matter to me?
But even as we might decry this omission, be cognizant that there’s not really an alternative. There are an infinite number of things that aren’t included in a model, and the documentation of all of them is clearly impossible (nor would it be helpful).
The closest we can get is to document alternatives we considered but then chose not to include. Those who come after us, looking at a system, will likely benefit from knowing our thoughts on these omissions; “Since we’re a company that employs eyewear models, we considered including the number of eyes a person has; but we’re making the assumption, for now, that everyone we employ will have two eyes, and not storing that explicitly.”
This is a simple and contrived example, of course, but in practice it comes up in much more interesting and complex cases. The boundaries of what you find “relevant” to a problem are squishy and malleable, and different people might draw them in different places. This is just a natural part of the “factoring out” activity of getting explicit about categories.
Personally, I try to err on the side of completeness; if I’ve even considered a question, that means it’s probably not total nonsense, and others who interact with a system might encounter the same questions. Better to write it down as a path not taken.
 For completeness, there’s perhaps a fifth reason for not storing something: you want to and are able to, but you’re prohibited from doing so by some convention or law. For example, you would like to store an employee’s political party or sexual orientation, but imagine that there’s a law stating that this information is private to an individual and not to be stored by another party for commercial purposes. I’m not sure where this reason fits into my taxonomy; it seems like a very tactical, non-essential reason to make the decision to omit something from a model, but it’s nonetheless quite a real reason, so I suppose I should include it.