I’ve got a challenge for you. Here’s a string of digits:


Is there a shorter way to represent this string of digits?

Sure there is! It’s just the same 5 characters repeated over and over. You could represent it instead with something like:

"Repeat 1a100 13 times"

The original pattern can be condensed into the latter with no loss of precision, because you can just follow those instructions to get back to the first string. The description functions as a program, and following that program produces the original input.

At some basic level, this is how all data compression works–by factoring out common patterns. When you create a .zip file that shrinks your spreadsheet from 5MB to 1MB, this is basically what it’s doing (in a much more sophisticated way, albeit)–finding repetition, and creating a smaller program that will produce the original again. Because of the repetition, you could say that the “real” amount of information in the original file was actually not 5MB, but something much less; your zipped version is closer to the optimal representation.

This is the intuition behind Kolmogorov complexity, which says that the essence of any pattern is defined as the shortest program that can reproduce it. If I have a string of a billion zeros, its complexity isn’t equal to a billion bytes; it can be fully reproduced by the program implied in the English phrase “a billion zeros”. Conversely, if I have a string of a billion random bytes, there’s not likely to be a program shorter than a billion bytes that can reproduce it exactly. (The study of algorithmic information theory goes much deeper than this, of course, with decades of contributions on the mathematical nature and behavior of this idea.)

Fun fact: it’s a hobby of some programmers to try to actually write the smallest possible programs that produce a given output. For a fun few minutes, check out the Code Golf Stack Overflow site, where people pit their skills (and languages) against each other to do exactly this. (Careful not to go overboard, though, or you may find yourself writing an entire language to win at code golf.)

Sharing Is Caring

As a side note, realize that this kind of compression-through-pattern-extraction is in no way limited to a single case (e.g. compressing one unique string). You can always find patterns in the patterns and set those aside as special instructions. For example, to get even more compression in our example above:

"Repeat 1a100 13 times"

You could notice that "repeat X N times" is an instruction that comes up a lot, and you could say "instead let's use a single character, like Ⓡ". (Of course, if Ⓡ comes up in the actual data, that means you'd need to do something special like escaping it, but that might be a good tradeoff if Ⓡ is uncommon.) So now you have something like:

"Ⓡ 13 1a100"

This means we've transferred a little bit of the information content from the individual instances (the input / output strings) into our interpreter program (the thing that does the compressing and uncompressing).

You could do this with arbitrarily large or complex patterns, of course; like, say you wanted to compress War and Peace:

‘Well, Prince, so Genoa and Lucca are now just family estates of the Buonapartes ... (2800 pages later ...)  ... and to recognize a dependence of which we are not conscious."

You could tell your interpreter program to always compress this entire 7MB string into a single character, ⓦ. This would make it extremely efficient at compressing War and Peace, and rubbish at compressing anything else.

You don't get any free lunch here, as far as Kolmogrov complexity goes; the compressor program counts as part of your "shortest program". Which means that if you started to add lots of fixed strings to your compressor program, you very quickly lose your top spot. A "repeat" instruction is probably wise, whereas entire Russian novels are not.

Categories And Instances

Information compression isn’t a new idea, at all. But as a mental model, it has so many interesting implications. One that dawned on me recently is the relationship with data. “Compressed” is really another way of saying “implied by category”.

I’ve previously written about the mental model of data: thinking of data as a dual of categories (shared information) and instances (unique information). If you think about it, this separation is essentially a type of compression, because you’re factoring out shared patterns that are invariant across the category, instead of repeating them for each instance. That information might exist, but just living at a different level; if you ask a question of the instance, you can get an answer, it just might come from the category.

Here’s a silly example: let’s say I was keeping an employee database, and this was the information I was tracking about each of my employees:

Name: John Smith
SSN: 555-65-1234
Species: Human
Metabolism type: Endothermic
Skeletal structure: Endoskeleton
Number of Eyes: 2
Average Internal Temperature: 98.6 F

If I proposed this as the schema for an employee database, I’d (rightfully) get laughed at. Unless I happen to employ beetles or jellyfish, much of this information is clearly redundant. Why? Because it’s implied by the category “employee” (which we all think of as a subtype of “human”). Every employee will have an endoskeleton, an endothermic metabolisms, two eyes, and so forth. Storing those facts for each employee would be pointless because I can infer it from the fact that they are employees, and thus humans. Nobody would seriously propose adding an attribute to a database that’s guaranteed to be the same for every record, because it’s a pointless waste of space.

Now, note that just because something is “determined by a category” doesn't mean it's actually an explicit part of any given software system. I promise that if you actually look at Employee databases around the world, most of them cannot literally answer a query about how many eyes their employees have, or whether their skeletons are on the inside or outside of their bodies. This knowledge lives only in the minds of the humans who design and use the systems; it's “common sense” (which is to say, implicit human knowledge that hasn't been directly encoded in a data system). Certainly, some categorical things are discoverable in software systems, but it’s not a guarantee.

It’s also not always clear whether something is really an attribute of an entire category, as opposed to something that varies across all the instances. What if, for example, I have an employee who doesn’t have two eyes, because they lost an eye in an accident? At that point, “number of eyes” is not something I can just factor out of every employee record. It’s true that most humans have two eyes, but it’s not universally true, and so by “compressing” it (i.e. leaving it implied at the category level, rather than explicit at the instance level) I’m potentially making a mistake, diverging from reality.

For this reason, good engineers tend to ask a lot of “what if?” questions during design. What if someone loses an eye? What if we hire a freak of nature whose body temperature is 80 F? What if we hire a boneless chicken? It can go too far, to be sure, but this act of probing the boundaries of a category is a healthy one.

Does Category Have To Be Implicit?

This points to an entirely different problem in technology: is it possible (and / or useful) to build systems that explicitly encode categorical information, such that machines could act on it? Could a machine reason that it’s not safe to poke employees with sharp objects, because they’re humans and therefore don’t have exoskeletons?

The answer to this is a qualified “yes”; as in, it seems possible in theory, but it’s mostly not what our information systems do today, and attempts to do it so far have had mixed success. This is something we’ll talk about at much more length in an upcoming post.