Attributes Are Slippery
In the world of concept modeling, there’s a sort of accepted hegemony around attributes; everyone “knows” what they are. Everyone, that is, except William Kent.
I wish I could quote the entirety of his chapter on Attributes from Data And Reality, but let’s just start here:
“As common as the term “attribute” may be, I don’t know what it means.”
When we talk about attributes (aka properties, characteristics, fields, features, dimensions, members, columns, facets, etc.) we mean “description of aspects of an instance of a category”. Categories (aka entities) could be said to “support” various attributes (carrots have a weight, but not a mailing address), and instances could be said to “have” values for those various attributes (this carrot has a weight of 8.6oz), or sometimes equivalently, to “be” those attributes (this carrot is 8.6oz).
“People have heights and birthdays and children, my car is blue, and New York is crowded.”
Kent then dismantles the simplistic concept of attributes, one step at a time.
Attribute Ambiguity
To start with, there a bunch of ambiguities with how we use the word “attribute”. Writing in 1978, but (as usual) ahead of his time, Kent actually resorts to (invents?) the graph model of conceptual definition (which eventually became the foundation for semantic technologies like RDF):
“Every attribute has a subject: what it’s an attribute of. People, my car, and New York were the subjects of the attributes in the examples above. Then there are targets, which are at the other end of the attribute, such as heights, blue, and crowded. Thirdly, there are links between subjects and targets. In the last example, it isn’t “New York” or “crowded” which are important in themselves; what is being expressed is a connection between the two: New York is crowded.”
You might recognize this as very nearly the “subject / predicate / object” triple:
:NewYork :hasPopulationDensity :crowded
(RDF affixes these identifiers to URIs, to remove ambiguity, and it labels the relationship more specifically than ”is”; but the spirit is the same.)
Kent then points out that there’s a leap of correspondence or “aboutness” that takes place:
“When I explore some definitions of the target part of an attribute, I get the impression that authors are referring to the representations, e.g., the actual letter sequence of “6- -f-e-e-t” ... If I were to take that literally, then expressing my height as “72 inches” would be to express a different attribute from “six feet”. Maybe the authors don’t really mean that; maybe they really are willing to think of my height as the space between two points, to which many symbols might correspond as representations. But I can’t be sure what they intend.”
I think we can all agree that what’s intended is to represent the underlying reality (the length of space between two reference points, in this case the top of my head and the bottom of my feet). But information systems rarely actually represent attributes at that level of semantic depth, which gives rise to all kinds of tactical problems in data systems (language translation, character encoding, etc.)
There are cases where there seems to be no gap of aboutness; when I say that my first name is the three-letter sequence “I-a-n”, there’s no ambiguity, nor any other way to represent it; that’s my name. Right? Maybe. You could argue that my name in Japanese characters (“イアン”) is still my name, because it’s the sound that matters. Indeed, any representation we make in a computer is “about” something, and is almost always just one way to represent that thing. Even a unique identifier or UUID, which is a string of bits that doesn’t inherently mean anything other than its own sequence of bits, could be represented in different ways (e.g. BigEndian versus LittleEndian). Kent deals with this distinction by just calling it something else:
“The target of an attribute is rarely a symbol directly. There is almost always a target entity distinct from the symbols. There are some notable exceptions to this rule, but then I wouldn’t call the phenomenon an “attribute”. If the target is really a pure symbol, then I prefer to call this “naming”.”
In any event, the ambiguity of language is skewered nicely by Kent’s list of all the things you could mean when you say “attribute”:
- The concept of color.
- The concept of blue.
- One of the character strings “blue”, “bleu”, “blau”, etc.
- The general observation that cars have colors.
- The fact that my car is blue.
The ambiguity is tamed somewhat by the language of our current technology stack, which can at least deal with the last two; the observation that cars have colors is an entity shape constraint or schema, and the fact that my car is blue is an existential claim relating one property to one instance. Different representations of some underlying reality (“blue”, “bleu” and “blau”) aren’t fundamentally ambiguous; aboutness is hard to describe, but we can hand-wave a bit and say we all basically grok it as either symbolic pointing (as in words) or pattern similarity (as in photos).
The first two (the concept of color versus the concept of blue) will still give us trouble, though; let’s push a little deeper in to Kent’s analysis.
Attributes Are Relationships
“For me these problems are overshadowed by a larger concern ... I can’t tell the difference between attributes and relationships. The fact that “Henry Jones works in Accounting” has the same structure as “Henry Jones weighs 175 pounds”. “175 pounds” appears to be an entity in the category of “weights” as much as “Accounting” is the name of an entity in the category of “departments”.”
Here we get to the crux of the difficulty; we want to believe that a descriptive fact about someone, like the weight of their body on the surface of the Earth, is somehow imbued with “inherence”. But Kent maintains a strong stance that:
“There really does always seem to be an entity lurking behind the scenes somewhere ... my height is not the string of characters “6 feet”. A height (or other length) is a certain interval in space (any good reason not to think of it as an entity?)”
This is counterintuitive, and Kent really struggles with this distinction:
“We can say that something is a car, and we can say that something is red. Intuitively, I feel that the first assertion is about the intrinsic nature of the thing (hence, its category), while the second asserts additional information about its characteristics (i.e. attributes). At one time I wanted to believe in a definable difference between category and attribute, but I didn’t know how to articulate it. ... I’ve abandoned my hope of defining that distinction, too.”
Within my own mental model of data, I find it easy to agree on the unity of attributes and categories. If we say that a relationship between two instances is also an instance (of the category of “relationship”) then all we need is categories and instances; a claim about someone’s height is just two category-instance pairs:
- a category of “intervals in space”, with an instance is “6 feet”,
- a category of “relationships between a person and a height-interval at a point in time”, with an instance of “the relationship between this person, Ian, and this height, 6 feet, as measured on Wednesday at 10:19 am”
Thus, a simple claim about the height of a person is, in reality, a compound claim: there exists a value of intervals in space which is 6 feet, and there exists a person named Ian, and there exists a relationship between the person and the interval, which implies correspondence (i.e. if you were to measure both the person and the interval by performing the same actions, such as by using a yardstick, you’d get the same result for both).
This kind of description feels roundabout and laborious! And yet, as Kent points out, it’s further supported by the fact that most attributes have attributes of their own. For example: the date an employee began receiving a certain salary, or the ages of an employee’s children. You can also decompose nearly every attribute into something more fine-grained; you can say an employee has a hair color, or you can say the employee has some number of hairs, each of which has some color; or, if you’re getting your ombré on, each of which has a length along which some spectrum of color is evident.
It’s categories and instances all the way down! This is ultimately the stance Kent takes:
“I will not formally distinguish between attributes and relationship, or between those two and categories. Even so, I do continue to use the terms “attribute” and “category” when they seem more natural, but I won’t be able to say why they feel more natural at the time. Most likely, it will correlate well with my implicit assumptions about the existence tests for the entities involved.”
And when it comes to how we should actually work with this:
“Let me suggest a way to satisfy our intuitions. Let us build a modeling system which only supports one basic linking convention, which we are free to call either “attribute” or “relationship”. The terms will be synonymous; we can use whichever one feels better at the moment.”
We (as in, the technology industry) have in fact built such a system, and it’s the graph-oriented approach to conceptual modeling, exemplified in RDF, OWL, etc. Unlike the hallowed relational database, where rows and columns influence us to see a deeper ontological difference between attributes and instances, this is ultimately a more natural and flexible mental model to use.
(This still leaves some mysteries to explore (particularly around literal values) but we’ll come back to those in a future post.)