Liberating Cohesion: RDF

May 31, 2024

What do empathy, common enemies, and small talk have in common? They are all cohesion mechanisms. And so are calendars, clocks, protocols, plans, meetings, and memes. Cohesion mechanisms make elements be a whole or work as a whole. They do so by imposing constraints. Constraints reduce freedom and, in the case of agents, decrease autonomy and agency.

All socio-technical systems need to balance autonomy and cohesion to remain viable. There is then the need for a separate action or technology to enable autonomy. For instance, in organizations, cohesion brought by a goal can be balanced by autonomy via delegation of operational decision-making. More generally, cohesion brought by why has to be balanced by autonomy of what, and cohesion of what needs to be balanced by autonomy on how.

There are, however, cohesion technologies that provide1 both cohesion and autonomy. Like other cohesion mechanisms, they apply constraints. Constraints reduce autonomy. At the same time, through their use, these special kind of technologies increase the overall autonomy and agency of their users. Standards and protocols have the potential to be such kind of technologies. Two popular examples, reviewed briefly in the previous post, are HTML and HTTP.

This special quality of bringing both autonomy and cohesion distinguishes the cohesion mechanisms in the Interoperability area of the CABIN spectrum from those in the other areas. I call this quality liberating cohesion.

The present essay, part of the Autonomy and Cohesion series, will dive deeper into another standard that has this capability: RDF. Like HTML, RDF is an open standard maintained by W3C. While HTML is for representing content for human readers, RDF is for representing structured data, but for both humans and machines. Unlike HTML, RDF is not just a language; it is a whole framework.

Let's first see what RDF is and where it can be found, and then discuss how using it brings cohesion and autonomy at the same time.

What is RDF?

RDF stands for Resource Description Framework. Resource is any entity2 such as person,3 place and just any thing, concrete or abstract, existing or imaginary. It provides a means to describe resources using uniquely identifiable relations. These relations describe both characteristics of and relationships between resources. Lastly, it is a framework that includes a conceptual model, semantics, vocabulary, and a set of serialization formats.

RDF is a data model, but the smallest element that can be considered RDF is the triple, which is already information.

RDF provides a single model for all kinds of data structures. Everything in RDF is expressed by triples: instance data, metadata, semantics, data transformation, validation, and rules. The basic elements of the query language SPARQL are also triple patterns.

RDF is a graph model, but its unit, the triple, is not just a node-edge-node construct. It is a logical statement.

Let's go through an example. In natural English, we can say:

A cat named Pat sat on a mat.

An RDF triple has the form subject-predicate-object. Subject and predicates must be uniform resource identifiers.4 In other words, subjects always denote entities such as people, places and things. Objects can be URIs or literals, expressing numbers, dates and strings. When objects are URIs, they can be subjects of other triples. When objects are literals, they can have a datatype such as string, integer, date, and boolean. Triples, the objects of which are URIs, state the relationship between two entities, such as a cat and a mat, identified by the subject and object URI. Triples in which objects are literals describe some characteristic of the subject, for instance, cat-hasColor-white and cat-hasName-Pat.

We need three things to express in RDF the cat statements above: a namespace,5 a convention for constructing identifiers, and a vocabulary6 for the concepts. Let's take the classic example.org as a namespace. The convention for URI construction for instances of things will be: a mnemonic string denoting their type, suffixed by a number. Applying these conventions to our two concrete entities, cat and mat, we get http://example.org/data/cat1 and http://example.org/data/mat1 as their identifiers. Although we have chosen a convention to suggest the nature of these two things, for machines these identifiers are just strings. We need then some terms to make this meaning7 explicit, and they should have URIs too. These terms will represent the nouns and verbs from the natural language but will be formally defined to remove ambiguity for non-human readers. Since such abstract concepts are also resources, just as the instances they describe, they need prefixes and a convention for the construction of their identifiers. Let's use the domain http://example.org/ again but separate the abstract concepts in another branch /ontology and use # as a delimiter instead of /. The naming convention for the unique part of the URI will be to use the respective noun for the types of things in PascalCase and the respective verb for the relationship type in camelCase. Using these conventions, an RDF representation of the cat sentence, expressed in Turtle notation, will look like this:

@prefix d: <http://example.org/data/> .
@prefix o: <http://example.org/ontology#> .

d:cat1 rdf:type o:Cat .
d:cat1 o:hasName "Pat" .
d:cat1 o:sat d:mat1 .
d:mat1 rdf:type o:Mat .

The Turtle (Terse RDF Triple Language) notation, one of the many ways RDF can be serialized (more on that later), is easy to read. By declaring prefixes, the identifiers can be written as compact URIs. When a machine reads them, it will concatenate the common and unique parts of the URI so that d:cat1 will become http://example.org/data/cat1. In addition, Turtle includes some syntactic sugar. For example "a" stands for rdf:type and is read "is a" making it sound like the natural statement it formally represents. To resemble written natural language even more, each Turtle statement ends with a full stop. Another element of syntactic sugar is using semicolons in a way again similar to regular punctuation to enable the listing of statements for the same subject without the need to repeat the subject.

Applying these two conventions, we can replace the Turtle snippet above with this equivalent:

@prefix d: <http://example.org/data/> .
@prefix o: <http://example.org/ontology#> .

d:cat1  a o:Cat ;
        o:hasName "Pat" ;
        o:sat d:mat1 .
d:mat1  a o:Mat .

Easy to read, neat and elegant.

Еvery set of RDF triples is a graph by virtue of having only triples and the fact that the same URIs in different triples represent the same resources. You can find them referred to as semantic knowledge graphs since they are knowledge graphs with explicit semantics, in other words, self-describing knowledge graphs. In graph-theoretic terms, they are directed labelled graphs: directed, because the meaning of а predicate is valid only in the direction from а subject node to an object node, and labeled, since the edges have labels which are the URIs of the predicates.

"Pat" is a literal value node, and this is where the graph ends. A literal cannot be the subject of a triple.

d:mat1 is a URI, so the graph can grow from that node on by describing the entity it represents and its relations with other entities. For example, we can add a triple about the color of the mat:

@prefix d: <http://example.org/data/> .
@prefix o: <http://example.org/ontology#> .

d:cat1  a o:Cat ;
        o:hasName "Pat" ;
        o:sat d:mat1 .
d:mat1  a o:Mat ;
        o:hasColor "brown".

Visually, the graph will look like this:

Relationships in RDF are first-class citizens with the same capabilities as entities. For example, unlike property graphs, where edges cannot be linked to other edges, in RDF it is common to have edges (RDF properties) linked to other edges or nodes. What's more, such constructs enhance the expressivity and the deductive power of the RDF graphs. To illustrate this, let's extend our ontology with the following statements:


@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix o: <http://example.org/ontology#> .

o:perched  a rdf:Property ;
        rdfs:subPropertyOf o:sat .

o:sat a rdf:Property .

o:hasName a rdf:Property.
o:hasNickName a rdf:Property;
        rdfs:subPropertyOf o:hasName .

Then, if we have more information, say:

A bird, nicknamed Nerd, perched on a birch,

in RDF, it will look like this:

@prefix d: <http://example.org/data/> .
@prefix o: <http://example.org/ontology#> .

d:bird1  a o:Bird ;
        o:hasNickName "Nerd" ;
        o:perched d:tree123
d:tree123  a o:Birch .

Then, if we ask, running a logical reasoner or a SPARQL query, who sat and on what, the answer would be the cat Pat sat on a mat (as stated) and the bird Nerd sat on a Birch (inferred), even though there are different relationships (RDF properties) linking the two animals with how they are called and on where they happen to be situated.

Where can one find RDF?

RDF is not used as much as its maturity and the benefits it brings would suggest. At the same time, it is spread wider than most people suspect. One manifestation of RDF is as Linked Open Data (LOD). Many organizations from the public sector, life sciences, cultural heritage and media publish RDF datasets using shared vocabularies and semantic links to other LOD datasets. Some public knowledge graphs contain general knowledge (cross-domain) and others domain-specific knowledge. Wikidata, DBpedia and YAGO are prominent examples of general knowledge LOD graphs. Of them, Wikidata is the biggest, with close to 16 billion triples today and growing fast. If you run this query, it will generate a small subset of the Wikidata graph, showing the influences in the Age of Enlightenment. The relations are of the type influenced by. If you click on a node, it will show first-degree connections.

Wikidata graph rendition of a query to get the influences in the age of Enlightenment, in state after clicking on the node representing Gottfried Wilhelm Leibniz.

Examples of domain-specific knowledge graphs are UniProt for protein data and Cellar for the EU legislation. UniProt is currently the biggest open knowledge graph with 112 billion triples.

There are several ways to interact with LOD publications. They usually offer keyword search, faceted browsing or some other way to explore the graph by clicking, and a SPARQL endpoint.

LOD publications are not the only Linked Data accessible on the open web. There is a lot of Linked Data embedded in web pages, and their number increases significantly each year. The common crawl corpus counted close to 139 billion schema.org triples8 in December 2023, while there were 106 billion in 2022, 94 billion in 2021, and only 20 billion in 2018. As expected, the biggest number and increase are related to products.

Source: https://webdatacommons.org/structureddata/schemaorg/

Apart from Linked Data available in RDF stores behind open SPARQL endpoints or embedded in web pages as RDFa or JSON-LD, there is also Linked Enterprise Data (LED), available as closed enterprise knowledge graphs in corporate networks. It is used as a semantic layer to unify and integrate the data fragmented in silo applications or as operational knowledge graphs, in other words, as a generic backend and not just as an analytical read-only layer. Well-known companies such as Bosch, IKEA, UBS, BASF, Morgan Stanley, Wells Fargo, Schneider Electric, BMW, Astra Zeneca, Pfizer, and Roche use RDF in various degrees. There are also some less known names but prominent in their industry. For example, you probably haven't heard of Grundfos, but they are the largest pump manufacturer in the world, and they use RDF extensively.

How does RDF bring autonomy and cohesion?

RDF combines universal ways to name, structure and give meaning to data using only open standards. Naming is done with URIs; the structure is always the subject-predicate-object triple, and the meaning is provided by extending RDF with shared vocabularies. These three ways, individually and in combination, enable autonomy and cohesion. Let's see how.

Naming (Identifiers)

URIs provide an efficient and effective solution to the problem of relating names and things. They are proper names.

[T]he architects of the Web have taken hold of the idea of proper names, and without purposefully altering its definition, have made naming the first supporting pillar of the Web, thus formulating an answer to the ages-old question of the relationship between words and things by combining in an original—and unintentional!—fashion the thoughts of Frege, Russell, Wittgenstein, and Kripke.8

One thing can be identified by multiple URIs (more on that aspect of autonomy later), but one URI uniquely identifies one referent (cohesion). At the same time, there can be many human-readable labels associated with a URI elegantly solving multilingualism and various UI and text-search problems. Labels are decoupled from the identifiers (autonomy).

Labels are typically linked with rdfs:label but there can be more sophisticated relations, such as those given by skos:prefLabel, skos:altLabel, skos:hiddenLabel. And yet, that higher expressivity doesn't cost any cohesion since all three of them are rdfs:subProperty of rdfs:label. When labels need lifecycle data and other metadata, that is just a modeling problem, elegantly solved by the popular extension of SKOS, SKOS-XL.

URIs are global. They are not coupled to a particular data store. Such a decoupling gives autonomy: all things represented with URIs are not dependent on a particular database. At the same time URIs bring interoperability and respectively cohesion: URIs can be resolved across different data stores. In comparison, identifiers in SQL databases are unique only within a table or a database instance. Likewise in labeled property graphs, the identifiers are database-specific.

HTTP URIs are resolvable. That allows you to perform an exploratory search9 or browse the RDF graph in any direction you want using the so-called follow-your-nose approach. Every hop in the graph is not simply a navigation step like with the regular hyperlinks. All nodes are linked in a meaningful way, always a part of a bigger story. The extra benefit is finding new things and associations serendipitously.

HTTP URIs have a small drawback when it comes to autonomy. They are coupled with a host. But that's not a limitation for RDF. URIs in RDF can be host-independent, as is the case with decentralized ID (DID), another open standard maintained by the W3C.

URIs are globally unique identifiers, but one URI is not a unique way to identify one particular thing. That would be a case of an extreme centralization. On the contrary, every organization (or every person if need be) is free to identify resources independently. That autonomy is balanced with cohesion. Two URIs that identify the same thing can be linked with owl:sameAs. Such linking is one of the Linked Data principles and is the criterion for 5 stars open data publishing. That's also applicable in the enterprise context where along with owl:sameAs, there are alternative approaches applied. One such approach is using the property gist:IndetifiedBy and the class gist:ID.10

Structure (Triple)

In most data management approahes, you need to know the structure to query and use the data. Meaning is tightly coupled with structure.11 Even in simple cases, like having two slightly different tables, and even when they are created with the same tool, if you paste one of the tables below or next to the other, you won’t get a coherent combined table. In RDF, meaning and structure are decoupled. The structure is always a triple and can express anything. If you bring two different RDF graphs together, created in different contexts and using different tools, they become one valid RDF graph. If you have two N-triples files and copy and paste one inside the other, that will make a valid and coherent RDF graph.

RDF has one structure, triple, used for everything. The instance data is RDF triples, which get meaning from ontologies, and ontologies are RDF triples too. The typical way of transforming non-RDF to RDF is by creating RML mapping which are also RDF triples, and so are the validation (SHACL) and the business rules.12 It's triples all the way down.

RDF is a conceptual model. You have the autonomy to serialize it and store it the way you want.

That's about files, but of course, quite often RDF is stored in databases. And while the typical RDF store is a triples store (or quadstore), RDF can be stored in SQL,13 graph DBs,14 document databases, or key-value stores Oxigraph.15

Semantics (Logic)

Decoupling structure and semantics, combined with the RDF feature that data and metadata are expressed in one and the same structure, gives the freedom to postpone schema definition. This makes RDF-based systems future-proof since the cost of change is very low.

If, in the description of Pat the cat, we add just another triple: dcat1 a wd:Q146, where wd:Q146 represents the class of cats in Wikidata, we get:

@prefix d: <http://example.org/data/> .
@prefix o: <http://example.org/ontology#> .
@prefix wd: <http://www.wikidata.org/entity/> .

d:cat1  a o:Cat, wd:Q146 ; 
        o:hasName "Pat" ;
        o:sat d:mat1 .
d:mat1  a o:Mat ;
        o:hasColor "brown".

Using a shared vocabulary brings semantic cohesion. Now, we can ask, for instance, what the name of the oldest known animal of Pat's kind is, and we get the answer "Creme Puff," together with other data such as birth and death dates. The node wd:Q146 makes one coherent graph.

Note: the edges are labeled with Wikidata property labels, not with their URIs, to be easier to read.

Using a shared vocabulary is a powerful way to bring cohesion, but it also creates a dependency. Such a dependency, however, can be avoided while keeping the benefits of the relationships. Here's an example. The Software Description Ontology(SDO) provides a way to describe software components. If you use this ontology, you will declare the applications you want to describe as sd:Software. This will also make them instances of the class schema:SoftwareApplication, because sd:Software is a subclass of schema:SoftwareApplication. The interoperability of your data will be increased since schema.org is widely used. The cohesion achieved this way, however, will not create dependency as would have been the case if schema:SoftwareApplication was directly re-used in SDO. If, for some reason, schema.org is gone tomorrow, nothing in your statements will break. And if another ontology, with its own class ex:Application, uses the same relationship, its designers have the autonomy to have their own probably slightly different notion of what a software application is. Every instance of sd:Software and ex:Application is a schema:SoftwareApplication but not vice versa.

And this was just using one property rdfs:subClassOf. RDF can be further extended to express complex relations using all axioms of first-order logic or a subset of it. This expressive power comes with two additional benefits: logical inconsistencies can be avoided, and inferencing over the graph can generate new statements (triples), which are the logical consequences of the asserted statements. We learn more from what we know already.

Conclusion

RDF is one of those standards that brings both autonomy and cohesion. It provides interoperability, which results in cohesion between heterogeneous data sources, and flexibility, which results in increased autonomy of its users. The balance between autonomy and cohesion is achieved through standardizing identification, structure and semantics.

The use of verbs like bring, provide, reduce, increase, enable, support and so on should not be taken to imply that standards of this kind have some kind of agency. The cohesion or autonomy is not the result of an action from their side but a consequence of their use.

In the context of RDF, resource and entity are synonyms, and the latter is used in the RDF semantics specification.

Regarding persons as resources is common but highly problematic, as I discussed elsewhere. That doesn’t affect the value of RDF. It’s simply a reminder that also here, an alternative, such as entity, would’ve been more appropriate.

A few clarifications. All URIs today are IRIs since they can accept international characters, while the original URIs were restricted only to ASCII characters. But I’ll keep calling the URIs. The second clarification is that apart from the URIs, RDF1.1 allows to have also blank nodes. Blank nodes are nodes that serve as placeholders for resources without assigning them a URI. They are usually used for grouping triples where their common subject is not supposed to be referred from other resources so it doesn’t need to have a global identifier. The last clarification is that, from RDF1.2, still a draft, subjects and objects of an RDF triple can also be triple terms, in other words, nested triples.

That’s only the case for HTTP URIs. DID URIs are domain-independent.

I use vocabulary and ontology interchangeably in this essay. In the context of semantic web ontologies, lightweight ontologies containing only definitions of properties and classes and RDFS-level axioms are usually referred to as vocabularies. There is also the term controlled vocabulary, which encompasses reference datasets such as taxonomies, thesauri, code lists and authority tables.

That something is meaningful for machines is a shorthand to point to an important utility but taken literally can lead to dangerous anthropomorphizing.

Halpin, H., & Monnin, A. (Eds.). (2014). Philosophical engineering: Toward a philosophy of the web. Wiley/Blackwell.

Watch this 18 min. lecture by Prof. Harald Sack if you want to learn more about exploratory search with RDF.

The approach is well explained in this video.

Dave McComb put it aptly in Data-centric Revolution: “In a traditional system, meaning is both horribly bound up in the data structure, and at the same time, not discoverable from it.”

There are different standards for expressing business rules in RDF: SHACL, SPIN, SWRL, and RIF.

See https://timbr.ai

For example, one of the most popular graph DBs, neo4j can become an RDF store using the neosemantics plugin.

Oxigraph is an RDF store based on the RocksDB key-value store.

Link & Think

Discussion about this post