Rules on Graphs in Graphs of Rules, Part 3
Use cases and benefits
This post is part of a mini-series on inference rules, which is part of a larger series on rules, which is part of an even larger series about autonomy and cohesion.
The objective of the first essay was to understand how inference rules work. The example used a small target graph that could be fully displayed in a diagram. The resulting graph had the same number of nodes but was denser after generating eight inferred relations. In the second part, the inferred relations were similar in kind, but their number was tens of thousands. There, the inference was on a much bigger graph of millions of facts. More importantly, it was done using six different ways to express the same logic. We learned about the consequences of different design decisions, resulting in different systems along the autonomy-cohesion axis.
In both essays, some benefits were stated, others implied, or listed in a linked slidedeck, but overall, most likely you are still left with the question: Why bother?
That’s what the current post aims to answer.
The usual drivers for using inference rules are performance optimisation and the need for logical reasoning. But rules can also be used for entity reconciliation, analytics, and even for the primary task of generating knowledge graphs from heterogeneous data sources.
All of those will bring immediate benefits. But they would do so even if the rules are not themselves in a graph. And indeed, the common practice is to keep even declarative rules in the application layer. But the big long-term benefits of inference rules can come when they are kept not in the application layer but in the data layer. Such a shift is another contribution to data-application decoupling, with long-term benefits that improve adaptability by lowering the cost of change.
What follows is a quick review of the use cases and benefits of using rules on graphs, maintained in graphs of rules.
Speed
The most popular use case and immediate benefit is the query simplification and performance.
In the example from the previous post, if we want to count the number of uncles, we have to use the following SPARQL query:
PREFIX wdt: <http://www.wikidata.org/prop/direct/>
SELECT (COUNT(DISTINCT ?uncle) AS ?totalUncles)
WHERE {
{
# Uncle via father's male sibling
?person wdt:P22 ?father .
?father wdt:P3373 ?uncle .
?uncle wdt:P21 wd:Q6581097 . # Male
}
UNION
{
# Uncle via mother's male sibling
?person wdt:P25 ?mother .
?mother wdt:P3373 ?uncle .
?uncle wdt:P21 wd:Q6581097 . # Male
}
}The same query over the inferred graph looks like this:
PREFIX s: <http://velitchkov.eu/shapes/rules-post#>
SELECT (COUNT (DISTINCT ?uncle) AS ?uncleCount)
WHERE {?person s:hasUncle ?uncle .}It is not just way simpler. It is seven times faster.
Identity
The same entity can appear with different global identifiers. This may be because the graph combines datasets from different publishers, each using its own identifiers. Or it could be that, when generating a knowledge graph from heterogeneous data structures, identifiers are missing and must be created from the content of a source file (such as JSON or XML). In cases like those, rules can infer triples that link all identifiers of the same entity. Sometimes the property owl:sameAs can be used, but there are better ways.
Graph Generation
Inference rules can be used to generate graphs from heterogeneous data structures. One reliable way is to use Façade-X to generate a raw RDF graph, then apply SHACL rules to that graph to enforce the desired identities and semantics.
Façade-X is a method for abstracting heterogeneous structures into an RDF graph. I have written a couple of essays about it previously. Currently, Façade-X is supported by the open source tool SPARQL Anything. Now, Façade-X is on its way to becoming a standard, with more implementations to come. Some are already in development.
SPARQL Anything can now generate RDF graphs from XML, JSON, CSV, HTML, Excel, Text, Binary, EXIF, File System, Zip/Tar, Markdown, YAML, Bibtex, DOCx, and PPTX.
Once the raw graph is generated, SHACL rules can apply the desired approach of resource identifiers and the intended semantics for the target knowledge graph. Resource identifiers are minted by concatenating a namespace and a local name. The local name can reuse some identifier from the source data, or be constructed as a concatenation of several values, generated with a UUID algorithm or created as a hash. The last option makes this rule-based approach superior to some mainstream approaches, such as those using the RDF Mapping Language RML. RML does not support hash functions, while they are standard in SPARQL. Also, using RML requires a solid understanding of source formats and structures, whereas using Façade-X and SHACL rules requires only knowing SPARQL.
Data Catalogues
The approach described for generating knowledge graphs can be used to automatically catalogue data. What needs to be known is the data source and the target catalogue model. As long as there is a protocol to connect to the data store and read privileges, a Façade-X can generate the raw graph, and SPARQL (or other SHACL) rules can transform it into the desired catalogue shape.
Data Quality
The most popular use of SHACL is as an RDF validation language. A SHACL shapes graph is applied to a data graph by a SHACL engine, and the result is a validation report. The report tells us what and where the problems are, but the problems remain.
With inference rules, some of these problems can be fixed. A typical case is when a value is missing, but it can be calculated or substituted. Another occurs when a value, such as a date, is not in the required format but the error follows a consistent pattern.
Inference rules do a great job of solving issues with incomplete data. There can be a missing value, but for analytical purposes, another value is good enough, so it can be used.
What is important in all these cases is to have that change clearly recorded so that the provenance is clear and it is easy to tell what comes from the source data and what comes from data enrichments. This kind of provenance information can again be recorded using inference rules.
Long-term benefits
When business rules are not in the application layer but in the data layer, it contributes to application-data decoupling, reducing technical debt and the cost of change and integration. Let’s unpack this.
The corporate IT in a large organization is built with an application-centric mindset. It is a systemic problem arising from how IT investments are managed. Typical IT investments focus on immediate business needs, staking out accidental application boundaries that inevitably become silos. The chosen solutions are driven by risk aversion and comfort: either applications are purchased from large vendors or developed using technologies the teams are comfortable with. On top of that, there is a functional requirements bias that persists because demonstrable features satisfy decision-makers, while “unsexy” non-functional needs like interoperability are ignored. In project boards, “space” (the enterprise) and “time” (the future) are not represented. All this leads to fragmented data, high costs of change, and high costs of integration. I described all this in detail in
The way to resolve is, as I suggested there, to unify and decouple at the same time. To unify entities’ identity through URIs, structure via RDF, and semantics through shared ontologies. And to decouple data from applications. When data is self-describing, applications become mere “visitors,” preventing them from owning — and ultimately breaking —the information they serve.
When we have inference rules working on RDF graphs, it’s halfway there. The other half is to keep these inference rules not in the application layer, but as RDF graphs themselves. This improves data governance (easier to add metadata and extend as needed), but the bigger benefit is that it contributes to the decoupling of data from applications, an architecture style also known as data-centicity.


