Apps Break Data
Information is not a first-class citizen in corporate information systems. Worse, it is neglected. Then why do we still call them information systems? We don't. We call them applications. And applications, quite appropriately, are built or purchased with an application-centric mindset. Consequently, data is broken into diverse fragments, tightly coupled with applications, and expensive to reassemble into coherent information entities. Software engineering and procurement practices work in sync with market forces to maintain this trend.
How does this happen? And what can be done about it?
This essay is about the corporate management of data. If an essay about corporate data management was writen a decade ago and mentioned Artificial Intelligence, that would have been a weird essay. It's the opposite today. A weird essay is a data management essay that's not about AI. And this will be such a weird essay. I think that AI, more concretely LLMs, came too early, before we managed to solve problems with data quality and interoperability.
Speaking of AI and weird, a recent paper1 reported that all LLMs have WEIRD bias. They are trained on data coming exclusively from countries that are W.E.I.R.D: Western, Educated, Industrialized, Rich, and Democratic. For LLMs, WEIRD is the norm.
In large organizations, weird is the norm as well, and here is one manifestation:
Information systems are not about information.
Now, the question is, of course,
How could that be?
How does it happen? The best way to answer this question is to review a typical IT investment process.
An IT investment is a chain of decisions framed between two realizations. The first is becoming aware of a business need. The second is making the change that is supposed to address that business need. In between, we have to justify the investment and make a set of choices.
The event that triggers the process is becoming aware of a business need. There should be some critical mass and energy to move forward. When there is, it leads to a business case that needs to justify the investment.
The business case defines the initial scope. The scope may change later on, but its initial state marks the application boundaries. Imagine staking out a house. The threads between the stakes determine where the wall of the future silo will be built.
What is important to note here is that application boundaries are historical and accidental. They are determined by past experience and chance.
Since justification is an important factor, it's worth mentioning a common business case paradox. The more value a business case claims, the higher the chance of being supported. But the more is promised, the lower the probability that it will be delivered. Chris Potts calls that "Project Probability Paradox."
And IT projects fail a lot. In the early 90s, Standish Group alarmed everyone by reporting that only 16.2% were successful; in other words, that fraction of the projects met their KPI targets. In the two decades that followed, things did not change much.
Failed IT projects get all the attention, but they are not responsible for the current state of corporate IT landscapes. It is created by the successful ones. How? Projects create local optima, which end up as data silos.
Famous and familiar
To see that better, let's go back to the IT investment process. We are now at the business case stage. It’s time to decide whether to buy or build software. But whatever the choice, there is a pattern. The pattern is based on risk aversion, convenience, and habits. I call that pattern "choosing the famous and familiar." If the decision is to buy, usually that's from a big and well-established company with a good reputation. When the decision is to build, the natural bias is to choose those technologies the current IT team is comfortable with, which are not necessarily the ones most appropriate for the task.
Once we are in the project stage, two interesting things happen. One is related to requirements, and the other to representation.
Functional requirements bias
Requirements have already been outlined at the business case stage. In the project, they take a definitive shape, regardless of the method used to define them — more traditional requirements elicitation or user stories. And just as with the biases in the choices for buying or building, the pull of the famous and familiar, there is a strong bias here, too: functional requirements have much higher priority than non-functional requirements. This is natural. First, functional requirements stem from the initial business need. Second, they are directly related to the benefits claimed in the business case. And third, they are demonstratable. When some money is spent, you need to show something. Decision-makes like demos. And who doesn't?
The non-functional requirements come second. If some of them get more attention, their visibility is due to market forces. That's the case with scalability. It was neglected before the cloud. Now, many features take care of scalability, and the cloud providers make sure that they are right in your face. As long as you can afford it, you are welcome to scale.
Among non-functional requirements, scalability got lucky. Security as well. Others, not so much. Worst of all is interoperability. Nobody so far has invented a way to make it sexy or to make a profit from it.
The functional requirement bias and the neglect of the non-demonstrable interoperability work in synergy with another pattern, that of underrepresentation.
Underrepresentation
IT projects have boards and steering committees where the different stakeholders are represented. However, two "stakeholders" are always absent from the project boards: space and time.
By space, I mean that, in the project decisions, the whole of the enterprise is not represented. Basically, there is nothing to offset the tendency for the project to achieve its KPIs at the expense of enterprise-wide benefits. It is a well-known problem. Many approaches have been proposed as solutions. Disciplines like IT Governance and Enterprise Architecture(EA) are born out of these concerns. But they have a marginal effect. IT Governance is limited to a few checkpoints and rarely focuses on data and interoperability. Enterprise Architecture is called enterprise, yet enterprise architects report to the CIO, which, in combination with some other pathologies2, makes EA dysfunctional. But even when it's not, having working software is way more important than diagrams with boxes and arrows that may point to some risk.
Not only space but also time is not represented on project boards and steering committees. More precisely, the future is not represented. All projects are driven by historical functional requirements, determined by concrete business cases (good by itself but not balanced) and are not prepared for unforeseen business cases. In other words, the software that is built is not future-proof.
High cost of change and integration
The resulting corporate IT landscape looks like this. All information entities such as customer, employee, supplier, product, contract, order, invoice, payment and so on are like pottery: vase, teapot, jar, bowl; you get the picture. You know what happens and how it looks if you smash them on the floor. Thousands of small fragments. Each information system is like a cup containing a few scoops of these fragments. Creating a view of an information entity feels like assembling a ceramic vase or a pot out of those small fragments, puzzled about where to look for the missing pieces (unlike the image below, the application cups are opaque) and how they fit together. That's the IT landscape of most large organizations.
Why is this the case?
The information models of corporate applications are in the physical layer, separate for each application and their interpretation is hidden in the application code.
What does it lead to?
High cost of change, integration and migration. Or, in other words, technical debt. In each application, the data structure is fixed by the functional requirements. They are accidental and historical. The function scope determines the data scoop. But both the organization and the environment change. There are new user and business needs and new compliance requirements. Implementing a change in an IT landscape built with an application-centric mindset takes a long time.3 Even a small change requires changing a database schema, which can take months and cost millions. Or there is a need to integrate the data from several applications. It can be done by building interfaces, buying data-integration platforms and data lakes, and implementing API platforms. Investing in them means the technical debt is paid for by taking even bigger loans with higher interest.
What can be done about it?
Unify and Decouple
Looking back at history, we'll see this is not a new problem. The spread of steam engines and other machinery during the industrial revolution necessitated the production of large amounts of screws and bolts. Each manufacturer had its own view on the pitch, depth and form of screw threads, and this resulted in a large variety of threads. Exchange and replacement were limited, and repairs were difficult and expensive. To address this "evil",4 Joseph Whitworth, a prominent British engineer and inventor, gathered all types of screws and bolts and compared them. Then, he presented his proposal for unification to the Institution of Civil Engineers in 1841. His "Paper on an Uniform System of Screw Threads" marked the birth of standardization.
That's the first history lesson: When the diversity of engineering design or measurement systems creates a problem, reduce it. Unify.
It is not a specific way to deal with this kind of situation. As explained in detail in Stimuli & Responses, one way to match the variety of the environment is to reduce it. The other is to increase our variety. We’ve been doing so for centuries using various information technologies. So, let’s have a second look at the history, but this time, the history of information technologies.
There is an interesting trend: we tend to increase the flexibility of the information technologies we use so that they can be used in more and different ways. The history of information technologies can be looked at as a history of increasing decoupling. When writing was invented, it decoupled content from its only medium so far, oral speech. It opened up new possibilities. Messages and stories can travel in time and space. When symbols got decoupled from the objects they represent, this allowed for new ways of thinking. Later on, the printing press enabled individual ownership of books and, in this way, decoupled the interpretation of information from authoritative sources like priests or scholars. The decoupling of software from hardware marked the era of modern computing. Decoupling of service from implementation, brought by new protocols and architecture styles, resulted in advanced web clients and applications. In summary, an effective way to amplify our variety is through information technologies and finding more and smarter ways to decrease the dependencies between their components.
So, the shortest answer to what can be done to improve the abysmal situation created by the application-centric mindset is to unify and decouple. What needs to be unified is identity, structure and semantics. And what needs to be decoupled is data from applications.
Regular readers of Link & Think have already recognised that this is the balance between Cohesion and Autonomy, and that it can only work if maintained at every level. When unifying identity, structure, and semantics, decoupling is also needed, and when decoupling data from applications, it should be done in a standardised way so that they keep working together and in combinations not possible before.
Unify
The data is typically stored in SQL relational databases. But it is not only the nature of the storage paradigm what creates silos; it's the software engineering culture that goes with it. A graph database may overcome some of the deficiencies of SQL databases, but it can be a silo too. One of the reasons for that is the use of local identities. Entities have local identifiers that can only do their job inside the database and, in a way, specific to that database. To remove the dependency between an information entity and a data store of a scoop of its characteristics, there is a need to unify them (use the same standard) and ensure they work across data stores (make them globally unique). It’s not a coincidence that this is the FAIR principle number 1:
F1: (Meta) data are assigned globally unique and persistent identifiers
The standardised, established and proven way to do so is with URIs, uniform resource identifiers. Typically, that's HTTP URIs, but if there is a need to decouple from the host, we now have Decentralised Identifiers (DID) as another option for URIs.
The second thing to unify is the structure. Data is stored using heterogeneous structures. Even when using the same storage paradigm, interoperability is deficient. Different proposals exist to solve this. From what I've experienced,5 the only mature standard that effectively unifies heterogeneous data structures is RDF. I explained some of RDF's benefits in the previous post.
The third thing that needs to be unified is the semantics. When discussing the IT investment, I wrote earlier that it happened between two realizations. It wasn’t a problem for you to figure out that the same word, realize, has two definitions. Within that single sentence, it referred to the IT investment initiation with its first meaning (become aware of something) and to the implementation with its second (cause to happen). This shift of meaning was implicit for you, but for machines, when dealing with structured data, it needs to be made explicit6. If each application has its own data store and each data store has its own data model, semantic interoperability is a challenge. The interpretation is hidden in the application code or somewhat partially shared via an API. The way to overcome this is to use shared enterprise and domain ontologies instead of local data models. The unification comes from using the same standards (RDFS, OWL, SHACL), and the same enterprise-wide ontology. That doesn't mean using a single ontology. On the contrary, decoupling and autonomy are important here, too. There can be multiple ontologies, such as an upper ontology like Gist, extended first with common enterprise-specific classes and properties and then further extended with domain ontologies.
Decouple
What needs to be decoupled is data from applications. It is best captured by the principles from the data-centric manifesto.7 Let’s take three of them:
Data is self-describing and does not rely on an application for interpretation and meaning.
Applications are allowed to visit the data, perform their magic and express the results of their process back into the data layer.
Access to and security of the data is the responsibility of the enterprise data layer or the personal data vault and is not managed by applications.
When data is self-describing, it will be interpreted in the same way by different applications. The evolution of the applications will not affect the meaning of the data. No application will be a bottleneck. Old applications can be phased-out with little impact. New applications can come and use the existing data. Data models will be simpler, and the volume of the programming code needed will be lower. The changes in the data will be done once and re-used by different applications. For data to be self-describing also means that the validations and the business rules are in the data layer, too, and expressed in a unified way in all three dimensions: identity, structure and semantics. Applications don’t own the data and don’t store the results of the data processing they provide. They visit the data, use it conforming to the policies, and store the result back into the data layer. The policies (for access, usage, etc), just like the validation and rules, are also part of the data layer and expressed in a unified way. Since applications will not have their own models and determine access to the data, there won’t be any application-induced fragmentation. Apps won’t break data.
Atari, M., Xue, M. J., Park, P. S., Blasi, D., & Henrich, J. (2023). Which Humans? OSF. https://doi.org/10.31234/osf.io/5b26t
See this essay from 2013. This situation with Enterprise Architecture hasn’t changed much since then
To readers who witnessed or implemented such changes, this is no news, but for the rest, here’s an anecdote, shared by Alan Freedman, that gives an illustration
Last year I requested that the name of an entity type be changed from "MessageType" to "EventSeverity" to more accurately represent the information it contains. The effort to rename the relational database table required two engineers, their manager, a program manager, an architect, QA and deployment engineers, six sprints, and around 30 distinct messages on a Jira ticket. It took four months.
I recommend Software Wasteland by Dave McComb to both groups of readers.
That is how it was indeed called in the original paper
The difficulty of ascertaining the exact pitch of a particular thread, especially when it is not a multiple or submultiple of the common inch measure, occasions extreme embarrassment. This evil would be completely obviated by uniformity of system, the thread becoming constant for a given diameter.
When I shared the slides from an “Apps Break Data” talk on LinkedIn, there were a few comments from people who agreed with the problem described but had different proposals for solving it.
In the context of semantic web technologies, it is customary to hear statements like "explicit semantics" and "ontologies bring meaning to data". Strictly speaking, meaning is something that only humans and other cognitive living agents can create when interacting with their environment (including interacting with each other). The concepts of meaning and understanding cannot be applied to machine-to-machine communication, although attempts to do so abound. Yet, talking in terms of "explicit semantics" and "ontologies bring meaning to data" has utility if understood in the sense that these technologies bring interoperability between machines as if they can understand each other. But that is combining disambiguation, logic and standardization, and not turning machines into sense-making agents.