Enemy of the State

Nowadays, I spend a lot of time thinking about system design and architecture. I spoke last year at PyParis about the lessons we learned building Salient and how to get some control of a large and complex system.

Of course,  a system like Salient is always evolving,  usually as a function of requirements and customer feedback (someone once used the analogy of “changing the engine of an airplane in mid-flight”).  In this post however I’m going to focus on a different kind of insight or requirement, one that doesn’t come directly from the customer but rather, indirectly, by challenging our own understanding of software and how it should be designed and managed.

As a former theoretical physicist I’m very familiar with the idea that abstract thinking, and particularly the right abstractions, can have a huge impact in solving practical problems.  In my own career I witnessed how, by recasting and reformulating the same problem with new and better abstractions we were able to achieve hundred-fold performance and accuracy improvements more than once.

So I was very pleasantly surprised when I discovered a paper espousing a similar powerful set of abstractions in the much more general setting of software design.  I learned about “Out of the Tar Pit” in this very beautiful talk by Rich Hickey (which warrants an entire blog post on its own) and have since been slowly digesting it and working through its implications for our software.

State and Control

The central point of the paper is that much of the complexity of large software systems is potentially avoidable (at least it is not theoretically essential) and has its origins in appropriate handling of state and control.

While the problem of state is something many of us have thought a lot about, I was less familiar with the notion of control and particularly that it might not be essential.

Basically software is generally defined in terms of a series of operations that happen in a certain order to generate a result.   The examples they give are very illustrative:

a = b + 3
c = d + 2
e = f/4

While the above lines might appear in a body of code its very clear that we are enforcing an order on them unnaturally.  I’m so used to this that it did not occur to me that this is really an artifact of the implementation rather than an essential feature of the problem.

Programming paradigms like declarative programming challenge this.  When we write an SQL statement or specify a pod spec in Kubernetes we are not specifying a set of operations but rather the desired business logic.  This cleanly separates out the logical constraints (e.g. relationships in a data model) or system specification from the flow of control required to achieve this (which is is generally not at all unique).  Think of the distinction between simply declaring that a field is a Foreign Key versus the code required to enforce this.  When thought of in these terms its rather shocking that we systematically confound these two aspects of system design in standard software development.

Accidental and Essential

The second dimension used in their deconstruction of the software development process is the distinction between essential and accidental complexity.  Essential complexity is, in their language, complexity that is innate to the specification of the problem.  This is to be distinguished from accidental complexity which emerges by thinking about the problem in terms of a specific implementation or paradigm.

In some sense this leads us to a very user-centric view where only user-specified aspects of the problem become truly essential.  For instance nothing about the choice of programming language, database, operating system, hardware, caching mechanism, etc. can be characterized as essential.  All these are subject to change and critical review.

By imagining a fictitious ideal world with no resource constraints they arrive at the following perhaps surprising observations:

  1. The only essential state in any system is the state provided as inputs by the users (and only if the requirements imply that this state must be available later on).
  2. There is no essential control in the system.  All control is a source of accidental complexity.  This is essentially tantamount to the statement that the entire system can be specified in declarative terms (users don’t care how you enforce the business rules).

Reflecting on the above, we are forced into an embarrassing realization.  Very little of the state we usually associate with a system (the database, the caches, status of various parts of the system, etc) are user-specified.  They are almost all implementation details and hence a source of “accidental” rather than “essential” complexity.  You might object that without those things the system can not reasonably be imagined to function but the important thing here is not to ignore the presence of such accidental complexity but rather to carefully distinguish essential from accidental complexity and to make this distinction the overriding principle in designing the system.

So far I’ve only attempted to heavily paraphrase some parts of the problem statement of the paper.  The ideas in this paper, and their implication for our own (as well as others’) software design can easily fuel several blog posts.  But before stopping I want to spend a moment making the above discussion more concrete, lest the reader be left with the impression that this is a purely academic exercise.

In a complex system like Salient we have a large amount of both state and control.  Documents are uploaded into the system, users annotate and interact with them, machine learning systems analyze them extracting entities, labeling types of sentences, users train the machine learning systems both directly and indirectly, etc.  These interactions are all orchestrated by processes invoked directly by users or by background tasks or even chains of tasks.  The flow of data and control logic spans multiple software components and generally runs in a distributed system composed of several physical servers.

At first glance this all sounds like an essential aspect of a system of this scale and complexity (and part of what makes it non-trivial to manage).  But by stepping back and asking ourselves some simple questions we can unravel a lot of this complexity (with the goal of eventually putting it back together in a much smarter way).

  1. What is the absolute minimum amount of data required to rebuild the system from scratch?  Thinking about this question helps a lot: it turns out that the answer to this question corresponds to a very small subset of the data actually stored and used in the system.  The rest is really “accidental state”.
  2. What drives the flow of information and processes in the system and is it canonical (i.e. the only way to do it) or just accidental?  Can we move away from user or agent driven processes and towards a more declarative structure, where the software robustly tries to achieve a certain state derived from the above minimal state and a set of business rules?

Thinking about Salient in these terms has helped drive some very nice design decisions on our part.

For instance an object-oriented way of thinking (something criticized heavily in the paper) would encourage us to encapsulate all data related to a document in one place, both the data uploaded by the user (the original document and annotations about it) as well as a huge amount of derived data (various processing and machine learning outputs, etc).  This tends to lead to a rather complex notion of state where essential state information about a document might be distributed over many data stores (for performance and other reasons) leading to a complex system of synchronization.  This is a good example of accidental complexity that feeds not only into the software but also into infrastructure and even system management processes.  Rethinking this with an eye towards accidental versus essential state lets us design a much cleaner version of the system.

Even in aspects of the system that have been built with very careful separations of concerns we are finding that using the above language helps us clarify and refine the design, mitigating potential flaws down the road.

In follow up blog posts I hope to dig more into this paper (and related ideas that we’ve been exploring at Lore) and also more specific examples of design patterns it has inspired.