COSS: Orchestration & Observability

November 21, 2021

Earlier I wrote two traditional types of open source businesses: operating systems and databases. Now let’s get into a more recent emergence — what I’ll call Orchestration & Observability tooling.

These are tools that sit on top of running systems — servers, operating systems, databases, containers, and APIs.

Some of these tools’ functionality is around orchestration — allowing you to scale up or down, load-balance or throttle traffic, coordinate across multiple types of resources. Others provide observability — allowing you to see system health, alert on errors or performance regressions,

Their precise technical terms are things like API gateway, reverse proxy, monitoring, mesh, provisioning, mesh.

These are tools like Hashicorp Terraform, Sentry, Kong, Chef, Puppet, Ansible, Sysdig, Grafana, and of course Kubernetes. Their closed-source counterparts tend to be the expanded universe of APM, logging, tracing, and observability systems including Splunk, New Relic, Datadog, Lightstep, and Honeycomb.

Value props of these tools include:

visibility into errors and performance regressions
alerting when critical states are approaching
ability to resolve outages quicker
standardized, low-effort, low-error ways to provision and manage resources
traffic management and routing

A few observations:

Architectural shifts in the 2010s shook up this space a lot. Tools like Chef and Puppet created in a pre-container, pre-public cloud world proved much less commercially viable afterwards. Perhaps we’ll see a similar shake-up in the 2020s as the serverless model slowly gains traction.
There’s some overlap with databases and operating systems. Are Grafana and InfluxDB observability tools or time series DBs? Perhaps both? Docker tried to move from to containers to orchestration with Docker Swarm but was unsuccessful.
The biggest commercial successes in this space, so far, have been closed-source. Datadog alone is worth significantly more than all open source tools in this space combined. Why is that? A couple of possibilities:

Product expansion: APMs expand to suck in multiple sources of data and serve a whole engineering organization, while open-source tools often function more as point solutions for subsets of DevOps and infra engineers.

GTM persona/budget: there are more well-established budgets for APM, and they’re often sold to a higher persona.