Commercial Open Source: Databases

July 11, 2021

One of my theses about open-source companies is that there are a few defined categories, and companies belonging to a category tend to follow similar paths.

I intend to write a series of blog posts on this topic, and I’ll start with one of the oldest and most common categories of software, databases.

History

Commerce database companies are as old as Oracle, which formed in the early 1980s to crystallize Edgar Codd’s research around relational algebra and shift the market from less-flexible hierarchical databases to the new relational databases.

After Oracle’s software beat out one nascent database known as Ingres, the leader of that team, the UC Berkeley professor Michael Stonebraker, took Ingres’s learnings into an open-source project he called Postgres.

A few years later, as the web started to take off, a Swedish company built and open-sourced an OLTP “on-line transaction processing” database for the growing number of web applications. They called their database MySQL, and it was the first commercially successful open source database company. After meteoric growth, MySQL was acquired in 2008 by Sun for $1B, then a groundbreaking sum.

As the number and types of software applications have exploded in scale over the last four decades, there has been a similar proliferation of ways to store data, depending primarily on access patterns.

Today. there are document databases (Couchbase, MongoDB) with flexible schemas; high-availability databases (Cassandra, DynamoDB) with basically zero error rates; columnar databases (Snowflake, Redshift) optimized for running certain types of analytics queries, and a number of other categories. Some of these categories have tilted to open-source solutions; others tilt towards closed-source and hosted solutions.

Typically, solutions within a category compete with each other for users and customers; competition across categories tends to be more rare. The wide variety of categories & workflows means that there are at least ten well-funded (>$100M) independent open source companies currently in this space (Databricks, Redis Labs, DataStax, Confluent, Couchbase, Neo4j, MongoDB, Elastic, MariaDB, InfluxDB), plus a large number of companies at an earlier stage.

On a high-level, these companies from an open-source perspective tend to share a number of similar dynamics.

Common Company Dynamics

Key value proposition: typically speed and scalability (with low error rates) of reads and writes for a given workload, as well as easy integration with relevant technology stacks.

Competitive dynamics: for a given category, there are often a couple of leading open source contenders and a couple closed-source or hosted options: for example, Cassandra vs DynamoDB for high-availability.

Defensibility of these businesses tends to revolve around a particular type of switching cost known as data gravity. Databases are at the center of applications. Once a business puts data in a particular database, it starts creating practices around it: backup and disaster recovery, schema migration and rollback procedures, app code specific to a particular database. These practices — vital for running any application at scale — tend to create lock-in; migrating a database from a running system is usually a multi-month process.

Pricing vectors tend to segment around compute (per-node) and storage (per-GB) — perhaps unsurprising since these factors are the main driver of costs!

Common features of open-source databases include availability on a specific cloud (AWS/GCP/Azure), as well as a particular availability zone within those clouds (like us-east-1), backups, managed upgrades, data visualization tools.

Types of databases

The proliferation of databases is due to the wide variety of workloads (types of queries) that are run against them.

Different types of databases are optimized for different workloads; they have different methods for the physical way they lay out data on disk, the way they handle connections between instances, and so on.

Graph databases (Neo4j) add additional support for object relationships within databases. In relational databases, object relationship information is stored just as a foreign key (eg author_id ON Post); graph databases give additional support for directionality, relationship history, and so on.

While graph databases have a handful of passionate advocates who see them as the eventual successor to relational databases, current common uses tend to be clustered around real-time, highly relational use-cases like product recommendations and fraud detection.

Columnar databases (Snowflake, Redshift) physically lay data out on disk column by column, rather than row by row as is more common in other databases. This optimizes for the types of read queries a business analyst will run (SELECT time, amount FROM orders) , rather than the type of read queries a web application will send (SELECT * FROM users WHERE id = 30421).

High availability databases (Cassandra, DynamoDB) are optimized for the D (durability) in ACID transactions. In other words, when you write to this database, come hell or high water, your write will succeed. This is usually implemented by methods like having multiple redundant nodes, so you can failover if there are any problems writing data. This is good for things you really don’t want to fail, like financial transactions — in this way, high-availability databases are the successor to the OLTP database category.

Time series databases (Innflux, Prometheus) tend to specialize in handling high-volume, highly-structured data describing system behavior — stock trading prices, sensor data, system operational metrics. An arbitrage trader might run queries on bid-ask spread volatility for a specific stock ticker on a millisecond basis; an APM system may want to display the trace of all calls executed by a specific HTTP request. These systems are usually used by highly-skilled operators viewing the recent past for debugging and anomaly detection purposes.

As a result, they ingest huge amounts of data and preserve it fully for only a few hours or days, keeping highly compressed aggregates and summary statistics afterwards. For a deeper dive, InfluxDB has a good overview.

Key/value stores (Couchbase, Redis) are unstructured data stores optimized to quickly retrieve small pieces of unstructured information. Also known as a dictionary or a hash table, their theoretical retrieval time is O(1) rather than O(log n) — engineers with upcoming technical interviews take note! Key/value stores tend to be quite common in consumer chat / social / gaming applications where fast retrieval is critical but occasional temporary inconsistency (who sent which message first?) is not the end of the world.

For more reading, Wikipedia has fairly exhaustive detail on these and other categories; I found Bradfield School’s database course syllabus quite insightful as well.

Recurring themes

Betting on a new use-case

Typically the growth of a particular database and/or company tends to correspond with the growth of a particular development trend:

MySQL and, later, MongoDB and Cassandra/DataStax grew with the web.
Elastic grew with the rise of internal text corpuses large enough to need search
InfluxDB has grown with the rise of DevOps and IoT.

For every startup, focusing on a specific profile of target customer is key; for open source database startups, underlying trends tend to simplify this significantly.

Competing with Postgres

There’s some amount of convergence in databases, where most new kinds of databases, as they become popular, sooner or later have their key feature added to Postgres, either natively or as an extension.

For example, Postgres added support for hierarchical data (hierarchical databases) in 2009 with ltree, added support for unstructured data (document databases) in 2012 with JSONField, native support for partitioning and sharding (key/value stores) in 2017, and currently has a graph database extension (AGE) in an 0.x release.

Of course, the Postgres version of key features is usually less rich than an entire database dedicated to a specific use-case.

Unbundling legacy incumbents

Unlike emerging commercial open source categories, databases have a ton of legacy incumbents. Data gravity keeps these companies alive long after their technologies have reached the sell-by date.

This includes the big three legacy software vendors — Oracle, SAP, and IBM, for each of whom databases are a substantial line of business — as well as smaller legacy pure-play vendors like Teradata and (sorry) Cloudera.

Unhappy customers of these companies tend to be interesting potential customers of open-source companies since they already have a budget.

Dancing with the public clouds

When Elastic IPO-ed in late 2018, they had $150M of annual revenue; at the same time, AWS was (reportedly) doing 2-3x as much business hosting Elastic clusters.

Because the public clouds (AWS/GCP/Azure) are the default way to host databases, and their presence strongly informs the roadmap of open source companies.

Features like availability on AWS/GCP/Azure in the specific availability zone the rest of your infra lives in, integration with your AWS/GCP/Azure org and identity management, and so on tend to be key to commercial open source database solutions.

And then there are the licensing spats with AWS — Mongo switched to an alternate license, then AWS launched DocumentDB; Elastic made some changes to their license so AWS forked Elastic, and so on.

It’s interesting to note that these kinds of spats don’t tend to be generic to all open source companies, but rather concentrated among open source database companies, who charge along the same pricing vectors as AWS — for compute and storage.