#1124: The Database Explosion: Why One Size No Longer Fits All

From vector stores to edge computing, discover why the world now has over 1,000 databases and why Postgres isn't always the answer.

0:000:00

Episode Details

Published: Mar 12
Duration: 27:55
Audio: Direct link
Pipeline: V5
TTS Engine: chatterbox-regular
LLM
Topics: vector-databases data-storage edge-computing

AI-Generated Content: This podcast is created using AI personas. Please verify any important information independently.

The Cambrian Explosion of Data Storage

For decades, the standard advice in software engineering was simple: just use a relational database like PostgreSQL for everything. While that remains excellent advice for the majority of use cases, the landscape of data storage is undergoing a radical transformation. Today, there are over 1,000 identified database systems, with roughly 400 of them actively maintained. This "Cambrian explosion" of specialized tools represents a fundamental shift away from the one-size-fits-all philosophy that dominated the industry for years.

Beyond the CAP Theorem

While many engineers are familiar with the CAP theorem—the idea that a system can only prioritize two out of three: Consistency, Availability, and Partition tolerance—modern database design has moved toward the PACELC framework. This model asks a more nuanced question: when the system is running normally (the "Else" in PACELC), how do you trade off Latency versus Consistency?

This trade-off is where much of today's innovation occurs. A high-frequency trading firm might sacrifice consistency for microsecond-level latency, while a bank must prioritize consistency above all else. Because general-purpose databases try to satisfy everyone, they often become a compromise that lacks the extreme optimization required for these specific, high-end use cases.

Hardware-Software Co-Design

The move toward specialization is also driven by a revolution in hardware. The database engines of the 1980s were designed for "spinning rust" hard drives, where the physical movement of a disk needle was the primary bottleneck. Today, we have NVMe drives, GPUs with thousands of cores, and massive distributed clouds where the network itself is the constraint.

Modern databases like DuckDB or various vector stores are built from the ground up to leverage these changes. By using "vectorization" and SIMD (Single Instruction, Multiple Data) instructions, these systems can process entire arrays of data at once rather than looping through individual rows. This hardware-software co-design allows for performance gains that aren't just incremental—they can be 100 times faster than traditional systems.

The Rise of the Specialists

We are seeing the emergence of highly specialized "species" of databases designed for specific mathematical tasks. Vector databases, such as Milvus or Pinecone, use specialized data structures like Hierarchical Navigable Small Worlds (HNSW) to perform similarity searches across millions of dimensions—a task a standard relational index cannot handle efficiently.

Similarly, columnar stores have become the gold standard for analytics by storing data by column rather than row, allowing for incredibly fast aggregations. Meanwhile, graph databases like Neo4j treat relationships as first-class citizens, making them indispensable for social networking or fraud detection where the connections between data points are more important than the data points themselves.

Choosing the Right Pain

The existence of 1,000 databases isn't just a result of engineers wanting to build new toys; it is a response to a diverse set of constraints. Whether it is a sensor on a wind turbine requiring a tiny footprint or a global AI application requiring massive vector throughput, every system is an answer to a specific problem. In the modern era, choosing a database is no longer just about picking a brand—it is about choosing which technical constraints and trade-offs you are willing to live with to achieve peak performance.

Downloads

Episode Audio

Download the full episode as an MP3 file

Download MP3

Transcript (TXT)

Plain text transcript file

Transcript (PDF)

Formatted PDF with styling

Episode #1124: The Database Explosion: Why One Size No Longer Fits All

Daniel's Prompt

Custom topic: The Database of Databases project tracks ~1000 digital database systems, with about 400 actively maintained. Beyond the giants like Postgres and SQLite, there is an incredible long tail of niche datab | Hosts: corn, herman

Have you ever looked at your toolbox and realized you have three different types of hammers, and yet somehow, none of them feel like the right one for the specific nail you are looking at? You have the heavy sledge, the little finishing hammer, and the standard claw hammer, but you are trying to tap a tiny brass pin into a piece of delicate crown molding, and all of them feel like instruments of destruction rather than precision. I feel like that is the state of software engineering right now, but on a massive, global scale. We are surrounded by tools, yet we are constantly feeling like we need something just a little bit more specific.

It is exactly like that, Corn. Except in the world of data, we do not just have three hammers. We have about one thousand of them. I am Herman Poppleberry, by the way, and we are diving into a very deep rabbit hole today. Our housemate Daniel sent us a link to this project called the Database of Databases, which is maintained by the Carnegie Mellon Database Group. It is this massive, living catalog that tracks every digital storage system they can find. And the numbers are just staggering. When you look at the raw data, it is a testament to human ingenuity and, perhaps, a little bit of our collective madness.

Yeah, Daniel was telling me about this over coffee this morning. One thousand systems. That is the count. And about four hundred of them are actively maintained as of right now, here in March of two thousand twenty-six. When you think about the fact that most people can probably name maybe five or six databases, like PostgreSQL, MySQL, Oracle, maybe MongoDB or SQLite, that gap between five and one thousand is just wild. It feels like a Cambrian explosion of data storage. We went from a few massive, monolithic beasts in the nineteen eighties and nineties to this ecosystem where new species are appearing every single week.

It really is. And you know, we have lived in this era for a long time where the advice was just use Postgres for everything. And honestly, for eighty percent of use cases, that is still great advice. Postgres is the reliable workhorse. But what we are seeing in this long tail of four hundred active systems is a rejection of the one size fits all philosophy. We are moving into an era of extreme specialization. It is not just about storing rows and columns anymore. It is about how those bits are physically laid out on the silicon, how they move across the network, and how they interact with the specific mathematical queries we are throwing at them.

So that is the question we are tackling today. Why? Why do we need a thousand ways to store bits? Is this just engineers being bored and wanting to build their own shiny new toy, or is there a fundamental technical necessity driving this fragmentation? I want to look at the long tail, the weird stuff, and the people who are actually running these niche systems in production. Because if four hundred people are actively maintaining these things, someone must be paying for them, and someone must be relying on them to keep the lights on.

I love this topic because it forces us to look at the intersection of hardware, mathematics, and business needs. You cannot explain the existence of a thousand databases without looking at how hardware has changed. Back in the nineties, everything was about spinning rust. Hard drives were slow, and you had to optimize for the physical movement of a needle on a disk. You had to minimize seeks because moving that arm took forever in computer time. Now, we have NVMe drives that are essentially as fast as memory, we have GPUs doing data processing with thousands of cores, and we have massive distributed clouds where the network itself is the bottleneck. The old rules do not apply the same way. When the hardware changes, the software has to be rewritten from the ground up to take advantage of it.

Right, and before we get too deep into the hardware, I think we have to talk about the theoretical framework that governs all of this. Most of our listeners are probably familiar with the CAP theorem, which was the big talking point ten or fifteen years ago. It says you can only have two out of three: Consistency, Availability, and Partition tolerance. It was the reason why we saw the first big split between relational databases and NoSQL. But Herman, you were telling me the other day that CAP is kind of old news and that people are looking at PACELC now.

CAP is a bit of a blunt instrument. It only tells you what happens when things go wrong, specifically during a network partition. But things are not broken most of the time. PACELC, which was proposed by Daniel Abadi, is a much better way to understand why we have so many specialized databases. The acronym stands for: if there is a Partition, how do you trade off Availability versus Consistency? Else, meaning when the system is running normally, how do you trade off Latency versus Consistency?

That second part, the E-L-C, seems like where the real innovation is happening. Because most of the time, your network is not partitioned. Most of the time, things are working. So the question is, how much speed are you willing to give up to make sure every single user sees the exact same data at the exact same millisecond? If I post a comment on a photo, does my friend in Tokyo need to see it in ten milliseconds, or is it okay if it takes a full second?

Right. If you are a high frequency trading firm, you might prioritize latency above everything else. You need that data now, and you can handle a little bit of inconsistency if it means you are five milliseconds faster than the competition. But if you are a bank, you cannot have that. You need consistency. You cannot have two people withdrawing the same hundred dollars from an ATM at the same time. So you build a database that is mathematically tuned for one specific corner of that PACELC square. And once you start tuning for these extremes, a general purpose tool like Postgres starts to look like a compromise that satisfies no one at the high end. It is the cost of abstraction. When you build something to do everything, you cannot optimize the low level path for any one thing.

I was reading about specialized vector databases recently, things like Milvus or Pinecone. With the AI boom we have seen over the last few years, these have exploded. And people always ask, why can I not just store my vectors in a regular relational database? I mean, a vector is just a list of numbers, right? Why can I not just put that in a column?

And the answer is usually about the cache lines and the CPU instructions. If you are doing a similarity search across ten million high dimensional vectors, a standard B-tree index, which is what Postgres uses, is basically useless. A B-tree is designed to find a specific value or a range of values in a one dimensional space. But vectors live in hundreds or thousands of dimensions. You need specialized data structures like HNSW, which stands for Hierarchical Navigable Small Worlds. These are graphs, not trees. They are designed to be traversed in a way that minimizes memory jumps and stays within the CPU cache as much as possible. When you build a database specifically for that, you are not just ten percent faster. You can be one hundred times faster. That is the difference between a search taking three seconds and taking thirty milliseconds.

That is a huge jump. And that brings us to the hardware software co-design aspect. We have talked about this before in different contexts, but in databases, it is becoming the primary differentiator. Look at something like DuckDB. It has become incredibly popular lately. It is an analytical database, but it is designed to run inside the process of your application, like SQLite. It is not a separate server you talk to over a network.

DuckDB is a great example of the long tail finding a massive niche. It is a columnar store. Most traditional databases are row stores, meaning they store all the data for one person together. Columnar stores store all the ages together, all the names together, all the prices together. If you are asking a question like, what is the average price of every item sold in March, a columnar store only has to read the price column. It ignores the names, the addresses, and the descriptions. It is incredibly efficient for analytics.

And because it is designed for modern CPUs, it uses vectorization. It processes chunks of data using SIMD instructions, which stands for Single Instruction, Multiple Data. It is basically doing math on entire arrays of data at once instead of looping through them one by one. You cannot easily bolt that kind of performance onto a legacy engine that was designed in the nineteen eighties for single core processors and slow disks. The entire architecture, from how it manages memory to how it schedules threads, has to be different.

It is also about the storage layer itself. Think about LSM trees versus B-trees. This is one of those classic database nerd debates that actually has huge real world consequences. B-trees, which are used by most relational databases, are great for reading data. They keep things in a very organized structure. But every time you write something, you might have to reorganize a bunch of pages on the disk. LSM trees, or Log Structured Merge-trees, are what things like Cassandra or RocksDB use. They are optimized for writes. They just append data to the end of a file and then merge things in the background later.

So if you are Google and you are logging trillions of events per second from every Android phone on the planet, you are going to pick an LSM tree based database every single time. You do not care if the reads are a little more complex because your write volume would melt a traditional B-tree database. This is why we have the long tail. Every one of these active four hundred systems is a specific answer to a specific set of constraints. It is about choosing which pain you are willing to live with.

And it is not just performance. Sometimes it is about the data model itself. We did that episode a while back, episode four hundred ninety-two, about moving beyond the folder and looking at graph based systems. In a traditional database, if you want to find a friend of a friend of a friend, you have to do these massive, expensive joins. It is like looking through five different filing cabinets to find one connection. It works, but it is slow and the code looks like a nightmare.

Right, whereas a graph database like Neo4j treats the connection as a first class citizen. The pointer is already there. You just follow the link. It is the difference between searching a map and following a physical string tied between two points. If your business is social networking or fraud detection, where the relationship between data points is more important than the data points themselves, a relational database is literally the wrong tool. It is not just slower; it is conceptually the wrong way to think about the problem.

But here is the thing that fascinates me, Corn. Who is actually using all of these? Because it is easy to say, oh, Google and Meta need this stuff. They have the scale where a one percent efficiency gain saves them ten million dollars in electricity. But there are one thousand databases. Meta is not using a thousand databases. There has to be a broader user base for this long tail.

The user base for the long tail is much more diverse than people realize. It is not just the tech giants. Think about edge computing. If you are running a sensor on a wind turbine in the middle of the North Sea, you cannot send every bit of data to a cloud Postgres instance. The satellite link is too slow and too expensive. You need a specialized time series database that can run on a tiny ARM processor, compress data by ninety percent using specialized algorithms like Gorilla compression, and handle intermittent connectivity. That is a niche, but it is a niche that keeps the power grid running.

Or look at high frequency trading. They use things like kdb plus. That is a database that is so specialized it has its own programming language called Q. It is designed to process millions of market events per second with sub-millisecond latency. The people using that are not web developers. They are quantitative analysts and systems architects who are fighting for nanoseconds. They do not care about SQL compatibility or easy-to-use web interfaces. They care about how many clock cycles it takes to calculate a moving average.

And then you have the whole world of embedded systems. Your car has databases in it. Your smart fridge has a database in it. Your medical devices have databases in them. These are not general purpose systems. They are often incredibly lean, read-only or write-once systems that are burned into firmware. They have to be bug-free because you cannot exactly push a patch to a pacemaker very easily. When you look at that list of one thousand, a lot of them are these invisible workhorses that power the modern world without anyone ever knowing their names.

There is also this concept of polyglot persistence. This is a term that gained traction a few years ago, but it is really the reality for any serious startup today. You do not just have one database anymore. You have a stack. Let's look at a hypothetical AI startup in two thousand twenty-six. They might use Postgres for their user accounts and billing because they need that rock solid ACID compliance. They need to know for a fact that if a user pays for a subscription, that record is never lost.

But then they use Redis for their real time leaderboard or their session management because it is all in-memory and lightning fast. Then they use Elasticsearch for their documentation search bar, and they use a vector store like Pinecone for their actual AI features, like finding similar images or documents. And maybe they use ClickHouse for their internal analytics to see how people are using the app. That is five different databases for one single application.

It sounds like a dream for a developer who loves new tools, but it sounds like an absolute nightmare for the person who has to keep it all running. We have to talk about the operational tax of this fragmentation. If I am a small team and I am running five different database systems, I now have five different backup strategies to manage. I have five different security patching cycles. I have five different ways that things can fail at three in the morning. If the Redis cache goes down, the app might just be slow. If the Postgres database goes down, the business stops. Understanding those failure modes for five different systems is a huge cognitive load.

That is the hidden cost of the long tail. We often talk about the technical benefits, the speed, the scale, the cool features. But we forget the human cost of complexity. Every time you add a specialized database to your stack, you are taking on technical debt in the form of operational overhead. You need someone who actually understands the internals of that specific system. If your specialized graph database starts acting up and your only expert is on vacation, you are in deep trouble. You cannot just Google the answer for a niche database the same way you can for Postgres.

It is the classic trade-off. Do you want a system that is ninety percent efficient but easy to manage, or a system that is ninety-nine percent efficient but requires a PhD to tune? I think a lot of companies are realizing that Postgres is often good enough because the ecosystem is so huge. You can find a million people who know how to optimize a Postgres query. You might only find ten people who know how to optimize a niche temporal database. The labor market itself is a factor in database choice.

And Postgres has been very clever about this. They have this system of extensions. Instead of forcing people to switch to a new database, they allow people to build things like TimescaleDB for time series data or pgvector for AI, all inside the Postgres ecosystem. It is an attempt to stop the fragmentation by being a platform rather than just a database. It is the "good enough" solution that keeps people from wandering into the long tail.

But even with that, there is a limit. You can only bolt so many things onto a foundation before the foundation starts to crack. If you are doing massive scale graph traversals, an extension is never going to beat a native graph engine that was built from the ground up for that specific purpose. The memory layout is just fundamentally different. At some point, the overhead of being "general purpose" becomes a literal physical barrier to performance.

I think we are also seeing a shift in how people evaluate these tools. In the past, you would pick a database and stick with it for a decade. You would buy the Oracle license and that was your life. Now, with the rise of serverless and managed services, the cost of trying a new database is much lower. If I can spin up a managed instance of a niche vector database in two minutes and pay by the hour, the barrier to entry for the long tail is almost zero. I do not have to learn how to install it, how to configure the Linux kernel for it, or how to set up replication. The cloud provider does that for me.

That is a great point. The cloud has acted as an accelerant for this fragmentation. If I had to buy a physical server and install a weird database myself, I would probably just stick with what I know. But when it is just an API call away, why not use the specialized tool? This is why we see four hundred active systems. There is enough of a market for even a very niche tool to survive if it can be easily deployed in the cloud. It has created a long tail economy for software.

It also changes the way we think about data sovereignty and privacy. Some of these niche databases are built specifically for zero trust environments. They encrypt every single bit of data at the storage level in a way that the database administrator cannot even see. They use things like homomorphic encryption or secure enclaves. That is a specialized use case that a general purpose tool might not prioritize, but for a healthcare company or a government agency, that is the most important feature. They will take a fifty percent performance hit if it means the data is mathematically impossible to leak.

So, looking at this list of one thousand, do you think we are at peak fragmentation? Or are we going to see two thousand databases in five years?

That is the big question. There is a theory in tech called the great re-bundling. We go through these cycles where everything fragments into specialized tools because the old tools cannot keep up, and then someone comes along and bundles them all back together into a single platform that is "good enough" at everything. We are seeing some of that with things like Snowflake or Databricks, which are trying to be the one place for all your data, whether it is relational, unstructured, or AI-driven. They want to be the one database to rule them all.

But I don't know, Herman. As long as hardware keeps evolving, we are going to keep seeing new databases. Think about the move toward CXL, which is Compute Express Link. It allows CPUs to share memory with other devices at incredible speeds. That is going to require a whole new way of thinking about how databases manage their buffers and their caches. The distinction between "local memory" and "remote storage" is blurring. Someone is going to build a database specifically for CXL, and that will be number one thousand and one.

You are right. And then you have quantum computing on the horizon. I know it is a bit of a buzzword, but the way you store and query data in a quantum system is fundamentally different from anything we have done in the last seventy years. You are not dealing with bits; you are dealing with probabilities and superposition. We are going to need a quantum database of databases eventually. The "Database of Databases" project is never going to be finished. It is a map of a territory that is constantly expanding.

It really highlights the point we discussed in episode eight hundred sixteen, about the evolution of human order. We have this inherent obsession with organizing things, from the first scrolls in Alexandria to the first SQL tables at IBM. But the more data we create, the more complex our filing systems have to become. We went from clay tablets to scrolls to SQL, and now we are in this era of extreme fragmentation. It is just the latest chapter in that story of us trying to make sense of the noise.

And it is also about survival. We talked about this in episode one thousand thirty-two, regarding ancient backups and how history survived the delete command. The reason we have so many databases is that we have realized that different types of data have different survival needs. Some data needs to be fast and ephemeral, like a cache that you can lose without much worry. Some needs to be immutable and permanent, like a blockchain or a legal ledger. You cannot treat those the same way if you want them to survive the test of time and hardware failure.

So, for the listeners who are sitting there looking at their own tech stacks and feeling overwhelmed by this sea of four hundred active databases, what is the practical takeaway? How do you navigate this without losing your mind or your budget?

My first piece of advice is: do not optimize prematurely. This is the golden rule of engineering. If you are starting a new project, ninety-nine percent of the time, the boring, reliable tech is the right choice. Start with Postgres or SQLite. They are the gold standards for a reason. They have the best documentation, the best community support, and the most predictable failure modes. You should only look into the long tail when you hit a wall that the general purpose tools cannot climb. If your Postgres queries are taking ten seconds and you have already tuned the indexes and the memory, then, and only then, should you look at a specialized engine.

That makes sense. It is like that rule in engineering: do not build a custom solution unless the off-the-shelf one is costing you significant time or money. But when you do hit that wall, how do you vet these niche systems? How do you know if a database from the long tail is actually going to be supported in two years, or if it is just a graduate student's thesis project that will be abandoned next semester?

That is where the Database of Databases project is actually very useful. You can look at the activity levels. Is it being maintained? Who is behind it? Is it a major university like Carnegie Mellon, or a company with a solid business model? You want to look at the operational maturity. Does it have a clear story for backups? Does it have monitoring hooks? Does it have a way to scale horizontally? If the documentation is just a README file on GitHub and the last commit was three years ago, stay away, no matter how cool the tech is. You are looking for a tool, not a research project.

And I would add, evaluate the maintenance cost as a first class metric. When you are doing an architecture review, do not just look at the query latency. Look at the "human latency." How many hours a week is your team going to spend managing this thing? How hard is it to hire someone who knows it? If a specialized database saves you fifty milliseconds but costs you ten hours of engineering time a week in maintenance, that is a bad trade for almost every company. Time is more expensive than CPU cycles.

The goal of technology is to serve the business, not the other way around. But at the same time, we have to appreciate the innovation. The reason we have these amazing tools today is because someone, somewhere, decided that the existing databases were not good enough and decided to build something weird. They decided to challenge the status quo.

It is a beautiful kind of chaos. It is the free market of ideas applied to data structures. Some of these systems will fail, and they will become the graveyard of the database world, but the lessons learned from them will be folded into the next generation of tools. It is an iterative process of understanding how to handle the sheer volume of information our civilization is producing. We are learning how to build better brains for our digital world.

It is a great time to be a data nerd, even if it is a bit confusing. I think we are going to see a lot more of these specialized engines as we move toward more edge computing and more integrated AI. The line between an application and a database is even starting to blur. Some of these niche systems are so tightly integrated into the app that you cannot really tell where one ends and the other begins. They are becoming more like "data-aware runtimes" than traditional databases.

Like we saw in that graph based OS episode, the idea of the database as a separate entity might eventually go away. We might just have data-aware applications that manage their own persistence in highly specialized ways, using the long tail of storage engines as a library of options.

Well, this has been a fascinating look into the long tail. I feel a lot better about my three hammers now, knowing there are a thousand other ones out there if I ever need them. It is not about having the most tools; it is about knowing which one to pick when the stakes are high.

Just do not try to use a sledgehammer to hang a picture frame, Corn. That is how you end up with a hole in the wall and a corrupted database.

Fair point. Before we wrap up, I want to remind everyone that if you are finding these deep dives helpful, please leave us a review on your favorite podcast app. Whether it is Spotify or Apple Podcasts, those ratings really help more people find the show and join our weird little community. We really appreciate the support.

Yeah, it genuinely makes a huge difference for us. And if you want to stay updated, head over to myweirdprompts dot com. You can find our RSS feed there, and you can also find us on Telegram by searching for My Weird Prompts. We post every time a new episode drops, and we often share links to the projects we discuss, like the Database of Databases.

And a quick shout out to Daniel for sending this one in. It definitely gave us a lot to chew on. I think I am going to go spend the rest of the afternoon looking at some of the more obscure entries in that catalog. There is one that claims to be a database based on DNA storage principles, which sounds like a whole other episode.

DNA storage? Now that is a long tail. I am in. Let's do it.

All right, everyone. Thanks for listening to My Weird Prompts. We will be back next time with another deep dive into whatever strange corner of the world Daniel sends our way.

Until then, keep questioning the defaults and maybe try a new hammer every once in a while. This has been My Weird Prompts.

Take care, everyone. See you in the next one.

This episode was generated with AI assistance. Hosts Herman and Corn are AI personalities.