#2476: Database Backups Without the Bloat

pg_dump, WAL archiving, and the free tools that beat expensive commercial backup software.

Featuring

Daniel

Corn

Herman

0:000:00

Episode Details

Episode ID: MWP-2634
Published: Apr 27
Duration: 23:51
Audio: Direct link
Pipeline: V5
TTS Engine: chatterbox-regular
Script Writing Agent: deepseek-v4-pro
Topics: docker open-source database-backups

AI-Generated Content: This podcast is created using AI personas. Please verify any important information independently.

The Free Foundation of Database Backups

Database backups don’t require expensive commercial software. The core tools — pg_dump, pg_basebackup, WAL archiving, sqlite3, and sqlcmd — have been free and open-source for decades. What’s changed is accessibility: AI coding agents now let anyone generate production-ready backup scripts in plain language.

The biggest misconception about Postgres backups involves the write-ahead log (WAL). The WAL is not a backup — it’s a sequential record of changes. Without a full backup to apply those changes to, WAL segments are useless. The WAL only has value as the difference between your last full backup and the present moment.

Postgres Backup Tools

pg_dump produces logical backups as SQL statements or a custom binary format. The -Fc flag creates a compressed, selectively restorable archive. It’s not incremental — every run is a full dump — but for Docker-hosted databases under 50GB, it’s the safest bet because the output is portable across versions and architectures. Run it with docker exec -t <container> pg_dump -U postgres -Fc <database> | gzip.

pg_basebackup performs physical file-level copies of the entire database cluster. It’s faster for large databases but not portable across Postgres versions. Combined with WAL archiving, it enables point-in-time recovery — you can restore to any moment between backups.

Postgres 17 introduced native incremental backups via the summarize_wal setting. Subsequent backups only copy changed blocks. The feature is still maturing, with overhead from WAL summarization and more complex recovery procedures.

For production environments, external tools like pgBackRest, WAL-G, and Barman remain the standard. pgBackRest supports full, differential, and incremental backups, S3-compatible storage, and automatic retention policies. A typical schedule: full backup Sunday at 2 AM, differentials every other night, incrementals every six hours.

SQLite and SQL Server

SQLite backups are simple but fundamentally limited: every backup is a full copy. Use sqlite3 <database> .backup <backup.db> for hot binary copies, or .dump for portable SQL output. There’s no incremental option — if that’s a dealbreaker, migrate to Postgres.

SQL Server backups use T-SQL commands via sqlcmd. Full backups use BACKUP DATABASE, differentials add WITH DIFFERENTIAL, and transaction log backups enable point-in-time recovery. The backup chain must remain intact — a broken link prevents restoration past that point.

The Democratization Angle

Commercial backup tools historically sold UI, automation, and hand-holding around free CLI tools. With AI coding agents generating scripts, cron configurations, restore procedures, and verification tests, that value layer is being commoditized. For most applications — CRMs, inventory systems, home setups — open-source tools plus AI assistance now provide a completely viable production strategy.

Downloads

Episode Audio

Download the full episode as an MP3 file

Download MP3

Transcript (TXT)

Plain text transcript file

Transcript (PDF)

Formatted PDF with styling

#2476: Database Backups Without the Bloat

Daniel sent us this one — he wants to talk database backup fundamentals. He's thinking about the typical data-driven app, like a CRM or inventory system, and the fact that most people are still running SQL databases under the hood. His setup is a home inventory system with a database, media files, and Docker containers, and he points out something that trips up a lot of people — you don't back up the container, you back up the database and the files. The core question is, what are the actual CLI tools worth knowing for a conventional backup strategy — daily incrementals, maybe a weekly full, a quarterly full — and how do you pull that off without buying some bloated commercial tool when the fundamentals have been free for years.

What I love about this prompt is that Daniel is basically pointing at a weird dynamic in the industry. The tools have been there all along. pg_dump, pg_basebackup, WAL archiving — none of this is new. But the learning curve for non-DBAs was real, and companies made a lot of money selling wrappers around these things. Now with AI coding agents, you can describe what you want in plain language and get a production-ready backup script in seconds. By the way, DeepSeek V four Pro is writing our script today. So we're living that reality right now — an AI model assembling the backup knowledge that used to require a specialist.

Here we are, the AI-generated donkey explaining database backups to the AI-generated sloth. There's something poetic there. But let's get into the actual tools, because I think the first thing to clear up is a misconception that I've definitely held myself. People hear that Postgres has this WAL, the write-ahead log, and they think, oh, Postgres is always in backup mode. The log is constantly writing, so I'm covered. That's not actually true, is it.

That's exactly the myth to bust right at the start. The WAL is not a backup. The WAL is a sequential record of every change made to the database. It's the incremental piece, but it's useless without a base to apply it to. Think of it this way — if you have a hundred hours of WAL segments but no full backup from which to start replaying them, you have nothing. The WAL is the difference between your last full backup and right now. Without the full backup, those WAL files are just a very detailed diary of data you can't access.

The WAL is like having a detailed changelog but no copy of the original document. You know exactly what changed on page seven, but page seven itself is gone.

And this is where the actual backup tools come in. For Postgres, the foundational tool is pg_dump. It's been there forever, it ships with Postgres, and it produces a logical backup — meaning it exports your database as SQL statements or a custom binary format. The flag to know is dash capital F lowercase c, which gives you a compressed custom-format archive that you can restore selectively. You can dump a single table out of it if you need to. But here's the thing Daniel was hinting at — pg_dump does not do incremental backups. Every run is a full dump of whatever database you point it at.

Which is fine for a home inventory system with a few thousand records, but if you're running a CRM with millions of rows and you're trying to do this daily, you're going to be dumping for a while.

Right, and that's where the tension Daniel mentioned comes in. He said dumps don't really have much of a role in routine backup, and for large production databases he's mostly correct. If your database is north of, say, fifty gigabytes, a nightly pg_dump starts to hurt. But here's the counterpoint — for Docker-hosted databases, which is exactly Daniel's setup, pg_dump is actually recommended precisely because it produces a consistent, portable snapshot that survives container recreation. You run docker exec dash t your container name pg_dump dash U postgres dash F c your database, pipe it through gzip, and you've got a file you can restore anywhere, any version, any architecture.

The tool that's supposedly not for routine backup turns out to be the safest bet for the Docker use case. That's the kind of nuance that I think gets lost when people just google "Postgres backup best practices" and find advice written for someone managing a five-terabyte cluster.

That five-terabyte cluster is where you need the other tools. Let's talk about pg_basebackup. This is the built-in physical backup tool. Instead of exporting SQL, it copies the entire database cluster at the file level. It's much faster than pg_dump for large databases, but the output is a binary copy of the data directory — you restore it by replacing your data directory with the backup. It's not portable across versions or architectures the way a pg_dump is. But it's the foundation for point-in-time recovery, because you combine a pg_basebackup with your archived WAL segments and you can roll forward to any moment between backups.

For the CRM scenario Daniel mentioned, where this is a business-critical app and you need to be able to say "restore to fifteen minutes before someone dropped the wrong table," you're looking at pg_basebackup plus WAL archiving.

And the configuration is straightforward but it's the kind of thing that used to scare people off. You set archive_mode to on in your postgresql.conf, you set archive_command to a shell command that copies each WAL segment to a safe location — could be another server, could be an S three bucket — and then you schedule your pg_basebackup runs. The RPO, recovery point objective, drops from hours to minutes. But here's where it gets interesting for the incremental backup question Daniel raised. Postgres seventeen, which has been out for a bit now, introduced native incremental backup support.

Native incremental as in, built into Postgres itself, not requiring an external tool.

You enable a setting called summarize_wal, and then you take an initial full backup with pg_basebackup, and subsequent backups can be incremental — only the blocks that changed since the last backup. For databases over five terabytes where a full backup takes twenty-plus hours, this is a game changer. But I'll be honest — the implementation is still maturing. There's overhead from the WAL summarization, recovery is more complex, and most production deployments I'm aware of are still using external tools for incremental backups rather than the native Postgres seventeen feature.

Which external tools are we talking about?

The big three in the Postgres world are pgBackRest, WAL-G, and Barman. pgBackRest is probably the one I'd point people to first. It supports full, differential, and incremental backups, it can push to S three-compatible storage, it handles retention policies automatically, and it's been battle-tested. A typical cron schedule looks like a full backup every Sunday at two AM, a differential every other night, and incrementals every six hours. The configuration lives in a single file, slash etc slash pgbackrest dot conf. It's the kind of tool where, five years ago, setting it up correctly required reading a lot of documentation and probably getting something wrong the first time. Now you can describe your setup to an AI agent and it'll generate the configuration for you.

That's the democratization angle Daniel was getting at. The tools were always free, but the expertise to use them correctly was not. Now the expertise is increasingly available on demand. You still need to know enough to verify what the AI gives you, but the barrier to entry has dropped dramatically.

There's a concrete example of this from just last month. There's a tool called Databasus — it's a web UI for Postgres backups that wraps pg_dump and pg_basebackup. It became the most starred Postgres backup tool on GitHub last year, and in March of this year, Anthropic backed it through their Claude for Open Source Program. So you've got an AI company directly investing in making database backup more accessible. The open-source backup tools are getting attention and funding precisely because the AI layer makes them usable by people who aren't DBAs.

Which raises the question — if the CLI tools are now accessible to anyone with an AI coding agent, what exactly is the value proposition of a commercial backup tool? They've been selling these things for decades, and the core functionality was always available for free. What they were really selling was the UI, the automation, the hand-holding.

That layer is being commoditized fast. If an AI agent can generate your backup scripts, your cron configuration, your restore procedures, and even your verification tests, then the commercial tool is competing on — what, exactly? There's still a place for that in regulated industries, but for the kind of application Daniel is describing, a CRM or an inventory system, the open-source CLI tools plus AI assistance is a completely viable production strategy.

Let's step back to SQLite, because Daniel mentioned it and it's worth covering. SQLite is everywhere — it's in mobile apps, desktop apps, embedded systems, and a lot of small-to-medium web apps. What does backup look like there?

SQLite backup is beautifully simple and also fundamentally limited in one important way. There is no incremental backup for SQLite. Every backup method produces a full copy of the database file. But for most SQLite databases, that's fine — they tend to be small enough that a full copy takes seconds. The command to know is sqlite3 your database dot backup backup dot db. That does a live, hot binary copy without locking the database. It's fast, it's exact, and it's production-safe. The alternative is dot dump, which exports the entire database as SQL statements. That's slower and produces a larger file, but it's portable across SQLite versions and platforms. If you're backing up to put the data in version control, you want dot dump. If you're backing up for disaster recovery, you want dot backup.

Neither of those is incremental, so for a daily backup strategy on SQLite you're just doing a full copy every day.

Every day, and you're keeping your retention window reasonable. The good news is that SQLite databases compress extremely well, so you can keep a lot of daily snapshots without eating too much storage. There's also a newer tool called sqlite3_rsync that does safe remote syncing over SSH without locks, which is useful if you're backing up to a remote server. But the fundamental constraint remains — no incrementals. If that's a dealbreaker, you've outgrown SQLite and it's time to migrate to Postgres.

For SQL Server, which is still everywhere in enterprise environments, what's the CLI story?

SQL Server's backup is all done through T-SQL commands, and the tool you use to run those commands from the command line is sqlcmd. A full backup looks like sqlcmd dash S localhost dash E dash Q, and then in quotes, BACKUP DATABASE your database TO DISK equals the file path. For differential backups, which is SQL Server's version of incrementals, you add WITH DIFFERENTIAL to that command. And then there's transaction log backups with BACKUP LOG, which give you point-in-time recovery. The key thing with SQL Server is that the backup chain has to stay intact — full, then differentials, then log backups — and if any link in that chain is broken, you can't restore past it.

It's a similar model to Postgres with WAL archiving, just different terminology and different tools.

Same concepts, different implementation. And there's also a PowerShell alternative if you're on Windows — the Backup-SqlDatabase cmdlet from the SqlServer module. But sqlcmd works everywhere SQL Server runs, including Linux and Docker, so that's the one to know.

Now let's talk about the Docker angle specifically, because Daniel mentioned it and it's a scenario where a lot of the conventional backup advice falls apart. In a Docker setup, the container is ephemeral. You don't back it up. You back up the volumes and you back up the database. But how you back up the database inside a container is where people get tripped up.

The biggest mistake I see is people trying to back up the raw database volume by tarballing the data directory while the database is running. Don't do that. You risk getting an inconsistent copy — the database might be mid-write when you tar the files, and your backup is now corrupted in a way you won't discover until you try to restore it. The safe approach is to use the database's own backup tools from outside or inside the container. For Postgres, you run docker exec to execute pg_dump inside the container and pipe the output to a file on the host. For SQLite, if your database file is on a mounted volume, you can just run sqlite3 dot backup from the host directly. The principle is, always use the database engine's native backup mechanism, not filesystem-level copies.

Unless you stop the database first. If you bring the container down, then tar the volume, then bring it back up, that's safe. But that means downtime, and for a business-critical CRM, downtime during business hours is not an option.

Right, and that's where the logical dump approach shines — you get a consistent backup without any downtime. The tradeoff, as we discussed, is that pg_dump on a large database takes time and puts load on the database during the dump. For a home inventory system, that's irrelevant. For a CRM with hundreds of users hitting it during the day, you schedule your dumps for the middle of the night. But here's a practical tip — if you're using Docker Compose, you can define a sidecar container that runs your backup script on a cron schedule. Tools like Offen's docker-volume-backup can handle the whole thing, including stopping containers temporarily if needed, tarballing volumes, and pushing to remote storage. It's a clean way to get consistent backups without building the whole pipeline yourself.

That tool integrates with the Docker label system, so you can annotate which volumes need backing up and which don't, directly in your compose file.

And it enforces the three-two-one rule without you having to think about it — three copies of your data, on two different media types, with one copy off-site. That's still the gold standard for backup strategy, whether you're running a home inventory system or a production CRM. The implementation changes, the principle doesn't.

If I'm Daniel, sitting down to set up backups for my home inventory system with Postgres and media files in Docker, what's my actual setup look like?

I'd do three things. One, a nightly pg_dump via cron, piped through gzip, stored on the host and synced to some off-site storage — Backblaze B two, an S three bucket, whatever. Two, a separate job that backs up the media files, probably just rsync to the same off-site location. Three, a quarterly restore test where you actually spin up a clean Postgres container, load the dump, and verify that your data is intact and your application works. The restore test is the part everyone skips, and it's the part that matters most. A backup you haven't tested is a wish, not a backup.

That restore testing point is worth underlining. I've seen companies with elaborate backup pipelines that had never once attempted a restore, and when they finally needed to, they discovered that some critical piece was missing — a configuration file, an encryption key, a dependency they didn't know about. The backup worked, technically, but the restore failed.

This is another place where AI assistance changes the game. Writing a restore test used to mean writing a bunch of shell scripts and Docker commands and hoping you got the sequence right. Now you can generate a restore test script in thirty seconds, run it in a CI pipeline, and get a pass or fail notification every quarter. The barrier isn't just lower for the backup itself — it's lower for the verification too. That's the part that actually keeps you safe.

Let's circle back to something Daniel said about commercial backup tools — that companies have been selling these for a long time for no reason. I think there's a more charitable reading. The tools were free, but the knowledge wasn't. If you were a small business running a CRM and you didn't have a DBA on staff, paying for a backup tool gave you a supported, documented, tested solution. The value wasn't in the bytes being copied — it was in the confidence that someone else had thought through the edge cases.

That's fair. And there are still edge cases where the commercial tools earn their keep. Compliance requirements, audit trails, encryption at rest with key management, integration with enterprise monitoring systems. But the threshold for when you need that versus when you can roll your own with pgBackRest or WAL-G has shifted dramatically. A five-person startup in twenty twenty probably bought a backup tool. The same startup today, with AI assistance available, can absolutely run their own backup infrastructure on open-source tools and be confident it's correct.

The other thing that's shifted is that the open-source tools themselves have gotten better. pgBackRest today is not the pgBackRest of five years ago. It supports parallel backup and restore, it handles tablespaces correctly, it has built-in verification. The gap between the free tools and the paid tools has narrowed on both sides — the free tools got better, and the knowledge to use them got more accessible.

Postgres itself keeps improving. The native incremental backup in Postgres seventeen is a big deal, even if it's still maturing. Barman, the Python-based backup manager, added incremental backups with compression and encryption in version three point fifteen, which came out in August of last year. WAL-G has been solid for cloud-native deployments for years. The ecosystem is genuinely rich. You don't need to buy a tool. You need to understand your recovery objectives and pick the right combination of built-in and external tools for your situation.

Let's make this concrete. If someone is listening and thinking, I have a Postgres database, I need to set up backups, what do I actually do, what's the decision tree?

First question, how big is your database and how much downtime can you tolerate. If it's under twenty gigs and you can handle a few minutes of load during the backup window, pg_dump with the custom format flag, scheduled nightly, is probably fine. You get portability, you get selective restore, it's dead simple. If it's larger than that, or if you need point-in-time recovery with a recovery point objective measured in minutes rather than hours, you need pg_basebackup plus WAL archiving at minimum, and realistically you want pgBackRest or WAL-G to manage the complexity. If you're running in Docker, you use docker exec to run pg_dump or you set up a sidecar container that handles the backup pipeline. And regardless of which path you take, you test your restores.

For SQLite, you're running dot backup or dot dump based on whether you want binary portability or SQL portability, you're scheduling it, and you're accepting that every backup is a full copy.

Which, for ninety percent of SQLite use cases, is completely fine. The database for a home inventory system is probably a few megabytes. You can keep a year of daily backups and barely notice the storage cost.

There's one more thing I want to touch on before we move to the fun fact. Daniel mentioned that AI tools like Cloud Code make it easier to use these CLI tools, and I think there's a deeper point here about what AI actually changes in operational work. It's not that AI invents new backup strategies. It's that it removes the activation energy. Previously, you knew you should set up proper backups, but the prospect of reading the pgBackRest documentation, figuring out the cron syntax, testing the restore procedure — it was enough friction that you'd put it off. Now the friction is gone. You ask, you get a working configuration, you review it, you deploy it.

That's a pattern we're seeing across infrastructure work. AI doesn't replace the need to understand what you're doing — you still need to know enough to verify the output — but it collapses the time from "I should probably set this up" to "it's done and tested" from days to hours. For database backups specifically, that means more people will actually have working backups, which is a net win for everyone who depends on their data.

And now: Hilbert's daily fun fact.

The collective noun for a group of porcupines is a prickle.

What can listeners actually do with all this? First, audit your current backup situation. If you're running a database and you don't have automated backups, fix that today. The tools are free and the AI assistance is available. Second, if you're backing up Docker-hosted databases, make sure you're using logical dumps, not raw volume copies, unless you're stopping the container first. Third, test a restore. Pick a random backup from last month and see if you can actually bring it back. You'll either sleep better or discover a problem before it's an emergency. Fourth, if you're evaluating a commercial backup tool, ask yourself what you're actually paying for. If the answer is "the UI and the support," that might still be worth it for your situation. But know that the underlying mechanisms are almost certainly pg_dump, pg_basebackup, or something very similar under the hood.

If you're running SQLite, don't overthink it. A nightly dot backup piped to your off-site storage, with a thirty-day retention policy, covers the vast majority of failure scenarios. The simplicity is the feature.

The one question I keep coming back to is whether the commercial backup market survives the next five years in its current form. When the CLI tools are this good, and the AI layer makes them this accessible, and companies like Anthropic are funding open-source backup tools directly, the value proposition of a paid backup solution gets narrower every year. There will always be edge cases — massive enterprise deployments, regulated industries, environments where you need twenty-four seven phone support. But for everyone else, the era of paying for database backup is probably ending.

Thanks to Hilbert Flumingtop for producing. This has been My Weird Prompts. You can find every episode at myweirdprompts.com, and if you're enjoying the show, leave us a review wherever you listen. We appreciate it.

Catch you next time.

This episode was generated with AI assistance. Hosts Herman and Corn are AI personalities.

#2476: Database Backups Without the Bloat

The Free Foundation of Database Backups

Postgres Backup Tools

SQLite and SQL Server

The Democratization Angle

Downloads

You Might Also Like

#2476: Database Backups Without the Bloat