#2465: JSON-L vs Parquet: When Each Format Wins

How far can JSON-L scale before it breaks? And why does Parquet dominate for millions of rows?

0:000:00
Episode Details
Episode ID
MWP-2623
Published
Duration
28:47
Audio
Direct link
Pipeline
V5
TTS Engine
chatterbox-regular
Script Writing Agent
deepseek-v4-pro

AI-Generated Content: This podcast is created using AI personas. Please verify any important information independently.

JSON-L vs Parquet: The Real Trade-offs in Modern Data Formats**

When moving large datasets without spinning up a database, two formats dominate the conversation: JSON-L and Parquet. Both are surprisingly portable, but they optimize for completely different use cases. Understanding when to use each can save hours of processing time and gigabytes of memory.

What Makes JSON-L Different

JSON-L (JSON Lines) is line-delimited: every line is a complete, standalone JSON object. There's no opening bracket, no closing bracket, no commas between records. This structure unlocks streaming. A parser can read one line, process it, discard it, and move to the next — constant memory usage regardless of file size. You can stream through a hundred-million-row file on a laptop with 8GB of RAM.

This appendability is critical for incremental data: logs, sensor readings, API pagination, streaming events. You simply open the file in append mode and write a new line. No rewriting, no seeking, no memory overhead.

The Hidden Cost of JSON-L

JSON-L has no schema enforcement. Every line is a JSON object, but nothing ensures line 17 has the same fields as line 3 million. Every tool must parse strings and infer types on every single line. There's no built-in compression; you must gzip the file yourself, forcing every reader to decompress before parsing. JSON-L scales beautifully in operational simplicity, but is computationally wasteful at query time.

Why Parquet Wins at Scale

Parquet is columnar — it stores data by column, not by row. If you have 50 columns but your query needs only 3, Parquet reads just those 3 columns and skips the other 47 entirely. Row-based formats like JSON-L or CSV must read every field in every row.

This columnar layout enables massive compression ratios. Storing similar data types together makes algorithms like run-length encoding, dictionary encoding, and bit packing dramatically more effective. Blue Yonder's benchmarks show Parquet files are typically about 5x smaller than equivalent UTF-8 CSV — savings in storage, network transfer, and cloud egress.

Parquet's Dirty Secret: The Small Files Problem

Parquet is write-once, read-many optimized. Writing Parquet is slower than CSV or JSON-L because the engine must buffer rows, build column chunks, apply encoding, compute statistics, and write metadata. It's not designed for frequent row-level updates.

But the real gotcha: Parquet's performance depends on files being at least tens of megabytes, ideally hundreds. Each file carries metadata, row group statistics, and schema overhead. Thousands of tiny Parquet files (a few kilobytes each) can perform worse than CSV because all time is spent opening files and reading metadata. As a rough heuristic, aim for Parquet files of 50-100MB minimum — typically a few hundred thousand rows for a standard tabular dataset.

How Hugging Face Actually Uses These Formats

When you upload a dataset to Hugging Face — CSV, JSON, JSON-L, whatever — the datasets library automatically converts it to Parquet behind the scenes. Parquet is the primary on-disk format powering the dataset viewer and enabling fast access. If you upload Parquet directly, Hugging Face symlinks your original files (provided row group specs meet requirements), skipping conversion entirely. Major datasets like CIFAR10 are stored directly in Parquet.

The File-as-Database Pattern

The ParquetDB paper from West Virginia University and NIST demonstrated a lightweight Python framework using Parquet files as the storage layer — no database server, no daemon, no connection strings. On ~4.8 million complex nested records, it outperformed both SQLite and MongoDB on query workloads. For analytical workloads on static or slowly-changing data, direct file access with column pruning is a completely viable pattern.

When It Breaks

Concurrent writes break Parquet — two processes writing to the same file will fail. Point lookups (finding one specific row by ID) require scanning column chunks; there's no B-tree index. ACID transactions don't exist. The file-as-database pattern works beautifully for large scans and aggregations on static data, but falls apart for concurrent writes or fast point queries.

The Bottom Line

JSON-L is for streaming, appendability, and human readability. Parquet is for analytical queries, compression, and columnar access. Choose JSON-L when data arrives incrementally and you need constant memory usage. Choose Parquet when you're running aggregations on millions of rows and want 5x compression. And remember: converting everything to Parquet without considering file sizes will create problems JSON-L never had.

Downloads

Episode Audio

Download the full episode as an MP3 file

Download MP3
Transcript (TXT)

Plain text transcript file

Transcript (PDF)

Formatted PDF with styling

#2465: JSON-L vs Parquet: When Each Format Wins

Corn
Daniel sent us this one — he's been thinking about flat data structures, specifically JSON-L and Parquet, which he says Hugging Face uses constantly. His core questions are pretty direct. One, how far can JSON-L actually scale before it falls over? Two, why is Parquet the go-to format when you're packaging tens of millions of rows? And he makes this point that both formats can be surprisingly portable for moving massive datasets around without spinning up a database. I think there's more to that portability angle than most people realize.
Herman
There really is, and I'm glad he brought up Hugging Face because that ecosystem is where a lot of these format decisions play out in practice. Also, quick note — DeepSeek V four Pro is writing our script today, so if anything comes out especially coherent, that's why.
Corn
So where do we start? JSON-L seems like the obvious entry point since most people encounter it first.
Herman
Right, and the first thing to get straight is what JSON-L actually is, because I've seen people confuse it with regular JSON arrays constantly. JSON-L is line-delimited. Every single line is a complete, standalone JSON object. There's no opening bracket, no closing bracket, no commas between records. That structure is what unlocks everything.
Corn
Why does that matter so much? A JSON array of objects seems functionally identical if you're just thinking about the data.
Herman
It matters because of how you read it. With a regular JSON array, the parser has to swallow the entire file to validate it. Even streaming, you need to track bracket depth and handle commas between objects. With JSON-L, you can read one line at a time, process it, throw it away, move to the next. That means constant memory usage. You can stream through a hundred-million-row file on a laptop with eight gigs of RAM and it won't break a sweat.
Corn
The scaling story is really about memory, not about speed.
Herman
Speed-wise, JSON-L isn't necessarily faster than a well-tuned JSON array parser. But memory-wise, it's night and day. The Scrapfly blog had a good breakdown of this back in March — they pointed out that JSON-L is the default output format for Scrapy, the web scraping framework, precisely because scraping jobs can run for hours or days. You don't want to hold the entire dataset in memory while you're still collecting it. You just append each new scraped item as a new line.
Corn
That appendability is interesting. With a JSON array, you can't just tack a new record onto the end without rewriting the closing bracket, which means rewriting the whole file.
Herman
JSON-L lets you append trivially. That's huge for any pipeline where data arrives incrementally — logs, sensor readings, API pagination, streaming events. You just open the file in append mode and write a new line. No seeking, no rewriting, no memory overhead. But here's where it gets interesting — there's a flip side.
Corn
Of course there is.
Herman
JSON-L has no schema. Every line is a JSON object, but there's nothing enforcing that line seventeen has the same fields as line three million. Every tool that reads JSON-L has to infer types on the fly — parse the string, figure out this field is an integer and that one's a float — and it has to do that for every single line. There's no built-in compression strategy either; you have to gzip the file yourself, and then every reader has to decompress before it can even start parsing. So JSON-L scales beautifully in operational simplicity, but it's computationally wasteful at query time.
Corn
That's where Parquet comes in, I assume.
Herman
That's exactly where Parquet comes in. Parquet is a columnar storage format — it stores data by column, not by row. If you've got fifty columns but your query only needs three, Parquet reads just those three and skips the other forty-seven entirely. DataCamp had a great explainer on this in February. That's not possible with row-based formats like JSON-L or CSV, where you have to read every field in every row to extract the ones you want.
Corn
The physical layout on disk is completely different.
Herman
In a row-based format, all the fields for row one are stored together, then all the fields for row two. In Parquet, all the values for column A across thousands of rows are stored together, then all the values for column B. This is why Parquet gets those massive compression ratios — you're storing similar data types together, which makes compression algorithms dramatically more effective. Run-length encoding, dictionary encoding, bit packing — these work much better on a column of integers versus a mixed bag of strings, numbers, and booleans all jumbled together.
Corn
How much smaller are we talking?
Herman
Blue Yonder's engineering team benchmarked this — Parquet files are typically about five times smaller than the equivalent UTF-8 encoded CSV. That's not just storage savings, that's network transfer savings, cloud egress savings, faster loading into memory. And the compression isn't just about size — reading fewer bytes from disk is always faster, even on modern SSDs.
Corn
JSON-L scales through streaming and simplicity, Parquet scales through compression and columnar access. But you mentioned a flip side for JSON-L — does Parquet have one too?
Herman
Parquet is write-once, read-many optimized. Writing Parquet files is slower than writing CSV or JSON-L because the engine has to buffer enough rows to build column chunks, apply encoding, compute statistics, and write the metadata. It's not designed for frequent row-level updates. If you're doing real-time streaming where you're writing a few rows every second, Parquet will punish you. JSON-L handles that beautifully — just append a line. Parquet wants you to batch up your writes.
Corn
There's the small files problem, right?
Herman
Yes, and this is Parquet's dirty secret that doesn't get talked about enough. Parquet's performance depends on files being at least tens of megabytes, ideally hundreds. Each Parquet file has metadata, row group statistics, schema information — there's fixed overhead per file. If you have thousands of tiny Parquet files, each a few kilobytes, that overhead dominates and you can actually perform worse than CSV. The columnar benefits evaporate because you're spending all your time opening files and reading metadata instead of actually scanning data.
Corn
Where's the crossover? At what row count does Parquet start making sense?
Herman
It depends on row width and data types, but as a rough heuristic, you want your Parquet files to be at least fifty to a hundred megabytes. For a typical tabular dataset with maybe twenty columns of mixed types, that's probably a few hundred thousand rows minimum. Below that, JSON-L or even CSV might serve you better, especially if you're doing full-table scans anyway.
Corn
That's a nuance I don't think most people appreciate. They hear "Parquet is faster" and just convert everything.
Herman
Right, and then they end up with ten thousand tiny Parquet files from some streaming pipeline and wonder why their queries are slow. There are tools that try to paper over this — Delta Lake, Iceberg, LakeFS — they handle compaction, merging small files into larger ones behind the scenes. But that's adding infrastructure to solve a problem that JSON-L doesn't have in the first place.
Corn
Let's talk about Hugging Face specifically, since that's where Daniel's prompt is coming from. What's actually happening under the hood there?
Herman
This is something a lot of users don't realize. When you upload a dataset to Hugging Face — whether it's CSV, JSON, JSON-L, whatever — Hugging Face's datasets library automatically converts it to Parquet behind the scenes. There was a really informative thread on their forums about this, with one of their engineers explaining that Parquet is the primary on-disk format they use to power the dataset viewer and enable fast access. So you might think you're sharing a CSV, but what consumers are actually reading is Parquet.
Corn
That's almost deceptive, but in a useful way.
Herman
It's pragmatic. And here's the kicker — if you upload Parquet directly, Hugging Face just symlinks your original files, provided your row group specifications meet their requirements. No conversion overhead, no duplicate storage. You upload Parquet, they serve Parquet. That's why a lot of the major datasets on Hugging Face, like CIFAR10, are stored directly in Parquet now.
Corn
The "best" upload format might actually be Parquet from the start, even though most people probably reach for CSV or JSON-L out of habit.
Herman
You skip the conversion step entirely. And since Hugging Face uses Apache Arrow as the in-memory format, Parquet files get loaded into Arrow tables with zero-copy memory mapping. That means you can work with datasets larger than your available RAM. Arrow's columnar format in memory maps directly to Parquet's columnar format on disk — it's a beautifully designed pipeline.
Corn
That portability angle Daniel mentioned — moving data without database connections — that seems like the through-line here. Both formats enable it, but in different ways.
Herman
Right, and the ParquetDB paper from West Virginia University and NIST really drove this home. They built a lightweight Python framework that uses Parquet files as the storage layer — no database server, no daemon, no connection strings. Just a folder of Parquet files. They benchmarked it on about four point eight million complex, nested records from a materials science database, and it outperformed both SQLite and MongoDB on their query workloads.
Corn
That's surprising.
Herman
It makes sense when you think about it. MongoDB is a general-purpose document database with all the overhead of a server process, network stack, query planner. ParquetDB was doing predicate pushdown directly on the files, using column statistics to skip irrelevant row groups, and reading only the columns the query needed. No network round trips, no query parsing overhead, just direct file access with column pruning. For analytical workloads on static or slowly-changing data, that's a completely viable pattern.
Corn
What breaks first with that approach?
Herman
Concurrent writes, for sure. Parquet files are not designed for multiple writers. If you've got two processes trying to write to the same file, you're going to have a bad time. Also, point lookups — if you need to find one specific row by ID, Parquet has to scan column chunks to find it. There's no B-tree index like you'd have in a proper database. ACID transactions are out the window. So the file-as-database pattern works beautifully for analytical workloads, batch processing, research pipelines — anything where you're doing large scans and aggregations on relatively static data. It falls apart when you need concurrent writes or fast point queries.
Corn
It's not replacing Postgres anytime soon.
Herman
No, and it's not trying to. But here's the thing — a huge amount of data work doesn't need a database. Data scientists are loading a dataset, running some aggregations, training a model, and moving on. They don't need replication, concurrency control, or point-in-time recovery. They need a file they can download, load into a DataFrame, and query. Parquet is perfect for that, and it's trivially portable — you can copy it to a USB drive, upload it to S3, email it. Any tool in the ecosystem can read it immediately.
Corn
JSON-L has a similar portability story, but it's more about human readability.
Herman
JSON-L is text. You can open it in any text editor, grep it, use command-line tools like jq on individual lines. It's the format you want when a human might need to inspect the data directly or when you're debugging a pipeline. Parquet is binary — you can't just open it and read it. You need tools. That's a real trade-off.
Corn
Let's put some numbers on this. When Daniel says "millions or tens of millions of rows," what does that actually look like in practice?
Herman
Let's say you've got ten million rows with maybe fifteen columns — a mix of strings, integers, floats, maybe some nested fields. As a CSV, that's probably around two to three gigabytes uncompressed. As JSON-L, it might be a bit larger because of the field name repetition in every single row. As Parquet with Snappy compression, you're probably looking at four to six hundred megabytes. That's a factor of five to seven times smaller.
Herman
If you're doing a full scan — reading every row and every column — JSON-L and Parquet are going to be in the same ballpark, maybe a slight edge to Parquet because of compression reducing I/O. But if you're reading three columns out of fifteen and filtering on one of them, Parquet will be dramatically faster. Ten times, fifty times, sometimes more. The column pruning and predicate pushdown mean you're reading a tiny fraction of the total data.
Corn
The choice between them really comes down to the access pattern.
Herman
It comes down to three things. One, write pattern — are you appending incrementally or batching? Two, read pattern — are you doing full scans or selective queries? Three, human factors — do you need to inspect the data with basic tools, or are you always going through a DataFrame library? JSON-L wins on append-heavy, full-scan, human-in-the-loop workflows. Parquet wins on batch-write, selective-query, machine-to-machine workflows.
Corn
In the Hugging Face world, it's overwhelmingly the latter.
Herman
You upload a dataset once, thousands of people download it and run selective queries on it. That's Parquet's ideal use case. The write cost is paid once, the read benefits accrue forever.
Corn
I want to circle back to something you mentioned earlier about JSON-L lacking schema enforcement. Is that always a downside, or are there cases where schema flexibility is actually the point?
Herman
And it depends on the domain. If you're dealing with heterogeneous data — say you're scraping e-commerce sites and every site has slightly different fields — JSON-L's schema flexibility is a feature, not a bug. You don't have to define a schema upfront, you don't have to handle migration when a new field appears, you just write the JSON object as you received it. The cost is that downstream consumers have to deal with that heterogeneity. With Parquet, the schema is embedded in the file metadata, which means consumers know exactly what they're getting before they read a single row. That's powerful for collaboration and pipeline reliability.
Corn
JSON-L pushes the schema problem downstream, Parquet solves it upstream.
Herman
That's a really clean way to put it. And it connects to something the ParquetDB paper touched on — schema evolution. Parquet supports adding and removing columns without breaking existing readers. Old files stay readable, new files can have different columns. It's not quite as flexible as JSON-L's anything-goes approach, but it's structured enough to give consumers confidence.
Corn
What about compression? You mentioned JSON-L needs to be gzipped externally. Is that a meaningful difference?
Herman
It is, because it affects how tools interact with the data. A gzipped JSON-L file has to be fully decompressed before any line can be read — gzip isn't splittable. So if you've got a hundred-gigabyte gzipped JSON-L file on HDFS or S3, you can't parallelize the read across workers the way you can with Parquet, where each row group is independently compressed and can be read in parallel. There are compression formats that solve this — bzip2 is splittable — but they're slower and less common. Parquet bakes splittable compression into the format itself.
Corn
Even the compression story ties back to parallelism.
Herman
Everything ties back to parallelism at scale. That's really the dividing line. JSON-L scales through simplicity — it's so simple that you can build parallelism on top of it by splitting files on line boundaries. Parquet scales through structure — it's designed from the ground up for parallel access patterns. Both work, but they make different trade-offs.
Corn
Let's talk about the ecosystem for a minute. You mentioned DuckDB earlier — where does that fit in?
Herman
DuckDB is fascinating because it's an in-process analytical database that eats Parquet files for breakfast. You can point DuckDB at a directory of Parquet files and run SQL queries directly against them, with full predicate pushdown, column pruning, and parallel execution. No ingestion step, no schema definition, no server. It's the file-as-database pattern taken to its logical conclusion. And it's fast — like, shockingly fast. It can scan Parquet files at memory bandwidth speeds.
Corn
You get SQL without a database server.
Herman
And DuckDB can also read JSON-L, but it has to parse and infer types on the fly, which is slower and more error-prone. With Parquet, the types are already there in the metadata. DuckDB knows column five is a thirty-two-bit float before it reads a single byte of column data.
Corn
That type inference point is something I've run into personally. JSON-L will happily let you write an integer in line one and a string in line two for the same field, and then your downstream tool has to figure out what to do with that.
Herman
Different tools handle it differently. Pandas might promote everything to object type, which kills performance. DuckDB might throw an error. Polars might try to cast and fail on the mismatched rows. It's a mess. Parquet avoids that entirely because the schema is explicit.
Corn
If someone's listening and they're trying to decide what format to use for a new project — say they're building a dataset that'll be a few million rows and they want to share it with collaborators — what's your decision tree?
Herman
First question — is the data arriving incrementally or in batches? If it's streaming in row by row, start with JSON-L. It's dead simple to append, you won't lose data, and you can always convert to Parquet later. Second question — what are consumers going to do with it? If they're going to run analytical queries selecting subsets of columns and filtering on specific fields, convert to Parquet before sharing. The conversion cost is paid once, and every consumer benefits. Third question — do humans need to inspect the raw data? If yes, keep a JSON-L copy around for debugging, even if your primary distribution format is Parquet.
Corn
What about the case where you're not sure about the schema yet? You're exploring, the fields might change.
Herman
JSON-L during exploration, Parquet once things stabilize. That's a really common pattern in practice — you collect data in JSON-L because it's flexible and appendable, you do your exploratory work, you figure out what the schema should be, and then you run a conversion step to produce clean, typed Parquet files for downstream consumption. Hugging Face basically does this conversion automatically, which is why a lot of people don't even realize it's happening.
Corn
Daniel mentioned portability specifically — moving data without direct database connections. I think that's worth dwelling on because it's a shift in how people think about data infrastructure.
Herman
It really is. For decades, the assumption was that if you had serious data, you needed a serious database. Oracle, SQL Server, Postgres — something with a daemon and connection pooling and a query optimizer. But the ecosystem has evolved to the point where a file can give you most of what you need for analytical workloads. Parquet with predicate pushdown gives you the I/O efficiency of a column store. DuckDB or Polars gives you the query engine. Arrow gives you the in-memory format. You can do billions of rows on a laptop with no infrastructure whatsoever.
Corn
Where does that break, practically speaking?
Herman
Concurrency, like I said. If you need a hundred people querying the same data simultaneously with guaranteed consistency, you need a database. If you need point lookups by primary key with sub-millisecond latency, you need a database. If you need to update individual rows without rewriting entire files, you need a database. But if you're doing analytical work — loading a dataset, exploring it, training models, generating reports — the file-based approach is not just viable, it's often faster and simpler.
Corn
Cheaper, I assume.
Herman
No managed database instances, no connection pooling infrastructure, no backup daemons. You pay for storage and maybe some compute for the query engine, and that's it. For research labs, startups, and individual developers, that's a game changer.
Corn
Let's talk about the Hugging Face angle one more time. Daniel mentioned both formats being used heavily there. Is there a case where you'd want to upload JSON-L instead of Parquet to Hugging Face, knowing that they'll convert it anyway?
Herman
If your data is naturally line-delimited and you're generating it incrementally, uploading JSON-L makes sense because it's the format you already have. The conversion cost is paid by Hugging Face's infrastructure, not yours. But if you're preparing a dataset specifically for distribution, uploading Parquet directly gives you more control — you can tune the row group size, choose the compression codec, and ensure the schema is exactly what you want. And as I mentioned, you avoid the conversion overhead entirely because Hugging Face will just symlink your files.
Corn
What's a row group size, for listeners who haven't tuned Parquet files before?
Herman
A row group is a chunk of rows within a Parquet file — it's the unit of parallel reading. Each row group contains column chunks for all columns, and each row group has its own statistics. When a query engine does predicate pushdown, it checks the statistics for each row group and skips the ones that can't possibly match. The default row group size is typically around a hundred and twenty-eight megabytes, but you can tune it. Larger row groups mean better compression and fewer metadata overhead, but coarser granularity for skipping. Smaller row groups mean more precise skipping but more overhead. It's a trade-off.
Corn
You'd tune based on your query patterns.
Herman
If your queries are highly selective — filtering for a specific date range or user ID — smaller row groups let you skip more data. If you're doing full scans most of the time, larger row groups give you better compression and less overhead.
Corn
This is the kind of detail that separates "I use Parquet" from "I understand Parquet.
Herman
Honestly, most people don't need to think about it. The defaults are fine for the vast majority of use cases. But when you're pushing into tens of millions of rows and you care about query latency, tuning row group size and compression codec can make a noticeable difference.
Corn
We've been comparing JSON-L and Parquet as if they're competitors, but it seems like they're actually complementary in a lot of workflows.
Herman
They really are. JSON-L is the ingestion format, Parquet is the consumption format. You collect in JSON-L because it's simple and appendable and human-readable. You distribute in Parquet because it's compact and typed and queryable. The pipeline from one to the other is well-understood and well-supported. It's not an either-or decision for most real projects.
Corn
CSV is still out there, hanging on.
Herman
CSV will never die, and that's fine. It's the lowest common denominator — every tool reads it, every human can open it in Excel. But CSV has real problems at scale. Escaping is ambiguous, there's no standard way to represent types, and it's row-based so you get none of the columnar benefits. CSV is great for a few thousand rows of tabular data that a human needs to look at. It's terrible for millions of rows that a machine needs to query.
Corn
I've debugged enough CSV encoding issues to last a lifetime. The number of times I've seen a comma inside a quoted field break an entire pipeline...
Herman
Don't get me started on date formats. Is it month-day-year or day-month-year? CSV doesn't tell you. Parquet tells you. JSON-L at least gives you ISO 8601 if you're disciplined about it. CSV gives you ambiguity and sadness.
Corn
That should be on a t-shirt. "CSV: Ambiguity and Sadness.
Herman
I'd wear it.
Corn
To pull this together — Daniel asked how far JSON-L can scale and why Parquet is useful for large flat data. JSON-L scales to hundreds of millions of rows through streaming and constant-memory processing, but it's computationally wasteful at query time because every tool has to re-parse and re-infer types. Parquet solves that by baking schema, compression, and statistics into the file itself, making selective queries dramatically faster and storage dramatically smaller. Both formats are portable in the sense that they're just files — no server, no connection strings, no infrastructure — but they're optimized for different stages of the data lifecycle.
Herman
That's a perfect summary. And the Hugging Face ecosystem is a great case study because it uses both — JSON-L for ingestion flexibility, Parquet for distribution efficiency. The conversion happens automatically, which is both convenient and slightly opaque. If you know what you're doing, you can optimize your pipeline by uploading Parquet directly and tuning your row groups.
Corn
One thing we didn't touch on — are there tools that make working with these formats easier for people who aren't deep in the data engineering weeds?
Herman
For JSON-L, jq is the Swiss Army knife — you can filter, transform, and aggregate JSON-L files from the command line with surprising power. For Parquet, DuckDB is the easiest entry point — install it, point it at a Parquet file, write SQL. No configuration, no schemas, no servers. And both formats are first-class citizens in Python through Pandas and Polars. You don't need to be a data engineer to benefit from these formats.
Corn
That's a good note to end on. These aren't exotic formats for specialized use cases — they're practical tools that make real workflows faster and simpler.

And now: Hilbert's daily fun fact.

The national flag of Nepal is the only national flag in the world that is not rectangular or square. It consists of two overlapping triangular pennants.
Corn
If you're working with flat data at scale, the concrete takeaway is pretty simple. If you're collecting data incrementally, use JSON-L. If you're distributing data for analytical queries, use Parquet. If you're doing both, use JSON-L for collection and convert to Parquet for distribution. The conversion step is cheap, well-supported, and pays for itself every time someone runs a selective query on your dataset.
Herman
If you're uploading to Hugging Face, consider uploading Parquet directly. You'll skip their conversion step, you'll have more control over the file layout, and your users will get better performance. The docs on row group configuration are worth reading if you're pushing past ten million rows.
Corn
One open question I keep coming back to — as these file-based formats get more sophisticated, with predicate pushdown and column statistics and schema evolution, at what point do we stop calling them file formats and start calling them database engines? Parquet plus DuckDB is functionally a columnar database. It just doesn't have a daemon.
Herman
That's a philosophical question the industry is still wrestling with. The line between "file format" and "database" has been blurring for years. I'm not sure it matters what we call it, as long as we understand the trade-offs.
Corn
This has been My Weird Prompts, produced by the one and only Hilbert Flumingtop. If you enjoyed this episode, leave us a review wherever you get your podcasts — it genuinely helps more people find the show. We're at myweirdprompts.com for the full archive.
Herman
See you next time.

This episode was generated with AI assistance. Hosts Herman and Corn are AI personalities.