Daniel sent us this one — he's been thinking about flat data structures, specifically JSON-L and Parquet, which he says Hugging Face uses constantly. His core questions are pretty direct. One, how far can JSON-L actually scale before it falls over? Two, why is Parquet the go-to format when you're packaging tens of millions of rows? And he makes this point that both formats can be surprisingly portable for moving massive datasets around without spinning up a database. I think there's more to that portability angle than most people realize.
There really is, and I'm glad he brought up Hugging Face because that ecosystem is where a lot of these format decisions play out in practice. Also, quick note — DeepSeek V four Pro is writing our script today, so if anything comes out especially coherent, that's why.
So where do we start? JSON-L seems like the obvious entry point since most people encounter it first.
Right, and the first thing to get straight is what JSON-L actually is, because I've seen people confuse it with regular JSON arrays constantly. JSON-L is line-delimited. Every single line is a complete, standalone JSON object. There's no opening bracket, no closing bracket, no commas between records. That structure is what unlocks everything.
Why does that matter so much? A JSON array of objects seems functionally identical if you're just thinking about the data.
It matters because of how you read it. With a regular JSON array, the parser has to swallow the entire file to validate it. Even streaming, you need to track bracket depth and handle commas between objects. With JSON-L, you can read one line at a time, process it, throw it away, move to the next. That means constant memory usage. You can stream through a hundred-million-row file on a laptop with eight gigs of RAM and it won't break a sweat.
The scaling story is really about memory, not about speed.
Speed-wise, JSON-L isn't necessarily faster than a well-tuned JSON array parser. But memory-wise, it's night and day. The Scrapfly blog had a good breakdown of this back in March — they pointed out that JSON-L is the default output format for Scrapy, the web scraping framework, precisely because scraping jobs can run for hours or days. You don't want to hold the entire dataset in memory while you're still collecting it. You just append each new scraped item as a new line.
That appendability is interesting. With a JSON array, you can't just tack a new record onto the end without rewriting the closing bracket, which means rewriting the whole file.
JSON-L lets you append trivially. That's huge for any pipeline where data arrives incrementally — logs, sensor readings, API pagination, streaming events. You just open the file in append mode and write a new line. No seeking, no rewriting, no memory overhead. But here's where it gets interesting — there's a flip side.
Of course there is.
JSON-L has no schema. Every line is a JSON object, but there's nothing enforcing that line seventeen has the same fields as line three million. Every tool that reads JSON-L has to infer types on the fly — parse the string, figure out this field is an integer and that one's a float — and it has to do that for every single line. There's no built-in compression strategy either; you have to gzip the file yourself, and then every reader has to decompress before it can even start parsing. So JSON-L scales beautifully in operational simplicity, but it's computationally wasteful at query time.
That's where Parquet comes in, I assume.
That's exactly where Parquet comes in. Parquet is a columnar storage format — it stores data by column, not by row. If you've got fifty columns but your query only needs three, Parquet reads just those three and skips the other forty-seven entirely. DataCamp had a great explainer on this in February. That's not possible with row-based formats like JSON-L or CSV, where you have to read every field in every row to extract the ones you want.
The physical layout on disk is completely different.
In a row-based format, all the fields for row one are stored together, then all the fields for row two. In Parquet, all the values for column A across thousands of rows are stored together, then all the values for column B. This is why Parquet gets those massive compression ratios — you're storing similar data types together, which makes compression algorithms dramatically more effective. Run-length encoding, dictionary encoding, bit packing — these work much better on a column of integers versus a mixed bag of strings, numbers, and booleans all jumbled together.
How much smaller are we talking?
Blue Yonder's engineering team benchmarked this — Parquet files are typically about five times smaller than the equivalent UTF-8 encoded CSV. That's not just storage savings, that's network transfer savings, cloud egress savings, faster loading into memory. And the compression isn't just about size — reading fewer bytes from disk is always faster, even on modern SSDs.
JSON-L scales through streaming and simplicity, Parquet scales through compression and columnar access. But you mentioned a flip side for JSON-L — does Parquet have one too?
Parquet is write-once, read-many optimized. Writing Parquet files is slower than writing CSV or JSON-L because the engine has to buffer enough rows to build column chunks, apply encoding, compute statistics, and write the metadata. It's not designed for frequent row-level updates. If you're doing real-time streaming where you're writing a few rows every second, Parquet will punish you. JSON-L handles that beautifully — just append a line. Parquet wants you to batch up your writes.
There's the small files problem, right?
Yes, and this is Parquet's dirty secret that doesn't get talked about enough. Parquet's performance depends on files being at least tens of megabytes, ideally hundreds. Each Parquet file has metadata, row group statistics, schema information — there's fixed overhead per file. If you have thousands of tiny Parquet files, each a few kilobytes, that overhead dominates and you can actually perform worse than CSV. The columnar benefits evaporate because you're spending all your time opening files and reading metadata instead of actually scanning data.
Where's the crossover? At what row count does Parquet start making sense?
It depends on row width and data types, but as a rough heuristic, you want your Parquet files to be at least fifty to a hundred megabytes. For a typical tabular dataset with maybe twenty columns of mixed types, that's probably a few hundred thousand rows minimum. Below that, JSON-L or even CSV might serve you better, especially if you're doing full-table scans anyway.
That's a nuance I don't think most people appreciate. They hear "Parquet is faster" and just convert everything.
Right, and then they end up with ten thousand tiny Parquet files from some streaming pipeline and wonder why their queries are slow. There are tools that try to paper over this — Delta Lake, Iceberg, LakeFS — they handle compaction, merging small files into larger ones behind the scenes. But that's adding infrastructure to solve a problem that JSON-L doesn't have in the first place.
Let's talk about Hugging Face specifically, since that's where Daniel's prompt is coming from. What's actually happening under the hood there?
This is something a lot of users don't realize. When you upload a dataset to Hugging Face — whether it's CSV, JSON, JSON-L, whatever — Hugging Face's datasets library automatically converts it to Parquet behind the scenes. There was a really informative thread on their forums about this, with one of their engineers explaining that Parquet is the primary on-disk format they use to power the dataset viewer and enable fast access. So you might think you're sharing a CSV, but what consumers are actually reading is Parquet.
That's almost deceptive, but in a useful way.
It's pragmatic. And here's the kicker — if you upload Parquet directly, Hugging Face just symlinks your original files, provided your row group specifications meet their requirements. No conversion overhead, no duplicate storage. You upload Parquet, they serve Parquet. That's why a lot of the major datasets on Hugging Face, like CIFAR10, are stored directly in Parquet now.
The "best" upload format might actually be Parquet from the start, even though most people probably reach for CSV or JSON-L out of habit.
You skip the conversion step entirely. And since Hugging Face uses Apache Arrow as the in-memory format, Parquet files get loaded into Arrow tables with zero-copy memory mapping. That means you can work with datasets larger than your available RAM. Arrow's columnar format in memory maps directly to Parquet's columnar format on disk — it's a beautifully designed pipeline.
That portability angle Daniel mentioned — moving data without database connections — that seems like the through-line here. Both formats enable it, but in different ways.
Right, and the ParquetDB paper from West Virginia University and NIST really drove this home. They built a lightweight Python framework that uses Parquet files as the storage layer — no database server, no daemon, no connection strings. Just a folder of Parquet files. They benchmarked it on about four point eight million complex, nested records from a materials science database, and it outperformed both SQLite and MongoDB on their query workloads.
That's surprising.
It makes sense when you think about it. MongoDB is a general-purpose document database with all the overhead of a server process, network stack, query planner. ParquetDB was doing predicate pushdown directly on the files, using column statistics to skip irrelevant row groups, and reading only the columns the query needed. No network round trips, no query parsing overhead, just direct file access with column pruning. For analytical workloads on static or slowly-changing data, that's a completely viable pattern.
What breaks first with that approach?
Concurrent writes, for sure. Parquet files are not designed for multiple writers. If you've got two processes trying to write to the same file, you're going to have a bad time. Also, point lookups — if you need to find one specific row by ID, Parquet has to scan column chunks to find it. There's no B-tree index like you'd have in a proper database. ACID transactions are out the window. So the file-as-database pattern works beautifully for analytical workloads, batch processing, research pipelines — anything where you're doing large scans and aggregations on relatively static data. It falls apart when you need concurrent writes or fast point queries.
It's not replacing Postgres anytime soon.
No, and it's not trying to. But here's the thing — a huge amount of data work doesn't need a database. Data scientists are loading a dataset, running some aggregations, training a model, and moving on. They don't need replication, concurrency control, or point-in-time recovery. They need a file they can download, load into a DataFrame, and query. Parquet is perfect for that, and it's trivially portable — you can copy it to a USB drive, upload it to S3, email it. Any tool in the ecosystem can read it immediately.
JSON-L has a similar portability story, but it's more about human readability.
JSON-L is text. You can open it in any text editor, grep it, use command-line tools like jq on individual lines. It's the format you want when a human might need to inspect the data directly or when you're debugging a pipeline. Parquet is binary — you can't just open it and read it. You need tools. That's a real trade-off.
Let's put some numbers on this. When Daniel says "millions or tens of millions of rows," what does that actually look like in practice?
Let's say you've got ten million rows with maybe fifteen columns — a mix of strings, integers, floats, maybe some nested fields. As a CSV, that's probably around two to three gigabytes uncompressed. As JSON-L, it might be a bit larger because of the field name repetition in every single row. As Parquet with Snappy compression, you're probably looking at four to six hundred megabytes. That's a factor of five to seven times smaller.
If you're doing a full scan — reading every row and every column — JSON-L and Parquet are going to be in the same ballpark, maybe a slight edge to Parquet because of compression reducing I/O. But if you're reading three columns out of fifteen and filtering on one of them, Parquet will be dramatically faster. Ten times, fifty times, sometimes more. The column pruning and predicate pushdown mean you're reading a tiny fraction of the total data.
The choice between them really comes down to the access pattern.
It comes down to three things. One, write pattern — are you appending incrementally or batching? Two, read pattern — are you doing full scans or selective queries? Three, human factors — do you need to inspect the data with basic tools, or are you always going through a DataFrame library? JSON-L wins on append-heavy, full-scan, human-in-the-loop workflows. Parquet wins on batch-write, selective-query, machine-to-machine workflows.
In the Hugging Face world, it's overwhelmingly the latter.
You upload a dataset once, thousands of people download it and run selective queries on it. That's Parquet's ideal use case. The write cost is paid once, the read benefits accrue forever.
I want to circle back to something you mentioned earlier about JSON-L lacking schema enforcement. Is that always a downside, or are there cases where schema flexibility is actually the point?
And it depends on the domain. If you're dealing with heterogeneous data — say you're scraping e-commerce sites and every site has slightly different fields — JSON-L's schema flexibility is a feature, not a bug. You don't have to define a schema upfront, you don't have to handle migration when a new field appears, you just write the JSON object as you received it. The cost is that downstream consumers have to deal with that heterogeneity. With Parquet, the schema is embedded in the file metadata, which means consumers know exactly what they're getting before they read a single row. That's powerful for collaboration and pipeline reliability.
JSON-L pushes the schema problem downstream, Parquet solves it upstream.
That's a really clean way to put it. And it connects to something the ParquetDB paper touched on — schema evolution. Parquet supports adding and removing columns without breaking existing readers. Old files stay readable, new files can have different columns. It's not quite as flexible as JSON-L's anything-goes approach, but it's structured enough to give consumers confidence.
What about compression? You mentioned JSON-L needs to be gzipped externally. Is that a meaningful difference?
It is, because it affects how tools interact with the data. A gzipped JSON-L file has to be fully decompressed before any line can be read — gzip isn't splittable. So if you've got a hundred-gigabyte gzipped JSON-L file on HDFS or S3, you can't parallelize the read across workers the way you can with Parquet, where each row group is independently compressed and can be read in parallel. There are compression formats that solve this — bzip2 is splittable — but they're slower and less common. Parquet bakes splittable compression into the format itself.
Even the compression story ties back to parallelism.
Everything ties back to parallelism at scale. That's really the dividing line. JSON-L scales through simplicity — it's so simple that you can build parallelism on top of it by splitting files on line boundaries. Parquet scales through structure — it's designed from the ground up for parallel access patterns. Both work, but they make different trade-offs.
Let's talk about the ecosystem for a minute. You mentioned DuckDB earlier — where does that fit in?
DuckDB is fascinating because it's an in-process analytical database that eats Parquet files for breakfast. You can point DuckDB at a directory of Parquet files and run SQL queries directly against them, with full predicate pushdown, column pruning, and parallel execution. No ingestion step, no schema definition, no server. It's the file-as-database pattern taken to its logical conclusion. And it's fast — like, shockingly fast. It can scan Parquet files at memory bandwidth speeds.
You get SQL without a database server.
And DuckDB can also read JSON-L, but it has to parse and infer types on the fly, which is slower and more error-prone. With Parquet, the types are already there in the metadata. DuckDB knows column five is a thirty-two-bit float before it reads a single byte of column data.
That type inference point is something I've run into personally. JSON-L will happily let you write an integer in line one and a string in line two for the same field, and then your downstream tool has to figure out what to do with that.
Different tools handle it differently. Pandas might promote everything to object type, which kills performance. DuckDB might throw an error. Polars might try to cast and fail on the mismatched rows. It's a mess. Parquet avoids that entirely because the schema is explicit.
If someone's listening and they're trying to decide what format to use for a new project — say they're building a dataset that'll be a few million rows and they want to share it with collaborators — what's your decision tree?
First question — is the data arriving incrementally or in batches? If it's streaming in row by row, start with JSON-L. It's dead simple to append, you won't lose data, and you can always convert to Parquet later. Second question — what are consumers going to do with it? If they're going to run analytical queries selecting subsets of columns and filtering on specific fields, convert to Parquet before sharing. The conversion cost is paid once, and every consumer benefits. Third question — do humans need to inspect the raw data? If yes, keep a JSON-L copy around for debugging, even if your primary distribution format is Parquet.
What about the case where you're not sure about the schema yet? You're exploring, the fields might change.
JSON-L during exploration, Parquet once things stabilize. That's a really common pattern in practice — you collect data in JSON-L because it's flexible and appendable, you do your exploratory work, you figure out what the schema should be, and then you run a conversion step to produce clean, typed Parquet files for downstream consumption. Hugging Face basically does this conversion automatically, which is why a lot of people don't even realize it's happening.
Daniel mentioned portability specifically — moving data without direct database connections. I think that's worth dwelling on because it's a shift in how people think about data infrastructure.
It really is. For decades, the assumption was that if you had serious data, you needed a serious database. Oracle, SQL Server, Postgres — something with a daemon and connection pooling and a query optimizer. But the ecosystem has evolved to the point where a file can give you most of what you need for analytical workloads. Parquet with predicate pushdown gives you the I/O efficiency of a column store. DuckDB or Polars gives you the query engine. Arrow gives you the in-memory format. You can do billions of rows on a laptop with no infrastructure whatsoever.
Where does that break, practically speaking?
Concurrency, like I said. If you need a hundred people querying the same data simultaneously with guaranteed consistency, you need a database. If you need point lookups by primary key with sub-millisecond latency, you need a database. If you need to update individual rows without rewriting entire files, you need a database. But if you're doing analytical work — loading a dataset, exploring it, training models, generating reports — the file-based approach is not just viable, it's often faster and simpler.
Cheaper, I assume.
No managed database instances, no connection pooling infrastructure, no backup daemons. You pay for storage and maybe some compute for the query engine, and that's it. For research labs, startups, and individual developers, that's a game changer.
Let's talk about the Hugging Face angle one more time. Daniel mentioned both formats being used heavily there. Is there a case where you'd want to upload JSON-L instead of Parquet to Hugging Face, knowing that they'll convert it anyway?
If your data is naturally line-delimited and you're generating it incrementally, uploading JSON-L makes sense because it's the format you already have. The conversion cost is paid by Hugging Face's infrastructure, not yours. But if you're preparing a dataset specifically for distribution, uploading Parquet directly gives you more control — you can tune the row group size, choose the compression codec, and ensure the schema is exactly what you want. And as I mentioned, you avoid the conversion overhead entirely because Hugging Face will just symlink your files.
What's a row group size, for listeners who haven't tuned Parquet files before?
A row group is a chunk of rows within a Parquet file — it's the unit of parallel reading. Each row group contains column chunks for all columns, and each row group has its own statistics. When a query engine does predicate pushdown, it checks the statistics for each row group and skips the ones that can't possibly match. The default row group size is typically around a hundred and twenty-eight megabytes, but you can tune it. Larger row groups mean better compression and fewer metadata overhead, but coarser granularity for skipping. Smaller row groups mean more precise skipping but more overhead. It's a trade-off.
You'd tune based on your query patterns.
If your queries are highly selective — filtering for a specific date range or user ID — smaller row groups let you skip more data. If you're doing full scans most of the time, larger row groups give you better compression and less overhead.
This is the kind of detail that separates "I use Parquet" from "I understand Parquet.
Honestly, most people don't need to think about it. The defaults are fine for the vast majority of use cases. But when you're pushing into tens of millions of rows and you care about query latency, tuning row group size and compression codec can make a noticeable difference.
We've been comparing JSON-L and Parquet as if they're competitors, but it seems like they're actually complementary in a lot of workflows.
They really are. JSON-L is the ingestion format, Parquet is the consumption format. You collect in JSON-L because it's simple and appendable and human-readable. You distribute in Parquet because it's compact and typed and queryable. The pipeline from one to the other is well-understood and well-supported. It's not an either-or decision for most real projects.
CSV is still out there, hanging on.
CSV will never die, and that's fine. It's the lowest common denominator — every tool reads it, every human can open it in Excel. But CSV has real problems at scale. Escaping is ambiguous, there's no standard way to represent types, and it's row-based so you get none of the columnar benefits. CSV is great for a few thousand rows of tabular data that a human needs to look at. It's terrible for millions of rows that a machine needs to query.
I've debugged enough CSV encoding issues to last a lifetime. The number of times I've seen a comma inside a quoted field break an entire pipeline...
Don't get me started on date formats. Is it month-day-year or day-month-year? CSV doesn't tell you. Parquet tells you. JSON-L at least gives you ISO 8601 if you're disciplined about it. CSV gives you ambiguity and sadness.
That should be on a t-shirt. "CSV: Ambiguity and Sadness.
I'd wear it.
To pull this together — Daniel asked how far JSON-L can scale and why Parquet is useful for large flat data. JSON-L scales to hundreds of millions of rows through streaming and constant-memory processing, but it's computationally wasteful at query time because every tool has to re-parse and re-infer types. Parquet solves that by baking schema, compression, and statistics into the file itself, making selective queries dramatically faster and storage dramatically smaller. Both formats are portable in the sense that they're just files — no server, no connection strings, no infrastructure — but they're optimized for different stages of the data lifecycle.
That's a perfect summary. And the Hugging Face ecosystem is a great case study because it uses both — JSON-L for ingestion flexibility, Parquet for distribution efficiency. The conversion happens automatically, which is both convenient and slightly opaque. If you know what you're doing, you can optimize your pipeline by uploading Parquet directly and tuning your row groups.
One thing we didn't touch on — are there tools that make working with these formats easier for people who aren't deep in the data engineering weeds?
For JSON-L, jq is the Swiss Army knife — you can filter, transform, and aggregate JSON-L files from the command line with surprising power. For Parquet, DuckDB is the easiest entry point — install it, point it at a Parquet file, write SQL. No configuration, no schemas, no servers. And both formats are first-class citizens in Python through Pandas and Polars. You don't need to be a data engineer to benefit from these formats.
That's a good note to end on. These aren't exotic formats for specialized use cases — they're practical tools that make real workflows faster and simpler.
And now: Hilbert's daily fun fact.
The national flag of Nepal is the only national flag in the world that is not rectangular or square. It consists of two overlapping triangular pennants.
If you're working with flat data at scale, the concrete takeaway is pretty simple. If you're collecting data incrementally, use JSON-L. If you're distributing data for analytical queries, use Parquet. If you're doing both, use JSON-L for collection and convert to Parquet for distribution. The conversion step is cheap, well-supported, and pays for itself every time someone runs a selective query on your dataset.
If you're uploading to Hugging Face, consider uploading Parquet directly. You'll skip their conversion step, you'll have more control over the file layout, and your users will get better performance. The docs on row group configuration are worth reading if you're pushing past ten million rows.
One open question I keep coming back to — as these file-based formats get more sophisticated, with predicate pushdown and column statistics and schema evolution, at what point do we stop calling them file formats and start calling them database engines? Parquet plus DuckDB is functionally a columnar database. It just doesn't have a daemon.
That's a philosophical question the industry is still wrestling with. The line between "file format" and "database" has been blurring for years. I'm not sure it matters what we call it, as long as we understand the trade-offs.
This has been My Weird Prompts, produced by the one and only Hilbert Flumingtop. If you enjoyed this episode, leave us a review wherever you get your podcasts — it genuinely helps more people find the show. We're at myweirdprompts.com for the full archive.
See you next time.