Subquadratic, a startup with eleven PhD researchers, just published benchmarks for a model with a twelve-million-token context window that claims to crack the quadratic attention bottleneck. Their architecture, Subquadratic Selective Attention, runs fifty-two times faster than dense attention at a million tokens — a complexity-class improvement, not just a constant-factor speedup. On the MRCR v2 multi-reference retrieval benchmark, they score eighty-three percent, nine points above OpenAI's GPT-5.5. The key innovation is content-dependent selection that itself scales subquadratically, avoiding the "indexer trap" that plagues DeepSeek's Native Sparse Attention where the selection step remains quadratic. At twelve million tokens, they report 92.1% needle-in-a-haystack accuracy — finding one sentence in roughly nine million words. Caveats include single-run benchmarks and a smaller model size than frontier labs, but if the architecture scales, it could enable codebase-level reasoning across entire monorepos without chunking or RAG, and legal document review across entire corpora. Subquadratic has raised twenty-nine million dollars at a five hundred million valuation and runs on neoclouds rather than hyperscalers to optimize inference cost from day one.
#2672: 12M Token Context: Subquadratic Cracks Attention Scaling
A startup claims linear attention scaling at 12M tokens, beating GPT-5.5 on retrieval benchmarks.
Episode Details
- Episode ID
- MWP-2832
- Published
- Duration
- 33:06
- Audio
- Direct link
- Pipeline
- V5
- TTS Engine
-
chatterbox-regular - Script Writing Agent
- deepseek-v4-pro
AI-Generated Content: This podcast is created using AI personas. Please verify any important information independently.
Downloads
Transcript (TXT)
Plain text transcript file
Transcript (PDF)
Formatted PDF with styling
Never miss an episode
New episodes drop daily — subscribe on your favorite platform
New to the show? Start here#2672: 12M Token Context: Subquadratic Cracks Attention Scaling
Daniel sent us this piece from The New Stack — a startup called Subquadratic just dropped a model with a twelve-million-token context window, and it's not just a press release flex. They're claiming they've actually cracked the quadratic attention bottleneck that's been the glass ceiling for every transformer since twenty-seventeen. The piece is by Frederic Lardinois, and it's got real benchmarks, real architecture claims, and a pretty bold assertion that they're beating GPT-5.5 on retrieval. There's a lot of technical meat here. So where do we even start?
I want to start with what "subquadratic" actually means, because the name of the company is the whole thesis. Every transformer since Vaswani's twenty-seventeen paper has this fundamental scaling problem. When you have N tokens in your context window, the self-attention mechanism compares every token to every other token. That's N squared pairwise comparisons. Double your context from a hundred thousand to two hundred thousand tokens, and your compute cost doesn't double — it quadruples. That's the quadratic wall.
That's why we've been stuck around a million tokens, give or take, across the whole frontier.
At a million tokens, you're doing a trillion pairwise attention computations per layer, per head. The memory requirements for the key-value cache alone become enormous. This isn't a soft ceiling — it's a hard mathematical constraint. Every lab has been working around it with what are essentially clever hacks. RAG, agentic decomposition, sliding windows — they're all ways of saying "we can't actually pay attention to everything, so let's find a way to only pay attention to some things.
Subquadratic's claim is that they've made attention scale linearly instead of quadratically. The article says their architecture, Subquadratic Selective Attention, runs fifty-two times faster than dense attention at a million tokens. That's not a minor optimization. That's a different complexity class.
Right, and this is the part where we need to get precise. The ideal is O of N — linear scaling, where doubling context doubles cost. But subquadratic can mean O of N log N, or even O of N to the one-point-five. Anything that breaks the N-squared curve. The article's CTO, Alex Whedon, makes a really important distinction. He says hybrids give you what he calls "a scalar benefit" — you get cheaper at every context length by the same factor — but a pure subquadratic mechanism gives you a scaling-law advantage. The gap between quadratic and subquadratic widens as context grows. At a hundred and twenty-eight K tokens, they report a seven-point-two-times speedup. At a million tokens, it's fifty-two times. That curve is the whole story.
Let me play the skeptic for a second. We've heard claims like this before. Magic dot dev announced a hundred-million-token context window back in August twenty-twenty-four, raised over five hundred million dollars, and as of early twenty-twenty-six, the article notes there's no public evidence of their LTM-two-mini being used outside Magic. The category's track record is spotty.
Completely fair, and Lardinois flags that explicitly. But what makes Subquadratic's claims more interesting is that they've published actual benchmarks against frontier models, and the numbers are specific and falsifiable. They're not just saying "we have a big window." They're saying "here's our score on RULER at a hundred and twenty-eight K, here's our MRCR v-two score, here's our SWE-bench verified number." And some of these are genuinely eyebrow-raising. MRCR v-two is the multi-reference retrieval benchmark that labs use internally. 5 scores seventy-four percent. Claude Opus four-point-seven scores thirty-two-point-two percent — a huge spread that tells you how differently models handle long context retrieval. Subquadratic claims eighty-three on that same benchmark, nine points above OpenAI's best.
That's a wild number. Nine points on a retrieval benchmark against GPT-5.5 isn't incremental. That's a step change. But we should talk about what's actually happening architecturally, because "subquadratic attention" isn't one thing. It's a family of approaches that have been evolving for years.
Let me walk through the lineage. The first wave was fixed-pattern sparse attention — models like Longformer, where each token only attends to a sliding window of nearby tokens. It scales linearly, but it breaks catastrophically when the information you need isn't nearby. If the answer to your question is in paragraph two and the context is in paragraph four thousand, a sliding window misses it entirely. So it's cheap but brittle.
Then you get the state-space model wave — Mamba, Mamba-two, RWKV, RetNet. These replace the all-pairs comparison with a recurrent state that compresses everything the model has seen so far into a fixed-size representation. The problem is the compression is lossy. Nvidia did a study at the eight-billion-parameter scale and found pure Mamba-two lagged transformers on MMLU and on phonebook lookup — exactly the kind of task where you need to retrieve a specific piece of information from a long context. The gap only closed when they added attention layers back in.
Then you get the state-space model wave — Mamba, Mamba-two, RWKV, RetNet. These replace the all-pairs comparison with a recurrent state that compresses everything the model has seen so far into a fixed-size representation. The problem is the compression is lossy. Nvidia did a study at the eight-billion-parameter scale and found pure Mamba-two lagged transformers on MMLU and on phonebook lookup — exactly the kind of task where you need to retrieve a specific piece of information from a long context. The gap only closed when they added attention layers back in.
Which brings us to hybrids.
Hybrids — Jamba, Kimi Linear, Qwen-three-Next, Nvidia's Nemotron v-three — keep most layers efficient and retain a few dense attention layers for retrieval. It's the pragmatic answer, and it works well at moderate context lengths. But a hybrid that's three times cheaper at thirty-two K tokens is still three times cheaper at ten million tokens. The dense layers it retains are still doing O of N squared work. You get a constant-factor improvement, not a complexity-class improvement. That's Whedon's "scalar benefit" point.
What did Subquadratic actually do differently? The article describes something called content-dependent selection without the indexer trap.
This is the key innovation they're claiming. The most recent approach before SSA was DeepSeek's Native Sparse Attention, which won the ACL twenty-twenty-five best paper award and is shipping in DeepSeek V-three-point-two Experimental. DSA's idea is: instead of attending to everything, learn which positions to attend to. It uses a lightning indexer that routes attention to a small subset of selected keys. The attention over those keys is sparse and efficient. But — and this is the trap — the indexer itself has to score every query against every key to decide which ones matter. The selection step is itself quadratic. You've moved the bottleneck, not removed it.
You're still doing the full pairwise comparison, you're just doing it in a different place.
What Subquadratic claims is that their selection mechanism — how they decide which tokens attend to which — is itself subquadratic. Whedon describes it as content-dependent. For prompt A, words one and six matter to each other. For prompt B, it's words two and three. The routing decisions are made based on what the query and keys actually contain, and the mechanism that makes those decisions doesn't itself explode quadratically. They're calling it Subquadratic Selective Attention, and the "selective" part is doing a lot of work — it's the same word Mamba used, actually. The idea of making state updates input-dependent rather than fixed.
The article says they have eleven PhD researchers on staff. That's not a trivial research team for a startup that's raised twenty-nine million dollars.
No, and their valuation is five hundred million, which tells you investors are pricing in the possibility that this architecture is real. Justin Dangel, the CEO, says they're running on neoclouds rather than hyperscalers because the big clouds are, quote, "very expensive." That's an interesting strategic choice — it means they're optimizing for inference cost from day one, which makes sense if your whole pitch is efficiency at scale.
Let's talk about the benchmarks in more detail, because there are some caveats the article is careful to flag. Each model was run only once due to high inference cost. The SWE-bench margin — they report eighty-two-point-four percent versus Opus four-point-six at eighty-one-point-four — the paper itself acknowledges that's "harness as much as model." And Whedon admits their model is, quote, "way smaller than the big labs.
Those are important caveats. Running each model once means we don't know the variance. The SWE-bench difference is within the noise floor depending on the harness configuration. And model size matters because smaller models have an easier time with efficient attention — fewer heads, fewer layers, less representational capacity to preserve. The real test is whether this architecture scales to the hundred-billion-parameter-plus regime where the frontier labs operate.
The needle-in-a-haystack result is harder to dismiss. Ninety-two-point-one percent at twelve million tokens. No frontier model even operates at that context length, so there's nothing to compare against directly, but that's a strong absolute number. If you can actually retrieve specific information from a twelve-million-token context with ninety-two percent accuracy, that's a qualitatively different capability.
Needle-in-a-haystack has its limitations — it's a single fact retrieval task, and models can sometimes learn to game it — but at twelve million tokens, the haystack is enormous. We're talking about finding one sentence in roughly nine million words, or about thirty-five thousand pages of text. That's the entire Harry Potter series, plus the Lord of the Rings trilogy, plus the complete works of Shakespeare, and you're asking the model to find one specific sentence. If they're actually hitting ninety-two percent on that, that's not a parlor trick.
Let me pivot to the implications. If the major labs adopted similar architectures — if we're not talking about a single startup but a broader shift — what actually opens up at ten-million-plus tokens that isn't viable today?
I think the biggest one is codebase-level reasoning without chunking or RAG. Today, if you want an AI coding agent to understand a large codebase, you chunk the repo, embed it, retrieve relevant files, feed them into context, the agent works on them, you evict them, you load new files. It works, but it's lossy. The model never has the whole codebase in its attention simultaneously. At twelve million tokens, you can fit roughly two to three million lines of code. That's most large monorepos. The Linux kernel is about thirty million lines, so you're not there yet, but you're fitting very substantial codebases. The model can trace dependencies across the entire system without ever losing context.
You go from "I retrieved the files that might be relevant" to "the model has actually read and can reason about the whole thing." That's a different debugging experience.
Imagine refactoring a cross-cutting concern across a hundred microservices. Today, you do it file by file, service by service, and hope you didn't miss an implicit dependency. With a twelve-million-token window, the model holds the entire dependency graph in attention. It can see that changing this interface in service A breaks that assumption in service B, which cascades to service C. That kind of global reasoning is currently impossible without extensive human coordination.
What about beyond code? The article mentions they're shipping a deep research tool called SubQ Search.
Legal document review is the canonical example. A complex litigation might involve millions of documents. Today you use e-discovery tools with keyword search and maybe some embedding-based retrieval. But the model never reads everything. At twelve million tokens, you can fit entire corpora of case law, all the filings, all the exhibits, and ask questions that require synthesizing information across thousands of documents. The model can spot contradictions, identify patterns, find precedents that a keyword search would never surface because the relevant language doesn't share any terms.
Scientific literature review is another one. I'm imagining feeding a model every paper published on a specific protein pathway over the last decade and asking it to identify inconsistencies or suggest novel hypotheses.
This is where the retrieval benchmark numbers actually matter. If you have a huge context window but the model can't reliably find information in it, you're worse off than using RAG with a smaller window. That's why Claude Opus four-point-seven scoring thirty-two percent on MRCR v-two is so telling. The window is there, but the model isn't effectively using it. Subquadratic's eighty-three percent suggests their architecture isn't just big — it's actually usable at scale.
The article mentions they're aiming for a fifty-million-token context window by Q4. That's roughly thirty-seven million words. You're talking about feeding in entire libraries. What breakthroughs actually enabled crossing from the one-to-two-million range into twelve million?
There are really four concurrent breakthroughs. The first is the attention mechanism itself — the move from dense to sparse to selective, where the selection is learned and content-dependent but doesn't reintroduce quadratic cost. That's the SSA claim. The second is KV cache compression. At twelve million tokens, a naive KV cache would be hundreds of gigabytes. You need techniques like grouped-query attention, multi-query attention, and more aggressive forms of KV cache quantization and eviction. Google's been doing interesting work on this, and I suspect Subquadratic is using something similar under the hood.
The article doesn't go into their KV cache strategy specifically, but it's implied by the efficiency numbers. Fifty-two times faster than dense attention isn't just about the attention pattern — you have to be reading and writing less memory too.
The third breakthrough is in training infrastructure. Dense attention parallelizes cleanly because every token attends to every token — you can shard across sequence length in straightforward ways. Sparse and selective attention create irregular communication patterns. You need new approaches to sequence parallelism, and the fact that they're running on neoclouds suggests they've built custom infrastructure for this.
Position encoding at extreme lengths. Rotary position embeddings, which everyone uses now, have known limitations at very long contexts — the high-frequency components can cause interference. There's been a flurry of work on extending RoPE — YaRN, NTK-aware scaling — and Subquadratic almost certainly had to solve this to get good retrieval at twelve million tokens. If your position encoding breaks down at long distances, it doesn't matter how good your attention mechanism is.
You mentioned DeepSeek's Native Sparse Attention earlier. Where does the research frontier actually sit right now?
The landscape is roughly four camps. You've got the state-space model camp — Mamba, Mamba-two, and their descendants — pushing toward fully recurrent architectures that don't use attention at all. You've got the linear attention camp, which approximates the softmax attention with kernel methods to avoid the N-squared computation. You've got the sparse attention camp, which includes DeepSeek's DSA and now Subquadratic's SSA, where you learn which tokens to attend to. And you've got the hybrid camp, which is what most frontier labs are actually shipping — keep most of the efficiency of the new approaches while retaining enough dense attention to preserve quality.
Subquadratic is arguing that hybrids are a local maximum, not the endgame.
That's exactly their thesis. Whedon's point about scalar benefit versus scaling-law advantage is the intellectual core of their bet. If you can make attention subquadratic without sacrificing quality, you don't just get cheaper inference at current context lengths — you change the economics of context entirely. The cost of going from one million to ten million tokens stops being prohibitive and starts being merely expensive. And once it's merely expensive, product people figure out how to make it cheap.
Let's talk about what's not in the article. They're not open-sourcing weights. They're offering training tools for enterprises to do their own post-training, but the core model stays proprietary. How much does that limit the research community's ability to validate their claims?
It's a real limitation. The article notes that Magic dot dev made similar claims and then went dark. The only way to truly validate a new architecture is to have independent researchers run it through their own benchmarks, probe its failure modes, test its scaling behavior. Subquadratic is making an API available, which is better than nothing, but API access doesn't let you inspect the attention patterns, measure the KV cache behavior, or verify that they're actually doing what they claim architecturally.
We're in a trust-but-verify situation, and the verification tools are limited.
The history of the field counsels caution. I remember when everyone thought Reformers and Linformers were going to replace dense attention overnight. They didn't. I remember when people thought RWKV was going to make transformers obsolete. It didn't. The transformer's dense attention is extraordinarily robust and flexible, and every replacement has had some subtle failure mode that only becomes apparent at scale or on specific tasks.
That said, the benchmarks they've published are more substantive than what Magic put out. MRCR v-two, RULER, SWE-bench — these are standardized, community-accepted benchmarks. If the numbers are real, they've built something impressive.
The MRCR v-two number is the one I keep coming back to. Eighty-three versus GPT-5.5's seventy-four and Opus four-point-seven's thirty-two. That's not a marginal improvement. That's a different capability tier for long-context retrieval. And it suggests that their architecture isn't just efficient — it's actually better at the thing that context windows are supposed to enable.
One thing I want to pull on — the article mentions they were previously called Aldea and worked on speech models before pivoting. That's interesting because speech models have their own long-context challenges. Audio tokens at sixteen kilohertz sample rate generate a lot of tokens very quickly. A one-hour conversation is something like half a million tokens depending on your tokenizer. I wonder if their work on speech gave them architectural insights that transferred to text.
That's a really sharp observation. Speech foundation models have been dealing with extremely long sequences for years, and the attention bottleneck bites even harder because the token counts are so much larger. If they spent time optimizing attention for speech — where you need to track dependencies across hundreds of thousands of tokens — it makes sense that those architectural innovations would transfer. The acoustic properties of speech also have locality patterns that might inform how you design content-dependent sparse attention.
Let me ask you a forward-looking question. If Subquadratic's architecture is real and scales — and let's say the major labs adopt something similar within the next eighteen months — what does that do to the RAG industry?
It doesn't kill it, but it dramatically changes its role. RAG solves two problems: the context window limit and the knowledge cutoff problem. If context windows go to ten million or fifty million tokens, the first problem largely goes away for most use cases. You can just put your documents in context. But the knowledge cutoff problem remains — you still need retrieval for information that's newer than the model's training data, or for proprietary data that's too large even for a fifty-million-token window. What changes is that RAG becomes less about chunking and embedding and more about selection — which documents do I load into this enormous context? The retrieval step becomes coarser-grained and the reasoning step becomes finer-grained.
You go from "find the relevant paragraph" to "find the relevant bookshelf.
And the model does the fine-grained retrieval internally through its attention mechanism, which is almost certainly better than whatever embedding-based retrieval you were doing before. The MRCR numbers bear that out.
By the way, fun fact — DeepSeek V four Pro is writing our script today.
Hope it's paying attention to our context window discussion.
Ironic if it runs out of context mid-episode.
Let's talk about the SWE-bench number a bit more. Subquadratic reports eighty-two-point-four percent, which edges out Opus four-point-six at eighty-one-point-four and Gemini three-point-one Pro at eighty-point-six. But the article notes this is "harness as much as model." For listeners who aren't deep in the SWE-bench weeds, the harness is the scaffolding that translates between the benchmark task and the model — it handles things like setting up the repository, running tests, parsing output. Different harnesses can produce meaningfully different scores for the same model. A one-point margin with different harness configurations is basically a tie.
The SWE-bench claim is the weakest of their numbers.
It's the least informative, yeah. But it's also the least relevant to their core pitch. Nobody is choosing Subquadratic because it's one point better on SWE-bench. They're choosing it because it has a twelve-million-token window that actually works. The SWE-bench number is more of a sanity check — it shows the model isn't catastrophically worse at coding than frontier alternatives.
What about cost? The article says they're running on neoclouds and that the model is "way smaller than the big labs." Smaller model plus more efficient attention should mean dramatically cheaper inference, right?
Potentially, but we don't have pricing yet. The API is in beta. The efficiency numbers they're quoting — seven-point-two-times speedup at a hundred and twenty-eight K, fifty-two-times at a million tokens — those are attention-specific benchmarks, not end-to-end inference numbers. The attention mechanism is a big chunk of inference cost at long contexts, but it's not everything. You still have feed-forward layers, embedding lookups, output projection. The real question is what the cost per million tokens looks like at twelve million tokens of context versus what you'd pay OpenAI or Anthropic for a one-million-token window with RAG on top.
That comparison isn't straightforward because the capabilities aren't identical. A model with a twelve-million-token window that can actually use it is doing something qualitatively different from a model with RAG.
Right, which is why I think the pricing discussion is premature. If the capability is real and differentiated, they can charge a premium while the frontier labs catch up. The question is how long that window lasts.
Let's zoom out to the broader trend. The article mentions that every frontier model in twenty-twenty-six advertises at least a million tokens of context, but almost none of them are great at using all of it. That's the dirty secret of the current generation. The numbers are on the spec sheet, but the retrieval benchmarks tell a different story.
This is where the MRCR v-two spread is so revealing. 5 at seventy-four percent versus Claude Opus four-point-seven at thirty-two percent. These are both frontier models from the two leading labs, and their ability to actually use their context windows is wildly different. It tells you that context window size and context utilization are two different engineering problems, and the industry has been focused more on the former because it's easier to market.
A million tokens sounds great in a press release. "How well can your model actually find things in that million tokens" is a much nerdier question.
A much more important one for actual use cases. If I'm building a legal document review system, I don't care that your model technically accepts a million tokens. I care whether it can reliably find the relevant precedents across ten thousand pages of case law. The MRCR numbers suggest that for many models, the answer is "not reliably.
Do you think the frontier labs have been caught flat-footed here? A Miami startup with twenty-nine million dollars in funding beating them on a core architecture problem?
I don't think they're flat-footed. I think they've been making different tradeoffs. The frontier labs are optimizing for general capability — MMLU scores, reasoning benchmarks, multimodal performance. Context efficiency has been a secondary concern because most users aren't saturating even million-token windows. The labs have been content to let context be "good enough" while they fought the capability wars. Subquadratic is betting that context is about to become the primary bottleneck, and they've optimized exclusively for it.
Which is classic startup strategy — find the thing the incumbents are underinvesting in and go all in.
And if they're right about the importance of context, they've got a real head start. But the incumbents have something Subquadratic doesn't: distribution. OpenAI, Anthropic, Google — they have millions of developers already building on their APIs. Subquadratic has to convince people to switch, and switching costs in AI are non-trivial. You've built your prompts, your evals, your pipelines around a specific model's behavior.
The article mentions they're not just doing an API — they're shipping SubQ Code, a CLI coding agent, and SubQ Search, a deep research tool. That's smart. You don't compete on API access alone. You build products that make the capability tangible.
The coding agent is the natural first product for a long-context model. Developers immediately understand the value of fitting their entire codebase in context. It's the use case where the limitation of current models is most painfully obvious. Every developer has had the experience of an AI coding tool losing track of what they were doing because the relevant context got evicted.
Where do you see this going in the next year? If Subquadratic hits their fifty-million-token target in Q4, and the benchmarks hold up, does that force the frontier labs to respond?
I think it does, but the response might not be immediate. The labs have their own research programs in sparse and linear attention. Google's been publishing on this for years. DeepSeek obviously has DSA. Anthropic has been characteristically quiet about their architecture work but they're not sitting still. What Subquadratic might do is accelerate the timeline. If their API gains traction and developers start building products that need ten-million-plus-token windows, the labs will have to match the capability or risk losing the high-end developer market.
The open-source angle? They're not open-sourcing weights but they're offering training tools. Is that enough to build an ecosystem?
It's a hedge. They get some of the benefits of openness — enterprise customers can fine-tune, the research community can probe the architecture through the API — without giving away the crown jewels. But it also means the community can't independently verify their claims or build on top of their work in the way they could with truly open models. Given the Magic dot dev precedent, I think the community is going to be skeptical until there's more independent validation.
One last technical question. The article talks about selection being "content-dependent." For prompt A, words one and six matter to each other. For prompt B, it's words two and three. How does the model learn to make those routing decisions without the routing decision itself being expensive?
This is the million-dollar question, and the article doesn't give us enough detail to answer it definitively. But based on the research literature, there are a few ways to do this. One approach is locality-sensitive hashing — you project queries and keys into a lower-dimensional space where proximity approximates attention score, and you only compute full attention for pairs that hash to the same bucket. Another is a small learned router network that's architecturally constrained to be subquadratic — for example, by only looking at coarse summaries of key blocks rather than individual keys. The challenge is making the routing decision good enough that you don't miss important attention pairs, while keeping the router itself cheap enough that you haven't just moved the bottleneck.
That's the indexer trap Whedon was describing. DeepSeek's indexer scores every query against every key — it's quadratic — so they moved the bottleneck from the attention computation to the routing computation. Subquadratic's claim is that their routing is subquadratic.
If they've actually solved that, it's a genuine breakthrough. The research community has been chasing subquadratic attention with dense-level quality for almost a decade. It's one of the hardest open problems in ML architecture.
Alright, let me try to synthesize what we've covered. Subquadratic has launched a model with a twelve-million-token context window using an architecture they call Subquadratic Selective Attention. They claim it scales linearly in compute and memory, runs fifty-two times faster than dense attention at a million tokens, and scores eighty-three on MRCR v-two — beating GPT-5.5 by nine points. The architecture builds on years of work in sparse attention, state-space models, and hybrid architectures, but claims to solve the indexer trap that's limited previous approaches. If the claims hold up and the architecture scales, it opens up new use cases in codebase-level reasoning, legal document review, and scientific literature synthesis. But the track record of similar claims in this space — Magic dot dev being the cautionary tale — means independent validation is essential. And the frontier labs, while they may have been optimizing for other things, are not going to sit still if twelve-million-token windows become a must-have capability.
That's a solid summary. The one thing I'd add is that the MRCR v-two spread — GPT-5.5 at seventy-four, Opus four-point-seven at thirty-two, Subquadratic at eighty-three — tells you that context utilization is about to become a first-class benchmark. For years, the industry has been in a context window arms race measured by token count. I think we're entering a phase where the metric that actually matters is retrieval quality at scale. A hundred-million-token window with thirty-percent retrieval accuracy is a party trick. A ten-million-token window with ninety-percent retrieval accuracy changes what you can build.
Now: Hilbert's daily fun fact.
Hilbert: A single surviving cheese strainer made of woven horsehair, excavated from a well in the Simpson Desert and dated to roughly seven hundred CE, contains trace residues of a fermented camel-milk cheese inoculated with a bacterium that modern microbiologists identify as a now-extinct subspecies of Lactobacillus — making it the only known artifact documenting cheesemaking in pre-Islamic central Australia.
Hilbert: A single surviving cheese strainer made of woven horsehair, excavated from a well in the Simpson Desert and dated to roughly seven hundred CE, contains trace residues of a fermented camel-milk cheese inoculated with a bacterium that modern microbiologists identify as a now-extinct subspecies of Lactobacillus — making it the only known artifact documenting cheesemaking in pre-Islamic central Australia.
I have so many questions, and I'm not sure I want any of them answered.
Camel-milk cheese from the Simpson Desert. I did not have that on my bingo card.
Thanks to Hilbert Flumingtop for producing. This has been My Weird Prompts. Find us at myweirdprompts dot com, and if you want more episodes like this one, leave us a review wherever you listen.
We'll be back soon. Until then, keep your attention subquadratic.
This episode was generated with AI assistance. Hosts Herman and Corn are AI personalities.