#2057: Why LLMs Can't Write a Novel in One Go

The output window is the new bottleneck: why massive context doesn't solve long-form generation.

Featuring

Daniel

Corn

Herman

0:000:00

Episode Details

Episode ID: MWP-2213
Published: Apr 5
Duration: 22:34
Audio: Direct link
Pipeline: V5
TTS Engine: chatterbox-regular
Script Writing Agent: Gemini 3 Flash
Topics: ai-agents context-window rag

AI-Generated Content: This podcast is created using AI personas. Please verify any important information independently.

The Paradox of Infinite Input and Finite Output

Large language models have made remarkable strides in processing vast amounts of information. With context windows reaching two million tokens, models like Gemini 1.5 Pro and Claude can "read" entire libraries in a single prompt. Yet, when it comes to generating content, a hard ceiling remains. Most frontier models can only output between four thousand and eight thousand tokens in a single response. This creates a fundamental mismatch: an infinitely capable reader attached to a hand that cramps after writing a postcard.

This constraint becomes critical in long-horizon tasks like writing a novel or managing complex software migrations. The challenge shifts from raw model capability to architectural design. How do we maintain "task guidance" when an agent must span hundreds of API calls without losing the plot?

The Core Problem: State Management

The naive approach—simply summarizing previous context—fails quickly due to what's called the "Lossy Compression" problem. Summarizing a chapter to "John walked through a dark forest and felt scared" loses the texture: the snapping twig, the childhood trauma, the smell of pine. These details are essential for coherence but vanish in compression.

Professional developers address this with State Serialization, often called the "Story Bible" approach. Instead of raw text, the agent maintains a structured file (JSON or YAML) containing global state: character descriptions, resolved plot points, and open loops. This machine-readable format prevents hallucination and drift. For example, a JSON entry stating "Dog Status: Alive" is harder for a model to contradict than a vague memory.

However, this approach introduces its own challenge: Metadata Bloat. If the instructions become longer than the creative output, the agent spends more tokens reading manuals than writing. For an epic with five hundred named characters, the Bible itself could consume forty thousand tokens before a single word is generated.

External Memory and Retrieval

To solve bloat, developers turn to External Memory Stores, essentially RAG for agents. Instead of loading the entire Story Bible into every prompt, the agent uses a vector database or key-value store. When writing Chapter Twenty and needing a reference from Chapter Two, it performs a semantic search, retrieves only the relevant memory, and injects it into the current thought space.

Studies show this explicit state management makes agents three times more likely to finish long tasks without losing coherence compared to systems relying on simple summaries. It mirrors human workflow: writers use sticky notes and character sheets, not perfect memorization.

Recursive Task Decomposition

Even with good state management, the output limit remains. A novel chapter might be five thousand words, but the model can only output eight thousand tokens total. The solution is Recursive Task Decomposition, exemplified by frameworks like "Write-HERE" (Heterogeneous Recursive Planning).

This approach uses multiple agents in a hierarchy. An Orchestrator agent breaks the goal "Write a Novel" into "Write Chapter One," then further into "Scene One," "Scene Two," and so on. Each sub-task is small enough to fit within output limits. The Orchestrator manages the "Context Budget," telling a Worker Agent to focus only on the immediate scene—like a high-speed chase—while ignoring irrelevant details from Chapter Twelve.

This divide-and-conquer method increases quality by narrowing scope. It's the difference between asking for "the history of the world" versus "what happened in Paris on July 14th, 1789." The latter yields vivid, usable detail because the model isn't compressing eons of history.

Maintaining Coherence Across Agents

Breaking tasks into discrete pieces risks "Context Drift" and loss of flow. If Agent A writes Scene One and Agent B writes Scene Two, prose can feel disjointed—like a committee of robots who haven't met. Serialized state introduces subtle errors: a character's anxiety might be summarized away, leading to inconsistent portrayal.

The fix is "Evaluator-Optimizer" loops. After a Worker Agent finishes a scene, an Editor Agent—equipped with the full Story Bible and raw text—reviews it for state violations. The Editor, instructed to be "pedantic and critical," checks for consistency in facts, tone, and details. If errors are found, it provides feedback for a second pass. This Multi-Agent Debate bridges short outputs with long-form coherence.

Cost and Conciseness Bias

This multi-agent approach isn't free. For every thousand words of final prose, an agent might burn a hundred thousand tokens in reasoning, planning, and editing. The token bill can be astronomical, with internal-to-published ratios reaching 50-to-1. But this is the price of consistency.

Additionally, LLMs have a "Conciseness Bias." Trained to be helpful and brief, models often rush to summarize when given broad prompts. To counter this, agents must micro-manage pacing, forcing the model to write in specific beats—like "describe the room for five hundred words" then "write five hundred words of dialogue."

Open Questions

The discussion raises unresolved questions about the "Story Bible" as a living document. Should it be static or updated dynamically? How do agents balance creativity with consistency? These challenges highlight that while LLMs are powerful readers, building coherent long-form output requires careful architectural scaffolding.

Downloads

Episode Audio

Download the full episode as an MP3 file

Download MP3

Transcript (TXT)

Plain text transcript file

Transcript (PDF)

Formatted PDF with styling

#2057: Why LLMs Can't Write a Novel in One Go

Imagine you are an author, and you have this brilliant idea for a hundred-thousand-word epic novel. You sit down at your desk, but there is a catch. Your typewriter only lets you see the last five pages you wrote, and it physically stops working after you hit page ten. To keep going, you have to stand up, walk out of the room, come back in, and somehow remember exactly where the plot was, what color the protagonist's eyes were, and why there was a mysterious silver key in chapter two. This is the exact paradox we are facing with Large Language Models today. We have these massive input windows where a model can read a whole library, but when it comes to actually generating content, there is a hard ceiling. It is like having the world's most sophisticated brain attached to a hand that gets a cramp after writing a postcard.

Herman Poppleberry here. That is a perfect way to frame it, Corn. We are living in this era of two-million-token context windows for models like Gemini one point five Pro or the newer Claude models, but the output limit is the stubborn anchor. Even in early twenty-six, most frontier models are still hitting a wall at around four thousand to eight thousand tokens for a single response. If you ask a model to write a novel, it cannot just stream out a quarter of a million words in one go. It physically cannot. So, the question remains: how do we build agentic workflows that maintain "task guidance" over a long horizon? How does the agent remember it is writing a mystery in chapter twenty if it finished chapter one ten API calls ago?

It is a fascinating problem because it moves the challenge from "model capability" to "architectural design." Today's prompt from Daniel hits right at the heart of this. He is asking about how that moving task context is achieved in agentic workflows when the output length is a hard constraint. By the way, today's episode is powered by Google Gemini three Flash, which is actually handling the script for us right now. It is a bit meta, considering we are talking about how these models manage complex instructions.

It really is. And to Daniel's point, we have to distinguish between "input context"—what the model can see—and "persistent task context," which is the evolving state of the job being done. When you are building an agent to do something complex, like writing a novel or managing a massive software migration, you aren't just sending a prompt. You are managing a state machine. The "memory" isn't just in the model's weights; it is in the scaffolding we build around it.

So, let's break that down. If I am an agentic orchestrator, and I need to write this novel, I can't just say "Go." I have to manage the "moving context." What is the first line of defense there? Is it just cramming a summary of the previous chapter into the next prompt?

That is the "naive" version, often called "prompt carryover," but it fails pretty quickly for complex tasks. If you just summarize, you lose the "texture." Think about the "Lossy Compression" problem. If you summarize Chapter One, you might say, "John walked through a dark forest and felt scared." But in Chapter Two, you need to know why he was scared. Was it the sound of a snapping twig? Was it a childhood trauma triggered by the smell of pine? A summary kills those details. The first real mechanism professional developers use is State Serialization. Think of this as the "Story Bible" approach. Instead of just passing raw text, the agent maintains a structured file—usually in JSON or YAML—that acts as the ground truth for the entire project. This file contains the "Global State." In a novel, that would be character descriptions, plot points already resolved, and "open loops" that still need to be addressed.

I like that because it makes the memory machine-readable. If you just ask an AI to "remember the plot," it might hallucinate that the hero's dog died in chapter three when he's actually sitting right there in chapter four. But if you have a JSON entry that says "Dog Status: Alive," it is much harder for the model to drift. But Herman, how does the agent actually interact with that JSON file? Does it read the whole thing every time?

Not necessarily. Before the agent starts writing Chapter Five, the orchestrator reads that JSON file, filters for only the "active" plot lines, and injects the relevant bits into the system prompt. It says, "Based on this state, write the next two thousand words." It’s like a briefing before a mission. But Herman, even with a Story Bible, you hit a limit. If the Bible gets too big, you are back to square one, eating up your input context with metadata. Imagine a Bible that tracks every minor character in an epic with five hundred named individuals. You’d spend forty thousand tokens just explaining who everyone is before you even write a single word of dialogue.

It’s the "Metadata Bloat" problem. If your instructions are longer than your creative output, you’re essentially paying for the AI to read a manual rather than do the work. So, what’s the move when the Story Bible itself starts to burst at the seams?

That is where Mechanism Two comes in: External Memory Stores, or what people often call "RAG for Agents." Instead of putting the whole Story Bible in every prompt, the agent uses a vector database or a key-value store. This is crucial for long-horizon tasks. Let's say the agent is writing Chapter Twenty and needs to reference a specific conversation from Chapter Two about a silver key. It doesn't need to hold Chapter Two in its active memory. It performs a semantic search against its own previous outputs, retrieves the relevant "memory," and brings it into the current "thought space." A study by the AI Engineering Institute in twenty-five showed that agentic systems using this kind of explicit state management were three times more likely to finish a long task without losing the plot compared to systems just "winging it" with simple summaries.

Three times? That is a massive delta. It basically means the difference between a coherent book and a pile of gibberish. It reminds me of how a human writer works—you have your sticky notes, your character sheets, and your previous chapters sitting on the shelf. You don't have the whole book memorized perfectly at every second; you just know where to look. But let's talk about the actual "writing" part. If the model can only output eight thousand tokens, and a chapter is five thousand, the model is already exhausted just finishing one segment. How do you handle the "Recursive Task Decomposition" Daniel mentioned?

This is where it gets really "nerdy" but incredibly cool. There is a framework called "Write-HERE" that came out recently—it stands for Heterogeneous Recursive Planning. The idea is that you don't just have one agent writing. You have an "Orchestrator" that acts like a lead editor. The Orchestrator doesn't write prose; it writes "plans." It breaks the goal "Write a Novel" into "Write Chapter One," then further breaks that into "Scene One," "Scene Two," and so on. Each of these sub-tasks is small enough to fit comfortably within the output limit. Because each sub-tasks is discrete, the model can dedicate all its "reasoning energy" to that one piece, rather than trying to juggle the entire architecture of the book at once.

Wait, so the Orchestrator is essentially managing the "Context Budget"? If Scene One is about a high-speed chase, the Orchestrator tells the Worker Agent, "Focus only on the physics of the cars and the tension of the driver." It doesn't tell it about the hero's grandmother's secret recipe in Chapter Twelve because that's irrelevant to the current output window.

Precisely. It’s about "Attention Management." By narrowing the scope of the output, you actually increase the quality. It’s the difference between asking someone to "tell me the history of the world" versus "tell me what happened in Paris on the morning of July 14th, 1789." The second prompt will give you much more vivid, usable detail because the model isn't trying to compress a billion years of history into a few thousand words.

It is like the "divide and conquer" algorithm for creativity. But there is a cheeky edge to this, isn't there? If you break it down too much, don't you lose the "flow"? If Agent A writes Scene One and Agent B writes Scene Two, how do you make sure the prose doesn't feel like it was written by a committee of robots who haven't met? I’ve seen this in early AI-generated scripts where one character sounds like a Victorian poet in one scene and a street-wise detective in the next.

You've hit on the "Context Drift" problem, Corn. It is a major second-order effect. When you serialize state—like turning a scene into a summary and back into a scene—you are basically playing a game of "telephone." Subtle errors accumulate. Maybe in Scene One, the character is "anxious," but the summary just says "he talked to his mom." By Scene Two, the agent thinks he is perfectly calm because the "anxiety" wasn't serialized in the state file. To fix this, you use "Evaluator-Optimizer" loops. After a "Worker" agent finishes a scene, an "Editor" agent—which has access to the full Story Bible and the previous scene's raw text—reviews it. It looks for "state violations." If it finds one, it sends it back with a "correction prompt."

So it is essentially a high-tech version of "trust but verify." You have one agent doing the work and another one acting as the "consistency police." Does the Editor agent have its own separate memory, or is it looking at the same JSON Bible?

It usually looks at the same Bible but has a different "System Instruction." While the Writer is told to be "creative and descriptive," the Editor is told to be "pedantic and critical." The Editor’s job is to look at the Writer’s output and ask: "Does this match the established facts? Is the tone consistent with Chapter One? Did the character suddenly forget they were holding a gun?" If the Editor sees a mistake, it doesn't just fix it—it provides feedback to the Writer for a second pass. This "Multi-Agent Debate" is how you bridge the gap between short outputs and long-form coherence.

I can see how that would be expensive, though. If you are doing three or four API calls for every thousand words of final output, your token bill is going to be astronomical. It is not just about the final novel; it is about the "thinking" tokens used to get there. We are talking about a 10-to-1 or even 50-to-1 ratio of "internal" tokens to "published" tokens.

Oh, the cost is a massive factor. For every thousand words of polished prose, you might be burning a hundred thousand tokens in "reasoning," "planning," and "editing." But that is the price of consistency. What I find wild is the "Conciseness Bias." LLMs are trained to be helpful and brief. If you give a model the whole plot and say "Write Chapter One," it will often try to finish the whole chapter in three paragraphs. It "rushes" to the finish line because its training data says "answer the user's question quickly." To get a real novel, you have to force the agent to write in "beats." You tell it, "Write the next five hundred words focusing only on the description of the room," then "Write the next five hundred words of dialogue." You have to micro-manage the pacing because the model's natural instinct is to summarize.

It is like trying to get a hyperactive kid to write a long-form essay. They just want to give you the TL;DR and go play outside. "The hero went to the cave, got the sword, and killed the dragon. The end. Can I have a cookie now?" You really have to hold their hand through the process. But let's look at the "Story Bible" as a living document. In these agentic workflows, is the "Bible" static, or does the agent update it as it goes? What happens if the AI has a stroke of genius mid-chapter?

It has to be dynamic. This is what the "Write-HERE" framework calls "Interleaving Execution." If, while writing Chapter Five, the agent decides it would be way cooler if the protagonist has a secret twin, it can't just write that and move on. It has to trigger a "State Update Task." The orchestrator pauses the writing, calls a reasoning model to update the Story Bible, checks for contradictions in previous chapters, and then resumes. The Bible is the "Single Source of Truth." If the text says one thing and the Bible says another, the system breaks. This is why frameworks like LangGraph or CrewAI are becoming so popular—they give developers the tools to manage this "Global State" explicitly.

I remember we touched on some of the stateful agent stuff in our discussion on LangGraph a while back—specifically how it handles those cycles. But here, the scale is just different. We are talking about thousands of state transitions. It makes me wonder about the "Lost in the Middle" phenomenon. Even if we have these massive two-million-token windows, does the model actually "see" the Story Bible clearly if it is buried in a mountain of other data? If the Bible is at the beginning of a massive prompt, does the model lose the thread by the time it gets to the end?

Research actually suggests that for agentic workflows, smaller is often better. If you give a model a hundred-thousand-token prompt, its performance on specific instructions can degrade. It gets "distracted." It’s like trying to listen to a whisper in a crowded stadium. Most high-end agent architectures actually prefer "Context Distillation." Instead of giving the model the whole book, you give it a "distilled" version of the current state. You provide the "High-Level Outline," the "Immediate Previous Scene," and the "Relevant Character Facts." This "sparse context" keeps the model's attention focused. It is a counter-intuitive insight: as context windows get larger, the most effective agents might actually use them less for raw data and more for "thinking space."

That is a sharp point. Just because you can fit a whole library in your head doesn't mean you can think clearly while trying to read every book at once. It is about "relevant attention." So, let’s play this out. If I’m building a software-coding agent rather than a novelist, how does "moving context" work there? Is the "Story Bible" just the codebase?

In coding, the "State" is the file tree and the function signatures. If the agent is refactoring a database module, it doesn't need to read the CSS for the front-end login page. The orchestrator "prunes" the context to only show the relevant API endpoints and the specific file being edited. Then, after the edit, a "Linter" or "Test" agent runs to see if the state update broke anything. If the tests pass, the "Global State"—the codebase—is updated. It’s the same pattern: Plan, Execute, Verify, Update State.

So, if someone listening wants to build something like this—maybe not a novel, but a complex agent for their business—what is the "takeaway" here? How do they avoid the "output limit" trap?

The first takeaway is: Never rely on the model's "implicit memory." If you are doing a task that takes more than one prompt, you need an "Explicit State Store." Whether that is a JSON file on your hard drive, a row in a Postgres database, or a vector store, you need a place where the "truth" lives outside the LLM. Use structured serialization like JSON or YAML. It is much easier to verify "Character_A_Location: London" than it is to parse a paragraph of natural language to see if the AI forgot where the hero is.

And the second takeaway has to be "Recursive Decomposition." Don't ask the AI to "Write the Report." Ask the AI to "Outline the Report," then "Research Section One," then "Draft Section One." By breaking the output into segments that are well below the hard limit—say, two thousand tokens each—you give the model the "headroom" to be detailed and creative without it hitting that "I need to finish this now" wall.

And don't be afraid of "Agentic Redundancy." Have an "Editor" agent. It feels like a waste of tokens, but in a long-horizon task, the "cost of failure" is cumulative. If Chapter Two is bad, Chapter Three will be worse because it is building on a shaky foundation. Fixing errors early in the "moving context" is much cheaper than trying to rewrite a whole novel because you realized at the end that the protagonist's name changed three times.

It is the "measure twice, cut once" philosophy of AI engineering. I also think there is a real "aha moment" here regarding the difference between "input" and "state." We often get blinded by these "millon-token" marketing headlines. But for an agent, the limiting factor isn't what it can "read"—it is how it "manages" what it has already "done." The "output limit" is actually a blessing in disguise because it forces us to build modular, verifiable systems instead of just throwing a giant prompt into a black box and hoping for the best.

I love that perspective. The constraint breeds better architecture. It forces us to act like engineers rather than just "prompt whisperers." We are building "Cognitive Architectures" where the LLM is just the processor, but the memory and the bus and the storage are all external. That is how you get a novel. That is how you get an agent that can code a whole app. You move the "intelligence" from the model's single response into the system's overall workflow. It's essentially building a computer where the LLM is the CPU, but you're the one who has to design the RAM and the hard drive.

It’s funny you say that, because in the early days of computing, programmers had to be incredibly clever with how they used limited memory. They had to "swap" data in and out of the processor. We’re doing the same thing with "Context Swapping" in LLMs. We’re going back to the future of computer science!

We really are. The "Context Window" is just the new "Registers" or "L1 Cache." If it’s not in the window, the model can’t think about it. So your job as the architect is to make sure the right data is in the right register at the right time.

It makes me wonder where this goes in the next year or two. Will we see "Stateful LLMs" where the "Story Bible" is actually baked into the model's KV-cache or something? Or will we always be managing this external scaffolding? There are some experimental models trying to implement "infinite" context through recurrent layers, but they always seem to lose precision compared to the transformers we use now.

There is research into "Long-Term Context" models that can "checkpoint" their state, but for now, the "Agentic Workflow" approach is winning because it is transparent. You can open that JSON file and see exactly what the agent "thinks" the plot is. You can't do that with a model's hidden weights. For production systems, that observability is everything. If the agent starts making mistakes, you can look at the "State Log" and see exactly where it went off the rails. You can say, "Ah, here in chapter seven, the state-update failed to record that the protagonist lost his keys." You can manually fix the JSON and the agent is back on track. You can't "edit" a model's internal memory like that.

It is the difference between a "wizard" and a "scientist." I'd rather have the scientist with the notebook than the wizard who might forget the spell halfway through. This has been a deep dive, but I think it really clarifies why Daniel's question is so vital. It is not about the "size" of the AI; it is about how you "orchestrate" its movements. It’s about the choreography of information.

Well said. And honestly, it makes the prospect of AI-written novels a lot more interesting. It is not just a "random text generator" anymore; it is a "collaborative planning system." I'm genuinely excited to see the first truly great "agentically authored" epic. It will be a triumph of architecture as much as a triumph of language. We might even see a new genre of "Architectural Literature" where the design of the agent’s state-space is as important as the prose itself.

Well, if it is a mystery novel, I just hope the agent remembers who the killer is by chapter thirty. If not, we are going to have a lot of very frustrated readers. "And the murderer was... wait, let me check my vector database... ah, yes, the butler!"

"The butler, who was actually a secret twin added in a state-update call in chapter twelve!"

Precisely. Well, I think we've unpacked the "moving task context" quite a bit today. It is all about the scaffolding, the "Story Bible," and the recursive breakdown. If you are building agents, don't just prompt—program the state. Think of yourself as the director of a movie, where the LLM is an actor who has a very short-term memory. You need to provide the script, the props, and the "continuity person" to make sure everything stays on track.

And keep an eye on those "thinking tokens." They are the silent engine of the whole operation. They are the "CPU cycles" of the twenty-first century.

Thanks as always to our producer, Hilbert Flumingtop, for keeping the gears turning behind the scenes. And a big thanks to Modal for providing the GPU credits that power this show's generation pipeline. This has been Episode nineteen eighty-nine of My Weird Prompts.

If you found this helpful, or if you're building your own agentic workflows, we'd love to hear about it. Find us at myweirdprompts dot com for the RSS feed and all the ways to subscribe. We’re always looking for new prompts that push the boundaries of what these models can do.

Until next time, keep your context clear and your state serialized. Goodbye!

Goodbye!

This episode was generated with AI assistance. Hosts Herman and Corn are AI personalities.

#2057: Why LLMs Can't Write a Novel in One Go

Downloads

You Might Also Like

#2057: Why LLMs Can't Write a Novel in One Go