By the way, today's episode is powered by DeepSeek V four Pro. Corn tells me it's handling the script duties.
That's right. So Daniel sent us this one and it's actually several questions folded together, but they all orbit the same thing. He's been doing batch inference — first real use case for him was building a classification model that can separate the rambling context from the actual questions in the prompts he sends us. And he noticed something: the model type matters. A conversational model needs a lot of hand-holding to stop being chatty and just do the task. An instructional model, something tuned specifically on following instructions, might actually be a better fit for batch processing jobs. So his main question is: what's the state of instructional AI models in twenty twenty-six that complement batch processing? And before we get there, he wants us to dig into additional use cases for batch inference in AI engineering and training — beyond the classification example he already figured out.
That classification use case he described is genuinely smart, by the way. The idea of splitting prompts into strands — questions versus context — and then building a context database so the agent can retrieve what it needs without Daniel having to repeat himself every time. That's not just a batch inference use case, that's a whole architecture pattern.
He called it a knock-on effect. The reordering alone might improve results, but the long-term play is that context database. I like that he's thinking in layers.
He always does. All right, let's start with the use cases because that's the part most people skip past. Everyone sees batch inference in the API docs, sees the fifty percent discount, and thinks great, cheaper tokens. Which is true but it's the least interesting part.
What are they missing?
The thing you have to understand about batch inference is that it's not just cheaper inference — it's asynchronous inference with a fundamentally different latency profile. OpenAI's batch API, for example, gives you results within twenty-four hours, typically much faster, but the point is you're not waiting for a response. You submit a file of requests, you walk away, you come back when it's done. That changes what kinds of jobs make sense.
That's where most people's mental model breaks, right? They think of AI as a real-time thing. You type, it responds. That's the ChatGPT experience and it's what everyone learned.
But a huge amount of what you'd actually want to do with language models has nothing to do with real-time interaction. The first big category is data annotation and labeling. Daniel's doing a version of this — he's classifying his own prompts. But scale that out. You've got a dataset of a hundred thousand support tickets and you need to tag them by category, sentiment, urgency, whatever. You could pay humans to do it, you could build a real-time pipeline that processes them one at a time, or you could fire off a batch job overnight and wake up to a fully labeled dataset at half the cost.
The half the cost part is real? I've seen the fifty percent number thrown around but I always wonder if there's fine print.
It's real and it's not fine print so much as a trade-off you need to understand. OpenAI's batch API pricing is literally fifty percent off the standard token rates. For GPT-four-o, that's a dollar twenty-five per million input tokens instead of two fifty, and five dollars per million output tokens instead of ten. Anthropic has a similar model. Google's batch pricing varies but the discount is comparable. The catch is you're giving up latency guarantees. They process your batch when compute is available.
Which for a lot of jobs, who cares?
Who cares, exactly. But here's where I think the misconception lives — people hear batch and they think it's only for massive jobs. Enterprise-scale, millions of tokens. And it's great for that. But Daniel's use case is maybe a couple hundred prompts. That's still worth doing in batch if you don't need the results in the next thirty seconds.
I think there's also a psychological barrier. People are used to the tight feedback loop. You send a prompt, you see the response, you iterate. Batch breaks that loop. You have to plan ahead, structure your requests, trust the system.
Which is actually a discipline that forces better engineering practices. When you're doing real-time inference, it's easy to get lazy. You tweak the prompt, you run it again, you tweak it, you run it again. With batch, you have to think through what you're actually trying to accomplish before you hit submit.
All right, so data annotation is the obvious use case.
Synthetic data generation is huge and growing. If you're training a smaller model or fine-tuning one, you need training examples — potentially thousands or tens of thousands of them. You can generate those through batch inference. You define the format, you provide a few examples, you generate a massive dataset of synthetic training data. It runs asynchronously, costs half as much, and by the time you're back from lunch you've got your training set.
This is where it connects to what Daniel was saying about instructional models, right? Because if you're generating training data, you don't want a model that's going to get creative or conversational. You want consistency.
That's exactly the connection and we'll get there in a minute. But let me give you one more category that I think is underappreciated: content transformation at scale. You've got a corpus of documents — legal contracts, medical records, technical documentation — and you need to transform them. Summarize them, translate them, extract structured data from them, convert them from one format to another. These are deterministic-ish tasks. The input is fixed, the desired output is well-defined, and you're running the same operation over thousands of documents. That's a batch job if I've ever seen one.
The translation one is interesting because real-time translation models exist and they're good. But if you need to translate a whole documentation site from English to five languages, you're not going to sit there doing it one page at a time through a chat interface.
Right, and the quality difference matters. When you use a model optimized for instruction following, you can define the translation parameters more precisely — maintain technical terminology consistently, preserve formatting, handle edge cases the same way every time. A conversational model might be more prone to stylistic drift across a large batch.
We've got data annotation, synthetic data generation, content transformation. What about evaluation?
Oh, that's a great one. Let's say you've built a system that generates responses and you want to evaluate the quality of those responses at scale. You can use batch inference to run an LLM-as-judge pipeline. Send thousands of response pairs to a model, have it score them on relevance, accuracy, tone, whatever dimensions you care about. That's a perfect batch workload — it's embarrassingly parallel, each evaluation is independent, and you want the lowest cost per evaluation you can get.
I love that term. It sounds like the problem should be ashamed of itself.
It's a real computer science term.
I know, that's what makes it better. All right, so Daniel mentioned he stumbled across a company called DoubleWord that's building a platform specifically for batch inference. What do we know about them?
DoubleWord dot A-I. They launched fairly recently and they're doing something I think is smart, which is aggregating batch inference endpoints across multiple providers. So instead of going to OpenAI for their batch API, then Anthropic for theirs, then Google for theirs, you go to one platform and route your batch jobs to whichever model makes sense. They handle the file management, the job tracking, the error handling.
Which solves the exact problem Daniel described. He was looking at OpenRouter for batch endpoints and thinking someone should build a platform just for this.
The timing is right. Batch inference has been available for a while — OpenAI launched their batch API back in April twenty twenty-four — but it's only recently that the ecosystem has matured enough for specialized platforms to make sense. The providers have stabilized their APIs, the pricing is consistent, and enough developers are hitting the limits of real-time inference that there's a real market.
I want to go back to something you said about the latency trade-off. Twenty-four hours is the official window, but what's the actual experience?
In practice, most batch jobs complete within a few hours, sometimes much less. OpenAI says they process batches as compute becomes available, which means if you submit during a low-demand period your job might be done in an hour. But the key is you can't rely on that. If you need a guaranteed turnaround time, batch isn't the right tool. If you can tolerate waiting up to a day, you get the discount.
Which brings us to the model selection question, which is really the heart of what Daniel's asking. He's noticed that for batch processing jobs — classification, extraction, transformation — a conversational model like ChatGPT is actually kind of annoying to work with. It wants to be helpful and chatty when you just want it to shut up and do the task.
This is the distinction between conversational AI and instructional AI that we've talked about before, and Daniel's absolutely right that it matters enormously for batch workloads. A conversational model is trained with reinforcement learning from human feedback to be engaging, to ask follow-up questions, to provide context and caveats. That's great when you're chatting with it. It's actively counterproductive when you're running a batch of ten thousand classification tasks and every response starts with "Sure, I'd be happy to help you classify this text!
Not just annoying — it actually breaks downstream processing. If you're expecting a clean JSON output and instead you get a friendly preamble, your pipeline falls over.
You end up writing prompt after prompt saying "do not include any introductory text, do not say hello, do not ask if I need anything else, just output the classification." And even then it sometimes ignores you because the conversational training runs deep.
What's the alternative? Daniel mentioned instructional models — models specifically tuned on instruction following. What does the landscape actually look like right now?
Let me give you the landscape as I understand it. The term instructional model — or instruction-tuned model — refers to models that have been fine-tuned specifically to follow instructions precisely, without the conversational window dressing. They predate ChatGPT. The original GPT-three was a base model, and then InstructGPT was the instruction-tuned version that actually shipped in the API before ChatGPT existed.
The instructional approach came first, then conversational took over the public imagination.
And for a while it looked like instructional models might get absorbed into the generalist models entirely. Why maintain a separate instruction-tuned model when GPT-four can do both — chat and instruction following — depending on how you prompt it?
That's the thing, isn't it? It can do both, but it doesn't do instruction following as cleanly as a dedicated model would, especially at scale.
That's why we're seeing a resurgence of interest in instruction-specialized models, particularly for batch and API use cases. The model providers have realized that developers building pipelines don't want the same thing as end users chatting with an app.
Who are the players? What models should someone like Daniel be looking at?
OpenAI has been moving in this direction with what they call their reasoning models and their structured outputs feature. But the most interesting developments are happening in the open-weight space and with some of the smaller providers. Let me walk through a few. First, there are models like Cohere's Command R series, which were designed from the ground up for enterprise retrieval-augmented generation and tool use. They have a much more task-oriented response style. They're less likely to ramble, less likely to add conversational filler.
Cohere's been consistent about this positioning, right? They never tried to compete with ChatGPT on chat. They went straight for the enterprise API use case.
Which looked like a weird choice two years ago and now looks prescient. But beyond Cohere, there's a whole category of models that have been fine-tuned specifically for structured output and instruction following. Mistral's models, particularly the newer ones, have strong instruction-following capabilities. The Llama three point one and three point two models from Meta have instruction-tuned variants that are quite good at this. And then there are the specialized fine-tunes built on top of these base models.
The open-weight models have an advantage here, don't they? You can take Llama and fine-tune it further on your specific task format, your specific output structure. You're not stuck with whatever the provider decided.
That's the huge advantage, and it's where batch inference and instructional models really converge. Here's the workflow I think is underappreciated: you take an open-weight instruction-tuned model, you fine-tune it on a few hundred examples of your specific task — Daniel's classification task, say — and then you deploy that fine-tuned model for batch inference at scale. You get a model that's precisely calibrated to your output format, your domain, your edge cases, running at half the cost of real-time inference.
Daniel's already done the first step without realizing it. He annotated a bunch of examples, had an AI agent extrapolate from those annotations, and now he's got a training dataset. That's the fine-tuning dataset right there.
He's closer than he thinks to a very robust pipeline. The missing piece is model selection for the batch job itself. Let me be specific about what I'd recommend for his use case. He needs a model that does three things well: follows formatting instructions precisely, maintains consistent behavior across thousands of requests, and has a low default temperature — meaning it doesn't get creative when creativity isn't wanted.
Low temperature is something he mentioned specifically. He said these models inherently have a very low temperature setting. Is that actually a property of the model or is that just a parameter you can set?
It's both. You can set temperature to zero on any model, but some models are trained in a way that makes them more deterministic even at higher temperature settings. Instruction-tuned models tend to have been trained with less entropy in their output distribution — they're optimized to give the same answer to the same prompt every time, which is exactly what you want in a batch processing pipeline. Conversational models are trained to have more variety, more creativity, because that's what makes for an engaging chat experience.
Even with temperature set to zero, a conversational model might still have more variance than an instructional model?
In practice, yes, because temperature isn't the only source of randomness. There's also the sampling strategy, the token selection algorithm, and the underlying probability distributions learned during training. A model that was trained to be creative will sometimes find creative ways to be creative even when you tell it not to be.
That's a very diplomatic way of saying the model ignores you.
I prefer to think of it as the model having strong priors. But yes, practically speaking, an instructional model is more likely to respect the "just do the task" instruction because that's what it was optimized for.
We've got Cohere Command R, we've got Mistral, we've got Llama instruction-tuned variants. What about the newer entrants? Anthropic's Claude models — where do they fit?
Claude is interesting because it's somewhere in between. Claude is very good at following instructions — Anthropic has put a lot of work into what they call constitutional AI, which is essentially training the model to be helpful, honest, and harmless while following user instructions precisely. In my experience, Claude is less prone to the chatty preamble problem than GPT-four in its default configuration. But it's still fundamentally a general-purpose model that does conversation and instruction following. It's not specialized in the way a dedicated instructional model would be.
Anthropic hasn't released a batch-specific variant or an instruction-only variant?
Not as a separate model. They have their batch API with the fifty percent discount, and you can use Claude through it. In practice, lots of developers do and get good results. The question is whether a specialized model would do better for certain narrow tasks, and I think the answer is yes — but you have to weigh that against the convenience of using a model you already know and have prompts tuned for.
Daniel also mentioned structured output as part of his idea — reformatting prompts into questions and context. What's the state of structured output support across these models?
This has been one of the biggest developments of the past year. OpenAI launched their structured outputs feature, which guarantees that the model will output valid JSON matching a specified schema. Not best-effort JSON, not usually-valid JSON — they're making a hard guarantee by constraining the token sampling process. Anthropic has something similar with their tool use and structured output capabilities. And on the open-source side, projects like Outlines and Guidance let you do constrained generation with open-weight models.
Constrained generation meaning you literally restrict which tokens the model can output at each step?
If the schema says the next field must be a boolean, the model can only output true or false. If it says the next field must be one of five categories, the model can only output those five strings. It's not relying on the model to follow instructions — it's mathematically constraining the output space.
Which solves the preamble problem entirely. The model literally can't say "Sure, I'd be happy to help" because those tokens aren't in the allowed set.
This is where batch inference gets really powerful. When you combine constrained generation with batch processing, you get a pipeline that's both cheap and reliable. You submit a thousand requests, you get back a thousand valid JSON objects with zero parsing errors, zero hallucinations in the output structure, zero friendly greetings breaking your downstream code.
For Daniel's specific use case — splitting prompts into questions and context — he could define a schema with two fields, questions and context, constrain the output to match that schema, and run it in batch. He'd get clean, machine-readable output every time.
Then the second phase — building the context database — becomes trivial because the output is already structured. No regex, no parsing hacks, no trying to figure out where the model's preamble ends and the actual output begins.
Let me play devil's advocate for a second. If constrained generation solves the output format problem, does the model type still matter? Couldn't you just use GPT-four-o with structured outputs and get the same result?
It solves the format problem but not the content quality problem. A conversational model operating under output constraints will still produce valid JSON, but the content of that JSON — the actual classification decisions, the actual text extraction — might be influenced by its conversational training. It might be more prone to hedging, more prone to adding unnecessary qualifiers, more prone to misinterpreting the task because it's trying to be helpful in a conversational way.
Give me an example.
Let's say you're classifying customer support tickets into categories like billing, technical, account, and other. A conversational model might be reluctant to classify something as other because it wants to be helpful and find a more specific category, even when other is the correct answer. An instructional model is more likely to just follow the classification guidelines as written. The difference is subtle but it compounds across thousands of examples.
The model type still matters, but maybe less than it did before structured outputs were available.
I think that's fair. Structured outputs shrink the gap but don't eliminate it. And for batch processing specifically, where you're optimizing for consistency and reliability over thousands of requests, small differences in model behavior get magnified.
Let's talk about temperature more directly. Daniel mentioned that instructional models inherently have a low temperature setting. Is that literally true, or is it more that they're designed to work well at low temperatures?
It's more that they're designed to work well at low temperatures. Temperature is a sampling parameter, not a property of the model weights. But the way the model was trained affects how it behaves at different temperature settings. A model trained with a lot of entropy in its output distribution will start producing nonsense at very low temperatures because you're forcing it to make high-confidence choices in a distribution that was never designed for high confidence. An instruction-tuned model is trained to have sharper probability distributions — it's more confident about what the right next token should be for a given instruction.
Sharper distributions meaning the probability mass is concentrated on fewer tokens?
If the instruction is "classify this text as positive, negative, or neutral," an instruction-tuned model might assign ninety-eight percent probability to the correct token. A conversational model might assign eighty-five percent because it's also considering tokens like "I think this is" or "Based on my analysis" or "This seems.
Even when the output is constrained to just those three labels?
If the output is constrained, the conversational model can't output those preamble tokens. But the underlying probability distribution still affects which of the allowed tokens it chooses. If the model is less confident, it might flip between positive and neutral on borderline cases more than an instruction-tuned model would. Again, small differences that compound at scale.
We've covered the model landscape. Let me ask you the practical question: if Daniel is listening and wants to pick a model for his classification batch job today, what should he use?
It depends on his constraints. If he wants the simplest path and he's already in the OpenAI ecosystem, GPT-four-o with structured outputs and the batch API is a solid choice. He'll get the fifty percent discount, the output will be valid JSON, and the quality will be good. If he wants maximum reliability and consistency, I'd point him toward an instruction-tuned model — either Cohere's Command R plus through their API, or a fine-tuned Llama three point one instruct model deployed on a platform like Together AI or Fireworks that supports batch inference.
If he wants to go the open-weight route and fine-tune?
Then I'd say start with Llama three point one eight-b instruct as a base, fine-tune it on those annotated examples he already has, and deploy it for batch inference. The eight-b parameter model is small enough to fine-tune on a single GPU, large enough to handle classification tasks well, and the instruct variant gives you that task-oriented behavior out of the box.
What about the cost comparison? Batch inference gives you fifty percent off, but fine-tuning has its own costs. Where's the break-even?
Rough numbers: fine-tuning Llama three point one eight-b on a few hundred examples might cost you twenty to fifty dollars in compute if you're using a cloud GPU. After that, running inference on your fine-tuned model costs whatever the hosting platform charges — typically similar to or slightly more than the base model API price. So if you're doing tens of thousands of classifications, the fine-tuning cost gets amortized quickly. If you're doing a few hundred, it's probably not worth fine-tuning and you should just use an off-the-shelf model.
Daniel's dataset sounds like it's in the hundreds, not tens of thousands. So off-the-shelf probably makes more sense for now.
But the beauty of his approach is that he's building the training dataset as a side effect of using the system. Every prompt he annotates is a potential fine-tuning example. Over time, he'll accumulate enough to do a proper fine-tune if he wants to.
I want to circle back to something Daniel said about the long-term vision. He's thinking about a context database that the agent can retrieve from, so he doesn't have to repeat context every time. That's essentially a retrieval-augmented generation system built on his own personal context.
It's a really smart pattern. The batch inference part handles the ingestion — classifying and structuring incoming prompts, separating the evergreen context from the ephemeral questions. Then the RAG part handles retrieval — when a new prompt comes in, the system pulls relevant context from the database and includes it in the prompt to the script-writing agent.
Which means the quality of the batch inference step determines the quality of the entire downstream system. If the context extraction is sloppy, the database is noisy, and the retrieval gives the agent irrelevant or incomplete context.
This is why model selection matters so much. You're not just saving money on inference — you're building a data asset. The cleaner your extraction and classification, the more valuable that asset becomes over time.
Like interest on a savings account, except the principal is your own prompt history.
Most people don't think about their prompt history as an asset. But Daniel clearly does, and he's right. Every prompt you've ever written contains information about what you care about, how you think, what context matters to you. Structuring that into a queryable database is valuable.
We've covered the use cases for batch inference — data annotation, synthetic data generation, content transformation, evaluation. We've covered the model landscape — instruction-tuned models from Cohere, Mistral,, plus the structured outputs capabilities from the major providers. What haven't we covered?
I want to touch on one more thing Daniel mentioned, which is the email he got from DoubleWord the day after he was thinking about batch platforms. That kind of synchronicity is fun, but there's actually something substantive here: the batch inference ecosystem is growing fast enough that new tools and platforms are appearing constantly. If you're doing AI engineering, it's worth periodically checking what's available rather than assuming the landscape looks the same as it did six months ago.
The platform layer is maturing.
A year ago, batch inference meant going to each provider's API directly, managing your own file uploads, handling your own error recovery. Now you've got platforms that abstract that away, route to multiple providers, handle retries, give you dashboards. It's becoming a proper product category rather than just an API feature.
Which lowers the barrier. Someone like Daniel, who's a developer but not necessarily an AI infrastructure specialist, can use batch inference without building a bunch of plumbing.
That's the trend that matters. Batch inference has been technically available for a while. What's changing is that it's becoming accessible. The combination of mature APIs, specialized platforms, better instruction-tuned models, and structured output guarantees means you can now build reliable batch pipelines without being an expert in any of those things individually.
Let me ask you a forward-looking question. Where do you see instructional models going? Daniel said he hopes they won't be dinosaurs in the era of generalist AI. Do you share that concern?
I don't think they're going extinct. I think they're going to become more specialized, not less. The generalist models will keep getting better at everything, including instruction following. But there will always be a gap between a model that can follow instructions and a model that's optimized to follow instructions. It's the difference between a general practitioner who can do surgery and a surgeon who does nothing but surgery. For specific high-volume tasks, you want the specialist.
The economics favor specialization. If you're running millions of classification tasks, a model that's ten percent more reliable or ten percent cheaper is worth switching to.
The market will support both. Generalist models for general use, specialist models for high-volume pipelines. We're already seeing this with the rise of small, task-specific models fine-tuned from larger base models. A one-billion-parameter model fine-tuned on your specific task can outperform a hundred-billion-parameter generalist model on that task, and it costs a fraction as much to run.
Which connects back to batch inference. If you're running that small specialized model at scale, batch processing makes the economics even better.
The whole stack compounds. Specialized model plus batch inference plus structured outputs — you go from paying premium rates for real-time generalist inference to paying a fraction of that for asynchronous specialist inference, with higher reliability and cleaner outputs.
Daniel's going to end up with a very nice pipeline if he follows this through.
He's already most of the way there. The classification model, the annotated dataset, the context database idea — he's got the pieces. The remaining work is mostly integration and model selection.
Now he's got a platform that specializes in batch inference showing up in his inbox. The universe is nudging him.
The universe has good timing.
All right, I think we've given Daniel a pretty thorough answer. Use cases beyond classification: annotation, synthetic data, transformation, evaluation. Model landscape: instruction-tuned models are alive and well, Cohere Command R, Mistral, Llama instruct variants, plus structured outputs from the major providers making the model type slightly less critical but still meaningful. And the ecosystem is maturing fast enough that it's worth keeping an eye on new platforms.
One last thing I'd add: if you're doing batch inference, test your prompts on a small sample before you fire off the full batch. The tight feedback loop you lose with batch means you really want to be confident in your prompt and your output format before you commit to processing ten thousand items.
That's just good engineering. Measure twice, cut once.
And with batch, the "cut" might take a few hours and cost you actual money, so the measuring is worth the time.
Now: Hilbert's daily fun fact.
Hilbert: The collective noun for a group of sloths is a "bed" of sloths.
...right.
That's either very fitting or deeply ironic, depending on which sloth we're talking about.
I'm not engaging with that. This has been My Weird Prompts. Thanks to our producer Hilbert Flumingtop. You can find us at myweirdprompts dot com. We'll catch you next time.
See you then.