Daniel sent us this one — he's been poking around a generative AI API gateway and noticed they've got a dedicated endpoint for prompt enhancement, running what looks like a small model under the hood. His question is basically: when you're building a pipeline, especially for diffusion models that generate images or video, should you train a tiny specialized model on examples of bad prompts transformed into good ones, or is a general-purpose model with a system prompt good enough? And he's asking whether there are known models out there specifically built for this. It's a build-versus-buy question, but at the model architecture level. And I think it's actually deeper than that — it's really a question about where intelligence should live in your pipeline.
Oh, this is timely. Baidu just open-sourced a model called ERNIE-Image about, what, two weeks ago? It ships with a dedicated Prompt Enhancer — a separate three billion parameter language model that sits in front of an eight billion parameter diffusion transformer. The enhancer takes a brief user input like "a cat on a chair" and expands it into a rich, structured description before it ever hits the image generator. And here's the thing — they published benchmarks showing it actually works, but not universally. It improves reasoning scores and text rendering, but it slightly degrades basic instruction following. GenEval drops from about zero point eight eight five six to zero point eight seven two eight when you turn the enhancer on.
It's not a pure win. You're trading basic prompt adherence for richer output. That's interesting — it means the enhancer is adding detail but also potentially introducing noise or over-specifying things the user didn't actually ask for. When would you not want to enhance a prompt? Can we walk through a concrete scenario where the enhancer actually makes things worse?
That's exactly the right question. If you're doing something where compositional precision matters — like "put the red ball to the left of the blue cube and nothing else" — the enhancer might elaborate with "a vividly colored red ball resting beside a deep blue cube on a wooden table in soft afternoon light," and suddenly you've got a table and lighting conditions the user never requested. The diffusion model follows the enhanced prompt faithfully, but it's no longer what the user wanted. Baidu actually lets you toggle the enhancer on and off per use case, which is smart. They know it's situational. Think of it like an overeager assistant who hears you say "I need a sandwich" and comes back with a full five-course meal. It's impressive, but maybe you just wanted the sandwich.
Right — and the table and the lighting aren't just extra details, they're constraints the model now has to satisfy. Every added detail is another opportunity for the model to mess something up. So if your original prompt was already precise, the enhancer is introducing risk for no reason. But I'm curious — how does this play out in practice? Like, if I'm a designer iterating on a composition, I probably want the enhancer off for the first twenty generations while I nail down the layout, then maybe turn it on at the end to add richness. Is that the workflow?
That's exactly the workflow the ComfyUI community has been converging on. They'll do their structural passes with a simple prompt and the enhancer disabled — just the raw diffusion model following instructions as literally as possible. Then once the composition is locked, they'll route the prompt through an enhancer node for the detail pass. It's a two-stage process, and the toggle is essential. Without it, you're fighting the enhancer during the structural phase, constantly getting beautiful, detailed images that have the wrong layout.
The gateway Daniel's looking at is basically productizing that same toggle. But the deeper question he's asking is whether you fine-tune a small model for this or just prompt a big one. What's the actual performance difference? And I want to push on this — is it really a ten percent gap, or does it depend on what you're measuring?
There's a paper from late twenty twenty-five — I want to say it hit arXiv around May — that compared fine-tuned small language models against prompted large language models on structured generation tasks. The fine-tuned small model outperformed the prompted large model by roughly ten percent on task-specific workflows. Ten percent is meaningful. But the tradeoff is real: fine-tuning requires a labeled dataset and compute time, while prompting gives you rapid iteration with basically zero setup. If you're prototyping, you prompt. If you're shipping something that needs to be reliable and fast, you fine-tune. And that ten percent number — it's an average across tasks. In highly specialized domains, the gap can be much wider. In general-purpose enhancement, it can be narrower. The paper actually had a case study on medical report generation where the fine-tuned small model beat the prompted large model by almost thirty percent, because the large model kept introducing plausible-sounding but clinically wrong details.
That medical example is terrifying and instructive. It's the difference between knowing what words sound right together and knowing what's actually correct. A prompted general model has great linguistic intuition but no domain grounding. A fine-tuned model, if trained on the right data, internalizes the constraints of the domain. It's the difference between a generalist who's read a lot and a specialist who's done the work.
That's why I keep coming back to the dataset. The fine-tuned model is only as good as the pairs it's trained on. If your training data is "bad prompt → pretty image prompt" scraped from Reddit, you're going to get an enhancer that adds "cinematic lighting" and "8K" to everything. If your training data is "bad prompt → prompt that produced the correct medical image," you get something genuinely useful. The quality of the mapping matters more than the size of the model.
Latency matters here, right? If you're sticking a prompt enhancement step into an image generation pipeline, every millisecond counts. A small fine-tuned model running locally is going to be faster than calling out to a big general-purpose model. But how much faster? Are we talking about a noticeable difference for someone sitting at a keyboard?
The ERNIE-Image Prompt Enhancer runs on twenty-four gigs of VRAM — that's consumer-grade, like an RTX thirty ninety or forty ninety. You can run it alongside the diffusion model on the same card. If you were using a prompted general-purpose model, you'd either need a second GPU or you'd be making API calls with network overhead. And in a ComfyUI workflow where you're iterating rapidly, that latency adds up fast. We're talking the difference between a hundred milliseconds locally and maybe two to five seconds round-trip to an API, depending on load. When you're doing fifty generations in a session, that's the difference between a fluid creative flow and constantly checking your phone while you wait. I've seen people in the ComfyUI community building nodes that use local models like Mistral three B through Ollama for exactly this purpose. They're making the same architectural choice — small, local, specialized.
By the way, today's episode is powered by DeepSeek V four Pro.
DeepSeek's been doing interesting work on the reasoning side. But back to the point — there's a whole taxonomy of approaches emerging here. On one end, you've got the fine-tuned enhancer baked into the model, like ERNIE-Image and Z-Image from Alibaba. Z-Image has a six billion parameter enhancer that does what they call "reasoning-enhanced prompts." On the other end, you've got the academic approach — there's a paper called PromptEnhancer from September twenty twenty-five that uses chain-of-thought rewriting trained through reinforcement learning, with a dedicated reward model that scores prompts against twenty-four specific failure modes. It's model-agnostic, so it works with any pre-trained text-to-image model without modifying weights. They demonstrated it on HunyuanImage two point one.
Twenty-four failure modes. That's granular. What kind of failures are we talking about? And how do you even build a taxonomy like that — do you just sit down and catalog every way an image generation can go wrong?
The taxonomy includes things like object omission, attribute binding errors, spatial relationship failures, counting errors — all the classic diffusion model pain points. The reward model, which they call AlignEvaluator, scores the enhanced prompt against those categories before it ever reaches the image generator. It's a much more structured approach than just saying "make this prompt more detailed." The chain-of-thought rewriter actually reasons about what might go wrong and preemptively addresses it. As for building the taxonomy — the paper's authors basically did a systematic review of failure cases from existing benchmarks. They looked at every published evaluation of diffusion models, cataloged all the documented failure modes, deduplicated and categorized them, and ended up with twenty-four. It's exhaustive but not arbitrary. Each category has clear criteria and examples.
The system prompt approach would be something like "you are a prompt enhancer, expand this brief description into a rich, detailed prompt suitable for image generation.fine, but it's generic. It doesn't know about the specific failure modes of the diffusion model downstream. It's like giving someone the instruction "make this better" without telling them what "better" means.
And that's the core insight. A fine-tuned enhancer can be trained on pairs of inadequate prompts and their corrected versions, where "corrected" means "the image actually came out right." The model learns the mapping between sloppy human input and generator-friendly input. A system-prompted general model is guessing based on its general language understanding. It might produce beautiful, detailed prose that completely misses what the diffusion model actually needs. Let me give you an analogy — it's like the difference between a translator who's lived in both cultures and a translator who's only studied the dictionary. The dictionary translator will give you grammatically correct sentences that are culturally clueless. The prompted general model is the dictionary translator.
It explains why the fine-tuned model wins even when it's much smaller — it's not about raw language ability, it's about knowing the specific mapping. A small model that's seen ten thousand examples of "this prompt failed because the spatial relationship was ambiguous → here's how to make it explicit" is going to outperform a large model that's never been trained on that specific task, no matter how eloquent the large model is.
There's also a localization angle here that I don't think gets enough attention. ERNIE-Image's enhancer was trained primarily on Chinese data, and it shows — it absolutely crushes Chinese text rendering benchmarks. LongTextBench Chinese scores zero point nine seven three three with the enhancer enabled. FLUX point two, which doesn't have a dedicated enhancer and wasn't trained heavily on Chinese, scores zero point two one eight three on the same benchmark. That's not a small gap — that's the difference between usable and completely broken.
Zero point two one eight three is abysmal. That's barely above random for text rendering. And it makes sense — prompt enhancement isn't just about adding detail, it's about understanding the linguistic and cultural context of the input. A system prompt in English telling a model to "enhance this prompt" isn't going to magically give it Chinese-language reasoning capabilities. You'd need to fine-tune on Chinese prompt pairs specifically. But this raises a question — what about languages that aren't Chinese or English? Where does that leave everyone else?
That's the long tail problem, and it's brutal. If you're working in Hindi, or Arabic, or Swahili, the off-the-shelf enhancers are essentially useless. You're in fine-tuning territory by default, because nobody has built the dataset for your language. And building that dataset is a significant undertaking — you need native speakers who understand both the language and the image generation task, creating prompt pairs that capture the specific failure modes of the diffusion model in that linguistic context. It's not just translation. The way spatial relationships are expressed in Japanese is different from how they're expressed in English, and the diffusion model might handle those constructions differently. You're mapping between linguistic structures and model behaviors, and that mapping is language-specific.
Daniel's question has a hidden dimension. It's not just "small fine-tuned model versus big prompted model." It's also "which language, which domain, which failure modes are you optimizing for?" A general-purpose model with a system prompt might work fine for English-language photorealistic prompts, and completely fall apart for Chinese text-heavy illustrations. And if you're working in a less-resourced language, the question isn't even build-versus-buy — it's build-versus-nothing.
This connects to something bigger that I think is underexplored. The API gateways Daniel mentioned — Envoy AI Gateway, Grab's gateway, Gravitee's LLM proxy — they're increasingly offering prompt preprocessing, rewriting, and template injection as middleware features. They can route requests to different models based on cost or latency criteria, apply prompt decorators, integrate retrieval-augmented generation for context enrichment. The architectural question shifts from "which model do I use for enhancement?" to "at which layer do I enhance?
Say more about that. What are the layers? And what drives the decision about which layer to use?
You've got three options. Layer one: enhance at the gateway, as a service. The gateway intercepts the prompt, runs it through an enhancer — maybe a small fine-tuned model, maybe a prompted general model — and forwards the enriched version to the image generator. The user doesn't even know it's happening. Layer two: enhance in the pipeline, as a dedicated model component. This is the ERNIE-Image pattern — the enhancer is a first-class part of the architecture, and the user can toggle it. Layer three: enhance in the prompt itself, through system instructions. You tell the main model "before generating, expand and enrich the user's prompt." Each layer has different tradeoffs in terms of latency, control, and transparency. The gateway layer maximizes convenience but minimizes user control. The pipeline layer gives you the toggle — maximum flexibility. The prompt layer is the simplest to implement but the least reliable, because you're asking a model to do two things at once: enhance and generate.
The gateway approach is fascinating because it abstracts the decision away from the user entirely. But it also means the user has no visibility into what's being added to their prompt. If the enhancer introduces bias or unwanted elements, they might not even know why their images are coming out wrong. It's like a photo app that silently applies a filter to every image — most people might not notice or care, but anyone doing serious work is going to be frustrated when they can't get the raw output they expect.
Right, and that's a real concern. Imagine a prompt enhancer that's been fine-tuned on a dataset that skews toward certain aesthetic preferences — maybe it always adds "cinematic lighting" or "trending on ArtStation" because those were common in the training data. The user types "a simple sketch of a dog" and gets back a hyper-detailed, dramatically lit render. The gateway made a choice on their behalf without them knowing. And here's the insidious part — they might not even realize it's the gateway doing it. They might blame the diffusion model, or their own prompt-writing skills. The opacity creates a debugging nightmare.
Transparency becomes a feature, not just a nice-to-have. ERNIE-Image's toggle is actually a user experience win — you can see what the enhancer does and decide whether you want it. A gateway that silently rewrites your prompts is... I don't know, it feels like a privacy violation at the creative level. You're not just processing my data, you're changing my creative intent without telling me.
I think that's exactly right. And it raises a question about where this is all heading. Are we moving toward a world where prompt enhancement is just assumed to be part of the stack, like image compression or noise reduction? Or does it remain an explicit choice? I think about the evolution of photography — for decades, cameras made all the decisions about exposure and focus, and professionals fought to get manual control. Now every phone has a pro mode. I suspect prompt enhancement follows the same arc.
I suspect it bifurcates. Consumer-facing products will silently enhance prompts because most users don't know how to write good ones and just want pretty pictures. Professional and prosumer tools will expose the enhancer as a controllable component because those users want precision. The API gateways Daniel's looking at probably serve both audiences, which is why they offer it as an optional endpoint rather than baking it into every request. The endpoint is the pro mode — you have to know it exists and explicitly call it.
There's also the question of whether the "small model plus enhancer" pattern is a new paradigm for efficient generation more broadly. ERNIE-Image's architecture is explicitly designed so the three billion parameter enhancer compensates for the eight billion parameter diffusion model's limitations. That's philosophically different from the "bigger model, better understanding" approach that's dominated the field. Instead of making the generator smart enough to understand sloppy prompts, you put a specialized translator in front of it. It's a division of labor — one model understands humans, the other model makes pixels.
It's the adapter pattern, basically. In software engineering, you don't rewrite your entire application to handle a new input format — you write an adapter. The enhancer is an adapter between human communication patterns and model-optimal input patterns. And the adapter pattern has a great property: you can improve the adapter without touching the thing it adapts to. If someone invents a better diffusion model next month, your enhancer still works. You just point it at the new model.
The economics work out. Training an eight billion parameter diffusion model plus a three billion parameter enhancer is still cheaper than training a single monolithic model that's large enough to handle both understanding and generation at high quality. You're decomposing the problem into two specialized components, each of which can be optimized independently. If a new prompting technique comes along, you retrain or fine-tune just the enhancer, not the entire pipeline. That's a massive economic advantage in a field where models become obsolete every six months.
Which brings us back to Daniel's build-versus-buy question. If you're building a pipeline today, do you fine-tune your own enhancer or use an off-the-shelf one?
It depends on your domain specificity. If you're doing general-purpose image generation, the off-the-shelf options are getting good. ERNIE-Image's enhancer is open-source and runs on consumer hardware. The ComfyUI community has nodes that wrap local models like Mistral. You can be up and running in an afternoon. But if you're in a specialized domain — medical imaging, architectural visualization, anything with strict terminology and constraints — you probably need to fine-tune. A general enhancer won't know that "CT scan with contrast in portal venous phase" shouldn't be expanded into "a dramatic medical image with cinematic lighting." And it definitely won't know that adding details to a medical prompt could introduce clinically incorrect elements that have real consequences.
I'm now imagining an enhancer adding "trending on ArtStation" to a CT scan prompt. That's a nightmare. But it's not just funny — it's a real category error. The enhancer has been trained to optimize for aesthetic appeal, and a CT scan has zero aesthetic requirements and very strict accuracy requirements. The optimization target is completely wrong.
That's a malpractice lawsuit waiting to happen. But it illustrates the point — domain specificity matters enormously. And this is where the PromptEnhancer paper's approach gets interesting. If you're fine-tuning with reinforcement learning against a reward model that knows your domain's failure modes, you can train an enhancer that's aligned with your specific needs. The twenty-four failure modes in that paper are a starting point — you could define your own taxonomy. For medical imaging, you might add failure modes like "introduced anatomical impossibility" or "added clinically irrelevant detail." The reward model then explicitly penalizes those, and the enhancer learns to avoid them.
The practical answer to Daniel's question is: start with a system-prompted general model for prototyping, because it's fast and flexible. But if you're shipping something, especially in a specific domain or language, fine-tune a small model. And if you're really serious, look at the reinforcement learning approaches that optimize against actual failure modes rather than just imitating human-written enhanced prompts. That's the ladder — prompt, then fine-tune, then RL-tune.
Keep an eye on the API gateway layer. If gateways are offering prompt enhancement as middleware, the economics might shift. Why maintain your own fine-tuned enhancer if the gateway does it better and cheaper? But that also means you're coupling your pipeline to a third-party service that you don't control. If they change their enhancer model, your output quality changes overnight without you changing anything. That's a real operational risk. I've seen teams burned by this with other ML services — a model update happens silently on the provider side, and suddenly your KPIs tank and you have no idea why.
That's the cloud services tradeoff in a nutshell. Convenience versus control. Same thing we've been debating since AWS launched. And the answer is always the same: it depends on how critical the component is to your product. If prompt quality is a nice-to-have, use the gateway. If it's core to your value proposition, own it.
One more thing I want to flag — there's an interesting benchmark detail from ERNIE-Image that I think is worth sitting with. The enhancer improves reasoning benchmarks and text rendering, but it slightly hurts basic generation scores. GenEval drops from zero point eight eight five six to zero point eight seven two eight. That's not a huge drop, but it's consistent. The enhancer is making the model better at the hard stuff and slightly worse at the easy stuff.
That actually makes intuitive sense. If you take a simple, clear prompt and expand it, you're adding complexity. The diffusion model now has more constraints to satisfy, more details to render correctly. There's more surface area for error. The enhancer is a net positive when the original prompt was inadequate, but it's unnecessary overhead when the original prompt was already good. It's like adding more ingredients to a recipe — if the recipe was too simple, it improves it. If the recipe was already perfect, you've just added clutter.
Which suggests the ideal system isn't one that always enhances — it's one that knows when to enhance. You'd want a classifier or a confidence score that says "this prompt is too vague, enhance it" or "this prompt is already detailed and specific, pass it through unchanged." That's the smart toggle — not a manual switch, but an automatic decision based on the prompt's characteristics.
A meta-model that decides whether to invoke the enhancer. Now we're three models deep. But honestly, that's not as crazy as it sounds. This is how complex systems work — you decompose the problem into a pipeline of specialized components, each making a specific decision. The classifier that gates the enhancer is a tiny model, probably just a few hundred million parameters. It's cheap to run and can save you from degrading good prompts.
Laugh if you want, but that's actually where this is heading. The PromptEnhancer paper's AlignEvaluator is essentially that — it scores prompts and only rewrites when there's a predicted failure mode. The architecture gets more complex, but each component is simple and specialized. It's the Unix philosophy applied to AI pipelines. Small, composable tools that each do one thing well, connected by clean interfaces.
Do one thing well. So we've got a clear recommendation emerging. For Daniel's use case — building pipelines that feed into diffusion models — the answer is: use a system-prompted general model to prototype and understand your enhancement needs, then fine-tune a small model once you've collected examples of what good enhancement looks like for your specific domain. And if you're working in a non-English language, fine-tuning is basically mandatory — the off-the-shelf enhancers are heavily English-biased except for ERNIE-Image, which is Chinese-first. Everyone else is on their own.
If you're using ComfyUI, there are already nodes that wrap local models for this. You can experiment without building anything from scratch. The comfyui-prompt-enhancer node uses Mistral three B through Ollama, and Eric's Prompt Enhancers give you five different enhancement strategies to play with. The tooling is there — the question is just whether to invest in fine-tuning for your specific needs. You can literally have this running on your machine this afternoon and start running A/B tests.
One caution I'd add: the benchmarks we've been citing are from the model developers themselves. ERNIE-Image's numbers come from Baidu. Independent benchmarks might tell a different story. I'd want to see third-party evaluations before making a final call. Self-reported benchmarks have a way of being... Not saying Baidu is fudging numbers, but every lab picks the benchmarks and the evaluation setup that shows their model in the best light. It's just good practice to verify.
But the pattern is consistent across multiple independent efforts — Baidu's ERNIE-Image, Alibaba's Z-Image, the academic PromptEnhancer paper, the grassroots ComfyUI nodes. Everyone who's actually built one of these systems has converged on the same architecture: a small, specialized enhancer in front of the generator. Nobody's shipping a system-prompted general model as their production enhancer. That's a strong signal. When you see independent teams all arriving at the same solution without coordinating, it usually means the solution space has a real attractor basin. It's not just a fad.
The market has spoken, in other words. The build-versus-buy question for Daniel becomes: do you use someone else's fine-tuned enhancer, or do you fine-tune your own? The "just prompt a big model" option is the prototyping phase, not the end state. It's the scaffolding you use while you figure out what you actually need, not the permanent structure.
And now: Hilbert's daily fun fact.
The average cumulus cloud weighs approximately one point one million pounds — roughly the same as eighty elephants floating above your head. Which, now that I think about it, is a great example of something where adding detail makes it worse. "Eighty elephants floating above your head" is a vivid image, but it's also terrifying and completely changes the emotional tone of a nice summer day. Sometimes enhancement is counterproductive.
That's a perfect callback. So what should listeners actually do with this? If you're building image or video generation pipelines, start by experimenting with a prompted general model to understand what prompt enhancement even looks like for your use case. Run some A/B tests — enhanced versus unenhanced prompts — and see where the enhancement helps and where it hurts. The ERNIE-Image data shows it's not universally beneficial, so you need to know your own failure modes. Don't assume enhancement is always good.
Once you've got a sense of what good enhancement looks like, look at the off-the-shelf options. If you're in ComfyUI, try the local LLM nodes. If you're building an API, look at whether your gateway offers prompt enhancement middleware. And if your domain is specialized or non-English, plan to fine-tune. Collect a dataset of prompt pairs — inadequate inputs and their corrected versions — and train a small model on that mapping. The academic literature suggests you'll get about a ten percent improvement over prompting a general model, plus lower latency and full control. And that ten percent compounds when you're generating thousands of images a day.
The other practical takeaway: expose the toggle. Whatever enhancement approach you choose, let users turn it off. The transparency is worth it, and some prompts don't need enhancement. Don't make the creative decision on behalf of your users. Even if you think you know better — especially if you think you know better — give them the switch.
If you're evaluating models, pay attention to the benchmarks that matter for your use case. GenEval measures basic instruction following. LongTextBench measures text rendering. OneIG measures reasoning. ERNIE-Image's enhancer helps with reasoning and text but slightly hurts basic generation. Know which tradeoff you're making, and make it explicitly. Don't just look at the aggregate score and assume the enhancer is better — look at the breakdown and see if the things it improves are the things you care about.
Here's what I'm left wondering. We've been talking about prompt enhancement for diffusion models, but this pattern — small specialized model in front of a larger generative model — seems applicable much more broadly. Could you put a fine-tuned enhancer in front of a code generation model? A music generation model? The adapter pattern isn't specific to images. Are we looking at the early stages of a general architectural principle?
I think you absolutely could, and I'd expect to see it emerge across modalities. The fundamental insight is that human communication is messy and contextual, and generative models work best with structured, specific input. An enhancer bridges that gap. Whether it's images, code, music, or text, the pattern holds. The specifics of the fine-tuning dataset and the failure mode taxonomy change, but the architecture doesn't. For code generation, your enhancer might take a vague feature description and expand it into a detailed specification with edge cases and constraints. For music, it might take "something upbeat with a good bassline" and expand it into a structured prompt with tempo, key, instrumentation, and genre references. The adapter pattern is universal — it's just the mapping that's domain-specific.
Something to watch. And I think the next year is going to be really interesting on this front — as more modalities get production-grade generative models, we're going to see enhancers pop up for each of them. The question is whether we get general-purpose enhancers that work across modalities, or whether each modality needs its own specialized enhancer. My bet is on specialization, but I've been wrong before.
I'd bet on specialization too, for the same reason the fine-tuned small model beats the prompted large model — the failure modes are different across modalities. What makes a good image prompt is not what makes a good code prompt. A general-purpose enhancer would have to be enormous to cover all those cases, and at that point you're back to the "just prompt a big model" approach with all its limitations. Specialization wins on efficiency and accuracy.
Thanks to Hilbert Flumingtop for producing, as always.
This has been My Weird Prompts. Find us at myweirdprompts dot com, and if you're enjoying the show, leave us a review wherever you listen.
We'll be back with another one soon.