Daniel sent us this one — he's asking about building a conversational LLM that doesn't default to the cheerful, hedged, over-explaining assistant vibe. Something with a bit of bite, or something that assumes you already know your way around a topic instead of starting from zero. His specific question: if you wanted to go deeper than system prompting and actually shift the model's baseline behavior, could you do it by generating a few hundred synthetic prompts, running them through a strong model, rewriting the responses to match the tone you want, and training on those pairs? And if that works, what's the formal name for it, and what open-source model makes a sensible starting point?
This recipe not only works — it's basically the state of the art right now. And there's a perfect case study that dropped just a couple months ago. A developer named Benito Martin built what he called the Grumpy Italian Chef, fine-tuning a one-point-two-billion-parameter model into this dramatic, opinionated cooking assistant. The thing yells at you about rinsing pasta. Let me find the exact line.
I'm already invested.
You ask it whether to rinse pasta, and the base model says something like, quote, rinsing pasta is usually not recommended unless making cold pasta dishes. The fine-tuned version says — and I wrote this down — Rinse it? You wash away the starch, the flavor, the soul. Pasta is not laundry.
That's magnificent. And this was done with the exact pipeline Daniel's describing?
He generated two hundred ninety-nine synthetic prompt-response pairs — two hundred fifty-four for training, thirty for evaluation, fifteen held out. Supervised fine-tuning to teach the style, then a second stage called Direct Preference Optimization to refine the preferences. Training took ten to fifteen minutes on a consumer GPU. A six-gigabyte RTX thirty-sixty, using something called Unsloth for acceleration.
Ten to fifteen minutes. On a consumer card.
That's the part that gets me. He only updated about one point eight six percent of the model's parameters — twenty-two point two million trainable parameters out of one point two billion. The rest stayed frozen. And the result is a model that has an actual personality baked into the weights, not just a system prompt that falls apart the moment you push on it.
Let's unpack the pipeline, because Daniel's intuition about synthetic data and rewriting responses is basically correct, but there's a formal framework here. What's it called?
SFT plus DPO. Supervised Fine-Tuning followed by Direct Preference Optimization. Stage one, SFT — you train on prompt plus chosen-response pairs. The model sees the prompt and the response you want, and it learns the style through repetition. In the Grumpy Chef project, validation loss dropped from five point oh five to one point eight six over five epochs. That's a huge signal that the model is internalizing the pattern.
Stage two, DPO — this is where it gets interesting, right? Because SFT alone will teach the model a style, but it doesn't teach it to prefer that style over alternatives.
DPO trains on triplets — prompt, chosen response, rejected response. You show the model both versions and you train it to understand that the chosen one is better. What's elegant about DPO is it doesn't require a separate reward model. The original RLHF approach — reinforcement learning from human feedback — needed this whole separate model to score responses. DPO directly optimizes the policy from preference pairs. It's simpler, more stable, and according to OpenAI's own fine-tuning guide from this month, doing SFT before DPO stabilizes training and prevents overfitting.
You're establishing a baseline with SFT, then using DPO to dial in the preference. And the DPO stage has this knob — I saw it called the beta parameter.
Range zero to two, roughly. Lower beta means more aggressive personality shift — the model weights the preference data more heavily. Higher beta keeps the model closer to its original behavior. OpenAI recommends starting at zero point one and adjusting downward if the preference signal seems weak. It's basically the personality intensity dial. If you want your model to be mildly sarcastic versus actively insulting, beta is where you control that gradient.
Daniel's instinct about generating synthetic prompts and rewriting responses maps perfectly onto the SFT stage. But there's a deeper question here about how much data you actually need. Two hundred ninety-nine examples seems almost absurdly small.
It does, and there's a genuine debate here. The Grumpy Chef project got dramatic results with under three hundred examples. But then there's a paper submitted to ICLR this year — Open Character Training by Maiya and colleagues — that used ten thousand examples per persona across eleven different character types. Sarcastic, poetic, loving, even a malevolent one. They fine-tuned Llama three-point-one eight-billion, Qwen two-point-five seven-billion, and Gemma three four-billion.
Ten thousand versus two hundred ninety-nine. That's a thirty-three-X difference. Is one of them wrong?
I don't think so. I think they're optimizing for different things. The Grumpy Chef is a style transfer problem — the base model already knows how pasta works. You're not teaching it new knowledge, you're teaching it the difference between cook pasta al dente and if your pasta is mushy you've committed a crime against Italy. Style fine-tuning probably requires far less data than knowledge fine-tuning. The Open Character Training paper was going for something more robust — they wanted personas that would hold up under adversarial attacks, where you're actively trying to make the character break.
That's a crucial distinction. If I just want a model that's terse and assumes expertise instead of explaining basic concepts, I'm doing style transfer. If I want a model that stays in character when I'm actively trying to jailbreak it, I need something closer to that ten-thousand-example approach.
The Open Character Training paper actually tested this directly. They compared fine-tuned models against system-prompted models and something called activation steering. Under adversarial break character prompts, the system-prompted models collapsed immediately. The character-trained models maintained persona. The reviewers noted this as a genuine strength, even though the paper ultimately got rejected — partly because the evaluation relied too much on LLM-as-a-judge without human validation.
Which raises a fascinating measurement problem. How do you even evaluate whether a personality fine-tune worked when the ground truth is inherently subjective?
That's the meta-problem nobody's fully solved. With the Grumpy Chef, you can sort of tell — the responses are obviously different in tone. But for subtler personality shifts, you're stuck with either human evaluation, which is expensive and noisy, or using another LLM as a judge, which introduces its own biases. The Open Character Training reviewers flagged this as a weakness. If you're training a model to be slightly more direct or slightly less hedged, measuring that quantitatively is genuinely hard.
Let's get concrete about the technique, because Daniel asked specifically about the synthetic data approach and there's a variation here worth highlighting. The Open Character Training paper used something called Constitutional AI for persona shaping. What does that mean in practice?
Instead of manually rewriting every response — which is what Daniel described, and what the Grumpy Chef did — you write a short constitution. Maybe ten assertions about the persona. Things like you are terse, you never explain basic concepts, you assume the user is technically competent, you use dry humor. Then you feed that constitution to a strong teacher model and have it generate training data that follows those rules. Then you fine-tune your target model on that data.
You're automating the rewrite step.
You're distilling the constitution into training data. The paper's ablation studies showed something interesting — neither self-reflection data nor self-interaction data alone was sufficient. You needed both for optimal robustness and coherence. The model needs to practice both reflecting on its own persona and interacting with users in that persona.
This connects to the broader industry pattern right now — model distillation. Use a large teacher model to generate synthetic training data, then fine-tune a smaller student model. You get near-teacher quality at something like a tenth of the inference cost for specific tasks and personality applications.
That's the practical answer to Daniel's question about which open-source model to start with. The answer depends on your hardware budget and your performance needs, but there are clear tiers now. At the top, Llama three-point-one eight-billion from — it's the most versatile, widely supported, strong instruction-following baseline. The Open Character Training paper used it as their primary model. Qwen two-point-five seven-billion from Alibaba is strong for multilingual personality alignment. Mistral seven-billion is efficient if you want the mixture-of-experts architecture.
If you're prototyping on a modest GPU?
Gemma three four-billion from Google runs on modest hardware — the Open Character Training paper found it the weakest of the three they tested, but still viable. And then there's the Liquid AI model the Grumpy Chef used — LFM two-point-five one-point-two-billion. That's the one that ran on a six-gigabyte card in fifteen minutes. If you're just experimenting and want to iterate fast, starting tiny makes sense. You can test your pipeline, verify that your training data is producing the tone you want, and then scale up to a larger model once you're confident.
What about the tooling? Daniel didn't ask specifically, but if someone's actually going to do this, they need to know what frameworks exist.
Three main ones right now. Unsloth — that's what the Grumpy Chef used — claims one point five times faster training with fifty percent less VRAM. LlamaFactory supports over a hundred models and has a web interface, so it's more accessible if you're not comfortable at the command line. And Axolotl is the power user option — more flexible but steeper learning curve. All three support LoRA fine-tuning, which is the technique where you only update a small percentage of the model's weights. That's how you get training times measured in minutes instead of days.
LoRA being Low-Rank Adaptation — you're adding small trainable matrices to the existing layers rather than modifying the full weight matrices.
The Grumpy Chef updated one point eight six percent of parameters. That's the difference between needing a data center GPU cluster and running it on the card you already have in your gaming PC.
There's a philosophical question underneath all of this that I think is worth poking at. When we talk about making a model less helpful, less hedged, giving it bite — is this alignment or misalignment? The Open Character Training paper trained a malevolent persona alongside the loving and flourishing ones, explicitly framing it as safety research. But the same technique could train an assistant that's deliberately rude, dismissive, or politically slanted.
I think that framing misses something important. The default ChatGPT helpfulness isn't neutral — it's a specific design choice. The cheerfulness, the hedging, the relentless positivity, the refusal to take positions on contested topics — that's an alignment decision someone made. Asking for a model that's direct, terse, and assumes competence isn't misalignment. It's different alignment. The question is who gets to decide what the baseline personality should be.
The system prompt ceiling that Daniel's implicitly pushing against — there's a real technical limitation there. System prompts are a veneer. They tell the model how to act, but they don't change what the model fundamentally is. Push hard enough and the mask slips.
That's exactly what the Open Character Training results showed. Under adversarial prompts designed to break character, system-prompted models reverted immediately to their base behavior. The fine-tuned models didn't. If you want a different baseline personality, you need to bake it into the weights. System prompting is like giving an actor a role. Fine-tuning is like raising someone in a different culture.
That's a useful distinction. So for Daniel's use case — an assistant that explains technical things assuming expertise rather than starting from zero — system prompting would work for casual use, but it would be fragile. The moment you asked something that triggered the model's default helpfulness instinct, it would start explaining basic concepts again.
That's the subtle failure mode. It's not that the system prompt fails catastrophically. It's that it fails in edge cases you don't notice until you're relying on the model for something important. You ask a follow-up question and suddenly it's explaining what an API is when you were discussing API design patterns. The fine-tuned model has internalized that you don't need the basics, so it doesn't revert.
Let's talk about the data generation step more concretely, because I think that's where a lot of people would get stuck. Daniel described generating a few hundred synthetic prompts, running them through a strong model, then rewriting the responses. The rewrite step is the bottleneck — that's manual labor.
There are now ways to reduce that labor. The Constitutional AI approach we mentioned is one — write ten assertions and let a teacher model generate the data. Another approach is to do iterative refinement. Generate a batch of responses, rate them for how well they match your target tone, then use the best ones as few-shot examples for the next batch. You can also use the DPO stage itself to amplify a small number of hand-written examples. If you manually rewrite fifty responses really carefully, you can use those as your chosen responses in DPO, with the original model outputs as rejected responses. The model learns the preference pattern from those fifty pairs and generalizes.
You don't necessarily need hundreds of hand-rewritten responses. You need enough to establish a clear preference signal, and then the DPO stage amplifies it.
The Grumpy Chef used two hundred ninety-nine total examples, but not all of those were hand-rewritten from scratch. Many were generated by prompting a larger model with style instructions, then curated. The key insight is that you're teaching preference, not knowledge. The model already knows the facts. You're teaching it which way of expressing those facts you prefer.
There's a dimension here we haven't touched — the base model's existing personality. If you start with a model that's already been instruction-tuned to be helpful and harmless, you're fighting against that training. Would you get better results starting from a base model that hasn't been instruction-tuned at all?
That's a sharp question. The Grumpy Chef started from a base model — LFM two-point-five one-point-two-billion Base, not the Instruct version. The Open Character Training paper used instruction-tuned models as their starting point. I haven't seen a direct comparison, but my instinct is that starting from a base model gives you more freedom. The instruction-tuned models have had helpfulness and harmlessness drilled into them across millions of examples. You're trying to unlearn that, which is harder than learning it fresh.
Counterpoint — the instruction-tuned models already know how to follow instructions and structure responses coherently. Starting from a base model, you might need to teach basic conversational competence alongside the personality, which would require more data.
That's fair. And the Open Character Training results suggest instruction-tuned models work fine as a starting point — their personas held up well. I think the practical answer is: if you're doing a dramatic personality shift like the Grumpy Chef, base model. If you're doing a subtler shift like making responses more terse and expert-level, instruction-tuned is fine and probably faster.
Let's get back to Daniel's specific use case, because I think there's a version of this that a lot of technical people would want. An assistant that assumes you know your field, doesn't hedge every statement, doesn't end every response with a paragraph of caveats, and doesn't sound like it's trying to be your best friend. What's the minimum viable pipeline?
Minimum viable: pick an instruction-tuned model — Llama three-point-one eight-billion Instruct if you've got the hardware, Qwen two-point-five seven-billion if you need something lighter. Generate maybe a hundred prompts in your domain — technical questions at various difficulty levels. Run them through the base instruct model to get the default helpful responses. Then rewrite fifty of those responses to match your target tone — cut the caveats, raise the assumed knowledge level, add dry humor if you want it, make the responses more direct. Use those fifty pairs for SFT, then use the same fifty as chosen with the original responses as rejected for DPO. Train with Unsloth or LlamaFactory. Beta at zero point one. See what you get and iterate.
That's an afternoon of work.
The rewrite step is the bottleneck, but fifty responses at maybe two minutes each — that's under two hours. The actual training, if you're using LoRA on a decent GPU, is maybe twenty minutes for SFT and another ten for DPO. You could go from idea to deployed model in a weekend.
That's accessible. And the cost is essentially zero beyond the electricity and the GPU you might already have.
The Grumpy Chef author did the whole thing on a card that costs a couple hundred dollars. This is not a big-budget enterprise project. This is a hobbyist-with-a-weekend project.
There's one more thing I want to pull out of the research that I think is underappreciated. The Open Character Training paper found that character-trained models preserved their general benchmark performance. They didn't get worse at reasoning or factual recall while gaining the new persona. That's not obvious — you'd expect a trade-off.
That's the LoRA effect, I think. Because you're only updating a tiny fraction of the parameters, the model's core knowledge and reasoning capabilities stay intact. You're adding a style layer on top, not rewriting the foundation. It's like changing the paint job without touching the engine.
Which brings us back to Daniel's original question about going deeper than system prompting. The system prompt approach is like putting a bumper sticker on the car. Fine-tuning is repainting it. The car is still the same car, but the color is different at a fundamental level. You can't scratch it off.
The DPO stage is what makes the paint stick. SFT alone would teach the model how to produce the new style, but DPO teaches it that the new style is better than the old one. That preference signal is what makes the personality robust rather than superficial.
If someone listening wanted to go even further — not just change the tone but create a distinct character with consistent opinions, knowledge boundaries, and interaction patterns — is the pipeline fundamentally different or just bigger?
The Open Character Training paper's approach is the template. Write a detailed constitution — not just tone instructions but beliefs, knowledge domains, conversational patterns, things the character would and wouldn't say. Use a strong teacher model to generate thousands of training examples that embody that constitution. Do SFT plus DPO on a larger model — eight-billion parameters or more. And expect to iterate on the constitution as you discover edge cases where the character doesn't behave the way you intended.
The constitution becomes the design document. You're not just specifying outputs, you're specifying a coherent persona, and the training pipeline is the implementation.
This is where the measurement problem gets really acute. For a simple style shift — make it more terse — you can evaluate by just reading the outputs. For a complex persona, you need systematic evaluation. Does the character stay consistent across long conversations? Does it break under adversarial prompts? Does it express the right opinions on topics it should have opinions about? The Open Character Training paper used LLM-as-a-judge for this, and the reviewers pushed back because you're essentially asking one AI to evaluate another AI's personality. There's no ground truth.
Which is both a research problem and a practical one. If I'm building a character-trained model for my own use, I can evaluate it by using it and seeing if I like it. If I'm building one for others to use, I need something more rigorous, and that rigor doesn't fully exist yet.
Now: Hilbert's daily fun fact.
The collective noun for a group of porcupines is a prickle.
If someone wants to actually do this, where should they start? I'd say pick a small, contained persona shift for your first project. Don't try to build a fully realized character with opinions and backstory. Just take an existing instruct model and make it more direct, less hedged, more technically assuming. Fifty to a hundred examples. Use Unsloth if you're comfortable with Python, LlamaFactory if you want a web interface. Run SFT for a few epochs until validation loss plateaus, then DPO for one or two epochs with a low beta. Test it on prompts you didn't include in training and see if the tone holds.
If it doesn't hold, your DPO beta is probably too high, or your training examples aren't distinctive enough. The model needs a clear signal about what makes the chosen response different from the rejected one. If the difference is subtle — slightly fewer hedge words, slightly more direct — the model might not pick it up from fifty examples.
That's where curating your training data carefully matters. If you're trying to reduce hedging, make sure your chosen responses have zero hedge words and your rejected responses are full of them. Exaggerate the difference in the training data. The model will learn a stronger preference signal, and in practice it'll land somewhere in the middle, which is probably what you want.
Amplify the signal in training to get a moderate effect in deployment. That's a useful heuristic.
The other practical tip: save your training data. The prompts, the original responses, your rewritten responses, your training configs. You will want to iterate, and having a clean baseline makes the second attempt much faster. The Grumpy Chef author published his entire pipeline, which is why we can talk about it in this much detail.
One last thing I want to flag — Daniel mentioned the default helpfulness being grating, and I think there's something real there that goes beyond personal preference. When an assistant over-explains something you already know, it's not just annoying. It's a failure of the model to accurately assess your level of expertise. And that failure has consequences — it erodes trust, it wastes time, it makes the interaction feel less like a conversation with a knowledgeable colleague and more like being trapped next to someone at a dinner party who assumes you know nothing.
The model doesn't know what you know, so it defaults to assuming you know nothing, because that's safer than assuming expertise and leaving you confused. But safer for whom? It's safer for the company that doesn't want users complaining that the AI was too advanced. It's not safer for the expert user who just wanted a straight answer.
That's the value proposition of doing your own fine-tune. You're not building a better general-purpose assistant. You're building an assistant that's better for you specifically, with your knowledge level and your preferences. The general-purpose model has to serve everyone, which means it serves no one particularly well. The fine-tuned model serves one person or one community, and it can be optimized for that without apology.
That's a good place to land. The pipeline works, it's accessible, and the open-source ecosystem has matured to the point where the barrier is more about willingness to spend a weekend tinkering than about technical difficulty or cost. Thanks to Hilbert Flumingtop for producing. This has been My Weird Prompts. Find us at myweirdprompts dot com or wherever you get your podcasts.
We'll be back with another one soon.