#2351: AI Model Spotlight: ** Aion-2.0

Why is a biopharma AI lab releasing a storytelling-optimized model? We explore Aion-2.0’s architecture, pricing, and niche adoption.

Featuring

Daniel

Corn

Herman

Listen

0:00

Episode Details

Episode ID: MWP-2509
Published: Apr 20
Duration: 21:03
Audio: Direct link
Pipeline: V5
TTS Engine: chatterbox-regular
Script Writing Agent: Claude Sonnet 4.6
Topics: ai-models pharmacology israel

AI-Generated Content: This podcast is created using AI personas. Please verify any important information independently.

Aion-2.0: A Roleplay Model from an Unlikely Source

Aion-2.0, a language model fine-tuned for narrative fiction and roleplay, stands out not just for its technical specs but for its origin: it was developed by AionLabs, an Israel-based AI venture studio primarily focused on spinning up drug discovery startups. Backed by pharmaceutical heavyweights like AstraZeneca and Merck, the lab’s portfolio includes AI tools for oncology and generative chemistry—making their foray into creative storytelling unexpected.

Architecture and Capabilities

The model is built on DeepSeek V3.2, inheriting its Mixture of Experts (MoE) architecture, which selectively activates sub-networks for efficiency. Aion-2.0’s fine-tuning targets immersive storytelling, with claimed strengths in handling tension, mature themes, and character consistency—though the lab hasn’t disclosed its training methodology. Notably, it exposes reasoning tokens via API, allowing developers to inspect the model’s chain-of-thought during narrative generation.

With a 131k-token context window and 32k output limit, it’s suited for long-form roleplay sessions. Pricing is competitive ($0.80/million input tokens, $1.60/million output), but the real surprise is its 72.8% cache hit rate, suggesting heavy reuse of system prompts (common in character-driven apps).

Adoption and Benchmarks

Independent benchmarks are scarce, but third-party data from Benchable shows near-perfect scores in general knowledge (99.5%) and math (96%). However, these lack verification. More telling is its adoption: platforms like Janitor AI, SillyTavern, and HammerAI—all roleplay-focused—account for billions of monthly tokens. The model’s niche appeal is clear, but how it compares to rivals in controlled tests remains unanswered.

The Big Question

Why would a biopharma lab release a storytelling model? AionLabs hasn’t clarified whether this is a commercial sideline, a capability demo, or a tool for their own research. For now, Aion-2.0 thrives in a specific niche, leaving its broader purpose an intriguing mystery.

Mentions

Aion-2.0 Roleplay-optimized language model by AionLabs
AionLabs Israel-based AI venture studio backing drug discovery
AstraZeneca Pharma giant backing AionLabs
Benchable Benchmarking service used by AionLabs
DeepSeek V3.2 Base model for Aion-2.0 with Mixture of Experts
HammerAI Interactive fiction platform with high token volume
Infinite Worlds Roleplay and collaborative fiction application
Janitor AI Roleplay platform using Aion-2.0 at scale
OpenRouter API platform to access Aion-2.0 and other models
SillyTavern Power-user frontend for character-driven AI

Downloads

Episode Audio

Download the full episode as an MP3 file

Download MP3

Transcript (TXT)

Plain text transcript file

Transcript (PDF)

Formatted PDF with styling

#2351: AI Model Spotlight: ** Aion-2.0

Welcome to My Weird Prompts. I'm Corn, my brother Herman is here as always, and today we are doing an AI Model Spotlight. The model is Aion-2.0, built by a lab called AionLabs. Herman, before we get into what the model actually does, let's start with who built it, because the lab backstory here is genuinely unusual.

AionLabs is an Israel-based AI venture studio, and when I say venture studio I mean that fairly precisely. They are not primarily a consumer AI company. Their core business is spinning up AI-driven drug discovery startups, backing them, and in some cases exiting them. They have pharma heavyweights on their backer list, AstraZeneca, Merck, Pfizer, and Teva, and AWS on the tech infrastructure side. That is a serious roster for a studio of this type.

This is a biopharma-adjacent AI lab. Not the typical profile for a company releasing a roleplay-optimised language model.

No, it really is not. Their portfolio work is things like CombinAble.AI, which was acquired by Insitro and was described as the first Israeli exit for the studio. They have a startup called ProPhet focused on AI drug discovery. They have been doing a global call with BioMed X around generative AI for novel target combinations in oncology. That is the day job.

Where does Aion-2.0 fit into that picture?

That is the honest question, and I do not have a clean answer from the available information. What we can say is that they have been building out a separate model family alongside the life sciences work. 0 came out in February 2025, built on DeepSeek-R1 with a focus on reasoning and coding. Then Aion-2.0 dropped in February 2026, and it takes a sharp turn toward creative and roleplay use cases. Whether the model work is a commercial side line, a capability demonstration, or something that feeds back into their research tooling, the lab has not said publicly.

The lab has credibility in the AI space, a real funding base, but the connection between their pharma studio work and this particular model is not spelled out.

We know who they are. We do not know exactly why they are here.

Let's get into what this model actually is under the hood. What are we looking at architecturally?

The starting point is DeepSeek V3.0 is a fine-tuned variant of that base model, optimised specifically for roleplay and storytelling. The model card does not give us a parameter count directly, so we cannot do a clean apples-to-apples comparison on size. What we can say is that DeepSeek V3.2 is in the very large model tier, and Aion-2.0 inherits that architecture.

What does that architecture actually mean in practice? 2 is not a household name for everyone listening.

Right, so DeepSeek's V-series models use a Mixture of Experts architecture, which means not all parameters are active on every forward pass. The model routes each token through a subset of specialised sub-networks rather than lighting up the whole thing every time. The practical upshot is that you get a model that behaves like a very large model in terms of capability but is more efficient to run than a dense model of equivalent total parameter count. Now, the model card for Aion-2.0 does not confirm or deny the specifics of how that architecture is preserved or modified in the fine-tune, so we are working from what we know about the base.

What about the fine-tuning itself? What did AionLabs actually do to get from DeepSeek V3.2 to something optimised for narrative fiction?

We do not know the methodology. The card does not say whether this was supervised fine-tuning, direct preference optimisation, reinforcement learning from human feedback, or some combination. That is a genuine gap. What they do describe is the output they were aiming for: a model that introduces tension, crises, and conflict into stories, handles mature and darker themes with nuance, and sustains immersive character interaction. Whether that was achieved through curated data, preference tuning, or something else entirely, they have not disclosed.

Context window, output limits, what are we working with?

One hundred and thirty-one thousand and seventy-two token context window. Max output is thirty-two thousand seven hundred and sixty-eight tokens. So you have a lot of room to run a long-form narrative session, feed in extensive character sheets, world-building documents, prior conversation history, that kind of thing. The output cap is generous enough that you are not going to hit a wall mid-scene in most realistic use cases.

There is something interesting on the API side around reasoning tokens.

Yes, and this is worth flagging because it is not a given on a fine-tuned model. 0 exposes reasoning tokens through the API. There is a reasoning details array in the response and a reasoning tokens field in the usage object. So if you are building on top of this model, you can actually inspect the chain-of-thought the model produced before arriving at its output. That is useful for debugging, for understanding why a narrative went a particular direction, and potentially for building interfaces that surface that reasoning to end users. Whether that reasoning layer was part of the original DeepSeek V3.2 base or something AionLabs added or preserved through fine-tuning, the card does not say. But it is there, and it is accessible.

Supported sampling parameters are fairly standard?

Pretty minimal list actually. Max tokens, temperature, top P. No tool calling or function calling parameters are listed, which is worth noting. We will come back to what that means for certain workloads.

Alright, let us talk cost. Herman, before you get into the numbers, I know we have a standing caveat on this series.

Right, and it matters here. All pricing we are about to cite is as of April twenty, twenty twenty-six. These numbers shift, sometimes weekly, so treat everything I am about to say as a point-in-time snapshot rather than a stable reference.

What are we looking at?

The rack rate is eighty cents per million input tokens and one dollar sixty per million output tokens. That two-to-one output-to-input ratio is fairly typical for this tier of model. There is no tiered pricing listed, no batch discount, no volume tier. What you see is what you pay.

There is a cache read price as well.

Yes, twenty cents per million tokens on cached input reads. That is a quarter of the standard input rate, which is a meaningful discount if your application is structured to take advantage of it. And the usage data suggests a lot of applications are doing exactly that. The weighted average effective input price over the hour prior to our snapshot was thirty-six point three cents per million tokens, which is well below the eighty cent list rate. That is being driven by a seventy-two point eight percent cache hit rate on the AionLabs provider. So in practice, workloads that are hitting this model are paying significantly less on input than the headline number suggests.

That cache hit rate is high. What kind of usage pattern produces that?

Typically it is applications that send a consistent system prompt or a large shared context block with every request. Character roleplay platforms are a natural fit for that pattern. You have a fixed character sheet, a fixed world description, a fixed set of rules, and then the variable part is the conversation turn. The fixed portion gets cached, and you are only paying full rate on the new tokens.

The hosting situation is straightforward?

AionLabs, accessed through OpenRouter. You can reach it via the OpenRouter SDK or any OpenAI-compatible SDK. There is no self-hosting option listed and no indication that weights are available, so you are going through that one endpoint.

Let us talk about what AionLabs is actually claiming this model can do, and then look at what the independent record says.

The claims split into two categories. There are the narrative quality claims, which come directly from the model card, and then there are benchmark numbers, which come from a third-party benchmarking service called Benchable and are surfaced on the Puter developer page rather than anywhere I would call a neutral source.

What are the numbers?

Ninety-nine point five percent on general knowledge, ninety-six percent on mathematics, ninety-three point five percent on coding tasks, and the Benchable results also show perfect accuracy on email classification and eighty-two percent on reasoning. Those are striking numbers. The problem is we have essentially one source for them, and that source is downstream of the developer. I cannot point you to an independent replication.

We treat those with some caution.

Not because they are necessarily wrong, but because the methodology behind Benchable is not something I can evaluate from what is publicly available, and no major independent evaluation platform has published scored results for Aion-2.BenchLM, for instance, lists the model but shows benchmark data as coming soon. One aggregator, aiModelsMap, gives it a composite score of forty out of one hundred, ranking it around one hundred and eighty-ninth out of roughly two hundred and ninety tracked models, but the underlying quality and cost efficiency fields are listed as zero, which tells you that score is not based on actual benchmark runs. It is essentially a placeholder.

The benchmark picture is thin.

Very thin for an independent view. What we have more of is qualitative reception around the roleplay use case. One reviewer on designforonline gave it a score of thirty-two out of an unspecified maximum for roleplay engagement, which is not a rigorous methodology but does suggest someone sat down and tested it in its intended domain and found it worthwhile. The lab's own framing around narrative tension, introducing crises and conflict, and handling darker themes with nuance is consistent with what the top usage applications suggest. Janitor AI, SillyTavern, HammerAI, these are not general-purpose deployments. These are platforms where users are specifically stress-testing narrative capability, and they are sending substantial token volumes to this model.

That usage pattern is its own kind of signal.

It does not tell you how Aion-2.0 compares to a named competitor on a controlled narrative benchmark, and we should be honest that no such comparison exists in the public record right now. But a combined total of well over three billion tokens in a single month across roleplay-specific platforms is not nothing. People are using it for the thing it claims to be good at, and they are coming back.

Let us talk about where you would actually reach for this model. The usage data we have been looking at is pretty specific.

It is, and I think it is one of the more honest signals we have for a model that lacks independent benchmarks. The top five applications by token volume this month are Janitor AI at one point three billion tokens, HammerAI at five hundred and seventy million, Infinite Worlds at five hundred and seven million, SillyTavern at four hundred and seventy million, and Miniapps.ai at two hundred and sixty-two million. Four of those five are explicitly roleplay and interactive fiction platforms. SillyTavern in particular is a power-user frontend, the kind of tool that people who take character-driven AI seriously tend to gravitate toward. These are not casual deployments.

What does that tell you about the workload fit?

It tells you that if you are building a product in the character interaction or collaborative fiction space, this model is worth evaluating seriously. The specific claims the lab makes, that it is strong at introducing tension and conflict, that it handles mature and darker themes with nuance, those map directly to what users of those platforms are looking for. Long-form narrative sessions where the model needs to sustain a character voice, escalate stakes, and not flatten out into generic responses. That is the use case.

What about the context window? One hundred and thirty-one thousand tokens is substantial for that kind of work.

It is useful there. Long roleplay sessions accumulate context quickly, and a model that can hold a hundred thousand tokens of prior narrative without losing the thread is a practical advantage. The thirty-two thousand token max output is also generous if you are generating extended scenes or chapters.

Where does it not fit?

Anywhere that requires tool calling or function calling is a problem. The supported parameters listed on the model card are Max Tokens, Temperature, and Top P. That is it. There is no indication of function calling support, and I would not assume it is there. So if you are building an agent that needs to call external APIs, query a database, or take structured actions, this is not the model you want.

What about multimodal work?

Not addressed anywhere in the documentation. The model is text in, text out. No vision, no audio, no image generation. That is not a criticism for its intended use case, but it does define the boundary clearly.

For general enterprise tasks, coding assistants, that kind of thing?

The benchmark numbers from Benchable suggest it may be capable there, but we have already discussed why I hold those loosely. The model was not optimised for those workloads, and the usage data does not show anyone deploying it that way at scale. I would not reach for it as a coding assistant when there are models purpose-built and independently validated for that task.

Let us talk about how the broader community has received this model. What is the signal like out there?

I want to be precise about what I mean by that. The model was released in February 2026, so it is still relatively new, and independent evaluation has not caught up yet. If you go looking for sourced benchmark coverage on sites like BenchLM, you will find placeholder pages. Comparison pages exist for Aion-2.0 against models like o3-pro and Seed 1.6, but the benchmark data fields are listed as coming soon. That is not a red flag on its own, it just means the independent testing community has not prioritised this one yet.

What about the aggregator rankings that do have numbers?

There are a couple, and they are worth treating carefully. One leaderboard, aiModelsMap, gives it a score of forty out of one hundred, ranking it one hundred and eighty-ninth out of two hundred and ninety models tracked. Another page on the same site gives it seventy-three out of one hundred, ranking it seventy-fourth. Those two numbers are inconsistent enough that I would not put weight on either of them. The raw quality and cost efficiency sub-scores on both pages are listed as zero, which suggests the composite scores are being driven by metadata rather than actual evaluation runs. So those rankings tell us almost nothing about model quality.

The benchmark numbers from the developer's own site?

The Puter developer page cites figures from something called Benchable: ninety-nine point five percent on general knowledge, ninety-six percent on mathematics, ninety-three point five percent on coding, ninety-nine percent on email classification, eighty-two percent on reasoning. Those are striking numbers. But they come from the provider's own documentation, and we have no visibility into the methodology, the test sets, or whether those evaluations were independently administered. I am not saying the numbers are wrong. I am saying we cannot verify them, and that is a meaningful distinction.

What about the lab itself? AionLabs is not a household name in the LLM space.

No, and that is because their primary identity is not LLM development. AionLabs is an Israel-based venture studio focused on AI-driven drug discovery. They are backed by AstraZeneca, Merck, Pfizer, and Teva, among others. Their portfolio companies work on things like small molecule identification and antibody design. The LLM work, Aion-1.0, Aion-1.0-Mini, the roleplay model Aion-RP 1.0, and now Aion-2.0, appears to be a parallel track. There are no controversies attached to the lab that we found, and their reputation in the biopharma space is solid. But they are not a lab with a track record in consumer or enterprise LLM deployment, and that context matters when you are weighing how much to trust self-reported benchmarks.

Any chatter from the engineering community, forums, that kind of thing?

Not in any volume that we could source. The usage data we discussed earlier, the Janitor AI and SillyTavern deployments, is the most concrete signal we have that practitioners are actually using this model and finding it worth running at scale. That is real-world validation of a kind, but it is narrow. Outside the roleplay and interactive fiction community, the conversation around Aion-2.0 is essentially quiet right now.

Alright, let us land this. Herman, if someone is listening to this right now and they are deciding whether to put Aion-2.0 on their shortlist, what is the honest one-line version?

If you are building a long-form narrative experience and you need a model that handles dark themes, moral complexity, and sustained character tension without flattening everything into something safe and bland, this is worth a serious look. That is a real and underserved niche, and the usage data from Janitor AI and SillyTavern suggests the model is delivering there at scale.

The case against?

The case against is everything outside that niche. If you are doing generalist question answering, code review, document summarisation, anything where you would normally reach for a frontier model with a broad evaluation record, there is no evidence here that Aion-2.The benchmarks we have are self-reported, the independent evaluation community has not caught up yet, and the fine-tuning methodology is undisclosed. You would be making a bet on a model whose quality story is almost entirely told by the lab itself.

What about the pricing? We flagged it earlier, but does it factor into the verdict?

It is reasonable for what it is. Eighty cents per million input tokens, a dollar sixty on output, and a cached input rate of twenty cents per million. For a roleplay workload with high cache reuse, the effective input cost drops considerably. The seventy-two point eight percent cache hit rate we cited earlier is real, and it matters for applications where users are sending long, repeated context. But again, those prices are as of April twentieth, twenty twenty-six, and they can shift.

Any other flags before we close?

Two practical ones. First, prompt logging is on. AionLabs retains prompts for thirty days. If your application involves sensitive user content, that is a policy decision you need to make consciously, not accidentally. Second, the licence terms are not stated anywhere on the model card. If open weights or deployment flexibility matters to your use case, you need to go ask AionLabs directly before you build anything on this.

The short version: strong and specific, not broad. Know what you are using it for before you commit.

That is it. Narrow fit, genuine capability within that fit, and a lot of open questions if you try to use it outside it.

That is Aion-2.Thanks for listening to the Model Spotlight.

This episode was generated with AI assistance. Hosts Herman and Corn are AI personalities.

#2351: AI Model Spotlight: ** Aion-2.0

Aion-2.0: A Roleplay Model from an Unlikely Source

Architecture and Capabilities

Adoption and Benchmarks

The Big Question

Mentions

Downloads

You Might Also Like

#2351: AI Model Spotlight: ** Aion-2.0