Daniel sent us this one — and I think he's trying to make our brains hurt on purpose. He's asking about this new tier of what he's calling AI clouds — providers like Nebius, Baseten, Modal, RunPod — that offer serverless GPUs or GPU-only infrastructure. The actual question is: what would make someone build on one of these instead of AWS or a traditional hyperscaler? And then he wants us to actually sort through them — when does it make sense to go Modal versus Nebius versus RunPod versus Baseten, and which one fits different types of scale and operations. There's a lot to unpack here.
By the way, today's episode is powered by DeepSeek V four Pro, which feels appropriate for a conversation about which infrastructure runs AI workloads.
It does have a certain symmetry to it. Alright, so where do we even start with this? Because I think the first thing that jumps out is that calling these companies "small" is just wrong. They're not small. They're just not hyperscalers.
Right, and that's the framing mistake a lot of coverage makes. The hyperscaler threshold is specific — we did a whole episode on what actually makes a hyperscaler — and it's about owning the entire stack from chips to fiber to buildings at a scale where you're spending tens of billions a year on capex. AWS, Azure, GCP. These AI cloud players don't do that, but some of them are running thousands of GPUs with InfiniBand networking and pulling in billions in funding. CoreWeave alone had something like seven billion in funding and was projecting a five billion dollar run rate. That's not a boutique operation.
CoreWeave is the one that's basically become the hyperscaler of the neocloud world, right? They're serving Microsoft and OpenAI.
They're in a tier of their own at this point. But the group Daniel's asking about — Nebius, Baseten, Modal, RunPod — these are where most AI startups and smaller product teams are actually making decisions right now. And the core differentiator that jumps out immediately is cost. I mean, it's not even close.
Give me the numbers. What's the actual gap?
As of April this year, AWS on-demand H100 pricing is around six dollars and eighty-eight cents an hour. Azure is even worse at about twelve twenty-nine. Meanwhile, Nebius is charging two ninety-five an hour for H100s on-demand, and RunPod is at two sixty-nine. You're looking at three to six times cheaper for the same GPU.
Three to six times cheaper. That's not a marginal difference. That's structural.
It is structural, and that's the key point. Hyperscalers have enormous overhead — they're amortizing hundreds of data centers, they're supporting thousands of services beyond GPU compute, their sales orgs are massive, their compliance certifications are expensive to maintain. The AI cloud providers are stripped down. They do GPUs, they do networking, they give you some orchestration, and that's it. The margin structure is completely different.
Here's my question. If the price gap is that massive, why isn't everyone just fleeing AWS? What's the catch?
There are a few catches. One is compliance. Nebius has ISO 27001, SOC 2, and GDPR — which is solid, especially if you need European data residency, which is actually one of their selling points. But none of these AI clouds have HIPAA, none have FedRAMP, none have PCI DSS at the level AWS does. So if you're a healthcare AI company or you're doing anything government-adjacent or payment processing, you might not have a choice. You're on a hyperscaler regardless of the price.
That's the compliance ceiling. And I'm guessing the second catch is just breadth of services. If you're already using S3 and Lambda and DynamoDB and fifty other AWS services, moving your GPU workloads to Nebius means you now have a split architecture.
And that split architecture introduces egress costs, which is actually one of the hidden traps here. Hyperscalers charge eight to twelve cents per gigabyte for data egress. So if you're training a model on Nebius and then transferring a hundred gigabyte checkpoint back to your AWS environment, you're paying eight to twelve dollars just for that transfer. At scale, that can actually rival your GPU compute bill. Most of the AI clouds include bandwidth or charge much lower flat rates, but the problem is you're still paying hyperscaler egress on the other end.
The decision isn't just "cheaper GPUs." It's "cheaper GPUs plus architectural complexity plus egress math." And that math changes depending on what you're actually doing. Which brings me to what I think is the real heart of Daniel's question — when do you pick which one? Because these four aren't interchangeable.
Not at all. Let me lay them out. And I want to start with Modal because it's the one Daniel mentioned they're actually using for serverless GPU. Modal's differentiator is pure developer experience. You write Python, you decorate a function, and Modal handles everything — containerization, scaling, cold starts, the whole thing. Their Python SDK is the product. Their cold starts are two to four seconds, which is very good for serverless.
They charge for it. What's their H100 pricing?
About four fifty an hour equivalent. So they're roughly seventy percent more expensive than RunPod for the same GPU. That premium is entirely for the developer experience.
Which is worth it until it isn't. I've seen takes arguing that Modal's billing can get opaque at scale, and there's a lock-in concern because you're building against their specific SDK, not against standard container workflows.
That's a real consideration. If you're a three-person startup trying to ship an AI feature and nobody on the team wants to think about Kubernetes, Modal is magical. You write Python, it runs, you don't think about infrastructure. But if you're scaling to thousands of requests per minute and optimizing every dollar of inference cost, that seventy percent premium starts to hurt, and the SDK lock-in means migrating off is a real engineering project.
Modal is the "I want to write code, not configure infrastructure" option. What about RunPod? They seem to be the budget play.
RunPod is fascinating because they're actually a hybrid. They offer serverless endpoints, but they also offer persistent GPU pods, which are essentially VMs with dedicated GPUs. Their serverless cold starts are the fastest in the space — forty-eight percent under two hundred milliseconds, which they call FlashBoot. Their GPU variety is also the widest — they go all the way from RTX 4090s at thirty-four cents an hour up to H100s at two sixty-nine. And they do per-second billing.
If I'm cost-sensitive and I need a mix of serverless and dedicated, RunPod is probably where I land. But what's the tradeoff? There's always a tradeoff.
The developer experience isn't as polished as Modal. You're working with containers, you're configuring more things yourself. It's not raw infrastructure by any means, but it's not "decorate a function and forget about it." And their serverless offering, while fast, doesn't have quite the same warm pooling sophistication that Modal has built.
Alright, let's talk about Nebius. They seem to be going after a different segment entirely.
Nebius is the "we do serious training workloads" player. They're not really serverless-first — they offer on-demand and committed-use GPU clusters, and they're building out serverless capabilities, but their core pitch is thousand-GPU clusters with InfiniBand networking, managed Kubernetes and Slurm, and European data residency. Their pricing is aggressive — two ninety-five on-demand for H100, two dollars flat on committed use. They also have H200s at three fifty and B200s at five fifty an hour.
The InfiniBand piece is actually important. If you're doing distributed training across hundreds of GPUs, the networking fabric between those GPUs is everything. Standard ethernet doesn't cut it. InfiniBand is what the hyperscalers use internally for their large clusters.
Right, and Nebius offering that at two ninety-five an hour is a genuinely compelling proposition for teams that need to do large-scale training but don't want to negotiate enterprise contracts with AWS. The European data residency is also a real differentiator — if you're a European AI company dealing with GDPR-sensitive data, having your training infrastructure in European data centers with ISO 27001 certification matters.
Then there's Baseten, which Daniel called Base10 but I'm pretty sure he meant Baseten. They're the most inference-focused of the bunch.
Baseten is interesting because they're built specifically for production model inference. Their differentiator is something called Truss, which is an open-source framework for packaging models, and they deploy across multiple clouds — AWS, GCP, Vultr. They also do some pretty sophisticated optimization. There was an NVIDIA case study on them where they achieved up to two hundred twenty-five percent better cost-performance for high-throughput inference using TensorRT-LLM optimization. Their cold starts are slower — eight to twelve seconds — but for production inference where you're keeping endpoints warm anyway, cold start matters less.
Their H100 pricing is higher though, right? I saw something around nine ninety-eight an hour for a full H100.
Yeah, but they also offer fractional GPUs through multi-instance GPU partitioning, so you can get a piece of an H100 for lighter workloads. And the per-minute billing means you're not paying for full hours if you don't need them. Their real pitch is: if you're running a production inference API and you need it to be fast, reliable, and cost-optimized at the model level, Baseten does that out of the box.
Let me try to synthesize this into something useful. If I'm a business building an AI product, my decision tree probably looks something like this. Step one: do I have compliance requirements that force me onto a hyperscaler? If yes, stop, go to AWS or Azure, pay the premium, move on with your life.
That's the compliance ceiling we talked about. HIPAA, FedRAMP, PCI DSS — if you need those, the AI clouds aren't an option yet.
Step two: what am I actually doing? Am I training large models from scratch or doing fine-tuning on big clusters? If yes, I'm probably looking at Nebius or CoreWeave — providers with InfiniBand, large cluster orchestration, and committed-use pricing that makes the economics work for sustained compute.
The breakpoint there is utilization. If your GPUs are running above about fifteen to twenty-five percent utilization, dedicated instances are cheaper than serverless anyway. Serverless pricing is built for bursty, unpredictable workloads where you're paying a premium per second for the privilege of scaling to zero when idle.
Which brings us to step three. If I'm doing inference or light fine-tuning with bursty traffic patterns, now I'm choosing between Modal and RunPod's serverless offerings. And that choice comes down to: how much do I value developer experience versus raw cost?
How much do I care about cold starts. If you've got latency-sensitive user-facing applications where a four-second cold start is going to annoy people, you might need to keep dedicated capacity warm anyway, which changes the math. RunPod's sub-two-hundred-millisecond cold starts on nearly half of requests is impressive for serverless.
Then step four: if I'm specifically building a production inference API that needs to serve models at scale with optimizations like TensorRT, Baseten starts looking really attractive. They're more expensive per GPU hour, but the per-token cost might actually be lower because of the optimization layer.
This is where I want to introduce something that I think gets overlooked in these comparisons. The GPU you're renting matters more than who you're renting it from for cost-per-token. A B200 on-demand at six dollars and two cents an hour yields about forty-two cents per million tokens. An H100 PCIe at two dollars and one cent an hour yields about forty-seven cents per million tokens. The B200 costs three times more per hour, but it's actually cheaper per token because it delivers roughly three point three times the throughput.
The cheapest hourly rate doesn't necessarily give you the cheapest inference. You have to do the per-token math.
And that's where Baseten's TensorRT optimization and Nebius offering B200s become interesting even though their headline hourly rates are higher than RunPod's. If you're serving millions of tokens a day, you care about cost per token, not cost per GPU hour.
This also connects to the lock-in question with Modal. If you're building on Modal's SDK and you're happy with their H100 pricing at four fifty an hour, but then B200s become the standard for cost-effective inference, you're dependent on Modal offering B200s and pricing them competitively. With a more container-based approach on RunPod or Nebius, you can switch GPU types or even providers more easily.
Although I should say — and I want to be fair to Modal here — they do offer A100s and H100s currently, and their abstraction layer means you don't have to reconfigure your infrastructure when new GPU types come online. In theory, they handle that for you. The lock-in is real, but so is the value of not thinking about GPU types at all.
Let's talk about who's actually using these, because that grounds the conversation in reality. What do we know about the user bases?
Modal seems to be really popular with the AI startup crowd — teams that are building quickly, experimenting, shipping features. Their Python SDK is loved by developers who just want things to work. RunPod has a broader base — everyone from individual researchers fine-tuning models on RTX 4090s to small companies running production inference on H100s. Nebius is going after the mid-to-large scale training market, plus European companies that need GDPR compliance. Baseten's customers tend to be teams that have already productized their models and are optimizing for production inference at scale.
The hybrid approach is becoming the norm, right? Nobody's picking just one.
That's what I'm seeing. You use serverless for development, experimentation, and handling traffic spikes. You keep dedicated capacity for your baseline production workloads. Maybe you train on Nebius, serve inference on Baseten, and use RunPod serverless for burst capacity. The idea that you'd pick one provider and commit to it exclusively is increasingly outdated.
There's also an egress strategy question that I don't think enough teams think through upfront. If your training data lives in S3 and you're training on Nebius, you're paying to move data. If your model artifacts need to end up back in your AWS environment for serving, you're paying again. The AI clouds that offer integrated storage — Nebius has persistent storage, RunPod has network volumes — can reduce some of that friction, but the multi-cloud data movement tax is real.
It's not just the dollar cost. It's latency, it's reliability, it's the operational complexity of managing data pipelines that span providers. This is where the hyperscalers' integrated ecosystems win. If everything lives in AWS, your S3 to SageMaker to Lambda pipeline is trivial to set up and operate. The moment you split across providers, you're building and maintaining cross-cloud data pipelines, and that engineering time isn't free.
There's a total cost of ownership calculation here that goes beyond the GPU hourly rate. You've got the compute cost, the egress cost, the engineering cost of managing multi-cloud complexity, and the opportunity cost of lock-in. And different teams will weigh those differently depending on their stage, their team size, and their growth trajectory.
I think the sweet spot for these AI clouds is really clear for two profiles. Profile one: you're an early-stage AI startup, you're cost-sensitive, you don't have enterprise compliance requirements, and you need GPU compute that doesn't require a PhD in cloud architecture to operate. You're probably on Modal or RunPod. Profile two: you're a mid-stage company with real inference traffic, you've optimized your models, and you're trying to drive down per-token costs while maintaining reliability. You're probably on a mix — maybe Baseten for production inference, Nebius for training runs, and some serverless for flexibility.
Profile three is the one where hyperscalers still dominate: you're an enterprise with compliance requirements, existing cloud commitments, and an architecture that's deeply integrated with a single cloud provider's ecosystem. The GPU premium hurts, but the switching costs hurt more.
There's one more angle I want to hit on because it comes up in the comparisons. The "serverless tax" — that seventy percent premium Modal charges over RunPod for H100s — is that worth it? And I think the honest answer is: it depends on your team. If you have infrastructure engineers who are comfortable with containers and Kubernetes and GPU optimization, RunPod or Nebius give you more control at lower cost. If you're a team of ML engineers who just want to write model code and not think about infrastructure, Modal's premium might pay for itself in engineering time saved.
There's a midpoint here that's worth mentioning. RunPod's serverless offering gives you a lot of the "don't think about infrastructure" benefit without Modal's SDK lock-in, because you're deploying containers. It's not as seamless as Modal — you're still writing Dockerfiles and configuring things — but it's more portable. If you decide RunPod's pricing or GPU selection isn't working for you anymore, you can take those containers elsewhere.
That portability argument gets stronger the larger you get. At small scale, the engineering cost of switching providers is manageable regardless. At large scale, being locked into a provider-specific SDK can become a genuine business risk. I've seen discussions where teams on Modal start hitting that wall — the pricing becomes harder to predict at scale, the SDK abstraction starts feeling constraining rather than liberating, and the migration cost has grown with their usage.
That's also a good problem to have, right? If you've scaled to the point where Modal's pricing and lock-in are real concerns, you've probably built something people want. The migration pain is a consequence of success.
And Modal's team is aware of this — they're not oblivious to the lock-in critique. The question is whether they'll introduce more portability or more transparent pricing as their customers scale. That'll determine whether they remain a long-term home for growing companies or become a stepping stone that teams graduate from.
Let me try to put some concrete decision heuristics out there, because I think that's what Daniel's really asking for. If you're doing bursty inference with significant idle periods — think a hundred to five hundred requests a day, unpredictable patterns — serverless is the right model, and you're choosing between Modal and RunPod serverless based on developer experience versus cost. If you're doing steady production inference above maybe twenty percent utilization, dedicated GPUs are cheaper, and you're looking at RunPod pods, Nebius, or Baseten depending on your optimization needs. If you're training large models, you need InfiniBand and cluster orchestration, and Nebius is probably your best bet among the ones we're discussing.
If you're doing fine-tuning or smaller training runs, RunPod's GPU variety becomes really attractive. You can fine-tune on A100s at two eighteen an hour or even use RTX 4090s at thirty-four cents if your model fits. That kind of flexibility doesn't exist on Modal or Baseten.
The RTX 4090 thing is actually worth pausing on. Thirty-four cents an hour for a GPU that's perfectly capable of fine-tuning a lot of models — that's democratizing access in a way that didn't exist even two years ago. You can experiment for pocket change.
That's the broader story here. These AI clouds are doing for GPU compute what DigitalOcean did for VPS hosting — they're taking something that was complex and expensive on hyperscalers and making it accessible and affordable for smaller teams. The hyperscalers aren't going anywhere, but they're no longer the only sensible option.
Alright, I want to zoom out for a second and ask a question that I think is lurking underneath all of this. Are these AI clouds sustainable as businesses, or are we looking at a consolidation wave in the next couple years? Because the pricing is aggressive, the margins have to be thin, and the hyperscalers could decide to get competitive on GPU pricing if they start losing enough workload.
That's the existential question for this tier. CoreWeave has the scale and the contracts to be sustainable — serving Microsoft and OpenAI gives you a revenue base that's hard to disrupt. Nebius has the European data residency angle, which is a structural moat that hyperscalers can't easily replicate without building European data centers with the right certifications. RunPod's GPU variety and low-cost positioning give them a different kind of moat — they're serving a market segment that hyperscalers don't seem interested in chasing. Modal and Baseten are more vulnerable because they're competing on software and developer experience, which hyperscalers could theoretically replicate.
Although the hyperscalers have tried to replicate good developer experiences before and it's not exactly their strong suit. AWS's attempts at making GPU compute accessible haven't been great. SageMaker exists, but nobody would describe it as a joy to use.
That's true. And there's an innovator's dilemma dynamic here. The hyperscalers' business models are built on enterprise relationships, compliance certifications, and ecosystem lock-in. They make more money when you use more of their services. A stripped-down GPU cloud that does one thing well and charges a fraction of the price is almost a different business entirely. They could compete on price, but doing so would cannibalize their existing margins across their enterprise customer base. That's a hard decision for a public company to make.
The AI clouds might actually have more runway than a naive analysis would suggest. The hyperscalers are constrained by their own business models.
I think that's right. And we're also still in the early stages of AI adoption. If the market for GPU compute grows tenfold over the next few years, there's room for multiple tiers of providers. The hyperscalers can keep the enterprise compliance market, the AI clouds can take the startup and mid-market, and everyone grows.
One last thing I want to touch on before we move to practical takeaways — Daniel mentioned Modal specifically as what they're using for serverless GPU. And I think that's instructive. For a team that's building and experimenting, the developer experience premium is worth it. But Daniel's also the kind of person who's going to be thinking about what happens at the next stage of scale. That's why he's asking about the whole landscape.
That's the right way to think about it. Pick the tool that matches your current stage, but understand the landscape well enough to know when you've outgrown it and what comes next. Too many teams pick a provider early and then never reevaluate, even as their usage patterns and requirements change dramatically.
Alright, I think we've covered the landscape. Let's do some practical takeaways.
Now: Hilbert's daily fun fact.
The national animal of Scotland is the unicorn. It has been since the twelve hundreds.
If you're building an AI product and trying to figure out which of these providers to use, here's what I'd actually do. First, calculate your current and projected GPU utilization. If you're below about twenty percent utilization, start with serverless — probably RunPod if cost matters most, Modal if developer speed matters most. If you're above twenty percent, run the numbers on dedicated instances from Nebius or RunPod pods.
Second, do the per-token math, not the per-hour math. A cheaper GPU that's slower might cost you more per token than a more expensive GPU that's faster. This is especially relevant if you're serving inference at scale. Look at B200 availability and pricing, not just H100s.
Third, map your compliance requirements before you pick a provider. If you need HIPAA or FedRAMP, you're on a hyperscaler and that's that. If you need GDPR and European data residency, Nebius becomes very interesting. If you don't have specific compliance needs, the AI clouds are wide open to you.
Fourth, factor in egress costs. If your data and your other services live in AWS, moving GPU workloads off AWS saves you on compute but costs you on data transfer. Model that out before you make the switch. At small scale it doesn't matter much, but at scale it can erase your compute savings.
Fifth, don't marry your provider. The hybrid approach is the norm now. Use serverless for experimentation and burst capacity, dedicated instances for baseline production, and don't be afraid to mix providers if it makes economic sense. The operational complexity is real, but the cost savings usually justify it once you're past the experimentation phase.
The through line here is that we're in a moment where GPU compute infrastructure is actually getting more competitive, not less. The hyperscalers had a near-monopoly on serious cloud infrastructure for years, and now there's a whole tier of focused competitors eating away at the GPU segment specifically. That's good for everyone building AI products.
The open question I'm left with is whether the hyperscalers respond by dropping GPU prices or by bundling more aggressively. If AWS decides to make GPU compute a loss leader to keep AI workloads in their ecosystem, the economics shift. I don't think that's likely in the near term given their margin structure, but it's worth watching.
Thanks to our producer Hilbert Flumingtop for keeping this operation running, and to Daniel for sending us a prompt that forced us to actually map out the whole competitive landscape instead of just talking about one provider.
This has been My Weird Prompts. You can find every episode at myweirdprompts.com or wherever you get your podcasts. If you found this useful, leave us a review — it helps other people find the show.
We'll be back with another one soon.