#1548: Modal and the End of the Serverless GPU Cold Start

Stop waiting for containers to warm up. Discover how Modal is reinventing GPU infrastructure to eliminate friction in AI development.

0:000:00

Episode Details

Published: Mar 25
Duration: 20:56
Audio: Direct link
Pipeline: V5
TTS Engine: chatterbox-regular
LLM

AI-Generated Content: This podcast is created using AI personas. Please verify any important information independently.

For years, the term "serverless" has been something of a misnomer in the world of high-performance computing. While marketed as a magical, frictionless way to deploy code, the reality often involves significant "cold start" delays. Developers frequently wait thirty seconds or more for a container to warm up and a GPU to initialize before any actual work begins. This friction is more than an inconvenience; it is a productivity killer that renders many responsive AI applications unviable.

Moving Beyond Kubernetes

While much of the industry relies on Kubernetes for container orchestration, this standard was never designed for the sub-second scaling required by modern AI workloads. Kubernetes is optimized for long-running services, making it heavy and slow when pulling large images or managing GPU memory dynamically.

To solve this, new infrastructure approaches are emerging that bypass traditional wrappers in favor of custom-built runtimes. By developing specialized file systems and schedulers from the ground up, platforms can now place tasks on a GPU in milliseconds. This transition from being a consumer of a managed API to an architect of one's own infrastructure allows developers to define Python dependencies and system libraries with precision, offering total control over the hardware stack.

The Economics of the 51% Rule

Choosing between serverless and dedicated bare-metal hardware often comes down to a financial heuristic known as the "51% rule." If a GPU's utilization is consistently above 51%—meaning the hardware is active more than half the time—it is generally more cost-effective to rent a dedicated instance or purchase hardware.

However, for workloads that are "bursty," such as inference, internal tools, or creative iteration, serverless is the clear winner. Traditional cloud providers often require users to rent GPUs by the hour or month, leading to massive waste during idle time. Modern serverless platforms offer per-second billing, allowing users to spin up a cluster of GPUs, perform work in parallel, and scale back to zero instantly.

Solving the Cold Start with Snapshots

One of the most significant hurdles in AI deployment is the time it takes to load model weights into GPU VRAM. For large models, this initialization can take upwards of twenty seconds. A breakthrough solution currently gaining traction is the use of GPU Snapshots.

Instead of reloading the entire model every time a container starts, a snapshot takes a literal "picture" of the VRAM state, including initialized CUDA kernels. This allows the system to inject the state into the GPU almost instantly, bringing cold starts for heavy models down from fifteen seconds to under three. This shift makes serverless viable for real-time, interactive applications that were previously impossible.

Challenges in Specialized Workflows

Despite these advancements, certain industries face unique hurdles. In architecture and design, many core tools like Rhino are historically tied to Windows environments. Because high-performance compute platforms are almost exclusively Linux-native, bridging this gap requires creative engineering.

Architects are increasingly using serverless clusters as "geometry-processing workers." By keeping the design interface on a local Windows machine and offloading heavy computational tasks—like structural simulations or environmental analysis—to a Linux-based GPU cluster, firms can achieve massive concurrency without abandoning their primary software ecosystem.

Downloads

Episode Audio

Download the full episode as an MP3 file

Download MP3

Transcript (TXT)

Plain text transcript file

Transcript (PDF)

Formatted PDF with styling

Episode #1548: Modal and the End of the Serverless GPU Cold Start

Daniel's Prompt

Custom topic: Modal has generously provided GPU compute credits for this podcast - for which we are extrenmely grateful. We've covered serverless GPU before, but let's look at Modal specifically and what their plat

Have you ever noticed how the word serverless is one of the biggest lies in modern computing? It is marketed as this magical, frictionless experience where you just throw code at the cloud and things happen. But the reality for most of us is that you click a button, you wait twenty or thirty seconds for a container to warm up, and by the time the G-P-U is actually ready to do any real work, you have already checked your phone twice, looked out the window, and completely forgotten why you were there in the first place. It is a productivity killer disguised as a convenience.

It is the classic cold start problem, Corn. It is the absolute bane of anyone trying to build responsive A-I applications or interactive tools. If your user has to wait fifteen seconds for a large language model or an image generator to even begin thinking, the experience is basically broken. But today's prompt from Daniel is about a company that has spent the last few years trying to kill that specific friction entirely. He wants us to dig into Modal.

Right, Modal. We should probably start by being transparent and clarifying that we are not just talking about this because they provide the G-P-U credits that keep our show running. Daniel is actually using it for his own heavy-duty video generation projects, and he is genuinely curious about how it stacks up against the traditional cloud giants like Amazon Web Services or Google Cloud. He is seeing a level of speed that he did not think was possible with serverless.

I am Herman Poppleberry, and I have been deep in their documentation and their architecture whitepapers for the last seventy-two hours. What makes Modal fascinating is that they did not just build another thin wrapper around A-W-S or a reseller platform for spare G-P-U capacity like some of the other players. Erik Bernhardsson, the C-E-O, and Akshat Bubna, the C-T-O, decided to take the hard path. They built their own custom container runtime, their own scheduler, and their own file system from the ground up.

Which sounds like a massive amount of unnecessary engineering work unless you have a very specific, very painful problem you are trying to solve. I mean, why not just use Kubernetes like the rest of the world? It is the industry standard for a reason, right?

Because Kubernetes was never designed for sub-second scaling of G-P-U workloads. Kubernetes is great for long-running web services, but it is heavy. It is slow to pull large images, and it is not optimized for the way G-P-U memory needs to be managed for A-I. If you look at the architecture of Modal, they have a specialized file system called Modal Volumes and a custom scheduler that can place a task on a G-P-U in milliseconds. When you run a function on Modal, it is not just spinning up a generic virtual machine and hoping for the best. It is injecting your specific code into a pre-baked, optimized environment almost instantly.

I remember when Daniel first started messing with this. He was coming from the Model-as-a-Service world. You know the drill. You send a R-E-S-T A-P-I call to OpenAI or Anthropic, you get a response back, and you never have to think about the hardware or the drivers or the C-U-D-A versions. Moving to Modal was a bit of a shock for him because it is more of an Infrastructure-as-Code play. You are basically writing Python decorators like at-app-dot-function to define exactly what your environment looks like.

That shift from being a consumer of an A-P-I to being an architect of your own infrastructure is the biggest hurdle for most people. In the A-P-I world, you are just a tenant in someone else's house. In the Modal world, you are building the house. You have to define your Python dependencies, your system libraries, and even which specific G-P-U you want to use, whether it is an A-one-hundred or the new H-one-hundreds. But the payoff is that you own the entire stack. You are not at the mercy of a provider changing the model weights under your feet or hitting a rate limit that ruins your production pipeline right when you need it most.

It is that control that Daniel keeps raving about. But before we get too deep into the technical weeds, we should talk about the company itself. Erik Bernhardsson has a pretty legendary pedigree in the engineering world. He was the fortieth employee at Spotify and the creator of Luigi, which is a very popular data orchestrator, and Annoy, which is a library for approximate nearest neighbors. He was also the C-T-O of Better dot com. So when he says he wants to fix data-intensive workloads, people tend to listen.

And the market is definitely listening. Modal reached unicorn status late last year after an eighty-seven million dollar Series B led by Lux Capital. They are currently valued at over one point one billion dollars. That is a lot of pressure for a New York-based startup, but it shows how desperate the industry is for efficient, scalable compute. We are in the middle of this massive shift where everyone is trying to figure out how to run these models without going bankrupt.

Speaking of not going bankrupt, we should probably talk about how we actually use this thing for the show. Because before we moved our Text-to-Speech workflow over to Modal, we were looking at some pretty depressing bills. We were trying to run custom models like X-T-T-S-v-two to get the voices just right, and the traditional cloud options were not great.

We were looking at over five hundred dollars a month just to keep a single, mid-tier G-P-U instance running twenty-four-seven. And the tragedy was that we only actually used it for maybe three or four hours a week when we were processing episode scripts. The rest of the time, that silicon was just sitting there in a data center somewhere, burning money and generating heat for absolutely no reason. It was the definition of inefficiency.

It was like paying full rent on a luxury apartment in Manhattan that you only visit for an hour on Tuesday afternoons. It makes no sense. Now, with Modal, when we hit the button to generate the audio for this episode, Modal spins up a cluster of G-P-Us, does the work in parallel, and then instantly scales back to zero the moment the last sentence is rendered. We only pay for the literal seconds that the G-P-U is active.

And the pricing is actually quite aggressive now. As of late March twenty-twenty-six, their effective rate for an N-V-I-D-I-A H-one-hundred is roughly three dollars and ninety-five cents per hour, but remember, that is billed by the second. If you compare that to a bare-metal provider where you often have to rent the card for at least an hour or even commit to a month, the math starts to look very different for bursty workloads. If your task takes forty-five seconds, you pay for forty-five seconds. Period.

You mentioned something earlier when we were prepping for this called the fifty-one percent rule. Is that like a legal thing or just an engineering heuristic?

It is a financial heuristic that has gained a lot of traction in the industry lately. The idea is that if your G-P-U utilization is consistently above fifty-one percent—meaning the card is actually doing work more than half the time—it is actually cheaper to go buy your own hardware or rent a dedicated bare-metal instance. If you are running a massive training job that is going to last for three weeks straight, you do not want serverless. You want a dedicated rack. But for inference, for internal tools, or for anything that fluctuates throughout the day, if your utilization is below fifty-one percent, serverless is almost always the winner because you are not paying for the idle time.

I think that is where Daniel found the most value. He has been doing these video generation experiments where he needs to generate, say, one hundred different clips to find the one that actually looks good. If he used a standard A-P-I, he would be sitting there for an hour waiting for them to process one by one because of concurrency limits. On Modal, he just maps that function across one hundred containers simultaneously.

It is massive concurrency. You are essentially renting a supercomputer for sixty seconds. Generating one hundred clips takes almost the same amount of time as generating one. That is a fundamental shift in how you think about creative workflows. You stop optimizing for the cost of a single run and start optimizing for the speed of the entire batch. It changes your iteration cycle from days to minutes.

But there is still that cold start issue we talked about at the beginning. I mean, if he is doing something interactive, even a ten-second wait feels like an eternity in the modern era. I saw something in the recent updates about G-P-U Snapshots. How does that actually work? Is it just caching the image?

This is one of the most significant technical updates we have seen this year, and it is currently in alpha and beta testing. Normally, when a container starts, it has to load the model weights from the disk into the V-R-A-M of the G-P-U. If you are dealing with a large model, that can take fifteen or twenty seconds just to move the data. Modal’s G-P-U Snapshots basically take a literal picture of the V-R-A-M state, including the model weights and the C-U-D-A kernels that are already initialized.

So instead of reloading and re-initializing the whole thing, it just resumes the state?

It injects that state into the G-P-U almost instantly. They have brought cold starts for heavy models like Comfy-U-I or large Text-to-Speech engines down from fifteen seconds to under three seconds. For something like a custom application or a specialized video diffusion model, that is the difference between an application feeling broken and feeling like magic. It makes serverless viable for real-time applications that were previously impossible.

I want to pivot to something Daniel mentioned about architects. He has been talking to some friends in the design world who are trying to move their rendering and simulation workflows to the cloud. Specifically people using Rhino and Grasshopper. That seems like a much harder nut to crack than just running some Python code because those tools are so tied to specific operating systems.

It is a massive hurdle because of the operating system mismatch. Most architectural software, specifically Rhino and its compute engine, is historically tied to Windows. Modal, like almost all high-performance compute platforms, is a Linux-native environment. You cannot just drop a Windows installer into a Modal container and expect it to work. It is like trying to put a square peg in a round hole.

So are those architects just out of luck, or is there a way to bridge that gap?

There are two main ways to do it, but both require some engineering effort. One is using the rhino-three-d-m library, which is the open-source version of the Open-NURBS geometry kernel. You can run that on Linux to do geometry processing, like calculating intersections or generating complex paths. But if you want the full rendering power of Rhino’s Cycles engine, you have to use a headless version compiled for Linux.

That sounds like a lot of work for someone who just wants to see what their building looks like in the sunlight or run a structural analysis.

It is, which is why we are seeing a shift where Modal acts as a geometry-processing worker. The architect stays in their Windows environment on their local machine for the design work, but they offload the heavy computational tasks, like massive structural simulations or environmental analysis, to a Modal cluster. The local machine sends the data, Modal spins up a thousand G-P-Us to run the simulation in parallel, and sends the results back. It is about using the right tool for the right part of the job.

It is funny because we have seen this pattern before. People think serverless is just for simple web hooks or resizing images, and then someone comes along and uses it to render a city or simulate a drug molecule. It really comes down to whether you are willing to learn the Python-native way of doing things. You have to be comfortable with the code.

And that is the core of the Modal philosophy. They are betting that the future of A-I development is not going to be built by people clicking buttons in a dashboard or just calling generic A-P-Is. It is going to be built by engineers who treat their infrastructure as part of their code. That is why they have reached that unicorn status. They are providing the tools for the people who are actually building the next generation of software.

It is a bold bet. A billion-dollar valuation for a company that basically helps people run Python scripts on G-P-Us. But when you look at the scarcity of compute, it makes sense. They are also aggressively updating their hardware pool. They were among the first to get N-V-I-D-I-A Blackwell B-two-hundreds and H-two-hundreds into their serverless pool this year.

If you want to do frontier-class inference on the latest silicon without signing a three-year contract with a data center or buying a million dollars worth of hardware, there are not many other places you can go. Modal gives you access to the most powerful chips on the planet for a few dollars an hour. That democratization of hardware is a huge deal for small teams.

I guess the question for the listener is, at what point do you make the switch? If you are just playing around with Midjourney or ChatGPT, you obviously do not need this. But if you are building a product, where is the tipping point?

The tipping point is when you find yourself frustrated by the limitations of a third-party A-P-I. Maybe it is the latency, maybe it is the cost at scale, or maybe you need to use a very specific fine-tuned model that no one else hosts. Once you need that level of control, you have to choose between managing your own servers or using a platform like Modal. And if you choose the server route, you better enjoy spending your weekends debugging C-U-D-A drivers and worrying about thermal throttling.

Which is exactly what Modal is selling. They are selling you your weekends back. They handle the drivers, the orchestration, the networking, and the scaling. You just provide the logic. It is a compelling pitch for anyone who has ever lost a day to a broken environment configuration.

We should probably talk about the takeaways for people listening who are considering this. The first step is to audit your current G-P-U utilization. If you are paying for a cloud instance right now, look at your logs. How often is that card actually at one hundred percent? If it is sitting idle for more than half the day, you are literally throwing money away. You are subsidizing the power bill of a data center for no benefit to your project.

And do not be intimidated by the Infrastructure-as-Code aspect. Yes, you have to define your environment in Python, but once you do it once, you can reuse that image forever. It is a one-time tax for a lifetime of flexibility. You can define your C-U-D-A version, your specific version of PyTorch, and all your dependencies in a few lines of code.

The other thing to look at is the batching potential. If you have any process that currently runs in a loop, see if you can parallelize it. That is where the real magic of serverless G-P-U compute shines. You turn a linear problem that takes an hour into a parallel one that takes sixty seconds. That kind of velocity changes how you think about experimentation.

I think that is what Daniel liked most. He mentioned that once he got over the initial learning curve, he could go from an idea to a deployed, scaling application in about twenty minutes. That kind of speed is addictive. It allows you to try weird ideas that you would otherwise ignore because the setup time was too high.

It is the democratization of the supercomputer. We are living in the inference era now, as we talked about in episode fifteen forty-four. More than half of all A-I infrastructure spending is now going toward keeping models running rather than just training them. Platforms like Modal are the backbone of that shift.

It is a massive shift. And it makes me think about the long-term implications. If the cost of inference continues to drop and the speed continues to increase, the barrier to entry for high-end A-I applications is going to vanish. We are going to see small startups launching tools that rival what the giants are doing.

We are already seeing it. Small teams are building video editing tools and architectural simulation platforms that are incredibly powerful. They are doing it by leveraging these serverless platforms to get massive scale without the massive overhead of a traditional dev-ops team.

I do wonder, though, if we are going to see a world where this becomes even more abstract. Like, will I ever reach a point where I do not even have to write the Python decorator? Where the system just knows when I need a G-P-U and provides it seamlessly?

We are heading that way. If you look at the progress from two years ago to today, the friction has dropped significantly. We went from waiting minutes for instances to waiting seconds, and now with those V-R-A-M snapshots, we are looking at sub-three-second response times. The goal is to make a G-P-U feel as ubiquitous and as fast as a standard C-P-U function.

It is a bold vision. Especially when you consider that the hardware itself is becoming increasingly scarce and expensive. If Modal can keep their prices competitive while providing that level of performance, they are going to be a very hard target to hit for the traditional cloud providers who are still stuck in the virtual machine mindset.

One thing that often gets overlooked is the developer experience beyond just the G-P-U. Modal has things like Modal Secrets for managing A-P-I keys and Modal Volumes for persistent storage. It feels like a cohesive ecosystem rather than a collection of disjointed tools you have to stitch together with bash scripts and prayers.

I think we have covered a lot of ground here. From the custom runtime architecture to the economic fifty-one percent rule, and how this is actually changing the way we produce this very show. It is a fascinating time to be watching this space.

It really is. The democratization of this kind of hardware is going to lead to some very strange and wonderful applications in the next few years. We are just scratching the surface of what is possible when you give everyone access to a supercomputer for a few cents a second.

Well, I for one am looking forward to the day when my cold starts are so fast that I do not even have time to check my phone. I might actually have to stay focused on my work, which is a terrifying prospect for my attention span.

You could always just use that extra time to ask me more probing questions about container runtimes, Corn.

Do not tempt me, Herman. I have plenty of them. But I think we should probably wrap this one up before you start explaining the nuances of C-U-D-A kernel optimization and lose half our audience.

Probably for the best. We will save the kernel talk for the after-show.

Thanks as always to our producer Hilbert Flumingtop for keeping the gears turning behind the scenes. And a big thanks to Modal for providing the G-P-U credits that power our Text-to-Speech and automation pipelines. We genuinely could not do this show at this scale without that infrastructure.

If you found this technical deep dive useful, we have covered similar territory in the past. Check out episode four eighty-four, The Silicon Sharing Economy, for more on the early days of serverless G-P-Us, or episode twelve twenty-seven where we discussed the Mojo programming language and the quest for A-I performance.

This has been My Weird Prompts. If you are enjoying the show, a quick review on your podcast app really does help us reach new listeners who might be as nerdy as we are. It makes a huge difference in the algorithms.

You can also find us at myweirdprompts dot com for our full archive and all the ways to subscribe to the feed. We have got all the show notes and technical links there as well.

We will be back soon with another prompt from Daniel. Until then, stay curious.

Goodbye.

This episode was generated with AI assistance. Hosts Herman and Corn are AI personalities.