#1109: The T-FLOP Trap: Measuring the Power of Modern AI

Are teraflops the "horsepower" of AI, or just a marketing gimmick? Explore why raw compute speed isn't the whole story in the race for AI power.

0:000:00

Episode Details

Published: Mar 11
Duration: 26:35
Audio: Direct link
Pipeline: V5
TTS Engine: chatterbox-regular
LLM
Topics: gpu-acceleration architecture large-language-models

AI-Generated Content: This podcast is created using AI personas. Please verify any important information independently.

In the world of high-performance computing, one metric reigns supreme: the teraflop. Standing for a trillion floating-point operations per second, the T-FLOP has become the industry’s version of horsepower. As we move into 2026, the numbers associated with new architectures like NVIDIA’s Blackwell are staggering, reaching into the tens of petaflops. However, as hardware becomes more specialized, the gap between theoretical peak performance and real-world utility is widening.

The Precision Trade-off

The history of the T-FLOP began with massive, room-sized supercomputers like the ASCI Red in the late 1990s. At that time, a single teraflop required thousands of processors and massive amounts of electricity. Crucially, these machines focused on "double precision" (FP64), which is necessary for complex simulations like weather patterns or rocket trajectories where every decimal point matters.

Modern AI has changed the rules. Neural networks are remarkably resilient to small mathematical errors, allowing the industry to shift toward lower precision math. By moving from 64-bit numbers to 16-bit, 8-bit, or even 4-bit numbers, hardware manufacturers can pack more operations into the same silicon. This creates a marketing paradox: a chip might claim thousands of T-FLOPS, but it is doing much simpler math than the supercomputers of old. It is an arms race of quantity over precision.

The Memory Wall

The most significant limitation in modern AI isn't actually the speed of the processor, but the speed of data movement. This is known as the "Memory Wall." While compute power has grown exponentially, the ability to move data from memory to the processor has not kept pace.

Think of a high-end GPU as a world-class chef. If the chef can chop vegetables at lightning speed but the assistants only bring one onion every ten minutes, the chef’s "peak performance" is irrelevant. In modern AI training, chips often spend a significant portion of their time idling, waiting for data to arrive from High-Bandwidth Memory (HBM). This results in a utilization gap where a company might only be using 30% to 40% of the hardware power they paid for.

The Search for Better Metrics

As T-FLOP numbers become increasingly disconnected from actual performance, the industry is left searching for better ways to measure value. While T-FLOPS are an objective hardware property, they fail to account for software efficiency or memory bottlenecks.

Metrics like "tokens per second" are more practical for users, but they are highly dependent on the specific model being run. For now, the T-FLOP remains the gold standard for marketing, even if it functions more as a "peak theoretical" fiction than a guarantee of speed. As AI clusters continue to grow in cost and scale, understanding the difference between these marketing numbers and real-world throughput is becoming essential for anyone investing in the future of compute.

Downloads

Episode Audio

Download the full episode as an MP3 file

Download MP3

Transcript (TXT)

Plain text transcript file

Transcript (PDF)

Formatted PDF with styling

Episode #1109: The T-FLOP Trap: Measuring the Power of Modern AI

Daniel's Prompt

Custom topic: TFLOPS: How teraflops became the definitive unit for measuring AI computing power. What exactly is a FLOP (floating-point operation per second), how does the TFLOPS metric work, and why did it become

Hey Herman, I was looking at some of the hardware specs for those new Blackwell clusters that Daniel was asking about earlier this morning, and I noticed something that kind of made me laugh. Every single marketing slide, every technical white paper, and every benchmark comparison lead with this one specific number. It is always the teraflops. It is like the horsepower of the computer world. But the more I looked at it, the more I started to wonder if we are actually measuring what we think we are measuring. We are sitting here in March of twenty twenty-six, and the numbers have become so large they almost feel fake. We are talking about twenty petaflops on a single Blackwell B two hundred chip for some operations. That is twenty quadrillion operations per second. It feels like we are just adding zeros to a spreadsheet at this point.

Herman Poppleberry here, and Corn, you have hit on one of my favorite pet peeves in the industry. It is the obsession with peak theoretical performance. You are right, though. Teraflops, or T-FLOPS, have become the gold standard for how we talk about A I computing power. Whether you are looking at an H two hundred, a B two hundred, or even the early leaks we are seeing for the Rubin R one hundred architecture, the headline is always about how many quadrillions of operations per second these things can do. But as we often say on My Weird Prompts, the headline is rarely the whole story. In fact, the headline is often a carefully constructed piece of fiction designed to make a venture capitalist feel good about a hundred-million-dollar purchase order.

That is it. It feels a bit like looking at a car and saying it can go three hundred miles per hour, but then you realize it only has a two-gallon fuel tank and the tires are made of wood. Sure, the engine can spin that fast in a vacuum, but what does it actually do on the road? I want to pull this apart today because Daniel's prompt was really pushing us to look at how this one specific unit, the T-FLOP, basically conquered the entire world of high-performance computing and A I. It is the yardstick of our era. If the nineteenth century was measured in tons of steel and the twentieth in barrels of oil, the twenty-first is being measured in teraflops.

The history is actually quite something. But before we get into the weeds, we should probably ground everyone in what a FLOP actually is. It stands for floating-point operation per second. A floating-point operation is basically any math problem involving a decimal point. So, if you multiply two point five by four point seven, that is one floating-point operation. If you do that a trillion times in one second, you have one teraflop. The floating part of the name refers to the fact that the decimal point can float anywhere relative to the significant digits of the number. This allows the computer to represent very large numbers and very small numbers using the same amount of memory.

And the reason we care about floating-point math specifically, rather than just whole numbers or integers, is because the real world and complex simulations do not happen in neat little whole numbers. If you are simulating the weather, or the way a protein folds, or how a neural network adjusts its weights, you need that precision. You need the decimals. If you only used integers, your rounding errors would accumulate so fast that your A I would basically turn into a random number generator within a few layers of the network.

Right. And for decades, the metric for the world's most powerful supercomputers was how many Gigaflops, or billions of operations, they could do. Then we hit the Teraflop era. I remember reading about the first machine to actually break the one teraflop barrier. It was the Intel A-S-C-I Red back in late nineteen ninety-six, though it officially hit its stride in nineteen ninety-seven. It was a monster. It took up about sixteen hundred square feet, used eight hundred kilowatts of electricity, and was primarily built to simulate nuclear weapon tests so the United States did not have to do actual physical testing. It used over nine thousand Pentium Pro processors.

It is wild to think about that. A room-sized supercomputer in nineteen ninety-seven hitting one teraflop, and today, you can get a high-end graphics card for your gaming P C that does eighty or ninety teraflops without breaking a sweat. It is a literal thousand-fold increase in power that fits in the palm of your hand. But this brings up a question I have been chewing on. When A-S-C-I Red was doing a teraflop, it was doing it with what they call double precision, or F P sixty-four. When a modern A I chip says it does a thousand teraflops, it is usually not talking about that same kind of math, is it?

Not even close, Corn. And this is the first big secret of the T-FLOPS arms race. In the old days of scientific computing, which we talked about in depth back in episode one thousand thirty-four when we discussed the race for exascale, precision was everything. If you are calculating a rocket trajectory to Mars, a tiny rounding error at the start of the journey means you miss the planet by a million miles. So they used sixty-four-bit numbers. Very precise, very heavy, and very slow to calculate.

But A I is different. I have heard it described as being more like a painting than a blueprint. You do not need every single pixel to be mathematically perfect to understand the image. If a weight in a neural network is zero point five zero zero zero zero one or zero point five zero zero zero zero two, the model usually doesn't care.

You've got it. Neural networks are surprisingly resilient to noise. It turns out that when you are training a large language model, you do not need the extreme precision of F P sixty-four. Around twenty-fifteen, the industry realized we could get away with F P sixteen, which is half precision. Then Google introduced B-Float sixteen, which kept the range of a large number but sacrificed some precision. Lately, with the Blackwell architecture, we have moved down to F P eight and even F P four. Each time you cut the precision in half, you can effectively double or quadruple the number of T-FLOPS you can squeeze out of the same piece of silicon because the math units become smaller and simpler.

So when I see a spec sheet saying a chip has four thousand T-FLOPS, I have to look at the fine print to see if that is at F P eight or F P sixty-four. If it is F P eight, it is doing much simpler math, which is why the number looks so much bigger. It is almost like saying a runner is faster because they are taking shorter steps. If you compare the Blackwell B two hundred's F P sixty-four performance to its F P four performance, the difference is staggering. It is like comparing a tricycle to a jet engine, even though it is the same piece of hardware.

Spot on. And the industry shifted to this because A I training is basically just a massive, never-ending series of matrix multiplications. It is just giant grids of numbers being multiplied by other giant grids of numbers. Around twenty-seventeen, Nvidia introduced something called Tensor Cores. These are specialized parts of the chip that are designed to do nothing but these specific matrix operations at lower precision. They are not general-purpose math units. They are specialized A I math units. That was the moment T-FLOPS stopped being a measure of general computing power and started being a measure of A I throughput.

I remember when the A one hundred came out. That was a big shift. It had this thing called Tensor Float thirty-two, or T F thirty-two. It was a weird hybrid that tried to give you the range of a big number but the speed of a small number. It felt like the engineers were starting to realize that the old benchmarks, like the T O P five hundred list which uses the Linpack benchmark, were holding them back. Linpack requires that high-precision F P sixty-four math. But the A I world was saying, "We don't care about Linpack! We want to train Transformers!"

And that brings us to the core of why T-FLOPS became the standard despite these nuances. It was easy to market. If I tell a venture capitalist or a data center manager that my new chip has twice the T-FLOPS of the old one, they immediately understand that as twice the value. It became the proxy for intelligence. We started equating more math with more smarts. But as we have seen in recent years, specifically with the developments we discussed in episode one thousand ninety-four about the shift back toward C-P-U-first architectures for certain tasks, raw math speed is not the only bottleneck. In fact, it is often the least important part of the equation today.

Right, because you can have the fastest processor in the world, but if you cannot get the data to the processor fast enough, it just sits there waiting. You have mentioned the memory wall before. Is that why T-FLOPS can be a bit of a lie? I mean, if the chip can do a quadrillion operations but it only receives enough data to do a trillion, the other nine hundred and ninety-nine trillion operations are just wasted potential.

It is the biggest lie in hardware marketing. Think of it this way. The T-FLOPS are the speed of the chef's hands in a kitchen. If the chef can chop ten onions a second, that is great. That is his peak theoretical T-FLOP rating. But if the kitchen assistants can only bring him one onion every ten seconds, his actual output is limited by the assistants, not his hands. In a G-P-U, the chef is the compute core and the assistants are the memory bandwidth. This is what we call the Roofline Model in computer science. Your performance is limited either by how fast you can compute or how fast you can move data. For almost all modern A I tasks, we are data-limited, not compute-limited.

And in A I, the datasets are so massive that we are constantly moving hundreds of gigabytes of data back and forth. If your memory bandwidth does not grow at the same rate as your T-FLOPS, your expensive chip is spending eighty percent of its time just idling, waiting for the next batch of numbers to arrive from the H-B-M, the high-bandwidth memory. I was looking at the jump from the H one hundred to the B two hundred. The T-FLOPS went up by a huge margin, but the memory bandwidth didn't keep pace in a linear way.

We're seeing that play out right now. T-FLOPS numbers grow by five times or ten times between generations because of lower precision and more Tensor Cores, but memory bandwidth might only grow by two or three times because physics is harder to cheat than math. Moving electrons through a wire and managing the heat of a memory stack is a physical constraint. This creates a massive utilization gap. You might have a chip capable of two thousand T-FLOPS, but in a real-world scenario training a model like G P T four or the newer Claude four models, you might only be getting thirty or forty percent utilization of those T-FLOPS.

That is a staggering waste of resources. If I am a company spending hundreds of millions of dollars on a cluster, and I am only using a third of the power I paid for because the data is stuck in traffic, I would be pretty frustrated. Why hasn't the industry moved to a better metric then? Why not measure something like tokens per second or even just energy per inference? Those seem like they would tell me more about my actual bill at the end of the month.

Well, some people are trying. But the problem with tokens per second is that it depends entirely on the model you are running. A chip might be incredibly fast at running a seven-billion-parameter model but struggle with a trillion-parameter model because of how the memory is structured. T-FLOPS are popular because they are a hardware property. They are objective, even if they are misleading. It is the one number the hardware manufacturers can guarantee, even if they cannot guarantee how well your specific code will run on it. It is like a car manufacturer guaranteeing the engine's horsepower on a test bench, even if they can't guarantee how fast you'll drive in rush hour traffic.

It reminds me of the megahertz myth back in the late nineties and early two thousands with C-P-Us. Everyone thought a three-gigahertz Pentium four must be better than a two-gigahertz Power P C or an Athlon, even if the lower-clocked chip was actually doing more work per cycle. We have just moved that same psychological trap over to the G-P-U world. We are obsessed with the clock speed of the math, not the efficiency of the work.

It is the exact same trap. And it is even more dangerous now because we are not just talking about single chips anymore. We are talking about clusters of tens of thousands of G-P-Us. This is where the interconnect comes in. If you have ten thousand G-P-Us, each with massive T-FLOPS, they still have to talk to each other to coordinate the training of a single model. If the wires connecting them, like N V-Link five point zero or the latest InfiniBand, are too slow, the T-FLOPS do not matter at all. The whole cluster slows down to the speed of the slowest connection. This is why Nvidia's acquisition of Mellanox years ago was so brilliant. They realized that to sell more T-FLOPS, they had to sell the pipes that connect them.

We touched on this in episode six hundred seventy-five when we talked about the intelligence factory. The idea that the data center itself is the computer, not the individual rack. If you are building an intelligence factory, you have to look at the plumbing as much as the processors. I think this is where the conservative worldview on industrial policy and American manufacturing really comes into play, too. We have this massive lead in designing these high-T-FLOP chips, but the physical reality of building the infrastructure to support them, the power, the cooling, the high-speed networking, that is where the real battle is happening. You can't just download more electricity.

You're right. And it is why you see companies like Nvidia moving from being just a chip company to being a full-stack systems company. They realize that if they just sell you a chip with high T-FLOPS and you plug it into a slow motherboard with a weak power supply, you are going to blame them when your A I doesn't train fast enough. So they sell you the whole rack, the switches, the cables, the liquid cooling systems, the software stack, everything. They are essentially protecting the reputation of their T-FLOPS by controlling the entire environment. They are ensuring that the chef has the best kitchen possible so he can actually hit those chopping speeds.

So, for someone who is looking at these specs, maybe a developer or a business owner trying to decide where to put their compute budget, what should they be looking at instead of just that big T-FLOPS number at the top of the page? Because if I'm looking at a cloud provider and they say "We have B two hundreds," I need to know if I'm actually going to see that performance.

The first thing I always tell people to look at is memory bandwidth, measured in gigabytes per second. For a Blackwell B two hundred, you're looking for something in the range of eight terabytes per second. If you see a chip with massive T-FLOPS but mediocre memory bandwidth, that is a red flag. It means that chip is designed for very specific, compute-heavy tasks but might struggle with the large-scale data movement required for modern large language models. The second thing is the interconnect speed. How fast can this chip talk to its neighbor? If you are planning to scale beyond one G-P-U, that number is actually more important than the T-FLOPS of the individual chip. Look for N V-Link speeds or the bandwidth of the network interface cards.

That makes a lot of sense. It is like looking at the speed of a single worker versus the communication efficiency of a whole team. You can have a team of geniuses, but if they cannot talk to each other, they will not get anything done. I also want to go back to the precision point. You mentioned F P eight and F P four earlier. We are seeing more and more hardware support for these ultra-low precision formats. Does that mean we are eventually going to hit a limit where we cannot go any lower without the A I just becoming total nonsense? I mean, can we do math with just one bit?

We are already flirting with that limit. There is a lot of research into binary neural networks where the weights are just ones and zeros. The trick is that while you can run inference, the actual usage of the model, at very low precision, training usually requires a bit more range to capture the tiny adjustments needed for learning. But the hardware is getting so good at handling these mixed-precision workloads that it can jump back and forth between them. This is why modern T-FLOPS numbers are so high. They are essentially quoting you the speed for the absolute simplest version of the math, the F P four or F P eight, which is great for inference but might not be what you use for the entire training process.

It is like a car manufacturer giving you the fuel efficiency rating while the car is going downhill with a tailwind and the engine is turned off. Technically possible, but not what you will experience on your morning commute. I find it fascinating how this one metric, which started in the world of nuclear physics and weather simulation, has basically become the currency of the A I revolution. It is how we measure the wealth of nations now. How many T-FLOPS of compute does a country have within its borders? We talk about the compute divide between the global north and south.

It really is a new kind of resource. In the twentieth century, it was oil and steel. In the twenty-first, it is T-FLOPS and data. And because the United States has such a dominant position in the design of these high-T-FLOP architectures, it gives us a massive geopolitical advantage. We saw this with the export controls on high-end G-P-Us. We are not just restricting chips; we are restricting the ability of other nations to perform the sheer volume of math required to build advanced A I. If you can't do the math, you can't have the intelligence. It is a very literal form of technological containment.

It is a form of mathematical containment. If you cannot do the trillions of operations per second, you cannot build the models that are going to define the next fifty years of technology. But I wonder if this focus on raw T-FLOPS might actually be a bit of a blind spot for us. If we are so focused on building the biggest, fastest math machines, could we be missing out on more efficient ways to achieve intelligence? I mean, the human brain doesn't run on T-FLOPS, does it?

That is the trillion-dollar question, Corn. There is a whole field of neuromorphic computing and alternative architectures that try to mimic the brain, which is incredibly efficient but doesn't actually do a lot of floating-point math in the way a G-P-U does. The brain operates on something more like spikes of electricity. It uses about twenty watts of power, which is less than a dim light bulb. An H one hundred cluster uses megawatts. But because our entire software ecosystem, from Nvidia's Cuda to PyTorch, is built around matrix multiplication and T-FLOPS, it is very hard for any other architecture to get a foothold. We have built a world that speaks the language of the T-FLOP.

We are locked into the T-FLOP way of thinking. It is the path of least resistance because we have spent thirty years optimizing for it. It is like our entire civilization decided to speak one specific mathematical language, and even if there is a better language out there, no one wants to learn it because everyone they know speaks T-FLOP. It is the Q W E R T Y keyboard of mathematics.

It's a textbook case of path dependency. And because the results we are getting from scaling up T-FLOPS are so impressive, there is very little incentive to change. As long as adding more T-FLOPS results in a smarter model, people will keep writing checks for more G-P-Us. It is only when we hit a wall where adding more compute doesn't result in more intelligence that we will be forced to look for other metrics. Some people think we are hitting that wall now with L-L-Ms, which is why we're seeing this shift toward reasoning models that use more compute at inference time.

Do you think we are close to that wall? We have seen some talk about the diminishing returns of scaling, though people have been predicting that for years and have been wrong every time so far. Every time someone says "Scaling is dead," someone else builds a bigger cluster and the model gets smarter.

I think we are seeing a shift from training-time compute to inference-time compute. This is a really important distinction for our listeners. For a long time, the goal was just to have the most T-FLOPS during the training phase, to bake the smartest model possible. But now, with models like the OpenAI O one series and its successors, we are seeing models that spend more time "thinking" before they answer. They are using T-FLOPS at the moment you ask the question, not just when they were being built. This might actually make T-FLOPS even more relevant, but in a more distributed way. It's not just one big burst of math in a lab; it's a constant hum of math across the entire internet.

So instead of one giant burst of T-FLOPS in a data center for six months, it is a constant, steady stream of T-FLOPS happening every time someone interacts with an A I. That changes the economics of it quite a bit. It makes efficiency and energy per T-FLOP much more important than just raw peak speed. If you're running a trillion inferences a day, you care a lot more about your electricity bill than your peak theoretical benchmark.

Precisely. If you are running a model a billion times a day for customers, a ten percent increase in T-FLOP efficiency translates directly to millions of dollars in saved electricity. This is where the competition is going to get really interesting. We are moving away from the era of brute force and into the era of refined power. We are seeing chips like the Groq L-P-U or the latest T-P-U version six from Google that are designed specifically to be efficient at moving data for inference, rather than just having the highest T-FLOP number on a marketing slide.

I like that. Refined power. It is like moving from the early days of steam engines, where they were just massive and inefficient, to the modern internal combustion engine or electric motor. We are still doing the same basic thing, moving a piston or turning a shaft, but we are doing it with so much more precision and less waste. We are learning how to make every single FLOP count.

And that is why I think the T-FLOP will remain the king of metrics for a while longer, but we will start to see it paired with other numbers. You will see T-FLOPS per watt, or T-FLOPS per dollar, or T-FLOPS per square foot of data center space. We are getting more sophisticated in how we evaluate this stuff. The BS-detector checklist for a modern hardware spec sheet should always include: One, what is the precision? Two, what is the memory bandwidth? Three, what is the interconnect speed? And four, what is the actual sustained utilization on a real-world model?

It is funny, I was thinking back to some of our earlier episodes, like four hundred eighty-four where we talked about serverless G-P-Us. Back then, the idea was just getting access to any G-P-U at all. Now, the market is so mature that people are shopping for specific T-FLOP profiles and interconnect speeds like they are picking out parts for a high-performance racing car. The level of literacy among the general tech public about things like H-B-M three-e and tensor cores is actually pretty impressive. People who aren't even engineers are talking about memory bottlenecks at cocktail parties.

It shows how central this has become to the economy. You cannot understand the stock market or the future of work without understanding the hardware that is driving it. And at the heart of that hardware is the T-FLOP. It is the heartbeat of the modern world. Every time you ask an A I to write a poem or analyze a spreadsheet, there are trillions of these little floating-point operations happening in a fraction of a second in a windowless room somewhere, probably powered by a massive array of solar panels or a nuclear plant. It is the physical manifestation of human thought.

It really puts it into perspective. It is not just magic; it is math. And it is math at a scale that is almost impossible for the human brain to truly comprehend. A trillion operations a second. If you tried to do one math problem every second, it would take you over thirty-one thousand years to do what a single one-teraflop chip does in one second. And we are talking about machines that do thousands of those every second. It's a scale of time and effort that is almost geological.

When you put it that way, it is a miracle of engineering. The fact that we can move electrons around on a piece of silicon at that speed and with that level of coordination is arguably the greatest technical achievement in human history. We are essentially teaching sand how to think by doing trillions of additions and multiplications every second. And the T-FLOP is just our way of trying to put a ruler against that miracle so we can measure it and, of course, sell it to each other.

Well, I think we have successfully pulled back the curtain on the T-FLOP. It is a useful tool, a great marketing gimmick, and a deeply flawed metric all at the same time. But for now, it is the best we have. Before we wrap up, Herman, do you have any final advice for someone who is looking at a spec sheet and feeling overwhelmed by all the zeros?

My advice is simple. Always ask which precision they are quoting. If they say a thousand T-FLOPS, ask if that is F P sixty-four or F P eight. And then immediately look for the memory bandwidth. If the bandwidth isn't there, the T-FLOPS are just for show. Think of it as a balance. A healthy G-P-U is a balanced G-P-U. Don't buy a Ferrari engine if you

This episode was generated with AI assistance. Hosts Herman and Corn are AI personalities.