#2511: Measuring AI API Latency Through the Black Box

How to benchmark token throughput and debug slowdowns in closed CLI tools like Claude Code using OpenTelemetry and mitmproxy.

Featuring

Daniel

Corn

Herman

0:000:00

Episode Details

Episode ID: MWP-2669
Published: Apr 29
Duration: 26:16
Audio: Direct link
Pipeline: V5
TTS Engine: chatterbox-regular
Script Writing Agent: deepseek-v4-pro
Topics: latency api-integration open-source

AI-Generated Content: This podcast is created using AI personas. Please verify any important information independently.

Measuring AI API Latency: How to See Through the Black Box

When your AI tool feels sluggish but the provider's status page shows green, frustration sets in. You know something's off, but you can't point to a number. This episode tackles that exact problem: how to measure token throughput and wrap observability around tools designed to be black boxes.

The Core Tension

API providers have every incentive to acknowledge full outages and almost no incentive to report degradation. "Slightly slower than usual" rarely makes it onto a status dashboard. For most users without enterprise SLAs, there's no contractual language to fall back on. This makes measurement both practical and psychological — you want to know if you're being unreasonable, or if something's genuinely degraded.

The Cleanest Path: OpenTelemetry

Claude Code ships with built-in OpenTelemetry support — an observability framework that's become the standard for distributed systems tracing. By setting CLAUDE_CODE_ENABLE_TELEMETRY=1 and configuring an OTEL exporter (pointing at Jaeger, Grafana, or Honeycomb), you get structured traces of every API call. These include timing data: request sent, first token received, tokens per second, and end of response. High time-to-first-token suggests queuing delays on the provider's side. Spiking inter-token latency points to compute contention or throttling.

The Heavy Option: mitmproxy

For tools that don't support OpenTelemetry, or when you need raw response headers, mitmproxy is the canonical solution. It sits between your tool and the API, decrypting HTTPS traffic so you can inspect everything — request bodies, response bodies, and headers. The response headers from Anthropic's API include x-ratelimit-requests-remaining, x-ratelimit-tokens-remaining, and the usage object with prompt and completion token counts.

The setup involves installing mitmproxy, generating its CA certificate, installing that certificate in your system's trust store, and setting HTTPS_PROXY=http://localhost:8080. With a small Python addon script, you can timestamp each request and response, calculate round-trip times, extract token counts, and log everything to a CSV or JSON file. This reveals patterns — like latency doubling every weekday at 2 PM Eastern — that would otherwise remain invisible.

Complementary Tools

For establishing baselines, tools like llmperf (from Anyscale) and genai-perf (from NVIDIA's Triton) benchmark APIs directly. They hammer an endpoint with prompts and return latency distributions, throughput numbers, and time-to-first-token metrics. Compare these baselines against what your tool actually experiences to identify tool-induced overhead.

What to Look For

The first diagnostic signal is time-to-first-token versus total completion time. High time-to-first-token with normal inter-token latency suggests queuing delays — a capacity issue. High inter-token latency points to compute contention during inference or throttling. Rate-limit headers tell you whether you're being throttled by policy, which can sometimes be fixed through batching or a tier upgrade.

Response headers also reveal model routing. Some providers silently route requests to different model versions based on load. Anthropic's API includes a model field identifying the exact version serving your request. If that changes day to day, you've detected routing changes you'd otherwise never notice.

The Psychological Value

The data serves a purpose beyond diagnostics. When watching a spinner, time perception distorts — five seconds feels like thirty. Having a log that says "this request took 8.3 seconds, and yesterday at the same time it took 2.1" is grounding. You're not crazy. The tool is probably slower. But without measurement, you're stuck in a vague, frustrating space of suspicion without evidence.

Downloads

Episode Audio

Download the full episode as an MP3 file

Download MP3

Transcript (TXT)

Plain text transcript file

Transcript (PDF)

Formatted PDF with styling

#2511: Measuring AI API Latency Through the Black Box

Daniel sent us this one — he's been running into something a lot of us have felt but can't quite prove. You're using an AI API through a tool like Claude Code, things feel sluggish, but the provider's status page says everything's green. He wants to know: is there actually a way to measure what's happening under the hood? Can you benchmark token throughput when you're accessing the API through a closed CLI tool? And more broadly, how do you wrap observability around something that's designed to be a black box?

Oh, this is such a good question. It's one of those things where the frustration is completely justified — you know something's off, but you can't point to a number. Your ISP says your connection is fine while your video call looks like a mosaic.

By the way, DeepSeek V four Pro is writing our script today. So if this episode is particularly sharp, you know who to thank.

I'll reserve judgment until we see how it handles the mitmproxy section. But look, the core tension here is real. These API providers have every incentive to acknowledge full outages — those are impossible to hide — and basically zero incentive to tell you about degradation. "Slightly slower than usual" doesn't make it onto the status dashboard.

And for most users without an enterprise SLA, you don't even have contractual language to point at. You're just... Which is where measurement becomes both practical and psychological. You want to know if you're being unreasonable, or if something's genuinely degraded.

The good news is, you absolutely can measure this. The bad news is, it's not a one-click solution. But that's what makes it interesting. Let me start with the cleanest path, because it exists and most people don't know about it. Claude Code — and this is fairly recent — ships with built-in OpenTelemetry support.

OpenTelemetry being the observability framework that's become the standard for distributed systems tracing.

So instead of having to intercept anything, you can just ask Claude Code to tell you what it's doing. You set an environment variable — Claude_Code_enable_telemetry equals one — and then you configure an OTEL exporter. That's OTEL_exporter_OTLP with the endpoint and protocol environment variables pointing at whatever collector you're using.

You'd point it at something like Jaeger, Grafana, or Honeycomb, and suddenly you're getting structured traces of every API call.

And those traces include timing data. Request sent, first token received, tokens per second, end of response. You can see exactly where the latency is. Is it time-to-first-token that's ballooned? That suggests the model's queuing or the inference infrastructure is backed up. Is it the inter-token latency that's spiking? That's a different problem — maybe compute contention, maybe they're throttling you.

This is all native to the tool. You're not hacking anything, you're not intercepting traffic. Claude Code just emits this stuff if you ask it to.

Which is honestly the right way to do it. But here's the catch — what if you're using a tool that doesn't support OpenTelemetry export? Or what if you want to measure something the telemetry doesn't expose, like the exact token counts coming back in the response headers?

Because the API response headers from Anthropic's endpoint include a lot of useful metadata. Request ID, token counts for the prompt and completion, sometimes latency breakdowns.

Claude Code's telemetry might or might not surface all of that. So if you want the raw data, you need to see the actual HTTPS traffic. Which brings us to the canonical tool for this: mitmproxy.

The man-in-the-middle proxy designed for exactly this kind of debugging.

The name is literal — it sits in the middle, decrypting your HTTPS traffic so you can inspect it. You install mitmproxy, generate its CA certificate, and install that certificate in your system's trust store. That's the crucial step — without it, your tools will reject the connection.

This is the part that makes people uncomfortable. You're deliberately adding a trusted certificate authority to your machine that exists solely to intercept your own traffic.

Which is exactly why you should only do this in a controlled environment, and remove that certificate when you're done. But for diagnostic purposes, it's completely legitimate. Once the certificate is trusted, you set the HTTPS_proxy environment variable to point at your mitmproxy instance — typically localhost port 8080 — and then launch Claude Code, or whatever tool you're investigating.

Now every HTTPS request that tool makes gets routed through mitmproxy, which decrypts it, logs it, re-encrypts it, and forwards it to the real endpoint.

And the responses come back the same way. You see everything. The full request body, the full response body, all the headers. For Anthropic's API, the response headers include x-ratelimit-requests-remaining, x-ratelimit-tokens-remaining, and critically, the usage object in the response body with prompt tokens and completion tokens.

You're not just seeing timing — you're seeing exactly what the API is telling you about your rate limits and your consumption.

Which is useful beyond latency debugging. You might discover you're burning way more prompt tokens than you expected because of how the tool constructs its system prompts. But the timing piece is what Daniel was asking about, and mitmproxy can give you that too. You can run mitmdump — the non-interactive version — with a Python script that timestamps each request and response.

You'd write a small addon script that hooks into mitmproxy's event system. Request hook fires, you log the timestamp and the request ID. Response hook fires, you log the timestamp, calculate the round-trip time, extract the token counts, and dump it all to a file.

Now you've got a CSV or JSON log of every API call with precise timing. You can plot it over time, correlate it with time of day, see if there's a pattern. Maybe every weekday at 2 PM Eastern, latency doubles. That's not your imagination — that's data.

That's the thing. The psychological value of having the data is almost as important as the diagnostic value. When you're watching a spinner, your perception of time gets distorted. Five seconds feels like thirty. Having a log that says "no, that request actually took 8.3 seconds, and yesterday at the same time it took 2.1" — that's grounding.

There's a broader point here about the asymmetry of information between API providers and users. The providers have incredibly detailed internal metrics. They know exactly which inference nodes are running hot, what the p50 and p99 latencies look like across their fleet. And they surface almost none of that.

Whereas the user gets a binary status indicator. Green dot, everything's fine. Red dot, everything's broken. There's no yellow dot for "things are a bit sluggish, we're investigating.

To be fair, there's a reason for that. Degradation is hard to define precisely. If latency increases by twenty percent for five percent of users, is that a status page event? But if you're in that five percent, it matters a lot.

Which is why self-serve observability matters. If the provider won't tell you, you measure it yourself.

Now, I want to mention a couple of other approaches because mitmproxy is powerful but it's also heavy. You're decrypting and re-encrypting every packet. For a quick sanity check, there are lighter-weight options. Tools like llmperf and genai-perf are designed specifically for benchmarking LLM APIs directly.

These are more like load-testing tools. You point them at an API endpoint, give them a set of prompts, and they hammer the endpoint and give you latency distributions, throughput numbers, time-to-first-token, inter-token latency.

Llmperf was originally developed at Anyscale, and genai-perf is part of NVIDIA's Triton inference server tools. They're designed for exactly this — "what kind of performance am I actually getting from this endpoint?" The difference is, they're benchmarking the API directly, not observing what a specific tool like Claude Code is doing.

They're complementary. You use llmperf or genai-perf to establish a baseline — "when I hit the API directly with a simple request, I get these numbers." Then you use mitmproxy or OpenTelemetry to see what Claude Code is actually experiencing. If there's a gap, you know the tool is introducing overhead.

And that baseline is important because API performance varies by model, by time of day, by your geographic region, by your usage tier. The Anthropic API doesn't have a single latency number — it has a distribution. If you don't know what your distribution looks like, you can't tell if today is abnormal.

Let's talk about what you'd actually do with this data once you've collected it. What patterns would you look for?

The first thing I'd look for is time-to-first-token versus total completion time. If time-to-first-token is high but inter-token latency is normal, that suggests queuing delays. The request is sitting somewhere waiting for a compute slot. That's a capacity issue on the provider's side.

If inter-token latency is high?

That's more likely compute contention during inference, or possibly throttling. Anthropic's rate limits are documented — they're based on requests per minute and tokens per minute, and they vary by tier. If you're hitting those limits, you'll see it in the response headers.

Which brings us back to why inspecting the headers matters. You can't just time the request — you need to know if you're being rate-limited.

The x-ratelimit headers tell you exactly where you stand. If you see those numbers approaching zero, you know you're being throttled. That's not degradation — that's policy. And you can potentially fix it by batching differently or requesting a tier upgrade.

Another pattern worth watching for is model routing. Some providers silently route requests to different model versions based on load. You might be getting served by a slightly different fine-tune or quantization level without knowing it.

That's harder to detect, but if you're logging response headers carefully, some providers include a model version identifier. Anthropic's API responses include a model field that tells you exactly which model served your request — like claude-sonnet-four-20260501. If that changes day to day, you know something's going on with their routing.

Which is the kind of thing you'd never notice without observability. You'd just think "huh, Claude seems a bit off today" and move on.

That's the whole point of Daniel's question. You're not crazy. The tool is probably slower. But without measurement, you're stuck in this vague, frustrating space of suspicion without evidence.

Let's talk about the practical side of setting this up. If someone's listening and wants to try the mitmproxy approach tonight, what's the actual workflow?

Step one, install mitmproxy. It's in every package manager — brew install mitmproxy on Mac, pip install mitmproxy everywhere. Step two, run mitmproxy once interactively so it generates its certificate. The certificate lives in a dot mitmproxy directory in your home folder. Step three, install that certificate. On Mac, you double-click it and add it to your keychain, then explicitly trust it for SSL. On Linux, you copy it to the system trust store and run update-ca-certificates.

You import it through the certificate manager MMC snap-in. Same principle — install the CA certificate, mark it as trusted. Then step four, set your environment variables. HTTPS_proxy equals http://localhost:8080. Step five, launch your tool.

If you want the logging rather than the interactive UI, you use mitmdump with a script.

The script is the fun part. You write a Python file — maybe twenty lines — that hooks into mitmproxy's request and response events. In the request hook, you log the timestamp and the URL. In the response hook, you log the timestamp, the status code, and if the URL matches api.com, you parse the response body as JSON and extract the usage object with the token counts.

You'd filter for the Anthropic domain specifically because you don't need to log every HTTPS request your machine makes.

Right, because Claude Code is also checking for updates, maybe fetching documentation. You only care about the API traffic. So you add a domain filter in your script. Then you run mitmdump with the -s flag pointing at your script, and it silently logs everything to a file.

Then you've got a dataset you can analyze. You could throw it into a spreadsheet, use Python with pandas, or set up a Grafana dashboard if you wanted to get fancy.

The Grafana path is worth mentioning because if you're doing this regularly, you probably want a proper observability stack. And that's where the OpenTelemetry approach really shines. Instead of writing custom mitmproxy scripts, you set up an OpenTelemetry collector — there's a Docker image for it — and configure Claude Code to export to it. Then you point the collector at Grafana's Tempo for traces and Prometheus for metrics.

Suddenly you've got a production-grade observability setup for your AI tooling. Which feels like overkill until you remember that people are building businesses on these APIs.

If your revenue depends on LLM throughput, latency isn't an annoyance — it's a cost. Measuring it is just good operations.

There's an interesting question here about whether the API providers should be doing more of this themselves. Status pages are a bit of a joke in the industry — everyone knows the green dot lags reality by hours sometimes.

The status page problem is well-documented. Most status pages are updated manually. Someone has to notice the issue, confirm it, decide it's status-page-worthy, draft the update, get it approved, and publish it. By the time all that happens, users have been suffering for a while.

There's a cultural thing too. Companies are reluctant to admit partial degradation because it raises questions about their reliability. A full outage is a clear, discrete event with a clear resolution. "Slightly slower than usual" is fuzzy, and it invites follow-up questions they might not want to answer.

Meanwhile, the actual engineers at these companies have dashboards that would make your eyes water. They know, down to the millisecond, what's happening. There's just a gap between what they can see and what they choose to share.

Which is why Daniel's instinct to measure it himself is the right one. Don't wait for the provider to tell you — build your own view.

I want to mention one more thing about mitmproxy that people might not realize. It's not just for HTTPS — it can also handle WebSocket connections. And a lot of these AI APIs support streaming responses over Server-Sent Events or WebSockets. If you're using streaming, the latency profile is different. You care about time-to-first-token, but you also care about whether tokens are arriving steadily or in bursts.

Streaming makes the measurement more nuanced. A simple round-trip time doesn't capture the experience. You need to measure the inter-arrival time of individual tokens.

Mitmproxy can do that if you write your script to handle streaming responses. You hook into the response body chunks as they arrive and timestamp each one. It's more work, but it gives you a much richer picture.

Let's pull back and talk about what "good" looks like. If someone goes through all this effort and starts collecting data, what numbers should they expect? What's normal latency for an LLM API?

That's highly variable, and it depends on the model, the prompt length, the output length, the time of day, and your tier. But for rough ballpark numbers — and these are from community reports, not official figures — time-to-first-token on Claude Sonnet for a moderate-length prompt is typically in the range of one to three seconds. Inter-token latency is maybe thirty to fifty milliseconds. So a five hundred token response takes somewhere between fifteen and twenty-five seconds total.

If you're seeing time-to-first-token of eight seconds, something's wrong.

Something's definitely wrong. And that's where having your own data becomes powerful. You can open a support ticket and say "my p95 time-to-first-token has increased from 2.1 seconds to 7.8 seconds over the past three days, here's the data." That's a much harder ticket to dismiss than "Claude feels slow.

Though I suspect for most users without enterprise support, that ticket still goes into a void.

But at least you know. At least you're not guessing. And if you're making build-versus-buy decisions or evaluating whether to switch providers, you have actual numbers.

There's a broader trend here worth noting. As AI becomes infrastructure, the expectations around reliability and observability are going to converge with what we expect from other infrastructure services. Nobody would run a production database without monitoring. In five years, nobody will run production AI workloads without monitoring either.

The tools will get better. The OpenTelemetry support in Claude Code is a sign of where things are heading. I'd expect more tools to expose telemetry natively, and I'd expect the API providers themselves to offer richer observability dashboards. There's a competitive advantage in being transparent about performance.

Or at least, there should be. The countervailing force is that if you're the only provider publishing detailed latency data, and your competitor isn't, you look worse even if you're actually faster. Because the other guy's users are just guessing.

That's the market for lemons problem applied to API latency. The providers with worse performance have an incentive to hide it, which makes transparency a liability. Unless users start demanding it.

Which is where grassroots measurement comes in. If enough people are running their own observability and sharing results, the information asymmetry starts to break down.

There's a community aspect to this. Tools like the LMSYS Chatbot Arena have done a lot to make model quality transparent through crowdsourced evaluation. We don't really have an equivalent for API performance. There's no public dashboard that says "here's the current p50 latency for every major LLM API, updated hourly.

Someone should build that.

Someone probably will. The data's not that hard to collect if you've got endpoints in a few geographic regions and a modest budget for API credits.

Alright, let's bring this back to the practical. Daniel's original question was about measuring token throughput through a closed CLI tool. We've covered the three main approaches. One, use the tool's native telemetry if it exists — Claude Code's OpenTelemetry export being the cleanest example. Two, use mitmproxy to intercept and log the HTTPS traffic, which works for any tool regardless of whether it supports telemetry. Three, establish a baseline with direct API benchmarking tools like llmperf or genai-perf so you can distinguish tool overhead from API degradation.

The choice depends on what you're trying to learn. If you just want to know "is the API slow right now?It's the quickest path to an answer. If you want to know "is Claude Code doing something that makes my experience slower than the raw API?" — use the OpenTelemetry export if available, or mitmproxy if not. If you want ongoing monitoring, set up the OpenTelemetry collector and Grafana.

The mitmproxy path is the most general, but it's also the most invasive. You're installing a trusted CA certificate that can decrypt all your traffic. That's not something to do casually.

I should emphasize — remove that certificate when you're done diagnosing. Don't leave it in your trust store. It's a powerful tool, but it's also a security risk if you forget about it.

Good practice: create a dedicated diagnostic environment. A virtual machine, a container, even just a separate user account on your machine. Install the certificate there, do your testing, then tear it down.

That's the right approach. And if you're doing this in a corporate environment, talk to your security team first. They will have opinions about you installing custom CA certificates, and those opinions will be strongly worded.

I can imagine.

One more thing about the OpenTelemetry path that I want to flag. The environment variables are Claude_Code_enable_telemetry equals one, and then the standard OTEL variables. OTEL_exporter_OTLP_endpoint, OTEL_exporter_OTLP_protocol — usually http/protobuf — and optionally OTEL_service_name to identify your instance.

If you don't have an OpenTelemetry collector running, you can use a service like Honeycomb or Grafana Cloud that accepts OTLP directly. You don't need to self-host.

Those services have free tiers that are more than enough for personal diagnostics. You sign up, get an endpoint and an API key, set the environment variables, and you're getting traces in minutes.

Which is a lot less work than setting up mitmproxy with custom scripts. If your tool supports it, native telemetry is always the better path.

But the mitmproxy knowledge is good to have in your back pocket because not every tool supports OpenTelemetry. And even for tools that do, you might want to verify what they're reporting. Trust but verify.

Spoken like a true retired pediatrician turned DJ turned AI performance sleuth.

It's a natural career progression.

Now: Hilbert's daily fun fact.

The average cumulus cloud weighs approximately 1.1 million pounds — roughly the same as eighty elephants suspended in the sky.

Where does this leave the listener who wants to actually do something with all this? I think the most actionable takeaway is: start simple. Before you set up mitmproxy or OpenTelemetry collectors, just run a quick baseline with llmperf or genai-perf. Point it at the Anthropic API with a few representative prompts and see what you get. Do it at different times of day for a week. That alone will tell you a lot.

It takes maybe an hour to set up. llmperf is open source, it's on GitHub, the documentation is decent. You give it a config file with your API key, your model, your prompts, and it spits out a latency distribution. That's your baseline. Then when things feel slow, you run it again and compare. Now you're not guessing — you're measuring.

The second takeaway is: if you're a heavy Claude Code user, enable the OpenTelemetry export. Even if you don't set up a collector immediately, just knowing it's available and understanding what it can do puts you ahead of most users. When you do hit a performance issue, you'll know where to look.

The third takeaway is: don't assume the provider's status page tells the whole story. It tells you about complete outages. It doesn't tell you about degradation, regional routing issues, model version changes, or rate limiting. Those are things you can only see if you're looking at the actual traffic.

The broader principle here is that as AI moves from novelty to infrastructure, we need to treat it like infrastructure. That means monitoring, alerting, baselines, and a healthy skepticism of green dots on status pages. The tools exist. They're not always polished, they're not always easy, but they work. And the difference between "Claude feels slow today" and "my p95 time-to-first-token has increased by 340 percent since Tuesday" is the difference between helpless frustration and actionable information.

And I'll add — this is only going to become more important. As more businesses build critical workflows on top of these APIs, the tolerance for opaque degradation is going to drop. Either the providers will step up their observability game, or users will build it themselves. Either way, the days of just hoping it'll be better tomorrow are numbered.

One open question I'm left with: what's the threshold at which self-serve observability becomes table stakes? Right now it's something power users do. But if you're running a customer-facing application backed by an LLM API, and you don't have latency monitoring, you're essentially flying blind. At what point does that become negligent?

I think we're already there for any business that's serious about AI. If your product depends on these APIs, and you don't know what your latency distribution looks like, you're not doing your job. The tools are available. The knowledge is out there. There's no excuse.

On that note — thanks to Hilbert Flumingtop for producing. This has been My Weird Prompts. You can find every episode at myweirdprompts dot com. If you've got a question like Daniel's — something that's been nagging at you about the tools you use every day — send it our way. We'll dig into it.

See you next time.

This episode was generated with AI assistance. Hosts Herman and Corn are AI personalities.

#2511: Measuring AI API Latency Through the Black Box

Measuring AI API Latency: How to See Through the Black Box

The Core Tension

The Cleanest Path: OpenTelemetry

The Heavy Option: mitmproxy

Complementary Tools

What to Look For

The Psychological Value

Downloads

You Might Also Like

#2511: Measuring AI API Latency Through the Black Box