Imagine it is two in the morning. You are fast asleep, dreaming of whatever it is people dream about—presumably not spreadsheets—and suddenly your phone starts screaming. Your autonomous AI agent, the one you spent six months perfecting to handle your customer logistics, has just absolutely face-planted. It didn't just stumble; it’s hallucinating wild data and sending 400-level errors to your most important clients. Why? Because an upstream API team changed a required field in a JSON schema and didn’t tell anyone. This is the nightmare of the agentic age, and today’s prompt from Daniel is right over the target. He wants to talk about the tooling landscape for testing APIs and MCP servers, specifically looking at Postman and MCP Explorer, and how we handle the growing crisis of API-MCP drift.
It’s a massive problem, Corn. We are moving from a world where humans read documentation to a world where models consume schemas, and the margin for error has basically vanished. By the way, for the eagle-eyed listeners out there, today’s episode is actually being powered by Google Gemini 1.5 Flash. It’s writing our script today, which is fitting since we’re talking about the very protocols that allow models like Gemini to actually interact with the real world. I’m Herman Poppleberry, and I’ve been diving deep into the 2026 tooling reports from TestMu and SmartBear. The reality is that the "plumbing" of the internet is getting a serious upgrade, but our testing habits are still stuck in 2022.
You know, it’s funny you call it plumbing. I always think of you as the guy who gets excited about the specific grade of PVC pipe being used in the basement. But Daniel’s point about drift is what really caught my eye. If the API moves and the Model Context Protocol server stays still, the "brain" of the AI is essentially looking at an old map of a city that’s been redesigned. It’s going to drive the car into a lake. Herman, before we get into the existential dread of broken agents, let’s talk tools. Postman is the big name everyone knows. Is it still the king, or has it been disrupted by all this agentic madness?
Postman is still the industry standard, but it’s had to evolve rapidly. If you look at their January 2026 release, they actually added native MCP server testing capabilities. They realized that you can’t just test the HTTP status code anymore. In the old days—which, in AI years, was like three weeks ago—Postman was great for making sure that if you sent a GET request, you got a 200 OK back with the right payload. Now, with Postbot, their AI assistant, it can actually auto-generate test scripts and documentation. But the real shift is their bi-directional sync with GitHub and GitLab. They’re treating API collections as living code, which is step one in fighting that drift Daniel mentioned.
But wait, when you say "Postbot," are we talking about the AI just guessing what the tests should be? I’ve seen some of these auto-generated scripts, and half the time they’re just checking if the response is "JSON." That’s like a building inspector checking if a house has walls but ignoring the fact that the front door opens into a sheer drop.
That’s a fair critique, but it’s gotten much more sophisticated. Postbot now performs what’s called "Contractual Inference." It looks at your historical traffic and your OpenAPI spec simultaneously. If it sees that you’ve added a discount_code field to your checkout API, it doesn't just check if the field exists; it writes a test case to ensure the value is a valid string of a certain length. It’s trying to bridge that gap between "Does the code run?" and "Does the code do what the documentation says it does?"
But how does it handle the logic of that field? If the discount_code should only work on Tuesdays, does Postbot figure that out, or is it just looking at the data shape?
It’s starting to look at the business logic by analyzing the documentation strings in your code. It uses RAG—Retrieval-Augmented Generation—to pull in your internal Confluence pages or READMEs to understand the intent of the API. So if your docs say "Discounts are only valid for midweek purchases," Postbot will actually attempt to generate a test case that sends a request with a Monday timestamp and expects a specific error message. It’s moving away from unit testing and toward "Intent Validation."
Okay, so Postman is the Swiss Army knife that now has a little laser pointer for AI. But Daniel also mentioned MCP Explorer. Now, I’ve played around with this a bit, and it feels different. It doesn't feel like a request-response tool; it feels like a "What is my AI seeing?" tool. Is that a fair distinction?
That is exactly the distinction. Think of Postman as testing the pipes—is the water flowing? Is the pressure right? MCP Explorer is testing the "semantic layer." It’s the "Postman for MCP." Its job is to let a developer connect to a local or remote MCP server and see exactly what tools, resources, and prompts are being exposed to the LLM. The latest version, 2.3.0, which just dropped in March, introduced this incredible automated schema validation. It doesn't just check if the code runs; it checks if the schema you’re providing to the AI is actually valid according to the MCP spec. It allows you to simulate agent interactions in a sandbox. So, you can see if the AI is going to interpret a tool's description correctly before you ever deploy it.
See, that’s the part that fascinates me. In a traditional API, if I name a field "user_id," the computer doesn't care if I call it "user_id" or "u_id" as long as the code matches. But with MCP, the description of that field actually matters for how the LLM decides to use it. If I change the description to be slightly more ambiguous, the AI might start passing the wrong data. Does Postman catch that?
Generally, no. Postman is looking for structural integrity. It sees a string is a string. MCP Explorer is where you realize, "Oh, I described this temperature tool in Celsius, but the underlying API is now returning Kelvin." The AI reads the description, thinks it's getting Celsius, and suddenly your smart thermostat is trying to melt your house. That is "Behavioral Drift." It’s not a technical failure in the sense that the code crashed; it’s a communication failure between the developer and the model.
It’s like giving someone a manual for a Boeing 747 but the buttons inside the cockpit are for a Cessna. Technically, they are all buttons, but the pilot is going to be very confused when they try to put the landing gear down. But how does MCP Explorer actually show you that? Does it have a "Confusion Meter" for the AI?
Sort of! It has a feature called "Prompt Preview." When you click on a tool in the Explorer, it shows you exactly how that tool will be presented to the LLM's context window. It also has a "Simulated Execution" mode where you can type a natural language command like "Get the weather in London" and see which tool the Explorer thinks the LLM will pick based on your descriptions. If it picks the "Update Weather" tool instead of "Get Weather" because your descriptions are too similar, you know you have a problem before you even hit production.
That "Simulated Execution" sounds like a game changer. Can you feed it different "personalities" or model types? Like, does Claude interpret a tool description differently than GPT-4o?
MCP Explorer 2.3.0 actually has a "Model Selector" in the testing pane. You can toggle between different model backends to see if a particular description is "brittle." Some models are more sensitive to specific keywords. For instance, GPT-4 might be very forgiving if you forget to specify a required parameter in the description, but a smaller, faster model like Llama 3 might just hallucinate a value because it didn't understand the constraint. MCP Explorer highlights those discrepancies in red.
That’s a huge time saver. I want to talk about this drift issue more because Daniel asked if we should be developing these in parallel. But before we get there, let's look at some other players. You mentioned Insomnia and something called TestSprite?
Insomnia is still the lightweight darling for people who find Postman too bloated. They’ve got a great plugin ecosystem. But TestSprite is the one that’s actually making waves in 2026. It uses autonomous AI to "crawl" your APIs and MCP servers. Instead of a human writing a test case like "What happens if the input is null?", TestSprite just pokes and prods the system in thousands of ways a human would never think of. It’s essentially using an AI to find the edge cases that will trip up your other AI.
It takes an AI to catch an AI. I like that. It’s very "Blade Runner" of them. But does TestSprite actually understand the MCP protocol, or is it just banging on the API endpoints?
It understands the protocol. It actually reads the mcp.json configuration and says, "Okay, I see you have a tool called delete_record. What happens if I try to call delete_record with a string instead of a UUID? And more importantly, what happens if I try to convince the AI agent that it needs to call this tool on a system file?" It’s doing fuzz testing but at the semantic level. It’s looking for "Instruction Injection" vulnerabilities where the API and the MCP layer aren't perfectly aligned on permissions.
Wait, can it actually simulate a social engineering attack on the MCP server? Like, if I have a tool that can access emails, can TestSprite find a way to trick the tool into leaking data?
Yes, it uses what’s called an "Adversarial LLM" in the background. It will try to find phrases or sequences of tool calls that bypass your safety filters. For example, if your API has a limit on how many records can be fetched at once, TestSprite will try to find a "recursive tool loop" where it asks the agent to call the tool over and over in a way that bypasses that rate limit. It’s testing the logic of the agent's autonomy, which is something a traditional unit test in Postman could never do.
Let’s get to the meat of Daniel’s question. The drift. We’ve got the API, which is the actual functionality, and we’ve got the MCP server, which is the bridge that tells the AI how to use that API. Currently, most teams build the API, and then some poor engineer in the corner is tasked with "making it work with Claude" or "making it work with Gemini," so they wrap it in an MCP layer. That feels like a recipe for disaster.
It’s a nightmare. The problem is that the API is often owned by the backend team, and the MCP server is owned by the "AI Engineering" or "Product" team. They move at different speeds. When the backend team optimizes an endpoint—maybe they flatten a nested JSON object to improve latency—they’ve technically improved the API. But if they don't update the MCP schema, the AI agent is still looking for that nested object. It fails. And because LLMs are "helpful," they might not just throw an error; they might try to "guess" where the data went, leading to those silent failures we discussed.
So, the AI becomes like a confident liar. "Oh, I couldn't find the 'address_details' object, so I'll just assume the user lives at 123 Main Street." That’s terrifying. Daniel asks: Should we be creating and updating these always in parallel? Herman, you’re the donkey who reads the white papers. Is "parallel development" the gold standard now?
The industry is shifting toward what we call "Contract-First Agentic Design." Instead of building the API and then wrapping it, you start with a single source of truth—usually an OpenAPI or Swagger spec. In 2026, the best practice is to use tools like Smithery’s MCP generator. You point it at your OpenAPI spec, and it procedurally generates the MCP server. If you change the OpenAPI spec, the MCP server updates automatically. That’s the only way to truly kill the drift.
See, that makes sense to me as a sloth. I don't want to do the work twice. If I can write one definition and have it pop out both the API documentation for humans and the MCP schema for the bots, I’m a happy camper. But is it really that simple? Can you really just "generate" a good MCP server? Because, as we talked about, the descriptions matter. A generic generator might give a tool a description like "GET /api/v1/user," but an AI needs to be told "Use this tool to retrieve a user's subscription status and billing history."
You’ve hit on the big limitation. Automated generation gets you eighty percent of the way there—it handles the types, the endpoints, and the required fields. But that last twenty percent, the "semantic hint" for the AI, still requires a human touch. This is why some teams are moving to an "MCP-First" approach. They are actually designing APIs specifically for AI consumption. They flatten the data structures—because LLMs actually struggle with deeply nested JSON—and they write the documentation specifically to be read by a model.
Wait, so we're redesigning the internet to be "flatter" because the AI is a bit picky about its data structures? That feels like the tail wagging the dog, Herman. Shouldn't the AI just get better at reading nested JSON?
It’s not just about the AI being "picky." It’s about token efficiency and reliability. Every layer of nesting is more tokens the model has to parse and more chances for it to lose the thread of the conversation. If you provide a flat, descriptive response, the model is faster, cheaper, and more accurate. It’s about "Semantic Versioning" too. In 2026, we’re seeing versioning where version 1.2 of an API is tied in lockstep to version 1.2 of the MCP definition. You don't deploy one without the other.
But how do you enforce that? If I’m a developer in a rush, I’m going to push the API fix and say "I’ll update the MCP server tomorrow." How does the tooling stop me from being a lazy sloth?
This is where "CI/CD Gates" come in. In a modern 2026 workflow, your GitHub Action or GitLab Pipeline won't let you merge a Pull Request if the OpenAPI spec and the MCP schema are out of sync. There are specialized linters now—like mcp-linter—that act as a referee. If you change a field name in the code but the description field in your MCP server still references the old name, the build breaks. It forces that parallel development Daniel was asking about.
Is there a way to automate the description updates too? Like, if I change a field from user_zip to postal_code, can an AI update the MCP description for me?
Yes, there are "Semantic Diff" tools now. When you run your build, the tool compares the old code to the new code and uses an LLM to suggest updates to the MCP descriptions. It will say, "Hey, you changed the logic of this function; I’ve updated the tool description to reflect that the output is now sorted by date." The developer just has to click "Approve." It removes the friction that usually leads to documentation rot.
It’s like a marriage contract. "I, API, promise to stay exactly as I am described in this MCP schema, for better or for worse, in 200 OKs and in 500 Internals." I want to dive into a case study Daniel might appreciate. Think about a fintech startup. They’re using an AI agent to process loan applications. They have an API that checks credit scores. One day, the credit score API starts returning a "confidence_score" alongside the actual credit number. The backend team thinks, "Great, more data!" But the MCP server hasn't been updated to tell the AI what "confidence_score" means. What happens?
This actually happened to a firm in London last year. The AI saw the new field, didn't know what it was, and because it was a higher number than the credit score, it started using the confidence score as the credit score. It was approving loans for people with a 99 percent confidence that they had a 400 credit score, but the AI saw "99" and thought, "Wow, great score!" That cost them millions before they realized the agent was misinterpreting the data. They solved it by implementing "Tusk Drift," which is a specialized tool that constantly compares live API responses to the MCP schema. If it sees a field that isn't in the schema, it flags it immediately.
That is a perfect example of why this isn't just "nerd stuff." It’s "don't lose all your money" stuff. But let’s play devil’s advocate. What if the API change is actually internal? Like, the backend team changes the database from Postgres to MongoDB, but the JSON output remains identical. Does the MCP server even care?
In that specific case, no. That’s the beauty of the abstraction. As long as the "Contract" remains the same, the AI doesn't care if the data is coming from a database, a spreadsheet, or a guy named Gary typing really fast. But the problem is that "identical output" is a myth. There is always a tiny change—a date format shifts from ISO 8601 to a Unix timestamp, or a null value becomes an empty string. To a human, that’s a five-minute fix. To an AI agent, it’s a total breakdown of reality.
Wait, a null versus an empty string? Does that really break an LLM? I thought these things were supposed to be smart.
They are smart, but they are also literal. If the MCP schema says a field will return a "null" if no data is found, the LLM might have a specific instruction to "ask the user for more info if the field is null." If it gets an empty string instead, it might think the data is an empty string and try to process it, leading to a "Hello, [Empty String]!" message or, worse, a database error when it tries to save that empty string into a strictly typed field. Testing tools like Postman are now specifically designed to catch these "Type-Semantic Mismatches."
So, if I'm a developer listening to this, and I've got an API and I'm building an MCP server for it, my first step is probably getting Postman for the basics and MCP Explorer for the agentic side. But how do I actually set up a workflow that doesn't result in me wanting to pull my hair out?
Step one: Use Postman for your contract testing. Set up mock servers. Postman’s mock servers are incredible because they allow you to test your MCP server against a "fake" version of your API that behaves exactly like the real one. This lets you iterate on the AI side without needing the backend to be finished. Step two: Use MCP Explorer to "interrogate" your server. Ask yourself: "If I were a tired, slightly confused LLM, would I know what this tool does based on this description?" Step three: Automate the sync. If you’re using TypeScript, use the official MCP SDKs to pull your types directly from your API code. If the code changes and doesn't compile, your MCP server won't either.
I like that "confused LLM" test. I usually just pretend I'm talking to you, Herman, and if you can understand it, then surely a multi-billion parameter model can. But how often should we be running these tests? Is this a "once a week" thing, or a "every time I save a file" thing?
It’s a "Shift Left" thing. You should be running MCP Explorer locally while you develop. But the real heavy lifting happens in the staging environment. You should have "Shadow Agents"—agents that aren't touching real customer data—constantly running against your staging MCP server. If the shadow agent starts failing its tasks, you know you’ve introduced drift. It’s continuous integration for behavior, not just code.
"Shadow Agents." That sounds like a cool band name. But practically, how do you measure if a shadow agent "fails"? Is it just looking for errors, or is it checking the quality of the output?
It’s both. We use "LLM-as-a-judge." You have a second, more powerful LLM—like a GPT-4o or a Claude 3.5 Opus—monitoring the shadow agent. The judge has a rubric: "Did the agent use the correct tool? Did it interpret the API response accurately?" If the judge sees the agent struggling, it logs a "Semantic Regression." This is the only way to catch those subtle shifts where the code is "fine" but the performance is degrading.
But seriously, the "Documentation Debt" is real. We used to joke that documentation was the thing developers did last and hated the most. Now, if you don't do it, your product literally doesn't work. It’s not just a courtesy for other developers; it’s the source code for the AI’s behavior.
It really is. We’re seeing this new role emerge: the "AI Interaction Engineer" or "Semantic Architect." Their whole job is managing this bridge. They aren't just writing code; they’re writing the "Prompts" and "Descriptions" that live inside the MCP server. And they’re using tools like TestSprite to run thousands of "creativity tests" to see if the AI can find a way to break the API. For example, can the AI trick the API into giving it data it shouldn't have by combining two different tools in a way the developers didn't intend?
Oh, like the "jailbreaking via API" problem. "I can't access the admin panel directly, but if I use the 'Generate Report' tool and then the 'Email Result' tool, I can send the admin database to my own inbox." That’s a classic lateral move. And if your MCP server gives the AI too much "context" or too many permissions, you’re just handing it the keys to the castle.
And that brings up a great point about the "Human in the Loop." Automated testing is great for catching a field change from an integer to a string. But it’s bad at catching "Logic Drift." If the meaning of the data changes—say, a "price" field used to include tax but now it doesn't—every automated test might still pass. The AI will still get a number. But the math is now wrong. You still need a human to look at the MCP Explorer output and say, "Wait, we need to clarify to the AI that this price is now pre-tax."
It’s the "Sloth Principle"—do it right once so you don't have to fix it ten times later. I want to go back to Daniel’s question about parallel development. If we agree that parallel is the way to go, does that mean the "wrapper" model is dead? Is the idea of "I have an existing API, let me just slap an MCP server on top of it" a bad strategy?
It’s a transition strategy, but it’s not a long-term solution. It’s like putting a motor on a bicycle. It works, but eventually, you want a motorcycle that was designed from the ground up to handle that speed. If you’re building something new today, you should be building "Agent-First." You design the MCP interface first, because that’s how your users—via their agents—are going to interact with your service. The API then just becomes the implementation detail of that MCP interface.
That is a radical shift. That’s flipping the script entirely. "The API is an implementation detail of the AI interface." I can hear the backend engineers screaming from here, Herman. They’ve spent twenty years perfecting REST and GraphQL, and now you’re telling them the "AI description" is the most important part?
Think about it this way: In the 2000s, we optimized for the browser. In the 2010s, we optimized for the smartphone. In the 2020s, we are optimizing for the agent. If your API is perfectly RESTful but an LLM can't figure out how to call it, your API is effectively useless in the modern economy. It’s a bitter pill, but look at the traffic patterns. In 2026, more and more API calls are being initiated by agents, not by "Submit" buttons on websites. If your "Agentic Surface Area" is buggy or poorly documented, your traffic drops. It’s the new SEO. If the agents can’t use your site, you don’t exist to the AI-driven economy. So, yes, the documentation—the MCP schema—becomes the most critical piece of infrastructure you own.
But wait, if everything is an MCP server, doesn't that make things harder to debug? If I have a chain of three agents talking to five different MCP servers, and the final output is wrong, how do I find the "drift"? Is there a "Trace" tool for this?
There is! Tools like LangSmith and Arize Phoenix are now integrating with MCP. They provide "Traceability." You can see the entire conversation: the user’s request, what the agent thought it should do, the specific MCP tool it called, the raw API response it got back, and how it interpreted that response. It’s like a black box flight recorder for AI. If you see the agent calling the "Weather Tool" but the API returns "Error: Invalid API Key," the trace shows you exactly where the chain broke.
Does it show the latency of each tool call too? I’ve noticed some agents get "stuck" if an MCP server takes more than a second to respond.
Yes, observability tools now track "Time to First Tool Call" and "Tool Execution Latency." If your MCP server is slow, the LLM might actually time out or, worse, decide to "skip" that tool and hallucinate the data instead. It’s a performance issue that becomes a reliability issue. Postman’s performance testing suite now includes "Agentic Load Testing," where it simulates hundreds of agents hitting your MCP server simultaneously to see where the bottleneck is.
"The new SEO." I hate how much that makes sense. So, instead of stuffing keywords into a meta tag, I’m stuffing "semantic hints" into an MCP tool description so that Claude or Gemini picks my tool over a competitor’s. "Choose this flight booking tool because it handles multi-city layovers with ninety-nine percent accuracy!"
It’s already happening. There are "MCP Aggregators" now—think of them like the Yellow Pages for AI agents. If your MCP server is well-tested, has zero drift, and provides clear descriptions, the aggregators rank you higher. Agents are literally being programmed to favor "reliable" MCP servers that pass their own internal validation checks. If your server throws a 400 error because of drift, the agent will just mark you as "unreliable" and move to the next provider.
Man, the future is just one giant automated performance review, isn't it? Let’s talk practical takeaways for the folks listening who are currently staring at a broken agent. We've talked about Postman for the "plumbing," MCP Explorer for the "brain," and Smithery for the "sync." What’s the "Monday morning" plan for a developer?
Monday morning? First, audit your current MCP server. Open it in MCP Explorer and look at every tool description. If a description hasn't been updated in three months, it’s probably wrong. Second, set up a "Drift Detection" test. You can do this in Postman by writing a simple script that compares your OpenAPI schema to your MCP schema. If they don't match, the build fails. Third, stop writing MCP schemas by hand. Use a generator. Even if you have to tweak it, start from the API code so you have that "source of truth."
And what about the people who say, "I don't have time for all this testing, I just need to ship"?
I would tell them that shipping a broken AI agent is worse than not shipping at all. A broken website just doesn't load. A broken AI agent can actively delete data, spend money, or alienate customers because it thinks it’s doing the right thing. The "Restart Tax" we’ve talked about in the past—the cost of having to re-prime and re-initialize an agent because it crashed—is huge. Testing is the only way to lower that tax.
It’s the difference between a broken vending machine and a vending machine that steals your credit card and starts calling your ex. The stakes are just higher. I think we’ve covered the "what" and the "how," but I want to touch on the "why" one more time. Daniel’s prompt mentions that drift has become an "increasing problem." Why now? Why didn't we see this as much a year ago?
Because a year ago, we were mostly using "Chat." You’d copy-paste an error, the AI would say "Oops," and you’d fix it. Now, we’re using "Agents." These things are running in the background, making dozens of API calls while you’re at lunch. You aren't there to see the "Oops." You just see the wreckage an hour later. The scale of interaction has outpaced our ability to manually supervise it. That’s why the tooling has to take over.
It’s the "autonomous" part of "autonomous agents." If they’re going to be independent, they need a perfectly clear set of instructions. Any ambiguity in the MCP layer is a crack that the agent is eventually going to fall through. Herman, let’s wrap this up with some forward-looking speculation. Where does this tooling end up? Do we eventually just stop writing code and just "describe" things to a meta-compiler?
I think we’re heading toward "Self-Healing APIs." Imagine an API that detects an agent is struggling to use it, analyzes the errors, and updates its own MCP description to be clearer. Or an MCP server that sees the underlying API has changed and automatically generates a "patch" for the AI’s context. In 2026, we’re seeing the first hints of this with tools like TestSprite’s "Auto-Fix" suggestions.
That sounds like a dream for me, but a nightmare for anyone who likes control. If the API is rewriting its own documentation, how do I know what it’s doing?
That’s where the "Audit Log" comes in. The tools will show you: "The API changed from X to Y, and I updated the MCP description to Z. Click here to revert." It’s about augmented intelligence, not replaced intelligence. We’re giving developers the tools to manage a landscape that is simply too fast and too complex for a human to handle with a spreadsheet and a prayer.
Self-healing code. Now that sounds like something a sloth could get behind. "The code fixed itself while I was taking a nap." That’s the dream. But until then, I guess we’re stuck with Postman and MCP Explorer.
It’s not a bad place to be. These tools are getting incredibly powerful. And honestly, it’s a fun time to be a developer. We’re basically teaching machines how to talk to each other. It’s like being a linguistics professor for robots.
A linguistics professor who also has to worry about the robots accidentally spending the company’s travel budget on a million rubber ducks. Well, this has been a great dive into the "agentic plumbing." Thanks to Daniel for the prompt—he always knows exactly what’s keeping the dev community up at night.
It’s a vital topic. If you’re not thinking about drift, you’re not building for the real world. Thanks as always to our producer, Hilbert Flumingtop, for keeping the gears turning behind the scenes. And a huge thank you to Modal for providing the GPU credits that power this show. They make the heavy lifting look easy.
This has been "My Weird Prompts." If you’re finding all this MCP and API talk useful, do us a huge favor and leave a review on your podcast app. It actually really helps us reach more people who are trying to figure out why their agents are acting up.
You can find all our episodes and the RSS feed at myweirdprompts dot com.
And if you want to be the first to know when we drop a new episode, search for "My Weird Prompts" on Telegram and join the channel. We’ll see you in the next one.
Stay curious, and keep those schemas synced.
See ya.