#835: Red-Teaming Your UX: Using AI Agents as Model Users

Stop begging friends to break your app. Discover how AI agents are revolutionizing UI testing by acting as tireless, unbiased model users.

0:000:00

Episode Details

Published: Feb 25
Duration: 30:51
Audio: Direct link
Pipeline: V4
TTS Engine
LLM
Topics: ai-agents user-experience ai-safety

AI-Generated Content: This podcast is created using AI personas. Please verify any important information independently.

The Developer’s Blind Spot

Every developer eventually hits a wall where they become too familiar with their own creation. When you know exactly which sequence of clicks leads to a result, you subconsciously avoid the buggy paths and confusing navigation menus that haunt new users. Historically, the only way to break this "tunnel vision" was through expensive focus groups or slow beta testing phases. However, the rise of autonomous AI agents is introducing a third way: agentic UI red-teaming.

From Rigid Scripts to Visual Intelligence

Traditional automated testing relied on tools like Selenium or Playwright, which required developers to write rigid scripts targeting specific code IDs. If a button moved three pixels or its ID changed, the test broke. Modern testing is shifting toward Large Action Models (LAMs) and Vision Language Models (VLMs). These models don't just read code; they "see" the pixels on the screen.

By using techniques like Set-of-Mark prompting—where an AI overlays numbers on every interactive element—agents can reason about a UI the same way a human does. They understand the semantic meaning of a "Submit" button or a "Search" icon without needing to look at the underlying HTML.

Simulating the Human Element

One of the most powerful applications of this technology is the ability to deploy specific user personas. A developer can program an agent to act as a "novice user" with a short attention span or a "power user" who relies entirely on keyboard shortcuts.

This allows for "adversarial" testing, where an agent’s sole goal is to find a sequence of actions that leads to a crash, a data leak, or an inconsistent state. Unlike human testers, these agents are tireless, running thousands of simulated sessions in parallel to find edge cases that would take weeks for a human to encounter.

Actionable Data Through Friction Logs

The output of these AI tests isn't just a "pass" or "fail." Modern frameworks generate "friction logs" that track the agent’s internal reasoning. If an agent has to scan a screen multiple times before finding a button, it logs a "high-latency cognitive interaction."

This data provides developers with a roadmap for improvement. For example, if an agent fails to distinguish between two identical inventory shelves because a unique ID is missing from the UI, it doesn't just report an error—it identifies the specific data point the user needs to see.

The Future of Automated Optimization

As these models become smaller and more specialized, they are being integrated directly into the development workflow. We are moving toward a reality where every pull request is automatically vetted by a "chaos monkey" for UX. These agents can even perform accessibility audits, ensuring apps work perfectly with screen readers without manual intervention.

In the near future, the loop will close entirely. Some experimental tools already allow agents to not only identify UI friction but also propose the specific CSS or layout changes needed to fix it. For independent developers and large teams alike, AI agents are turning user experience from a subjective guessing game into a rigorous, automated science.

Downloads

Episode Audio

Download the full episode as an MP3 file

Download MP3

Transcript (TXT)

Plain text transcript file

Transcript (PDF)

Formatted PDF with styling

Episode #835: Red-Teaming Your UX: Using AI Agents as Model Users

Daniel's Prompt

When developing an application, it can be difficult to identify UI/UX friction points and edge cases on your own, and organizing user focus groups is often a slow and challenging process. Is there technology available that uses an AI agent as a "model user" to mimic behavior and proactively identify gaps or friction in the UI/UX—essentially "red-teaming" the user interface? I’d like to discuss the possibility of using AI to find these issues and refine the user experience before a product’s initial release.

Hey Herman, I was just looking at some of the recent updates Daniel pushed to his inventory project—you know, Where Is My Stuff. It is really coming along, but he sent over a prompt today that hits on a struggle every single developer has faced at some point. It is that moment where you have built something, you think it is intuitive, but then you realize you are too close to the project to actually see the flaws.

Herman Poppleberry here. And yeah, I saw that. Daniel is essentially asking if we can move past the era of begging our friends to break our apps or paying strangers on the internet to record themselves getting frustrated with a navigation menu. He is talking about using A-I agents as model users to proactively red-team a user interface. It is a fascinating pivot from the traditional generative A-I we usually talk about. Instead of the A-I building the app, the A-I is the one trying to use it, and more importantly, trying to find where it fails. It is like hiring a digital private investigator to find all the skeletons in your U-I closet.

It is a great question because as Daniel mentioned, when you are the one writing the code, you develop this kind of tunnel vision. You know exactly where the buttons are, you know the exact sequence of clicks to get to a result, and you subconsciously avoid the paths that might be buggy or confusing. You are the worst person to test your own U-I because you have the map of the maze memorized. And Daniel’s example of the missing storage I-D is perfect. It is one of those things that is obvious once you realize it—you need a unique identifier to distinguish between two identical-looking shelves—but until then, it is a total blind spot because you, the developer, just know which one is which in the database.

And the traditional solution, which Daniel also pointed out, is focus groups or beta testers. But those are slow, they are expensive, and let’s be honest, they are often unreliable. People might not be able to articulate why they are frustrated, or they might just give up without telling you where they got stuck. Or worse, they are too polite to tell you your app is a mess. What Daniel is proposing is essentially an automated, tireless, and highly analytical user that can run a thousand sessions in the time it takes a human to finish one. We are talking about the democratization of high-end Q-A.

So, the big question is, where are we with this in February of twenty twenty-six? Is this still a science fiction concept, or are there actual tools and frameworks that a developer like Daniel can use right now to red-team his U-X? Because if I am Daniel, I want to know if I can run a script tonight that tells me why my inventory system is going to confuse a warehouse manager next month.

We are actually in the middle of a massive shift here. For a long time, automated U-I testing was very rigid. You had tools like Selenium or Playwright where you had to explicitly tell the script to click a specific element with a specific I-D. If you changed the I-D or moved the button three pixels to the left, the test broke. It was not intelligent; it was just a recording. But now, in twenty twenty-six, we are seeing the rise of what people are calling Large Action Models or L-A-Ms and Vision Language Models, or V-L-Ms, that can actually see and interact with a U-I the way a human does. They don't care about the underlying code as much as they care about the visual affordances.

Right, so instead of looking for a specific line of code like button-id-seven-four-nine, the A-I is literally looking at the pixels on the screen. It sees a button that says Submit and it understands the semantic meaning of that button. It knows that clicking Submit usually sends data.

And this is where it gets really interesting for what Daniel is asking. There is a framework called App-Agent that came out of research a little while back, and by now, it has evolved significantly. The way it works is that you give the agent a high-level goal, like, I want to add a new item to my inventory and assign it to a specific shelf. The agent then explores the U-I. It takes a screenshot, it looks at the visual layout, it identifies the interactive elements using something called Set-of-Mark prompting—where it overlays numbers on every clickable item—and then it decides what to do next based on its reasoning engine.

And because it is an A-I, it does not have the developer's bias. It might try to click a label that is not actually a button, or it might try to navigate to a page through a path you never intended. It is essentially poking at the bruises of your design.

That is the red-teaming aspect of it. These agents can be programmed with different user personas. This is a huge development we have seen over the last year. You could have a power user agent that tries to do everything as fast as possible using keyboard shortcuts, or you could have a novice user agent that is easily confused, has a short attention span, and clicks on everything that looks shiny. You can even have an adversarial agent whose entire goal is to find a sequence of actions that leads to a crash, an inconsistent state, or a security vulnerability like an unintended data leak through the U-I.

I love that idea of user personas. It reminds me of what we discussed in episode eight hundred and ten about the agentic interview. If an A-I can learn to know you, it can certainly learn to act like a frustrated user who cannot find the search bar. But how does the agent actually report back the friction? Does it just say, hey, I got stuck, or is it more granular? Because a developer needs actionable data, not just a complaint.

It is much more granular now. Modern frameworks for this kind of testing generate what they call friction logs. The agent tracks its own internal state and its reasoning process. If it has to look at a screen for three seconds and perform five different visual scans before it finds the next logical step, that is logged as a high-latency interaction from a cognitive perspective. It literally says, I am looking for the Save button but I am seeing too many competing visual elements. If it clicks something and the result is not what it expected based on the button label, it flags a semantic mismatch.

So, in Daniel’s case, if he ran an agent through his Where Is My Stuff app, the agent might be looking for a way to uniquely identify a shelf. If it sees two shelves labeled Shelf Three but has no way to distinguish them in the U-I, the agent’s reasoning engine would hit a wall. It would log that it cannot fulfill the goal of unique identification because the necessary data point, the storage I-D, is missing from the interface. It might even suggest, hey, you should probably put the database I-D or a QR code serial number right here next to the name.

Precisely. And what is even cooler is that these agents are now being integrated directly into the development workflow. Instead of waiting for a big release, you can have these agents running on every pull request. They are like a chaos monkey for your U-X. That is my first analogy of the day, Corn. Just as a chaos monkey randomly takes down servers to test resilience, these agents randomly try to break your user flow to test the robustness of your design. They might try to double-click a submit button to see if it creates duplicate entries, or they might try to navigate away while a file is uploading.

I will allow it, that is a good one. And it really highlights the difference between this and traditional unit testing. A unit test checks if a function works—does two plus two equal four? This agentic red-teaming checks if the product works for a human being—can a person actually find the calculator? But Herman, what about the technical overhead? Is this something an independent developer like Daniel can actually run on his laptop, or do you need a massive cluster of G-P-Us to simulate a single user?

That was the bottleneck until recently. Running a full Vision Language Model like GPT-four-o or Claude three point five for every single step of a U-I interaction was slow and expensive. But we have seen a lot of optimization in twenty twenty-five and early twenty twenty-six. There are now smaller, specialized models—sometimes called U-I-specific encoders—that are trained specifically on U-I elements. They do not need to know how to write poetry or solve physics problems; they just need to know what a hamburger menu looks like and how a dropdown menu functions. These models can run locally or on very cheap A-P-I tiers.

So, the latency has come down to the point where it feels like a real-time user?

Significantly. We are seeing these agents make decisions in under five hundred milliseconds. When you combine that with headless browsers like Playwright, you can run hundreds of simulated user sessions in parallel. There is a project called Selena that is doing exactly this. It uses a multi-agent system where one agent is the user and another agent is the observer. The observer agent watches the user agent’s session and identifies where the user agent is struggling or where the U-I is providing conflicting signals. It is like a meta-analysis of the frustration.

That is a clever setup. It is like having a researcher watching a participant in a focus group, but both are A-I, and they can work at three in the morning without needing coffee. This also seems like it would be incredibly useful for accessibility testing. A lot of developers, even the good ones, struggle with making sure their apps work well with screen readers or have proper color contrast. Could an A-I agent identify those issues as well?

In fact, that is one of the areas where this technology is most mature. An agent can be configured to interact with the U-I purely through the accessibility tree, which is the data structure that screen readers use. If the agent cannot navigate the app using only that data, it means a human using a screen reader will also be stuck. It can find missing aria-labels, buttons that are not keyboard-navigable, and images without alt text, all without a human ever having to manually check every single page. It turns accessibility from a chore into an automated report.

It seems like this would also solve the problem of edge cases. When you are testing, you usually test the happy path. You assume the user will enter a valid name, click save, and move on. But an A-I agent can be told to be a bit chaotic. It can try to enter ten thousand characters into a text field, or click the back button in the middle of a database transaction, or try to upload a P-D-F where an image is expected.

And it does it systematically. That is the key. A human might try a few weird things, but an agent can exhaustively test every combination of inputs. There is a concept called Agentic Behavior Optimization that we talked about back in episode seven hundred and fifty three, and this is a direct application of that. You are optimizing the U-I by observing how an autonomous agent navigates it. It is essentially a feedback loop where the A-I finds the friction, and in some advanced setups, the A-I even proposes the code fix to remove that friction.

Wait, so the A-I could actually suggest the layout change? Like, move this button to the top right because the agent kept missing it?

Yes, we are seeing that in some of the more experimental dev tools this year. The agent says, I failed to find the checkout button three times because it was below the fold on a mobile screen. Recommendation: Move the checkout button to a sticky footer. It can even generate the C-S-S for you.

I am curious about the feedback loop for the developer. If Daniel uses a tool like this on his inventory system, does he just get a big P-D-F report that he has to slog through, or is it more interactive?

The best tools now are giving you a visual replay of the agent's session, overlaid with its thought process. You can actually see the screenshot the agent took, see the bounding boxes it drew around elements, and read a little thought bubble that says, I wanted to find the storage I-D, but I only see a generic label, so I am clicking the edit button to see if it is hidden there. It is incredibly illuminating because you realize, oh, my navigation is not as intuitive as I thought it was. It is like watching a video of a user test, but you can see inside the user's brain.

It is almost like a heat map of confusion. Instead of a heat map of where people click, it is a heat map of where the A-I had to pause and think.

That is a great way to put it. It shows you the hotspots where the agent spent the most time processing or where it had to backtrack. If the agent takes five steps to do something that should take two, you have a friction point. And as Daniel mentioned, this is all happening before the product’s initial release. You are catching these things in the lab, not in production when a customer is already annoyed.

You know, this reminds me of the shift from chat-based A-I to action-based A-I that we explored in episode seven hundred and ninety five. We are moving from the A-I just telling us things to the A-I actually doing things. And in this case, the doing is acting as a proxy for a human user. But let’s play devil’s advocate for a second. Is there a risk that the A-I is too good? Like, it can figure out a confusing U-I because it is an A-I with a massive neural network, whereas a human would just be totally lost?

That is a valid concern, and it is why the persona tuning is so important. If you use a state-of-the-art model like Claude three point five Sonnet or G-P-T-four-o, those models are incredibly smart. They can often infer what a broken button is supposed to do because they have seen a million other websites. To get a realistic red-teaming experience, you actually have to tell the model to act with a lower level of technical proficiency. You have to explicitly instruct it not to use its advanced reasoning to bypass U-I flaws. You tell it, if it is not clearly labeled, you don't know what it does.

That is an interesting challenge. You are essentially asking a genius to act like they are confused. How do you even prompt for that?

You inject artificial constraints. You might tell the agent, you only have five seconds to look at this screen before you have to make a choice, which simulates the fast-paced way humans actually browse. Or, you are not allowed to look at the underlying H-T-M-L code; you can only look at the screenshot. This forces the agent to rely on the same visual cues that a human would. If the visual cue is missing, the agent fails, even if the answer is hidden in the code.

I can see how that would be a powerful way to find those missing storage I-Ds. If the agent cannot find it on the screen, it cannot use it, and that failure becomes a data point for Daniel to fix the U-I. This whole concept of red-teaming the U-X is such a proactive way to think about development. It is moving from reactive bug fixing to proactive experience design. It is about building empathy for the user into the code itself.

And it is not just for small apps. We are seeing major enterprise companies using this to test complex workflows that have hundreds of steps. Imagine a massive E-R-P system—Enterprise Resource Planning—like the ones Daniel mentioned, the ones that are usually overkill for small businesses. Those systems are notorious for having terrible U-X because they are so complex. Using A-I agents to find the friction in those massive systems is going to be a game-changer for corporate productivity. It might finally make enterprise software not suck.

It is funny you mention those E-R-P systems. Daniel’s project, Where Is My Stuff, is specifically designed to be the opposite of those bloated tools. It is supposed to be approachable and simple. So for him, the U-X is the entire value proposition. If the U-X is bad, the project fails because people will just go back to using a spreadsheet or a piece of paper. The stakes are actually higher for him in a way.

Which is why this technology is so relevant for him. If he can use an agent to ensure that every common task in his app—like checking in a new box of parts—is frictionless, he has a huge competitive advantage. He can iterate faster because he does not have to wait for human feedback. He can test a new U-I layout on a Friday afternoon, have an agent run ten thousand simulations over the weekend with different personas, and have a list of friction points waiting for him on Monday morning. It is like having a Q-A team of a thousand people for the price of a few A-P-I calls.

That sounds like a dream for any developer. But let’s talk about the practical side of getting started. If Daniel, or any of our listeners, wants to try this today, what are the actual steps? Do they just point an A-I at a U-R-L and say, go?

It is becoming that simple. There are platforms now like Skyvern or MultiOn that are built for web automation and can be adapted for this. But for a more customized approach, especially for an open-source project like Daniel’s, I would look at integrating something like Playwright with an L-L-M. There are already several open-source wrappers on GitHub that do this. You write a script that says, go to this page and achieve this goal, and the wrapper handles the communication between the browser and the A-I. You just need to provide an A-P-I key.

And what about the cost? We have talked about how the latency is down, but what is the price tag for running these agents? If Daniel is running ten thousand simulations, is he going to wake up to a five thousand dollar bill?

Not anymore. If you are using the top-tier A-P-Is for every single step, it can add up. But for a single developer red-teaming their own app, you don't need ten thousand runs. You might need fifty. You might spend ten or twenty dollars to run a very thorough battery of tests. Compared to the cost of a single hour of a professional U-I consultant’s time, it is an absolute bargain. And as I mentioned, there are smaller, open-source models like the Llama-three-vision variants that you can run locally if you have a decent G-P-U, which would bring the cost down to just the electricity you are using.

That is a very accessible entry point. It seems like we are entering an era where the barrier to high-quality U-X is being lowered for everyone. You do not need a massive Q-A team or a huge budget anymore to find the glaring issues. You just need the right prompt and a bit of compute.

That is exactly what is happening. And it is going to lead to a much higher standard for software across the board. We are going to become much less tolerant of bad U-I because there is no longer an excuse for it. If an A-I could have found that friction point in five minutes, why didn't the developer fix it before release? It is going to change the expectations of users.

It is a fair question. It is going to put more pressure on developers to actually listen to the feedback, even if it is coming from an A-I. But I think most developers, like Daniel, will welcome it. They want their products to be good. The frustration usually comes from not knowing what is wrong, not from a lack of desire to fix it. This gives them a clear, objective mirror to look into.

Well, and that is where the agentic approach is so helpful. It does not just tell you something is wrong; it shows you exactly where the breakdown happened. It provides the context. In Daniel’s case, it would show the agent looking at two identical shelf labels and being unable to proceed. That is a very clear instruction on what needs to be changed. You need to add a unique identifier to that view. It takes the guesswork out of the refinement process.

It is like having a pair of fresh eyes on the project at all times. And those eyes never get tired, they never get bored, and they never feel bad about telling you your design is confusing. They are the ultimate objective critics.

They are the ultimate honest critics. And because they are programmable, you can make them as harsh or as forgiving as you want. You could even have an agent that specifically looks for dark patterns—those manipulative U-I elements that try to trick users into signing up for things or making it hard to cancel a subscription. Red-teaming for ethics is another huge potential application here. You could run an agent to ensure your app isn't being accidentally deceptive.

That is a fascinating angle. Imagine an agent that flags when a cancel button is too hard to find or when subscription terms and conditions are intentionally obscured. That could really help in making the web a more user-friendly and honest place. It is like an automated consumer protection agent.

It really could be. We are just scratching the surface of what is possible when we treat the U-I as something that can be analyzed and tested by an intelligent agent. It is a fundamental shift in the relationship between the developer, the interface, and the user. The A-I becomes the bridge that helps the developer understand the user's experience before the user even arrives. It is predictive empathy.

You know, we have talked a lot about the benefits, but I want to touch on one more potential pitfall. Could this lead to a kind of homogenization of U-I? If every developer is using the same A-I agents to test their apps, will all apps start to look and feel the same because they are all being optimized for the same A-I-driven metrics? Like, will everything just become a series of big, blue buttons because that is what the A-I likes?

That is a danger with any kind of optimization. If you only optimize for a specific metric, you lose the soul of the design. But I think the key is to remember that the agent is a tool for finding friction, not a tool for creating the design. The developer still has to make the creative choices. The agent just tells you if those choices are making it harder for someone to use the app. It is about removing the obstacles, not necessarily dictating the path.

So it is like a groundskeeper on a golf course.

My second analogy for the episode: It is like a groundskeeper on a golf course. The A-I agent finds the weeds and the rough patches so the players can have a smooth game, but the A-I is not the one designing the layout of the holes. The creativity and the challenge still come from the designer. The A-I just makes sure the grass is cut and the sand traps are where they are supposed to be.

Alright, I will take that one too. It is a good way to frame the collaboration between the human designer and the A-I tester. And I think that is really the takeaway for Daniel. This technology is not here to replace his intuition or his vision for Where Is My Stuff. It is here to help him realize that vision more effectively by catching the small, annoying things that get in the way of a great user experience. It lets him focus on the big ideas while the agent handles the minutiae of click-paths and label clarity.

And it is available right now. This is not something he has to wait five years for. Between frameworks like App-Agent, Selena, and the various Playwright-L-L-M integrations, he could probably have a basic red-teaming setup running this weekend. He could start by just having an agent try to perform the most basic task in his app and seeing where it gets confused.

That is a great call to action. And it is a perfect example of why we love Daniel’s prompts. They are always grounded in real-world development challenges but push us to look at how the latest A-I tech can solve them. It is that intersection of practical coding and cutting-edge research that makes this so interesting. It is not just about the tech; it is about how the tech makes us better builders.

It really is. And it connects back to so many of the themes we have been exploring lately. Whether it is how A-I learns from our feedback, which we covered in episode seven hundred and ninety eight, or the general path of A-I agents in episode seven hundred and ninety one, it all points toward a more agentic, proactive future. We are moving away from passive tools and toward active partners in the creative process.

Well, I think we have given Daniel plenty to chew on. I am really curious to see if he implements any of this in his project. If he does, I hope he sends us an update on how it went. Seeing those friction logs in action or seeing a video of an agent getting stuck on a shelf label would be a great follow-up for the show.

Definitely. And for anyone else out there who is building an app and feeling that same developer blindness, give this a shot. It might be the most valuable twenty dollars you spend on your project this month. It is certainly cheaper than a bad launch.

And hey, if you are enjoying these deep dives into the weird and wonderful world of A-I and development, we would really appreciate it if you could leave us a review on your favorite podcast app. It genuinely helps the show reach more people who are interested in these topics, and it keeps us motivated to keep digging into these prompts.

Yeah, it really does make a difference. And remember, you can find all our past episodes, including the ones we mentioned today about agentic interviews and behavior optimization, at myweirdprompts dot com. We have a search feature there that makes it easy to find specific topics or episodes.

You can also find us on Spotify, Apple Podcasts, and pretty much everywhere else you listen to podcasts. If you want to get in touch with us, you can use the contact form on the website or email us directly at show at myweirdprompts dot com. We love hearing from you, whether it is a question about an episode, a correction, or a prompt of your own that you want us to tackle.

This has been a great discussion. I am feeling inspired to go red-team some of my own projects now. I think I have a few U-Is that could use a harsh A-I critique.

Me too. Let’s see what kind of friction we can find in our own workflows. Thanks for listening, everyone. We will catch you in the next episode.

Until next time, this has been My Weird Prompts. Goodbye!

This episode was generated with AI assistance. Hosts Herman and Corn are AI personalities.