#2482: When AI Chatbots Leak Your PDFs via Public S3 Buckets

A user uploaded a sensitive PDF to an AI chatbot. The chatbot stored it in a public S3 bucket with zero authentication.

Featuring

Daniel

Corn

Herman

0:000:00

Episode Details

Episode ID: MWP-2640
Published: Apr 27
Duration: 26:11
Audio: Direct link
Pipeline: V5
TTS Engine: chatterbox-regular
Script Writing Agent: deepseek-v4-pro
Topics: data-security ai-security cloud-computing

AI-Generated Content: This podcast is created using AI personas. Please verify any important information independently.

When a user uploaded a sensitive PDF to a major AI chatbot, they expected privacy. Instead, the chatbot returned a link to the document stored in a publicly accessible S3 bucket with zero authentication. The vendor's defense: the URL was long, random, and automatically expired — nobody would guess it. The user pushed back, the vendor eventually added authentication, but no bug bounty was paid.

This case raises a fundamental question: Is security by obscurity ever legitimate?

The Bug Bounty Consensus

Bug bounty programs at HackerOne, Bugcrowd, and Intigriti have a remarkably consistent position. Pure long-URL findings without evidence of weak entropy or policy misconfiguration are rated P5 at best or marked N/A. The reasoning: unguessable IDs are obscurity, not security, and researchers must demonstrate actual enumeration exploits. This creates a perverse incentive — companies can deploy insecure-by-design systems behind unguessable URLs and face zero financial consequences when researchers find the exposure.

AWS's Explicit Warning

Amazon itself warns against this practice. The AWS Security Blog stated clearly in 2019: do not rely on object key names for security — use bucket policies. The platform provider says in writing that this approach is wrong. Yet many vendors continue the practice.

Three Failure Modes

URLs leak through browser histories, server logs, proxy logs, and referrer headers. A 2023 Shodan scan found that about 10% of public S3 buckets use unguessable paths, but 80% of those had policy issues that made the obscurity irrelevant. The temporal problem compounds this — even with automatic expiry, documents remain accessible for days or weeks, and cleanup processes fail regularly.

Security With vs. Security By Obscurity

The distinction matters. Port knocking on a server — hitting a specific sequence before SSH responds — is security with obscurity. The real protection is the SSH key; the obscurity reduces attack surface. A non-standard SSH port cuts brute-force attempts by 90% while authentication does the real work. In Daniel's case, the URL was the entire security model. No authentication layer existed behind the obscurity.

The Quantum Threat

Grover's algorithm provides quadratic speedup for brute-force searches, effectively halving key length. A 128-bit random URL offers only 64 bits of security against a quantum adversary — within reach of sufficiently resourced attackers. The harvest-now-decrypt-later attack compounds this: adversaries can capture obscured data today and decrypt it when quantum computers become available. For trade secrets, medical records, and legal strategies uploaded to AI chatbots, that data remains sensitive for years.

The AI Trust Gap

People upload legal documents, medical records, and financial information to AI chatbots with an implicit trust model. They think they're having a conversation, not storing files in a public bucket. When the chatbot returns a raw S3 URL, it violates that trust. The vendor eventually added authentication — proving they knew the original design was inadequate — but only because the user made enough noise.

The takeaway: security by obscurity is never a standalone defense. It can add friction in a layered approach, but when it's your only protection, you're gambling that attackers will be unlucky. Quantum computing makes that bet increasingly foolish.

Downloads

Episode Audio

Download the full episode as an MP3 file

Download MP3

Transcript (TXT)

Plain text transcript file

Transcript (PDF)

Formatted PDF with styling

#2482: When AI Chatbots Leak Your PDFs via Public S3 Buckets

Daniel sent us this one — and it's a good one. He's got a user who uploaded a PDF to a major AI chatbot, got back a link to the document, and then discovered that link was sitting in a publicly accessible S3 bucket with zero authentication. Anyone with the URL could access it. The vendor's response was basically, don't worry, the URL is long and random and it expires automatically — nobody's going to guess it. The user pushed back, the vendor eventually added proper authentication, but no bug bounty was paid. So Daniel's asking: Is security by obscurity ever legitimate? And how does quantum computing change the calculus on this?

Before we dive in, quick note — today's script is coming to us from DeepSeek V four Pro.

Keeping us on our toes.

Alright, so this scenario Daniel's describing — it's basically a textbook case of security by obscurity, and the vendor's argument is exactly the one that bug bounty programs have been rejecting for years. I've been reading through HackerOne and Bugcrowd triage guidelines, and the consensus is remarkably consistent. Pure long-URL findings without evidence of weak entropy or a policy misconfiguration — they get rated as P5 at best, or just marked N/A.

Which is interesting because that means the vendor's defense — long, unscrapable URLs with automatic expiry — that's the exact reasoning the bug bounty world says doesn't count as a real vulnerability. So you've got this weird situation where the security industry says one thing and the platforms that are supposed to enforce security standards say another.

Right, and there's a specific HackerOne report — number one zero four eight five seven six — where a researcher found an S3 bucket with long random URLs and the triager rejected it as informative. The reasoning was that unguessable IDs are obscurity, not security, and you'd need to demonstrate an actual enumeration exploit for it to qualify. Bugcrowd's knowledge base says the same thing. Intigriti published a blog post on this back in twenty twenty-two — valid bugs require policy misconfigurations or demonstrable enumeration.

The incentives are completely backwards. A company can leave sensitive documents exposed behind what they call an unguessable URL, a researcher finds it, reports it, gets told it's not a real bug, and the company faces zero financial penalty. The only reason this particular vendor fixed it was because the user made enough noise. That's not a security program — that's a PR management program.

What makes this worse is that Amazon itself explicitly warns against exactly this practice. The AWS Security Blog stated clearly back in twenty nineteen — do not rely on object key names for security, use bucket policies. This isn't some edge case interpretation. The platform provider is telling you, in writing, that what you're doing is wrong.

Let's pull on that thread a bit. When AWS says don't rely on object key names for security, what's the actual failure mode they're worried about? Is it just the theoretical risk of someone guessing the URL, or is there something more concrete?

It's multiple things. First, URLs leak. They show up in browser histories, in server logs, in proxy logs, in referrer headers. If someone shares that link — even privately — it's now in their email, their chat history, their clipboard. The URL might be unguessable in a vacuum, but it doesn't stay in a vacuum. Second, S3 bucket policies can be misconfigured in ways that make enumeration possible even without guessing individual keys. A twenty twenty-three Shodan scan found that about ten percent of public S3 buckets use unguessable paths, but eighty percent of those had policy issues that made the obscurity irrelevant.

So the obscurity layer was already broken by other misconfigurations in four out of five cases. That's not a theoretical argument anymore — that's empirical.

And the third failure mode is the one that doesn't get talked about enough — temporal. The vendor in Daniel's case mentioned automatic expiry as part of their defense. But how long is that expiry window? Because if a document sits there for a week before expiring, that's a week where anyone who stumbles across that URL — whether through log leakage, accidental sharing, or a compromised browser extension — has full access.

There's a deeper problem with the expiry argument too. The vendor is essentially saying, we've made the attack window finite, therefore the risk is acceptable. But they're not telling the user what that window is, and the user has no way to verify that the document actually gets deleted. You're taking the vendor's word that their cleanup process works.

We've seen enough cloud misconfiguration incidents to know that cleanup processes fail all the time. Backups get made, logs get retained, caches don't get invalidated. The document might expire from the primary bucket and still exist in three other places.

Alright, so let's get into the core question Daniel's asking. Is security by obscurity ever legitimate? Because there's a distinction that some practitioners make that I think is worth examining.

This is the security with obscurity versus security by obscurity distinction. I saw a good piece on this from Venture in Security earlier this year. The argument is that obscurity can be a legitimate layer in a defense-in-depth strategy — like having a hidden door behind a locked one. The obscurity isn't your only protection, but it adds friction. The problem is when obscurity is the entire security model.

Port knocking on a server — you have to hit a specific sequence of ports before SSH even responds. That's security with obscurity. The real protection is still the SSH key, but the obscurity reduces your attack surface by making the service invisible to casual scans.

Or a non-standard SSH port. Moving from port twenty-two to port two two two two doesn't make your server secure, but it cuts down on automated brute-force attempts by something like ninety percent. The authentication is still doing the real work. The obscurity is just noise reduction.

Here's where the vendor's position falls apart — they had no authentication layer at all. The URL was the entire security model. There was no locked door behind the hidden door. There was just the hope that nobody would find the door.

Bruce Schneier has been making this point for decades. He calls security by obscurity fundamentally flawed. His argument — and he wrote about this just this month, actually, April twenty twenty-six — is that attacks always get better. What's unguessable today might be guessable tomorrow. Transparency forces you to build security that actually works rather than security that relies on attackers being stupid or unlucky.

There's a principle here that goes back to the eighteen hundreds — Kerckhoffs's principle. A cryptosystem should be secure even if everything about it except the key is public knowledge. The modern version is, your security should work even if the attacker knows exactly how your system is built.

NIST guidance reflects this. They explicitly recommend against relying on obscurity as a security control. It's not just Schneier being cranky — it's baked into federal standards.

Let's talk about the AI industry's specific exposure here, because I think this is where the story gets really interesting. Daniel's case involved an AI chatbot. People upload incredibly sensitive stuff to these platforms — legal documents, medical records, trade secrets, financial information. They're not uploading cat photos. They're uploading documents they wouldn't even email without encryption.

The trust model is completely different from something like Dropbox or Google Drive. With a file-sharing service, the user understands they're storing a file and sharing a link. With an AI chatbot, the user thinks they're having a conversation. The document upload feels like handing a piece of paper to someone across a desk. When the chatbot hands back a raw S3 URL, that's a violation of the implicit trust model. The user didn't consent to their document being stored in a publicly accessible bucket.

The fact that the vendor eventually added authentication tells you they knew the original design was inadequate. They didn't add authentication because they suddenly discovered a new security principle. They added it because someone made enough noise that the reputational cost of not fixing it exceeded the engineering cost of fixing it.

No bug bounty was paid. This is the part that really gets me. The user found a genuine security issue — a design flaw that exposed user documents to anyone with the URL. The vendor acknowledged the issue by fixing it. But because it didn't fit the narrow definition of a valid bug bounty finding, the researcher got nothing.

Should bug bounty programs update their scope to cover this class of finding? I think there's a strong argument that they should. The current approach creates a perverse incentive — companies can deploy insecure-by-design systems, hide behind unguessable URLs, and face no financial consequences when researchers find the exposure. The bug bounty program becomes a shield rather than a sieve.

The counterargument from the platforms is that pure URL-guessing isn't a real attack vector without evidence of weak entropy. If the URL is a hundred and twenty-eight bits of random data, brute-forcing it is computationally infeasible with classical computing. The triagers aren't being unreasonable — they're applying a consistent standard about what constitutes a practical exploit.

That's where quantum computing enters the conversation, and it changes the entire calculus. Let's get into that.

This is the part that makes the vendor's position not just bad practice but temporally naive. Grover's algorithm provides a quadratic speedup for brute-force searches. What that means in practical terms is that it effectively halves the key length of symmetric ciphers. A hundred and twenty-eight bit random URL — which is what you'd get from a properly generated UUID or random key — would offer only sixty-four bits of security against a quantum adversary.

Sixty-four bits is not theoretical anymore. That's within the realm of what a sufficiently resourced attacker could brute-force. It's not trivial, but it's not the astronomical impossibility that a hundred and twenty-eight bits represents classically.

There's a really good write-up on this from EugeneZonda, published in December twenty twenty-five, that walks through the math. Grover's algorithm doesn't give you an exponential speedup like Shor's algorithm does for factoring — it's quadratic. But quadratic is still devastating when your security model assumes geometric infeasibility. You go from a number of operations that's larger than the number of atoms in the universe to something that a nation-state could plausibly throw computing resources at.

The other quantum threat that doesn't get enough attention in these discussions is the harvest now, decrypt later attack. Even if practical quantum computers capable of running Grover's algorithm at scale don't exist yet, an adversary can capture encrypted or obscured data today and store it. Then, when quantum computers become available — five years, ten years, fifteen years from now — they decrypt it retroactively.

For the kind of documents people upload to AI chatbots — legal strategies, business plans, medical research, personal financial data — that data often has a long shelf life. A trade secret is still valuable ten years later. A medical record is still sensitive ten years later. The expiry on the URL doesn't matter if someone archived the document during the window when it was accessible.

The vendor's defense — long random URL with automatic expiry — is protecting against a casual attacker today while being completely transparent to a patient attacker with storage capacity. And storage is cheap. Storing billions of URLs and their associated documents in the hope that quantum computing eventually breaks them open — that's a viable strategy for an intelligence agency.

Let me put some numbers on this. A hundred and twenty-eight bit random value has about three point four times ten to the thirty-eighth possible combinations. Classically, if you could check a billion URLs per second, it would take you about ten to the twenty-second years — vastly longer than the age of the universe. With Grover's algorithm, that drops to about ten to the eleventh operations, which sounds like a lot but is within the realm of what a large-scale quantum computer could handle in a reasonable timeframe.

That's assuming the entropy is actually a hundred and twenty-eight bits. In practice, a lot of these random URL generators use weaker random number generators, or they derive the key from timestamps or other predictable values. The effective entropy might be much lower — maybe sixty-four bits classically, which Grover's reduces to thirty-two bits. Thirty-two bits is a joke. That's four billion combinations. You can brute-force that on a laptop.

This is why the bug bounty triagers are right to ask for evidence of weak entropy before escalating these findings. But from the user's perspective, you have no way to audit the entropy of the URL you received. You're trusting the vendor's implementation, and we've seen repeatedly that vendors get this wrong.

Let's talk about the broader AI chatbot vulnerability landscape, because Daniel's case isn't happening in isolation. There's an active threat environment around document uploads to these platforms.

There was a significant vulnerability discovered in Open WebUI — that's CVE twenty twenty-five six four four nine six — with a CVSS score of seven point three. It allowed JWT theft and full account compromise, including access to all uploaded documents. That's not a theoretical URL-guessing attack. That's a real exploit that gives an attacker access to everything the user uploaded.

Then there's the prompt injection vector via malicious PDFs. LastPass published a blog on this in twenty twenty-five — you can embed invisible instructions in a PDF that manipulate the AI's behavior or exfiltrate chat history. So you've got two completely different attack surfaces. One is the storage security of the uploaded documents. The other is the content of the documents being used to attack the AI itself.

What connects both of these to Daniel's case is the trust model. Users are uploading documents to these platforms assuming a level of security and isolation that may not exist. The vendor's security-by-obscurity argument for document storage is just one manifestation of a broader pattern — AI chatbot platforms are handling sensitive data in ways that haven't been fully thought through from a security perspective.

There's an irony here. These AI companies are building incredibly sophisticated models — they're solving problems that were considered impossible a decade ago. And then they're storing user documents in publicly accessible S3 buckets and saying don't worry, the URL is really long.

It's the classic disconnect between product velocity and security maturity. The AI capabilities are advancing at an extraordinary pace, but the operational security practices are still catching up. You see this in every fast-growing tech sector — the infrastructure security lags behind the feature development.

Where does this leave us on the core question? Is security by obscurity ever legitimate?

I think the answer is that obscurity can be a legitimate layer, but never the only layer. If you have proper authentication, encryption, access controls, and auditing, and you also use long random URLs to make enumeration harder — that's fine. That's defense in depth. But if the URL is your entire security model, you're not doing security. You're doing hope.

The quantum computing angle makes this distinction even sharper. An obscurity layer that's acceptable today as a secondary measure might become completely transparent in a post-quantum world. If you're designing systems now that will handle sensitive data with a long shelf life, you need to be thinking about post-quantum security. Not as a nice-to-have, but as a requirement.

The harvest now, decrypt later threat means that any data accessible today — even behind an unguessable URL — could be retroactively compromised. The only real protection is to ensure the data was never publicly accessible in the first place. Proper authentication isn't just about stopping today's attackers. It's about ensuring that there's no plaintext for future attackers to harvest.

Let me push back on one thing, though. Is there a scenario where pure security by obscurity is actually the right call? Not as a permanent solution, but as a pragmatic interim measure?

I think you can make a case for it in very low-stakes scenarios where the cost of a breach is negligible and the cost of proper authentication is high relative to the value of the data. A public blog post, a publicly shared meme, a readme file for an open-source project. If the data is intended to be public anyway, the obscurity of the URL is just making it slightly less convenient to find — it's not protecting anything sensitive.

That's not what happened in Daniel's case. The user uploaded a PDF — we don't know what was in it, but people don't upload documents to AI chatbots that they intend to be public. The very act of uploading implies an expectation of privacy. The vendor's security model was mismatched to the sensitivity of the data.

That mismatch is the core failure. Security has to be proportional to the value of what you're protecting. If you're storing user documents, the default assumption should be that those documents are sensitive. The burden of proof should be on the vendor to demonstrate that a weaker security model is appropriate, not on the user to discover that the security model is inadequate.

There's also a regulatory dimension here that we haven't touched on. Depending on what was in that PDF, the vendor's approach might violate GDPR, HIPAA, or other data protection regulations. These regulations generally don't accept security by obscurity as a valid protection mechanism for personal or sensitive data.

GDPR in particular requires appropriate technical and organizational measures to ensure a level of security appropriate to the risk. Storing personal data in a publicly accessible bucket with no authentication, even behind a long URL, would be very hard to defend as appropriate under that standard. The European Data Protection Board has been pretty clear that obscurity doesn't count.

Practically speaking, what should a user do if they find themselves in the situation Daniel described? They upload a document, they get back a link, and they realize it's publicly accessible.

First, document everything. Screenshots, timestamps, the URL itself, any communication with the vendor. Second, push for a substantive response — not just a canned reply about long URLs being secure. Ask specific questions: What entropy source are you using for these URLs? What's the exact expiry window? How do you verify deletion? Third, consider whether the document contained regulated data and whether you have reporting obligations.

From the vendor side, what should AI chatbot platforms be doing differently?

Authentication should be the default for any user-uploaded document. If a document is associated with a user account, accessing it should require proof that the requestor is that user. This isn't hard to implement — S3 supports pre-signed URLs, CloudFront supports signed cookies, there are a dozen well-established patterns for serving private content. The engineering cost is minimal relative to the reputational and regulatory risk.

The bug bounty question is trickier. Should platforms pay bounties for these findings?

I think they should, but with some nuance. A pure URL-guessing finding with no evidence of weak entropy or actual exposure probably doesn't merit a full bounty. But if a researcher demonstrates that user documents are accessible without authentication — even behind long URLs — that's a design flaw worth rewarding. The current approach of rejecting everything in this category creates a blind spot that vendors exploit.

It also discourages researchers from reporting these issues at all. If you know the bug bounty program is going to reject your finding, why bother? You might as well just tweet about it and let the PR pressure do the work. That's worse for everyone — the vendor gets blindsided publicly instead of having a chance to fix it quietly, and users are exposed for longer.

That's essentially what happened in Daniel's case. The user pushed back publicly — or at least persistently — and the vendor eventually fixed it. But that's not a sustainable model for security. You can't rely on individual users being persistent and technically sophisticated enough to escalate these issues effectively.

Alright, let's pull this all together. Security by obscurity — is it ever legitimate? The answer is a qualified no. As a secondary layer in a defense-in-depth strategy, obscurity can add useful friction. But as a primary or sole security mechanism, it's inadequate today and will become increasingly inadequate as quantum computing matures. The vendor in Daniel's case was wrong on the merits, and the bug bounty ecosystem's refusal to treat these findings seriously creates perverse incentives.

The quantum dimension is the sleeper issue here. Even if you think long random URLs are sufficient protection against classical attackers — and I don't, but even if you did — the harvest now, decrypt later threat means you're betting that quantum computers will never be practical. That's not a bet I'd want to make with user data.

One last question. Do you think this problem gets better or worse as AI chatbots become more integrated into workflows? As people upload more documents, more sensitive documents, more routinely?

Worse, almost certainly. The volume of sensitive data flowing into these platforms is increasing exponentially. Every new use case — legal document review, medical diagnosis assistance, financial analysis — brings higher-stakes data. The gap between user expectations of privacy and the actual security posture of these platforms is widening, not narrowing. This is going to be a recurring issue until the industry establishes clear standards for document handling.

Now: Hilbert's daily fun fact.

The average cloud weighs about five hundred thousand kilograms — roughly the same as a fully loaded Airbus A three eighty. All that water vapor suspended in the sky, and it stays up there because the droplets are so tiny that air resistance keeps them aloft.

What can listeners actually do with all this? If you're uploading documents to AI chatbots — and most of us are at this point — here's what I'd recommend. First, check the URL you get back. If it looks like a direct S3 or cloud storage link rather than a platform-specific document viewer, that's a red flag. Second, assume anything you upload could become public and make your decisions accordingly. Don't upload anything you wouldn't want on the front page of a newspaper. Third, if you find an exposure, report it and document it. Even if the vendor doesn't pay a bounty, your report creates a paper trail that matters for accountability.

From the development side, if you're building a platform that handles user documents, the lesson here is simple. Use proper authentication. Signed URLs, session tokens, access control lists — the tools exist and they're not hard to implement. The cost of doing it right is trivial compared to the cost of a breach or a regulatory action. And for the love of everything, don't tell your users that a long random URL is security. It's not, and they're smart enough to know it.

The broader point I'd leave listeners with is this — the debate over security by obscurity isn't really about URLs and S3 buckets. It's about whether we design systems that are secure by construction or secure by assumption. And as quantum computing reshapes what's computationally feasible, the gap between those two approaches is only going to widen.

Thanks as always to our producer Hilbert Flumingtop. This has been My Weird Prompts. You can find every episode at myweirdprompts.com or wherever you get your podcasts.

If you've got a prompt like Daniel's — something you've run into in the wild that made you stop and think — send it our way. We read them all.

We'll be back soon.

This episode was generated with AI assistance. Hosts Herman and Corn are AI personalities.

#2482: When AI Chatbots Leak Your PDFs via Public S3 Buckets

Downloads

You Might Also Like

#2482: When AI Chatbots Leak Your PDFs via Public S3 Buckets