#2586: Pseudo-Personalized Emails: The New Spam Uncanny Valley

How to detect and filter AI-generated outreach emails that fake personal connection without nuking legitimate messages.

0:000:00
Episode Details
Episode ID
MWP-2744
Published
Duration
33:42
Audio
Direct link
Pipeline
V5
TTS Engine
chatterbox-regular
Script Writing Agent
deepseek-v4-pro

AI-Generated Content: This podcast is created using AI personas. Please verify any important information independently.

The Pseudo-Personalization Problem: Why Those "Personal" Emails Feel So Wrong**

There's a new kind of spam creeping into inboxes, and it's uniquely frustrating. Unlike the obvious garbage that traditional filters catch, these emails sit in a strange uncanny valley between bulk commercial email and genuine personal correspondence. They reference your GitHub repos, your blog posts, your LinkedIn activity — but the personalization feels bolted on, not organic. There's no unsubscribe link, because acknowledging they're commercial would break the fiction that this is just a friendly note from a stranger.

Why It's Different from Regular Spam

Traditional spam is easy to identify and filter. It comes from known bad domains, uses obvious patterns, and triggers standard detection systems. Pseudo-personalized emails are designed to defeat pattern-matching. Each one is slightly different because templates get filled with scraped specifics — your GitHub username, a project you worked on, a recent talk you gave. Large language models have made this easier to generate at scale, with tools that scrape public profiles and feed them into prompts that produce what looks like thoughtful outreach.

The legal gray zone is deliberate. Under CAN-SPAM in the US, commercial email requires a working opt-out mechanism. Under GDPR, consent is required upfront. By omitting unsubscribe links, senders maintain plausible deniability that this is personal correspondence rather than marketing. Enforcement against this specific tactic is essentially nonexistent, because the volume per sender is low, the harm is diffuse, and proving automation versus genuine personal outreach is nontrivial.

Detection Signals That Actually Work

Filtering these emails requires a different approach than traditional spam detection. Several high-value indicators can help identify pseudo-personalized messages:

  • The sender has never emailed you before with no obvious reason they'd have your address
  • The email references publicly available information (GitHub repos, blog posts) in a way that feels like a database field was inserted
  • A soft pitch appears somewhere in the second or third paragraph
  • No physical mailing address appears in the footer
  • The sending domain doesn't match the company they claim to represent
  • The email was sent at an odd hour for the sender's claimed timezone
  • The sender's domain was registered recently (under six months)

Practical Filtering Approaches

For someone technical who wants low friction without losing legitimate messages, a tiered approach works best. Rule-based heuristics can catch the majority of cases: checking domain age via WHOIS lookup, flagging emails that reference GitHub activity from unknown senders, and looking for the absence of standard commercial email elements like unsubscribe links and physical addresses.

These checks can be automated through tools like n8n, which connects to the Gmail API and allows custom workflows. A workflow that triggers on emails from unknown senders, runs them through classification logic, and archives or labels them keeps the inbox clean while maintaining safety nets.

When to Use AI Classification

For higher accuracy, an LLM-based classifier can analyze each suspicious email with a prompt like "Is this a genuine personal email or automated outreach?" At a fraction of a cent per classification, this approach is cost-effective even for daily use. Expected accuracy ranges from 90-95%, which means some misclassifications will occur.

The key insight is the soft-fail approach: never auto-delete. Archive and label suspicious emails so they're out of the inbox but still searchable. This reduces the review burden from "read every email from unknown senders" to "spot-check a filtered folder once a week" — while ensuring no genuine opportunity gets permanently lost.

Downloads

Episode Audio

Download the full episode as an MP3 file

Download MP3
Transcript (TXT)

Plain text transcript file

Transcript (PDF)

Formatted PDF with styling

#2586: Pseudo-Personalized Emails: The New Spam Uncanny Valley

Corn
Daniel sent us this one — and I think he's tapped into something that's been quietly driving everyone nuts. He's talking about those emails that land in your inbox and pretend to be personal outreach, but they're clearly automated scraping spam. No unsubscribe link, because they're trying to fake being a real one-to-one message. He gets them constantly because he publishes on GitHub, and they're making it genuinely harder to spot actual human emails from people who want to connect. His question is: what's a low-friction way to filter these out, ideally at the server level, without nuking legitimate messages in the process. He's calling it pseudo-personalization. I think that name's going to stick.
Herman
It's a perfect name for it, honestly. And before we dive into solutions, I should mention — DeepSeek V four Pro is writing our script today, so if anything comes out especially clever, credit where it's due.
Corn
Alright, so where do we even start with this? Because this isn't just regular spam. Regular spam is easy — it's obviously garbage, your filters catch most of it, and when it slips through you roll your eyes and delete it. This stuff is different.
Herman
It's fundamentally different, and Daniel's frustration is completely justified. What he's describing sits in this uncanny valley between bulk commercial email and genuine personal correspondence. The reason it's so infuriating is that it exploits a social norm — the norm that says when someone writes to you personally, you should at least glance at it. These senders are parasitizing that norm.
Corn
Parasitizing is the right word. And what makes it technically distinct from regular spam is exactly what Daniel pointed out — no unsubscribe link. Under CAN-SPAM in the US, commercial email has to include a working opt-out mechanism. Under GDPR in Europe, you need consent in the first place. These emails deliberately omit the unsubscribe link to maintain the fiction that this isn't commercial email at all, it's just a friendly note from a stranger who coincidentally has a product you might like.
Herman
Right, and that's the legal gray zone they're exploiting. If you're sending bulk commercial email, you're a marketer and you have to follow the rules. If you're sending a personal email, you're just a person. These senders are pretending to be the second thing while actually doing the first thing. And Daniel's right that it's probably illegal in many jurisdictions, but enforcement against this specific tactic is basically nonexistent.
Corn
Because who's going to prosecute? The FTC isn't going after some startup that sent four hundred pseudo-personalized emails to GitHub repo owners. The volume is too low per sender, the harm is diffuse, and proving it's automated rather than personal is actually nontrivial.
Herman
That brings us to the detection problem, which is the core technical challenge here. Regular spam filtering works on patterns — certain IP ranges, certain phrases, certain header characteristics, known bad domains. Gmail and Google Workspace are actually very good at this for traditional spam. But these pseudo-personalized emails are designed to defeat exactly those pattern-matching systems.
Corn
Because each one is slightly different. They're scraping your GitHub profile, maybe your blog, maybe your LinkedIn, and generating text that references specific things about you. "Hey, I saw your repo on Rust-based CLI tools, really impressive work, I'm building something in a similar space and thought you might be interested..." It's templated, but the template gets filled with scraped specifics.
Herman
That's where large language models come in, both as the problem and potentially as part of the solution. Let me break down what's actually happening on the sender side, because understanding the adversary helps you build the defense. There are now dozens of AI-powered outreach tools — I'm thinking of platforms like Lemlist, Instantly, Smartlead — that integrate scraping and generation. They'll pull your GitHub activity, your recent posts, your job title, and feed that into a prompt that generates what looks like a thoughtful personal note. The AI is doing the pseudo part of the pseudo-personalization.
Corn
It's a terrible use of the technology, as Daniel said. Not just ethically — it's actually counterproductive for the senders. Anyone with pattern recognition skills spots these immediately. "I noticed your work on X" when X is literally the first thing on your public profile — that's not personalization, that's a database query.
Herman
There was a study from Stanford last year that looked at AI-generated outreach emails and found that recipients were actually more annoyed by them than by traditional spam, specifically because of the uncanny valley effect Daniel's describing. When something is clearly spam, your brain goes "spam" and you move on. When something is almost personal but not quite, it triggers a much stronger negative reaction. It feels deceptive in a way that obvious spam doesn't.
Corn
That's fascinating. So the senders are not only being unethical, they're being ineffective. The pseudo-personalization is worse than no personalization at all. But that doesn't solve Daniel's problem, which is that he still has to process these things. So let's talk solutions. He mentioned an email agent that uses intelligence rather than rules. I think that's the right instinct, but I want to push on the implementation details.
Herman
Let me lay out the landscape of possible approaches, from simple to complex, and then we can figure out what actually makes sense for someone in Daniel's position. He's on Google Workspace, he's technical, he wants low friction, and he's worried about false positives. Those constraints actually narrow things down usefully.
Corn
Start with the simplest option first. What can you do without any AI at all?
Herman
The simplest thing is Gmail filters with keyword matching, but that's going to have terrible precision and recall for this use case. These emails don't share consistent phrases. One might say "Love your work on" and another might say "Impressed by your contributions to." The whole point is that they're varied. You could try filtering on common scraping artifacts — if they mention your GitHub username in a weird context, or if they reference a repo but get the language wrong, but that's whack-a-mole.
Corn
What about filtering on the absence of an unsubscribe link? If an email is from a domain you don't recognize and it doesn't contain an unsubscribe link or a physical address, that's a signal.
Herman
That's actually a clever heuristic, but it's hard to implement in Gmail's native filter system. Gmail filters can check for the presence of text, but checking for the absence of specific text isn't straightforward. You'd need something that parses the email outside of Gmail's rules engine. Which brings us to the next tier: using Google Apps Script or a third-party tool that connects via the Gmail API.
Corn
This is where I think n8n could be useful. Daniel's been running n8n for about eighteen months — it's the closest self-hosted Zapier replacement, and it can connect to Gmail's API. You could build a workflow that triggers on new emails from unknown senders, runs them through some classification logic, and either archives them, labels them, or flags them for review.
Herman
That's a solid approach, and it gives you full control over the logic. But the classification piece is where it gets interesting. You could start with rule-based heuristics and then layer on AI only where needed. For instance, a rule that says "if the sender's domain is less than six months old and the email mentions my GitHub, flag it." That alone would catch a lot.
Corn
Domain age is a really good signal. Most of these scraping operations are using relatively new domains, or they're rotating through domains quickly as they get blacklisted. You could integrate a WHOIS lookup into the n8n workflow, check the domain creation date, and use that as one factor.
Herman
You can get more sophisticated with the signals. Here's my list of high-value indicators that an email is pseudo-personalized scraping spam. One: the sender has never emailed you before and there's no obvious reason they'd have your email. Two: the email references something publicly available about you — your GitHub, your blog, your company's about page — in a way that feels like a database field was inserted. Three: there's a soft pitch somewhere in the second or third paragraph. Four: no physical mailing address in the footer. Five: the sending domain doesn't match the company they claim to represent. Six: the email was sent at an odd hour for the sender's claimed timezone.
Corn
Number six is underrated. If someone claims to be in San Francisco but their email arrived at three AM Pacific time, that's a tell. Not definitive, but combined with other signals it adds up.
Herman
And this is where a scoring system makes sense. Each signal adds to a score, and above a certain threshold you take action. The question is what action. Daniel mentioned blocking the domain at the server level, which is the nuclear option. I'd recommend a softer approach first: automatically archive and label these emails, so they're out of the inbox but still searchable if you ever need to find one.
Corn
That's the false positive safety net. If you accidentally filter a real person, their email isn't gone — it's just not in your face. You can periodically review the filtered folder and rescue anything that got caught incorrectly.
Herman
That brings us to the AI agent approach Daniel was wondering about. The question is whether you need a full language model to classify these emails, or whether simpler methods work well enough. I think the answer depends on volume and tolerance for false positives.
Corn
Let's talk about what an AI-based approach would actually look like. You'd have something — again, probably an n8n workflow or a custom script — that receives each email from an unknown sender, sends it to an LLM with a prompt like "Is this a genuine personal email or an automated outreach message? Respond with only GENUINE or AUTOMATED," and then acts on the response.
Herman
The cost per classification with something like GPT four point oh or Claude would be a fraction of a cent per email. If Daniel's getting one or two of these a day, we're talking maybe fifty cents a month in API costs. That's completely reasonable. The latency is also fine — you don't need real-time classification for email filtering. A thirty-second delay between the email arriving and it being classified is imperceptible to the user.
Corn
The real question is accuracy. How good are these models at distinguishing genuine personal outreach from well-crafted pseudo-personalization?
Herman
I haven't seen a published benchmark specifically for this task, but based on what we know about LLM performance on similar classification problems, I'd expect somewhere in the ninety to ninety-five percent accuracy range. The models are very good at detecting the subtle tells — the slightly generic phrasing, the way the personalization feels bolted on rather than integrated, the absence of specific shared context that a real person would naturally include.
Corn
Ninety to ninety-five percent is pretty good, but it means five to ten percent error. If Daniel gets, say, forty of these a month, that's two to four misclassifications. If even one of those is a real person whose email got filtered, that's a problem.
Herman
That's why the soft-fail approach matters. Don't auto-delete. Archive and label. The cost of reviewing a filtered folder once a week is maybe two minutes. The cost of missing a genuine opportunity because someone's email got deleted is potentially much higher. The AI reduces the review burden from "read every email from unknown senders" to "spot-check the filtered folder.
Corn
There's also a hybrid approach that I think is underrated. Use the AI for classification, but only on emails that have already passed through a set of deterministic filters. So step one: is this from a known contact? If yes, deliver normally. Step two: does this email contain an unsubscribe link or a physical business address? If yes, it's probably legitimate marketing email — still annoying, but at least it's following the rules. Step three: does it match any obvious spam patterns? Step four: only then send the remaining ambiguous emails to the LLM for classification.
Herman
That's the right architecture. Each layer reduces the number of emails that need AI classification, which reduces both cost and the absolute number of false positives. And you can tune the deterministic layers to be extremely conservative — they only filter things that are almost certainly spam — and let the AI handle the nuanced cases.
Corn
Let me propose a concrete setup for Daniel, since he's already on Google Workspace and already using n8n. And I want to be specific here because "set up an AI email filter" is the kind of advice that sounds helpful but is actually useless without implementation details.
Herman
Let's build it.
Corn
Step one: in Google Cloud Console, enable the Gmail API for your workspace. Create a service account with appropriate scopes — you need the ability to read messages, modify labels, and ideally also manage blocked senders if we want that option. Step two: in n8n, create a new workflow with a Gmail trigger node that watches for new messages. You can configure it to only trigger on emails from senders not in your contacts, which is the first filter right there.
Herman
Important detail: Gmail's API has a feature called push notifications via PubSub, but for a low-volume personal inbox, polling every five or ten minutes is perfectly fine and much simpler to set up. The n8n Gmail trigger node handles this natively.
Corn
Step three: the workflow pulls the full email content — subject, body, sender address, headers. Step four: run it through the deterministic filters we talked about. Check if the sender is in your contacts, check for unsubscribe links in the body, check the domain age via a WHOIS API node. If it fails any of these checks, label it as suspected spam and archive it. If it passes all the deterministic checks, leave it in the inbox.
Herman
For the WHOIS lookup, there are free APIs that work fine for this volume. You're not checking thousands of domains a day, you're checking maybe ten. The key signal is domain age — anything registered in the last ninety days is suspicious. Not conclusive, but combined with other signals, it's strong.
Corn
Step five, and this is where the AI comes in: for emails that pass the deterministic checks but still seem off — or alternatively, for all emails from unknown senders — send the content to an LLM for classification. In n8n, you'd use the HTTP Request node to call the OpenAI API or the Anthropic API. The prompt is crucial here. You want something like: "You are classifying emails to determine if they are genuine personal messages or automated outreach. Genuine personal emails reference specific shared context, ask natural questions, and don't contain a sales pitch. Automated outreach emails use scraped public information to appear personal but contain a pitch for a product, service, or meeting. Classify the following email as GENUINE or AUTOMATED. Respond with only one word.
Herman
I'd add one thing to that prompt: ask for a confidence score. "Respond with only one word and a number from one to ten indicating your confidence." That way, you can set different thresholds. High-confidence automated emails get auto-archived. Low-confidence ones get flagged for manual review. That dramatically reduces the false positive risk.
Corn
That's smart. And then step six: based on the classification, apply a Gmail label and either archive the email or leave it in the inbox. You could have labels like "Likely Spam — AI Classified" and "Uncertain — Review." The uncertain ones are the only ones you need to actually look at.
Herman
Step seven, which is optional but powerful: for emails classified as automated with high confidence, you could automatically add the sender's domain to a block list. Daniel mentioned wanting to block these domains at the server level, and you can do that through the Gmail API as well. But I'd recommend a cooldown — maybe the domain gets blocked only after three separate emails from that domain get classified as automated. That prevents a single false positive from permanently blocking a legitimate domain.
Corn
That's the kind of safeguard that makes the difference between a tool you trust and a tool you're constantly second-guessing. The whole system needs to earn trust over time, and the way it earns trust is by being transparent about its decisions and giving you easy ways to correct mistakes.
Herman
Let me address the self-hosted versus managed question, because Daniel mentioned he's technical and already running n8n, but there's also a case for using a managed service. There are tools like SaneBox, Superhuman, and Hey that do some of this classification work out of the box. The advantage is zero setup and maintenance. The disadvantage is you're trusting someone else's classification logic and you have less control.
Corn
For Daniel specifically, I think self-hosted makes sense. He's already running n8n, he's comfortable with APIs, and his use case is specific enough that a general-purpose tool probably won't handle it well. Most email filtering services are designed around newsletter management and traditional spam — they're not optimized for detecting pseudo-personalized AI-generated outreach.
Herman
The specific pattern of "scraped GitHub data used to fake personalization" is niche enough that building your own classifier with domain-specific heuristics will almost certainly outperform a generic solution. You know your own patterns. You know which repos are public, which blog posts get scraped, which conferences you've spoken at. Those are the data points that scrapers use, and you can tune your filters accordingly.
Corn
There's another angle here that I think is worth exploring. Daniel mentioned that part of his frustration is ideological — he believes in agentic AI, and he sees this as a horrible misuse of it. I think that's worth sitting with for a moment, because it connects to a broader conversation about how these tools get deployed.
Herman
AI agents that could be doing useful things — automating tedious workflows, helping people access information, making software more accessible — are instead being used to generate slightly customized spam at scale. It's like inventing the printing press and using it to write passive-aggressive post-it notes.
Corn
It creates a negative externality that affects the whole ecosystem. When enough people start getting these pseudo-personalized emails, the default assumption for any cold outreach becomes "this is probably fake." That hurts people who are doing legitimate cold outreach — the researcher who found your paper and has a genuine question, the developer who wants to collaborate on a project, the journalist who wants to quote you. Everyone gets tarred with the same brush.
Herman
This is a classic tragedy of the commons problem. Each individual sender gets a slight benefit from using AI personalization — maybe their response rate goes from one percent to one point five percent. But collectively, they're poisoning the well for everyone. And the senders don't bear the cost of that poisoning — the recipients do, in the form of a noisier inbox and eroded trust.
Corn
Which is why technical solutions like what we're describing aren't just personal convenience. They're a way of reclaiming the commons. If enough people deploy smart filters that specifically target this behavior, the ROI for the senders drops, and the practice becomes less attractive. It's a distributed defense.
Herman
That's happening organically. Gmail and other major providers are getting better at detecting these patterns. I've noticed in the last six months or so that fewer of these emails are landing in my primary inbox — more of them are getting caught by the spam filter or shuffled into the promotions tab. The classifiers are learning.
Corn
Although the promotions tab is interesting, because these emails don't want to be there. The whole tactic is to avoid being classified as marketing. If Gmail starts consistently putting them in promotions, the senders lose the inbox placement they're trying to achieve, and the tactic becomes less valuable.
Herman
Let me circle back to something practical, because I want to make sure we're giving Daniel actionable advice and not just theory. Here's what I'd actually recommend, in order of increasing complexity and effectiveness. Tier one: right now, today, create a Gmail filter that catches any email containing the phrase "saw your" plus "GitHub" or "repo" or "repository." That's a very common scraping template. It won't catch everything, but it'll catch a lot, and it has essentially zero false positive risk — no real person writes "I saw your GitHub repository and was impressed" as their opening line.
Corn
Real people say "Hey, I was looking at your code and..." or "Quick question about your repo." The "saw your" phrasing is a tell. It's the kind of slightly stilted language that templates produce.
Herman
Tier two: set up the n8n workflow with the Gmail API as we described, starting with deterministic rules and a manual review folder. Get comfortable with the workflow, tune the rules, see what it catches. This is a weekend project for someone with Daniel's technical background. Tier three: add the LLM classification once you've validated that the deterministic rules are working and you understand the false positive rate. The LLM layer is where you go from "this catches most of them" to "this catches almost all of them.
Corn
Tier four, which is aspirational but worth mentioning: if you really want to go nuclear, you can use the Gmail API to auto-reply to classified spam with a complaint. Something like "This email appears to be unsolicited commercial email sent without consent. Please remove my address from your systems." Most of these senders will ignore it, but it creates a paper trail, and some email sending platforms will flag accounts that receive multiple complaints.
Herman
I'd be cautious with auto-replies, because they confirm to the sender that your email address is actively monitored. That can actually increase the volume of spam you receive, because your address gets marked as "verified active" in whatever database they're using. I'd stick with silent filtering.
Corn
Silent filtering is the way to go.
Herman
Let's talk about one more thing Daniel raised, which is the legal dimension. He mentioned that these emails are probably illegal but nobody prosecutes. He's right on both counts. Under CAN-SPAM, commercial email has to include a clear and conspicuous opt-out mechanism. Under GDPR, you need a lawful basis for processing someone's personal data, and "I scraped your GitHub" isn't one. But enforcement is almost entirely focused on high-volume spammers, not on the kind of low-volume pseudo-personalized outreach Daniel's describing.
Corn
The reason enforcement is so limited isn't just resource constraints, though that's part of it. It's also that proving these emails are automated rather than personal is difficult from the outside. A regulator looking at a single email can't necessarily tell if it was written by a human who did thirty seconds of research or by an AI that scraped a profile. The patterns only become visible in aggregate.
Herman
Which is another argument for technical solutions over legal ones. The law moves slowly and enforcement is spotty. A well-tuned n8n workflow moves at the speed of API calls.
Corn
Alright, let me try to synthesize this into something concrete. Daniel's problem is pseudo-personalized outreach emails — AI-generated, scraping-based, no unsubscribe link, designed to look personal. The solution we're proposing is a layered filtering system on Google Workspace using n8n and the Gmail API. Layer one: deterministic rules based on domain age, known template phrases, and structural tells. Layer two: LLM classification for ambiguous cases, with a confidence threshold and a manual review folder for low-confidence hits. Layer three: optional domain-level blocking for repeat offenders, with a cooldown to prevent false positive lockouts.
Herman
The whole thing is self-hosted, transparent, and tunable. That's the key. You're not trusting a black box — you're building something where you can see every decision and adjust every threshold. For someone like Daniel who's technical and opinionated about how his tools work, that's the right approach.
Corn
The setup time is probably a weekend. The ongoing cost is negligible — pennies a month in API fees for the LLM classification. The maintenance is minimal once it's tuned. And the peace of mind is substantial.
Herman
I want to add one more technique that's a bit unconventional but powerful. You can use the absence of certain signals as a positive indicator. A genuine personal email usually contains something that couldn't be scraped — a reference to a private conversation, a question about something that isn't publicly documented, a mention of a mutual contact. The LLM can be prompted to look for these "genuineness signals" and use their absence as evidence of automation.
Corn
That's a really good framing. It's not just "does this look like spam," it's "does this look like a real person trying to communicate." Real personal emails have a certain texture — they're messier, more specific, more contextual. They reference things that aren't in your top five Google results. They ask questions that a template wouldn't generate.
Herman
That's something AI classifiers are surprisingly good at detecting. They've been trained on vast amounts of human conversation, so they have a strong implicit model of what natural human communication looks like versus templated communication with slots filled in. The difference is often in the transitions between sentences, the way topics flow, the presence of small digressions and self-corrections that templates smooth out.
Corn
Templates are too coherent. Real emails have asides. "By the way, this reminded me of..." or "Sorry if this is random, but..." Those little disfluencies are hard to template convincingly.
Herman
And the scrapers haven't figured out how to fake that yet. They might eventually, but right now, the pseudo-personalization is still detectably pseudo. The personalization is a veneer — once you look at the structure underneath, it's clearly a template.
Corn
Let's also talk about what not to do, because I think there are some tempting but wrong approaches here. Don't try to solve this with Gmail's built-in spam reporting. Marking these as spam trains Google's classifiers, which is good for the ecosystem, but it doesn't solve your individual problem quickly enough, and the senders just rotate domains. Don't try to reply asking to be removed — as we discussed, that confirms your address is active. And don't try to build a perfect classifier that catches everything with zero false positives. That's a trap. Aim for "catches most and false positives go to manual review," not "catches all and deletes them.
Herman
The perfect is the enemy of the good enough, especially in email filtering. A system that catches eighty percent of pseudo-personalized spam with zero false positives that result in deletion is vastly better than a system that catches ninety-nine percent but occasionally nukes a real email. The manual review folder is the safety valve that makes the whole thing trustworthy.
Corn
Once you've had the system running for a few weeks, you'll have a good sense of its real-world performance. You can look at the review folder, see what's getting flagged, and tune accordingly. Maybe you find that certain domains keep showing up and you add them to a blocklist. Maybe you find that certain phrases are reliable indicators and you promote them to deterministic rules. The system gets better over time.
Herman
That's the beauty of self-hosted tools. They're not static — they evolve with your needs and your adversary's tactics. And the adversary here is adapting too. The scraping templates are getting more sophisticated, the AI generation is improving, the domain rotation is getting faster. A static defense loses effectiveness over time. An evolving defense doesn't.
Corn
Which brings us back to something Daniel said that I think is worth highlighting. He called this a "horrible misuse of AI" and said he wants to clamp down on it from an ideological perspective. I think building your own filter is a form of ideological action. It's saying: I refuse to let my inbox be a dumping ground for low-effort AI-generated outreach. I'm going to use AI defensively to protect my attention, rather than letting AI be used offensively to steal it.
Herman
Attention is the scarce resource here. These emails are attention theft at scale. They're trying to steal a few seconds of your cognitive bandwidth by pretending to be something they're not. And the cumulative effect, as Daniel described, is that it becomes harder to find the real human messages in the noise. That's a real cost. It's not just annoyance — it's degraded signal-to-noise ratio in a communication channel that matters.
Corn
For someone like Daniel who publishes open source work and presumably wants to hear from genuine collaborators and users, that degraded signal-to-noise ratio has a real productivity cost. Every pseudo-personalized email he has to read and dismiss is time he's not spending on actual work or actual relationships.
Herman
Let me bring this full circle with a practical checklist. If Daniel's listening and wants to set this up, here's what I'd do in order. One: audit your current inbox for a week. Tag every pseudo-personalized email and look for patterns. What phrases do they use? What data do they scrape? What times do they arrive? This gives you training data for your rules. Two: set up the Gmail API access and n8n workflow with the basic deterministic filters we described. Three: run it in "label only, no archive" mode for a week so you can validate that it's catching the right things without false positives. Four: add the LLM classification layer with the confidence scoring and the manual review folder. Five: after another week of validation, enable auto-archiving for high-confidence classifications. Six: periodically review the archive and tune.
Corn
That's a solid plan. And I think the fact that we can describe it in enough detail that someone could actually implement it is important. Too much advice in this space is hand-wavy — "use AI to filter your email" without specifying which API, which prompt, which workflow. The details matter.
Herman
The details are where the false positives live. If you're sloppy about the details, you end up filtering real emails and the whole thing becomes counterproductive. If you're careful about the details, you get a system that quietly and reliably protects your attention without you having to think about it.
Corn
That's the dream, right? A system that just works, that you don't have to babysit, that handles the garbage so you can focus on the signal. It's not a crushing issue, as Daniel said, but it's one of those things that, once solved, improves your quality of life in a small but consistent way every single day.
Herman
Every single day. Two or three fewer garbage emails to process, every day, for the rest of your career.
Corn
Alright, I think we've given Daniel a pretty complete answer. Layered filtering, n8n plus Gmail API, deterministic rules plus LLM classification, manual review as a safety net, domain blocking with a cooldown. Low friction, self-hosted, tunable.
Herman
If anyone else is dealing with the same problem — and I suspect a lot of people are — the same architecture works. The specific heuristics might differ based on what kind of public profile you have and what the scrapers are targeting, but the layered approach is generalizable.
Corn
Now: Hilbert's daily fun fact.

Hilbert: The national animal of Scotland is the unicorn. It has been since the twelve hundreds, when it was adopted as a symbol of purity and power on the Scottish royal coat of arms.
Corn
...right.
Corn
This has been My Weird Prompts. Thanks to our producer Hilbert Flumingtop for keeping the show running. If you enjoyed this episode, leave us a review wherever you get your podcasts — it helps other people find the show. You can also find every episode at myweirdprompts dot com. We'll be back soon.

This episode was generated with AI assistance. Hosts Herman and Corn are AI personalities.