Hey everyone, welcome back to My Weird Prompts. I am Corn, and I am here in our Jerusalem living room with my brother. It is a bit of a grey, drizzly afternoon here in the holy city, which honestly feels quite appropriate for the topic we are diving into today.
Herman Poppleberry here. It is good to be back at the microphones, even if the subject matter is, shall we say, less than sunny. The coffee is hot, the servers are humming in the corner, and I am ready to get technical.
It really is a heavy one today, but it is technically fascinating and, frankly, vital for the future of historical record-keeping. Today's prompt comes from Daniel, who is looking at the challenges of digitally archiving hate speech and extremism. Daniel was pointing out a massive catch-twenty-two in the world of digital preservation. Mainstream services like the Wayback Machine or Perma dot cc often explicitly prohibit archiving this kind of content in their terms of service. This leaves researchers, journalists, and historians in a massive bind.
Right, it is the classic digital preservation paradox. To study the worst parts of human history and prevent them from repeating, you have to document them. You need the primary sources. But the very platforms we use for documentation are, for very understandable legal and ethical reasons, terrified of hosting that material. We are talking about a digital "memory hole" that is opening up in real-time.
Exactly. Daniel was specifically looking at the situation in Ireland, where extremist rhetoric has been flourishing on certain platforms over the last couple of years, particularly around issues of immigration and housing. He wants to know what the best solutions are for individuals or private organizations who want to document this stuff without getting de-platformed themselves. We are going to look at the comparison between self-hosting and cloud-based options, and really get into the weeds of how you build a "dark archive" that survives.
This is such a critical topic because digital evidence is incredibly fragile. We often think of the internet as forever, but it is actually more like writing in sand at low tide. A tweet can be deleted in seconds, a whole forum can be taken down by a hosting provider, or a domain can be seized. If it is not captured the moment it happens, with the proper metadata and verification, it might as well have never existed from a legal or historical perspective. By the time a historian looks back at the twenty-twenties, the most influential—and most toxic—parts of our discourse might be completely gone.
Well, let's start with that mainstream problem. Why is it that the Wayback Machine or Perma dot cc have these restrictive terms of service? I mean, they are archives. Their mission is to save the web. Shouldn't they be neutral by definition?
In an ideal world, maybe. But they operate in a complex legal and social reality that has become much more hostile over the last few years. If the Internet Archive allows itself to be used as a mirror for extremist manifestos, recruitment videos, or doxing materials, they risk being seen not as an archive, but as a distributor. Under laws like the Digital Services Act in Europe—which has really bared its teeth in twenty-twenty-four and twenty-twenty-five—being a "passive host" is a moving target. If they are notified that they are hosting illegal hate speech and they do not remove it with "undue delay," they could face massive fines—up to six percent of global turnover—or even criminal liability in certain jurisdictions.
So, for them, the risk of losing their entire service just to host a few gigabytes of extremist content is simply not worth it. They have to protect the ninety-nine percent of their archive that is benign—the old Geocities pages, the government records, the news articles.
Precisely. And then there is the "poisoning" or "weaponization" aspect. If an archive becomes a known repository for hate speech, it can be used to bypass filters. Extremists are smart; they know that social media algorithms often don't block links to the Wayback Machine. So, they post their content, archive it, and then when the original post is deleted for a terms of service violation, they just circulate the archive link. At that point, the archive is literally helping the content spread. Perma dot cc, which is run by Harvard, is even more restrictive because their links are intended for legal citations. They cannot have their "perma-links" being used to host terrorist manuals.
That makes total sense from their perspective, but it leaves Daniel and his colleagues in Ireland in a tough spot. If I am a researcher or an organization trying to track how a specific extremist narrative evolved, and I cannot rely on these public utilities, I have to build my own infrastructure. Daniel mentioned the idea of a "sovereign cloud" in Israel for this purpose. What does that actually look like in practice in twenty-twenty-six?
A sovereign cloud is a concept that has really gained steam lately. It is essentially cloud infrastructure that is legally, operationally, and physically located within a specific country and governed by that country's laws, rather than being subject to the whims of a global corporation like Amazon or Google. In Israel, we have Project Nimbus, which is a massive government cloud project. For a project like archiving anti-semitism or extremism, having the data sit on servers in a jurisdiction that has strong protections for research and documentation—and a deep understanding of the context of that hate speech—is a huge advantage. You are not worried about a moderator in a different time zone clicking "delete" because they don't understand the historical value of a specific document.
But most people—most small NGOs or private researchers—do not have a government-backed sovereign cloud at their disposal. So, for the average private organization, the choice is usually between standard commercial cloud storage and self-hosting their own servers. Let's break down the cloud side first. If I just open an Amazon Web Services account or a Google Cloud account and start uploading screenshots and video files of hate speech to an S-three bucket, am I safe?
Short answer? No. Long answer? Absolutely not. Even if you set your bucket to private, you are still subject to their Acceptable Use Policy. These companies use automated scanners—increasingly powerful AI models—to look for prohibited content. If their scanners find that you are storing "objectionable" material, even if it is for research, they can and will nuke your entire account. And the problem with the cloud is that it is a black box. You do not truly own the hardware. You are essentially renting a slice of someone else's computer, and they have the keys. If they decide you are a liability, you lose everything.
So, even if I am doing it for legitimate research, I could wake up one morning and find ten years of data gone because an algorithm flagged a specific keyword or a frame in a video I archived.
Exactly. That is the "de-platforming" risk. Now, you can mitigate this by using specialized cloud providers that cater to "high-risk" content or journalism, like some of the boutique providers in Switzerland or Iceland. But even those are often just resellers of the big three—Amazon, Google, and Microsoft—and are still subject to those upstream rules. If the "upstream" provider cuts the cord, the "downstream" boutique provider goes dark too.
Okay, so that brings us to the "scrappy and messy" option: self-hosting. This is what Daniel was talking about. If I set up a server in my basement or a secure office in Dublin, what are the trade-offs?
The biggest pro of self-hosting is absolute, unadulterated control. If you own the physical hard drives, nobody can delete your data remotely. You are the root administrator. You decide what stays and what goes. This is the gold-standard for data sovereignty. But, as Daniel pointed out, it is a web of complications. You are moving from being a researcher to being a system administrator, a security expert, and a physical custodian.
Let's talk about the hardware first. If I am archiving video—which is where most extremist content lives now, on platforms like Telegram or Rumble—we are talking about massive amounts of data.
Yeah, video is the storage killer. You would likely start with a NAS, which is Network Attached Storage. Think of it as a specialized computer filled with high-capacity hard drives. In twenty-twenty-six, you are looking at eighteen-terabyte or twenty-terabyte drives as the standard. Brands like Synology or QNAP are the "prosumer" choice, or you can build your own using software like TrueNAS. You want redundancy, Corn. If one drive fails, you cannot lose the data. That means using RAID configurations—Redundant Array of Independent Disks. Specifically, you want RAID six, which allows for two simultaneous drive failures without data loss.
But then you have the physical security and maintenance. If your office floods, or the power goes out, or there is a fire, your archive is toast. And in a place like Ireland, or even here in Israel, you have to think about the physical safety of the equipment.
Exactly. And then there is the "target" problem. If you are self-hosting an archive of extremist content and people find out where it is, you become a prime target for Distributed Denial of Service attacks—D-DOS. If your home or office internet connection gets flooded with traffic by someone who doesn't want that data to exist, you are effectively offline. Or worse, if you do not secure the server properly and it gets hacked, you have just handed a curated, organized list of extremist material to someone who might want to use it for the wrong reasons. You have essentially done the work for them.
That is a scary thought. You are essentially creating a library of toxic material, and if the library is not built like a fortress, it is a liability. It is like storing hazardous waste.
It really is. And there is a technical challenge here that I think is often overlooked, which is "integrity." How do you prove that the content you archived is what you say it is? If you are a researcher or you want to use this in a legal setting—say, for a war crimes tribunal or a hate speech prosecution—you need to prove that you didn't just Photoshop a tweet or edit a video to make someone look bad.
Right, because if it is on my private server, I am the only one who can vouch for it. I could have edited the file five minutes ago.
Exactly. This is where "hashing" comes in, and it is the most important habit for any archiver. When you capture a piece of content, you run it through an algorithm to create a unique digital fingerprint, like a SHA-two hundred fifty-six hash. If even one pixel in an image changes, or one character in a text file is altered, the hash changes completely. To have a credible archive, you need to log these hashes the moment the content is captured. Ideally, you want to "notarize" these hashes on a public ledger—like a blockchain—or a trusted third-party service. That way, you can prove the file has not been tampered with since the date of capture, even if the file itself is kept private.
So, even if you are self-hosting the data to keep it safe from de-platforming, you still need some kind of external connection to verify its authenticity. It is not a closed loop.
Precisely. Now, let's talk about the "how." How do you actually grab this content? You can't just right-click and save everything. That doesn't scale, and it doesn't capture the metadata.
I assume there are tools for this that have improved since the early days?
There are. One of the best ones for self-hosters is called ArchiveBox. It is an open-source tool that you can run on your own server. You give it a list of URLs, and it goes out and captures them in multiple formats simultaneously. It grabs a PDF, a screenshot, a "WARC" file—which is the Web ARChive format—and even the raw HTML. It is like a Swiss Army knife for web preservation.
And it does this automatically?
Yeah, you can set it up to crawl specific sites or social media feeds. For video, there is a tool called "yt-dlp," which is the industry standard for downloading video from thousands of different sites, including the more obscure ones where extremists often migrate. It is incredibly powerful but, again, it requires some technical know-how to run via a command line. You have to manage "user agents" and "proxies" so the sites don't realize you are a bot and block your IP address.
This sounds like a full-time job, Herman. If I am a small NGO in Ireland trying to keep track of extremist rhetoric, I need a dedicated IT person just to keep the lights on and the scrapers running. It feels like the technical barrier to entry is actually helping the extremists stay hidden.
You really do. And that is why many organizations end up back at the cloud, but with a more sophisticated, "hybrid" approach.
Tell me about that. How do you get the best of both worlds?
In a hybrid model, you might do the "capture" in the cloud. Why? Because cloud servers have massive bandwidth and can handle the scraping of large sites without getting their IP addresses blocked as easily. You can spin up a virtual machine, run your ArchiveBox or your scrapers, and then, as soon as the data is captured and hashed, you pull it down to your local, self-hosted "cold storage" for long-term preservation. Once it is on your local drive, you delete it from the cloud.
So the cloud is the "arm" that reaches out into the internet, but the "brain" and the "memory" stay in your physical control.
I like that analogy. It gives you the speed and scalability of the cloud for the "ingest" phase, but the security and sovereignty of self-hosting for the "archive" phase. It minimizes the time the "objectionable" content spends on someone else's server, reducing the risk of your account being flagged.
But what about the legal side? This is where Daniel's point about Ireland is so interesting. If I have a server in my house in Dublin full of extremist manifestos or videos that incite violence, am I breaking the law just by possessing them?
That is the million-dollar question, and the answer is shifting under our feet. In Ireland, the Criminal Justice (Incitement to Violence or Hatred and Hate Offences) Bill has been a huge point of contention. While it is designed to combat hate speech, archivers worry about the "possession" clauses. If you have material that is deemed to be "likely to incite violence or hatred" with a view to it being disseminated, you could be in trouble. Now, there are usually exemptions for "bona fide" research or journalistic purposes, but "bona fide" is a subjective term that a court has to decide.
So, as an archiver, you have to be very careful that your archive doesn't look like a "distribution hub."
Exactly. This is why professional archivers are so obsessed with metadata. You don't just save the video; you save the context. You save the notes on why it was archived, who archived it, and what research project it belongs to. You are building a "chain of custody." If the police ever knock on your door, you want to be able to show them a professional, organized research database, not a folder named "Cool Extremist Videos."
This brings up an ethical point, too, which we shouldn't gloss over. If you are archiving this stuff, you are essentially staring into the abyss all day. You are looking at the worst parts of humanity—the vitriol, the threats, the graphic imagery. There is a massive psychological toll on the people doing this work.
Absolutely. We have seen this with content moderators at companies like Meta or TikTok. They suffer from actual post-traumatic stress disorder. If you are a small organization, you have to think about the well-being of the people running the archive. You need protocols for "vicarious trauma." Maybe that means using AI to transcribe videos so researchers can read the text rather than watching the graphic content, or using grayscale filters to dampen the emotional impact of images. It is not just about the servers; it is about the humans.
Let's go back to the technical specifics for a second. You mentioned "WARC" files earlier. Why is that format so important? Why not just save everything as a JPEG or a PDF and be done with it?
Because a WARC file—which stands for Web ARChive—is a "container" format. It is the international standard used by the Library of Congress and the Internet Archive. It doesn't just save the "look" of the webpage; it saves the entire transaction. It records the HTTP headers, the server response, the exact time of the request, the IP address of the server you hit, and all the underlying code. If you want to prove in a court of law that a specific post existed at a specific time, a WARC file is much harder to dispute than a simple screenshot, which is trivial to fake with "Inspect Element" in a browser.
So, if you are serious about this, you are building a WARC-based archive. It is the difference between taking a photo of a crime scene and actually bagging the evidence.
Exactly. And you are also thinking about "link rot." A huge problem in digital archiving is that even if you save the main page, all the links on that page might point to things that no longer exist. A sophisticated archive will do what we call "recursive" archiving—it will follow the links and archive those, too, creating a snapshot of a whole "neighborhood" of the internet.
But that could lead to an explosion of data. If you archive one post, and it links to ten others, and those each link to ten more... you are trying to download the whole internet.
Exactly. You have to set "depth" limits. Usually, researchers will go one or two levels deep. But even then, the storage requirements grow exponentially. This is why "deduplication" is so important. If ten different extremist posts all link to the same manifesto, you only want to save that manifesto once. Modern file systems like ZFS—which is what you would use on a high-end self-hosted server—can do this automatically at the block level, saving you massive amounts of space.
Okay, so let's weigh these two options for someone like Daniel or the organizations he is talking about. If you are a small non-profit with a limited budget and maybe one technical person, what is the play?
If I were in their shoes, I would probably lean towards a "managed" self-hosting approach with a "zero-knowledge" backup.
Break that down for me. What does "zero-knowledge" mean in this context?
Okay, so you buy a high-quality NAS—something with at least four or eight drive bays—and you set it up in a secure, climate-controlled location. You use ArchiveBox and yt-dlp to do the heavy lifting. But, you still need an off-site backup. If your office burns down, your archive is gone. This is where the cloud comes back in, but safely. You use "zero-knowledge" encryption.
Like a shield?
Exactly. You use a tool like R-Clone or Cryptomator to encrypt the data locally on your server before it ever touches the internet. You are the only one with the keys. Then, you upload those encrypted "blobs" to a cheap cloud storage provider like Backblaze B-two or an S-three bucket. The cloud provider has no idea what is in those files. They just see a giant pile of random, encrypted data. They can't scan it for "terms of service" violations because they don't have the keys. They are just providing "dumb" storage.
Ah, so encryption is the loophole that lets you use the cloud for storage without the risk of de-platforming based on content. They can't delete what they can't see.
Precisely. You keep the "live" archive on your local server for easy access and research, but you keep the "emergency backup" encrypted in the cloud. That way, you have the physical sovereignty of self-hosting and the disaster recovery of the cloud. It is a very solid, professional-grade solution that doesn't require a million-dollar budget.
That seems like a very practical path forward. But what about the "sovereign cloud" idea Daniel mentioned? Is that something that could actually work on a national scale for a community?
It is a beautiful idea, and I think for a country like Israel, which has a very specific interest in documenting anti-semitism globally, it makes a lot of sense. A state-level or large institutional archive can provide the kind of resources and legal protection that a private individual just can't. They can negotiate "peering" agreements with internet service providers to ensure their scrapers don't get throttled. They can provide high-level security against state-sponsored hacking.
But it also raises questions of trust, doesn't it? If a government or a single large institution is the one running the archive of "extremism," who defines what is extremist?
That is the million-dollar question, Corn. Today's "extremist" might be tomorrow's "mainstream" politician, or vice versa. If an archive is controlled by a single entity, it can be scrubbed or manipulated to fit a political narrative. We have seen this throughout history with physical archives. This is why I think the best solution is actually a decentralized network of private archives.
Like a "distributed" Wayback Machine?
Sort of. There is a technology called IPFS—the InterPlanetary File System. It is a peer-to-peer network for storing and sharing data. Instead of a file being located on one specific server, it is identified by its content hash and distributed across many nodes. If one node goes down—or is taken down by a government—the data can still be retrieved from other nodes that have "pinned" it.
That sounds like the ultimate defense against de-platforming. It is like the Hydra—cut off one head, and the data lives on elsewhere.
It is, in theory. But IPFS has a "persistence" problem. Just because a file is on the network doesn't mean it will stay there forever. Someone has to "pin" the data to ensure it remains available. This is where "incentivized" storage comes in, like Filecoin or Arweave. Arweave is particularly interesting because it is designed for "permanent" storage. You pay a one-time fee, and the data is supposed to be stored for two hundred years. It is essentially a "permaweb."
Is that being used for hate speech archiving?
It is being used for all sorts of high-risk archiving, including documenting human rights abuses in conflict zones. But again, it is a double-edged sword. If you put something on Arweave, it is very, very hard to get it off. That raises massive ethical questions about the "right to be forgotten" or the accidental archiving of illegal material that shouldn't be preserved, like child abuse material. The community has to build "content moderation" layers on top of the decentralized storage, which brings us right back to the original problem: who decides what stays?
It feels like we are in a digital arms race. On one side, you have the platforms trying to clean up their sites and avoid liability. On the other, you have extremist groups trying to spread their message and avoid detection. And in the middle, you have the researchers and historians trying to capture the truth before it is erased.
It really is an arms race. And the stakes are incredibly high. Think about the history of the twentieth century. If we didn't have the physical archives of the propaganda and the hate speech that led to the Holocaust or the Rwandan genocide, it would be much easier for deniers to claim those things never happened. In the twenty-first century, that evidence is digital, and it is disappearing faster than we can save it. If we don't archive it properly, we are leaving a massive hole in the historical record. We are essentially allowing the future to be gaslit.
I am thinking about the practicalities of someone listening to this who wants to start small. You mentioned SHA-two hundred fifty-six hashes. If I am just a guy with a laptop and I see something online that I think is important to document, how do I "hash" a file?
It is actually very simple. On a Mac or Linux machine, you just open the terminal and type "shasum dash a two five six" followed by the filename. On Windows, you can use PowerShell with the "Get-FileHash" command. It takes two seconds. If you are archiving something important, do that immediately and save the result in a separate text file or a spreadsheet. Even if you just save the file to a USB stick, having that hash means you can prove later that the file hasn't been changed. It is the simplest, most effective way to protect the integrity of your data.
That is a great, concrete takeaway. What about the "web of complications" Daniel mentioned regarding the Irish context? He mentioned a specific high-profile tweeter who was de-platformed from many services but not from X. If you are archiving a specific person like that, who posts fifty times a day, how do you handle the sheer volume?
That is where automation is key. You don't want to be manually saving every tweet. You use a tool like "twint-zero" or other scrapers that can monitor an account in real-time. But you have to be careful—platforms like X have become very aggressive about blocking scrapers in twenty-twenty-five and twenty-twenty-six. You might need to use "rotating proxies," which essentially make it look like the requests are coming from hundreds of different computers all over the world. It is a bit of a cat-and-mouse game.
This is getting back into that "scrappy and messy" territory. You are basically using the same tools that data miners and hackers use, but for a noble purpose.
Exactly. Archiving is often a form of "counter-intelligence." You are gathering data that someone else—whether it is the platform or the poster—wants to disappear. You have to think like a librarian, but act like a digital partisan.
I want to touch on one more thing Daniel mentioned—the idea of "sovereign" as a collective. He talked about the "Jewish nation as a collective" in Israel. Does that change the approach to archiving?
I think it does. It moves the responsibility from the individual to the community. If you have a community-funded archive, you can share the costs and the risks. You can have a "legal defense fund" for the archivers. You can have multiple redundant servers in different locations—maybe one in Dublin, one in Tel Aviv, one in Reykjavik. It is much harder to "cancel" a whole community's archive than it is to shut down one person's blog or one NGO's S-three bucket.
That makes a lot of sense. Strength in numbers, even in the digital world.
Exactly. And I think we are going to see more of this—"affinity archives." Groups of people who share a common interest or a common history coming together to build their own infrastructure because they don't trust the big platforms to do it for them. It is a return to a more decentralized internet, out of necessity.
It is a bit sad, though, isn't it? That we can't trust the "universal" archives like the Wayback Machine to hold everything. It feels like the dream of a single, unified library of human knowledge is dying.
It is a sign of the times, Corn. The internet is no longer a small, academic sandbox. It is the central arena for human conflict. And in an arena, you cannot expect the referee to hold your shield for you. You have to bring your own.
Well said. So, to summarize for Daniel and everyone else interested in this: if you are serious about archiving high-risk content, self-hosting with an encrypted cloud backup is your best bet for twenty-twenty-six. Use standard formats like WARC, use hashing for integrity from day one, and if you can, work as a collective rather than an individual.
And don't forget the psychological side. Take breaks. Use tools to minimize your direct exposure to the most toxic material. Don't let the darkness you are documenting consume you. The goal is to preserve the history, not to become a victim of it.
That is probably the most important piece of advice of all. Herman, this has been a really deep dive. I feel like I have a much better handle on why this is so difficult and why it is so necessary. It is about more than just hard drives; it is about the "right to remember."
Me too. It is one of those topics that seems technical on the surface, but it is actually about the very core of how we remember who we are and what we have done. If we lose our ability to document the "bad" history, we lose our ability to learn from it.
Exactly. Well, we are coming up on our time. I want to thank Daniel for sending in this prompt. It really pushed us to look at the intersection of technology, law, and history in a way we haven't before. It is a vital conversation.
Yeah, it was a great one. And to all our listeners, if you are finding these discussions valuable, please consider leaving us a review on your podcast app or on Spotify. It really does help the show reach more people who are interested in these "weird" but important topics. We are living in a time where information literacy is a survival skill.
It really is. And remember, you can find all our past episodes—we have over seven hundred now—at myweirdprompts dot com. There is an RSS feed there, and a contact form if you want to get in touch with us. You can also reach us at show at myweirdprompts dot com. We love hearing your prompts, especially the ones that make us think this hard.
We are on Spotify, Apple Podcasts, and pretty much everywhere else you listen to podcasts. We even have a presence on some of the decentralized platforms we talked about today.
Alright, that is it for this episode of My Weird Prompts. Thanks for joining us in our Jerusalem living room, and we will talk to you next time.
Goodbye everyone. Stay curious, and keep those backups running.
Bye.