I think a lot of internet users still live under an older privacy model. They assume pseudonyms are reasonably safe if they avoid posting their real name, keep accounts separate, and do not accidentally paste the same bio everywhere. This new paper argues that model is dying fast.
In Large-scale online deanonymization with LLMs, researchers show that large language models can link pseudonymous users across messy, text-heavy platforms without needing the neat structured datasets that older deanonymization work relied on. The paper does not pitch a distant future problem. It says plainly that an agent with internet access can re-identify Hacker News users and Anthropic Interviewer participants
and that their best methods reached up to 68 percent recall at 90 percent precision. That is not perfect automation, but it is far beyond “probably harmless.”
Why this paper is nastier than the usual privacy warning
The big shift is not just that LLMs are good at language. It is that they are good at turning scattered text into identity clues. The attack pipeline in the paper uses models to extract identity-relevant features, search for candidate matches with embeddings, and then reason over the best candidates to decide whether two profiles likely belong to the same person. In other words, the hard part is no longer collecting clean tables. The hard part is reading people, and LLMs are now good enough at that to make pseudonymity a weaker shield.
I find the dataset choices especially unsettling because they feel normal. The paper links Hacker News accounts to LinkedIn profiles, matches users across Reddit movie communities, and even splits one user’s Reddit history over time to see whether the fragments can be reconnected. None of that requires spy-movie tradecraft. It requires patience, language understanding, and enough automation to do what a determined human investigator used to do slowly.

The phrase I keep coming back to
The line that sticks with me is the paper’s conclusion that practical obscurity
no longer holds. That phrase matters because most online privacy has always depended on friction more than impossibility. It was not impossible to connect a pseudonymous forum account to a real identity before. It was just expensive, slow, and not worth doing at scale. If LLMs crush that cost, then a lot of old advice becomes dangerously outdated.
This is also why the results matter even if 68 percent recall does not sound absolute. A tool does not need perfect recall to create a serious privacy risk. If it can reliably narrow a large set of pseudonymous users into a smaller shortlist at high precision, that already changes the threat model for journalists, researchers, activists, job seekers, and ordinary users who keep different corners of their lives separate on purpose.
The Reddit reaction was blunt
The reaction in the PrivacyGuides Reddit thread was not subtle. People immediately treated it as a warning shot for anyone who still thinks account separation alone is enough. That seems like the right response to me. The paper is not saying “be a little more careful.” It is saying the internet has a much better pattern-matching engine now, and that engine can work across the kind of writing quirks, topic overlaps, and biography fragments that people used to assume were too fuzzy to weaponize.
The Office panic GIF earns its place here because this is exactly the sort of research that should make privacy-conscious users sit up straight. Not because panic is useful, but because a lot of people’s risk models are still calibrated for a pre-LLM internet.

What this changes in practice
If this paper holds up, the new baseline advice is harsher. Writing style matters. Repeated niche interests matter. Cross-platform references matter. Old posts matter. The combination matters most. A pseudonymous account is no longer just leaking obvious identifiers. It is leaking a behavioral fingerprint that a language model may be able to connect faster than a human would.
That does not mean everyone should vanish from public platforms. It does mean privacy guidance needs to stop pretending that manual compartmentalization is enough on its own. People who truly need separation will probably need stricter topic boundaries, less biographical spillover, and more awareness that natural language itself has become part of the attack surface.
My main takeaway is simple. We are past the stage where LLM privacy risks are mostly about chat logs and accidental data retention. They are now becoming search-and-link machines for identity. That is a very different category of problem, and I do not think most internet users have caught up to it yet.
Sources – arXiv abstract, paper PDF, PrivacyGuides Reddit thread.
Discover more from TheFlipbit
Subscribe to get the latest posts to your email.
