<D <M <Y
Y> M> D>

In February, I had an interesting conversation with Zooko about anonymity and pseudonymity. I pointed out that early cypherpunks were very optimistic about the ease with which Internet users would be able to maintain multiple, independent, unlinkable personas. They could simply have different names (or no names at all) for different situations, and avoid letting any information out that might connect these different personas. It would be hard to overstate how common this enthusiasm was in traditional cypherpunk optimism; the hope was that, not only would the occasional whistleblower be able to send the occasional isolated message, but people would be able to carry on long-term, repeated, mutually anonymous conversations -- many of them in public.

There have been lots of nice developments in anonymity, including both theoretical advances and deployed anonymity and pseudonymity technology. We have the various generations of remailers, we have Tor and other layered proxy systems, and we have elegant ideas like Invisiblog. But actually being pseudonymous turns out to be a lot of work, because there are just so many ways to mess up!

I mentioned the "tangled web" problem to Zooko: if you have several different personas, one of which may be your real name, those personas should ideally not have any of the same communication patterns. That includes spelling, punctuation, vocabulary, phrase structure, diction, frequency of communication, time of day of communication, and much more. There should not be any time correlation among your personas' communications (or at least no clearer correlation than could be accounted for by the hypothesis that you're in nearby timezones, and even that might be more information than you want to give away). Your different personas also should appear to have different knowledge, so that one of them might be expected to know certain things of which the other would be expected to be ignorant. (This can be a terribly difficult pretense to keep up in person, because psychologists are coming up with all sorts of ways to tell whether somebody is familiar with a particular topic, from clever language games and calculated ambiguities all the way through involuntary physical reactions. But on-line, we would expect to be free of some of these difficulties.) You might therefore have to keep track of which facts one persona is supposed to know as well as the fact that another persona isn't supposed to know about them.

This is a lot of work, and many of these factors are difficult to control consciously. The Tor bibliography points to a paper by Rao and Rohatgi on stylometry, the use of statistical techniques to try to attribute authorship to texts using only the evidence of the texts themselves. This was done successfully, and apparently convincingly, with some of the anonymous Federalist papers, and stylometric techniques have only gotten better. They can measure people's propensity to commit particular errors, to use particular words or kinds of words, to write sentences of particular lengths, to use one kind of punctuation or another, and combine dozens of factors that are believed to be fairly stable over time to produce a plausible composite model of the way someone writes. (One of my English teachers told me about the use of concordances to show that a writer had written a book after reading another one. The new book used an extremely unusual word that appeared in the earlier book, and the second author had never used that word in print before!)

The Rao and Rohatgi paper, "Can Pseudonymity Really Guarantee Privacy?", after discussing and demonstrating some stylometric techniques, suggests that anonymous communication channels are only a privacy solution at one layer, and that privacy can be compromised easily at another layer. They say:

We believe (and demonstrate) that recent advances in stylometry pose a significant threat to privacy that merits the serious and immediate attention of the privacy community. For instance, using stylometry, one can link the multiple pseudonyms of a person and if one such pseudonym happens to be his/her identity, then the protection afforded by the other pseudonyms is compromised.

(Notice that the author of this paragraph uses no comma before "and if", but uses a comma before "then" in an "if ... then" construction. I wonder if we could tell whether it was Rao or Rohatgi. Anyway, if you haven't read their paper, you should, because you'll learn vastly more from it than from the rest of this post.)

Zooko says that his first attempt at on-line pseudonymity was promptly unmasked by a human being, not even using formal statistical techniques: "[W]hen, in the throes of early cypherpunk enthusiasm, I decided to try a pseudonym, "Zooko" in 1996 or so, [...] Adam Back immediately responded to my posts to cypherpunks by asking if I were also Bryce Wilcox..." He points to the saying of Mark Twain: "If you tell the truth, you don't have to remember anything."

It's daunting to think just how much an effective pseudonymous communicator may have to remember. What time of day is the pseudonym supposed to be active? What punctuation style should the pseudonym use? What kind of vocabulary? Is the pseudonym a good typist (and if not, what particular kinds of typing errors does the pseudonym make)? Does the pseudonym regularly go on vacation at the same times as the real person behind it? Are there any idiosyncracies in the pseudonym's writing? Is there evidence of what kind of software the pseudonym's computer is running? Is the pseudonym good at writing HTML, does the pseudonym favor particular HTML tags, does the pseudonym use a particular HTML editor? Did the pseudonym ever make any claims about itself, its location, work history, academic qualifications, etc., and will it act consistently with those claims, and can it do things to back them up if someone tries (perhaps in a devious and subtle way) to call them into question? If there's more than one pseudonym per person, how can the person who controls all of them keep the answers to all these questions -- and others -- straight?

Most of these problems are independent of the limitations of whatever kind of anonymity technology is in use. Some anonymity technologies may themselves leak information beyond the control of the user, or give the user too many options that may result in different behavior visible to someone at the other end. (Right now I'm a little anxious about all the options in Privoxy, which is commonly used with Tor or as a privacy-enhancing proxy in its own right. If different users set up their Privoxies in sufficiently varying ways, they might become distinguishable on the basis of some of those differences!)

And getting over all these problems still requires having a way to defeat stylometry, and as yet nobody even has a clear account of how hard that would be, because nobody has extensively studied how good stylometry can be when the person whose style is under examination is trying to beat stylometry. (Here "nobody" excludes the spook lords in their halls of stone.)

Anonymity and pseudonymity are obviously easier in cases where there's less variation in the messages that end users are sending, and where very high latency is acceptable. For example, if users are only sending any one of 10 predetermined messages, it would be hard to do stylometry attacks against them when they didn't compose the messages themselves. That means that anonymous networks with very low information rate can probably be built, but it's best for the anonymity if the participants don't speak natural languages at all, and best of all if they don't try to say anything about the real world...

On the bright side, reading anonymously is easier than writing anonymously. It might be hard to write the Federalist papers without giving away who you are (or at least which Federalist numbers you wrote), but it might be relatively straightforward to publish the Federalist without knowing (or letting other people find out) who chooses to read it.

After discussing other reasons why anonymous publishing is hard and why linkability can result from a single casual error, Zooko continues:

The reason for my early cypherpunk enthusiasm about pseudonymity is that if a person can't be traced from on-line interaction to physical body, then that person can't be physically threatened, coerced, or attacked. Unfortunately, the easy implementations of pseudonymity give rise to another quality in addition to the "no tracing from pseudonym to body" quality. That secondary quality is the "no linking one of my pseudonyms to another of mine". In theory, we could have the former quality without the latter, which would ameliorate the problem of pseudonymous folks being immune from negative reputation. Which would, maybe, eventually, cause us to view pseudonymous people with less social suspicion. That's a long chain of "maybes". I'm not holding my breath!

In later conversation he suggests that cypherpunks were probably too optimistic about pseudonymity solely on the basis of this one important feature (freedom from punishment or coercion for communicating ideas), and that cypherpunks overgeneralized from this benefit to the conclusion that we don't, or shouldn't, need identity for anything. More about this question later, I hope.


[Main]
Support Bloggers' Rights!
Support Bloggers' Rights!


Contact: Seth David Schoen