Sharding for Privacy


I spent yesterday at Transparency Camp, a great unconference filled with not only many of the most brilliant minds in technology (especially social media), but with a huge number of representatives from different parts of the US government, all interested in using social media to fulfill the new administration’s requirement for transparency and openness in government. It was a great time, and I’ll write more about the sessions there at a future point.

One session I attended, though, was put on by the inimitable duo of David Recordon and Chris Messina, and was essentially a long discussion session about the uses and implications of OpenID, especially as it could be implemented by the government to allow citizens to comment on new ideas, regulations, or legislation. One representative from the Department of Defense was in attendance (in addition to people from many other parts of government), and was discussing using OpenID to log into one gateway server that could hold identity information for the entire government, allowing just one site to go through privacy protection screening, and just making assertions about the data there to the rest of the government.

While I disagreed with David and Chris on the utility of a gateway on face (I believe that OpenID should be used without names or emails, thus allowing people to comment on every government site without ever giving them personally identifying information, and thus removing the necessity for privacy protection screening entirely, whereas David and Chris noted the UX downsides of not being able to let users get emails on comments, or use their full names rather than URLs for identification), one point I raised seemed to resonate with many of the government attendees there, and I thought it was worthwhile to expand on it here.

The citizenry of the United States absolutely does not want one authority that contains all of their personal information, and is empowered to transmit it. On a slightly less Social Web front, we just confronted this issue as a country, with the complete rejection of the RealID act; when California and Texas (in addition to many smaller states, like Montana) rejected it outright, the program effectively collapsed, having been killed off by citizen outrage.

And yet, when we confront each new frontier, we seem to get the same idea again: “wouldn’t it be great if I only had to go one place to get everything? That way, I’d only have to make one call, and parse one set of data, and I could just secure that one place.” It does sound compelling, especially as I’ve spent quite a bit of time writing special cases to handle differences, from one social network to the next, in the way that different providers handle UserIDs (example: Flickr has a human-readable ID, for instance “USSJoin”, that while it can be used to access one’s photos on the web, cannot be utilized in their API calls to get, for instance, an RSS feed of the user’s photos; instead, one has to make an authenticated API call to translate “USSJoin” into the much less friendly “24001683@N05”). So it’d be great to avoid all that.

At the same time, I think there’s a huge amount of value in reducing the amount of identity data that can be leaked– intentionally or unintentionally– from one source. It preserves privacy by letting people determine themselves what they want others to correlate. As I wrote a few months ago (and related to the participants in the session),

… a friend of mine once bought two rolls of duct tape, twenty-five feet of rope, one box of condoms, and a birthday card; had he simply bought them at different stores or on different occasions, none of them would have been exceptional, nor would there be any particular thing to tie them together– but their simultaneous purchase greatly horrified the saleslady at the store in question, even though he assures me they were for four different purposes.

So then, this demonstrates what I would refer to as “sharding.” Just as with a shattered mirror, one can use small bits to useful purpose, without needing to have the entire collection assembled all in one place; I should be able to use whatever shard or combination of shards I want, to create whatever area of mirror (or representation of my digital self) I want.

We can see this idea in other areas, foremost of which is Chris Messina’s own DiSo project, which is designed to enable a way for a site to pull content from a huge number of silos across the web. This is a great idea– not just for photos or microblogging, but for identity data, and even more crucial things, like personal health information. I want to be able to give my doctor a coherent view of all my test results over time– but I don’t want other people to be able to query one source, or one group of crackers to attack one source, and get every blood test I’ve ever had. So I should be the only one who can assemble all these shards.

Chris raised the critique, in the session, that this is just security through obscurity– that though I make it harder to rectify, all this data is still available, and could be pieced together. That’s true, if it’s all stored on the same identifier (e.g., my name). But it doesn’t have to be, particularly on the Internet; if I use a different OpenID at each information silo, there’s no unifying way to piece all that data together, unless I give someone the identifier-to-site mapping. That’s the key that unlocks my data; one could consider it almost a sort of steganography, in that it’s knowing where to look that makes the random splotches turn into meaningful data.

I don’t trust the government– or anyone– with all my data. In this connected age, I no longer need to; it’s not more convenient for me to do so. So let’s resist the temptation to centralize data “just because.”