Long-Term Authentication for Digital Archives

This is my final project for 600.409, Digital Preservation, here at Hopkins; it's been an incredibly fun and rewarding seminar with four professors (for our eight students): Randal Burns, Sayeed Choudhury, Tim DiLauro, and John Griffin (winner of the minimalism award for his business website). They've been great to learn from for the entire semester.

Our final project requirements are fairly open, basically requiring real thought about the preservation of digital artifacts, whatever (after a semester of discussion) we've construed that to mean. In clearly the best project requirements sheet ever, after a long discussion of the requirements and format, the professors state

"Students are not required to follow the above format, and may instead propose any project leading to any deliverable that represents an equivalent level of effort."

Later, in the evaluation section, they state

"Projects will be evaluated on the process by which the results are obtained, not on the strength of the results themselves. Unusual hypotheses are encouraged and negative results are acceptable. It's the journey, not the destination."

Clearly, a great assignment. One last metanote, then I'll start with what I actually wanted to explore: after I asked how to deliver my paper (one of the suggested formats was a web page of some sort), the response said that possible acceptable formats would include:

"Delivering a PDF file by email.

Setting up a web page visible only to your friendly instructors.

Setting up a web page visible to the whole world.

Posting your writeup on your blog.

Starting a flame war in a newsgroup summarizing your major points.

Digging your results.

Etc."

Wow. :-)

So then, on to the actual work.

For my project, I wanted to explore digital archives (on which we've spent quite a lot of time), from the angle of "How should scholars get access to a digital archive?" This investigation started from my experience obtaining a reader card at the Library of Congress, so I will begin my exploration with that, very real, scholarly repository.

To access the Library of Congress, one needs to meet their researcher requirement:

"The Library is open to all researchers above high school age (18 years or older) possessing a valid photo identification (e.g. driver's license, passport) with a current address." Library of Congress FAQ

So this is not a highly restrictive archive-- one would hope that one would have a scholarly purpose, but this is not (at this time) strictly required. In this case, one then proceeds to fill out a form asking for a bit of personal information (a current address is about as stringent as it gets, although they do ask survey questions relating to what you intend to study). Then they take your photo, and hand you a card; while the website says that it expires in two years, I was told that it does not actually ever expire; as long as one keeps the card, it can be used in perpetuity (leading to some researchers, now of a rather advanced age, still having their decades-old card with their original photo of a bright-eyed graduate student). With this card, then, one can access most of the resources of the Library. (A few reading rooms have much more stringent requirements-- in most cases, to protect exceedingly rare and valuable books, but they aren't particularly relevant to the digital analogy-- data is not usually fragile in the same way, and so wouldn't be protected for those reasons.)

So wonderful; I have a card identifying me as a reader (with the traditional terrible government-issue photograph), I can make my librarian mother proud that her son is now a "real" scholar. If I lose it, I can go back to the registration area, and they will reconfirm my physical address, then give me a new one. They will never update my photo or any of the other information. When I go to use a reading room, I then present the card; while it has a barcode on it that could be used to confirm it's real, the attendant simply compares my picture to the card, and lets me in-- even forty years after I first obtained the card.

This idea boils down to a Shibboleth (in the Biblical sense; the academic sense we'll get to later):

"Gilead then cut Ephraim off from the fords of the Jordan, and whenever Ephraimite fugitives said, 'Let me cross,' the men of Gilead would ask, 'Are you an Ephraimite?' If he said, 'No,' they then said, 'Very well, say Shibboleth.' If anyone said, 'Sibboleth', because he could not pronounce it, then they would seize him and kill him by the fords of the Jordan. Forty-two thousand Ephraimites fell on this occasion." Judges 12:5-6, New Jerusalem Bible

"Speak 'friend' and enter." [Lord of the Rings]

The latter, obviously, for the more geeky / less religious in the audience; they are equivalent stories, in that I'm allowed access because I say (in the correct way for the resource I'm trying to access) that I would like to have access, and that I am a scholar who might benefit from it; I am then taken at my word.

Let's suppose, though, that the Library really did check whether I was affiliated with a legitimate research institution-- they could call Hopkins, and Hopkins would likely tell them (maybe not with FERPA, but probably) that I was a graduate student in good standing. Then they would let me in. Perhaps they would do that on an ongoing basis, but given that their cards are good forever, probably not.

So then, we have three use situations, and four added advantages, to the Reader Card system as it currently stands. The three use situations are:

First-Time Access - Registering as a reader and obtaining the card
Ongoing Access - Using the Reading Rooms
Lost Token - Going back and getting a new card

The four additional advantages of the Reader Card at the physical Library are:

To get some sense of the total number of people who use the library
- In addition, of course, to the log books at the doors, but having a "total number of users" is different and useful to compare to "total visitors."
To collect basic statistical information on the Readers.
- Where I'm from / what university I represent might be something nice to know.
To help to discourage passers-by from using the facilities just to use them.
- Now, this is somewhat controversial (aren't libraries supposed to be *giving* information?), but it's not without merit to think the Library might have an interest in dissuading tourists from going in and poking around just for its own sake; as a research library (rather than a community library), it has a specific goal, and the millions of visitors to Washington, D.C. aren't necessarily interested in that work. Tourists have a glass cage they can stand in to see the Main Reading Room, if that's all they'd like to do. (As a teenager, I was highly indignant to be denied access on general principle, but I digress.) The simple act of making people go to a basement and register does perform basic throttling.
To help the Library provide more targeted services on an as-needed basis.
- For instance, my user account, represented by the card, can be granted access to restricted areas of the Library.

So then, now let us turn our attention to digital identity, as we create a digital scholarly repository. We must preserve the three use situations, and we'd like to preserve the additional advantages (and create more) if possible, through this transition. Let us then consider several alternatives.

Unfettered access.
- Anyone could have access, no need for registration. The data's there, why not use it?
- This makes some sense for archives whose contents are not restricted; for instance, Project Gutenberg is a digital library for whom "checking out a book" doesn't require an account. It works much less well if the contents are sensitive, if they're restricted in some way (due to copyright, for example), or if having limitless access would hurt the archive in some way (for instance, an infinite number of users hammering on a video archive would likely cause access issues for scholars).
- We get to skip all the overhead of having to maintain user information, and we still get to use normal techniques to get total users / total visits information (much as we get on any web page). We lose all the other advantages, however.
"Free Signup"
- This is the approach that, for instance, the New York Times[http://nytimes.com/] uses. Give them your email address, create a password, and they'll give you access to everything.
- There are standard methods for dealing with the three situations of normal use, and depending on the questions that are asked during signup, one can retrieve the statistical information, and possibly targeted services-- or at least, targeted ads; the latter may be helpful in maintaining a digital repository through offsetting operating costs.
- If one requires a scholarly email address for signup, one can even get the scholarly affiliation requirement; this is how Facebook operated until recently, and there were some advantages to that arrangement.
- The disadvantages, though, are that the repository is now tasked with maintaining all of the information, and dealing with lost/forgot password issues. The Library does this in real life, but as it turns out, there are additional options-- so why settle for only sufficient?
- In addition, if there's any restrictions on the archive, it's fairly trivial to get around them; a robot can create hundreds or thousands of accounts in a day. So if you wanted to limit how much users could consume in a day-- too bad.
"Somebody Else's Problem"
- Instead of creating our own authentication system (based on emails, or Reader cards, or what have you), let's instead use someone else's. Indeed, this is what the Library does in part; in order to confirm your name and address before it issues a Reader card, they check your driver's license or passport. They are, therefore, relying on someone else to assert your identity; then, because that other entity has given you an identity, they'll give you one, too.
- We can outsource the third standard use case, and have only a lightweight first case; we only need to collect the information we deem important, and we don't have to mess around with passwords or usernames. If we're giving free account creation, we don't have to store *any* user information, but if we're relying on assertions from the third party, then we might wish to; again, whatever we think is important is what we go with.
- We get the four listed advantages for free; we know total users and total visits, we collect the statistical and targeting information we want (including targeting for advertising, as in #2, should we be so inclined), and needing an account elsewhere to create one at the archive is precisely as much of an impediment as we choose it to be; more on this momentarily. There are additional benefits possible as well.

There are several digital equivalents to this external identity idea (including one expressly designed for academia, called, for reasons that should be clear from above, Shibboleth), but the one that has found real and broad-based success in the "real world" is OpenID. OpenID is a federated open standard; anyone can run their own OpenID server. In addition, OpenID has URLs as its username construct, which allows them to be globally unique by definition (so one doesn't have to worry about two people attempting to register the same name on your site), and allows delegated identity, so that users can use personal websites as OpenIDs without having to run their own servers. For instance, my OpenID is http://ussjoin.com. I do not run my own OpenID server; instead, I currently delegate that task to MyOpenID.com. Should I grow dissatisfied with them, I can delegate it elsewhere; these are all issues that I can handle on my own, and they will not impact on the consumer site-- in this case, the digital archive. This also helps to solve the long-term authentication problem; with delegation, users can point old identities to new, so that a chain of identity can exist-- something that does not in any real fashion happen with photo identification. (OpenID currently has delegation limits, but they can be worked with.)

So this seems useful on face, but let's look at the additional advantages of outsourced identity:

We can grant trust based on someone else trusting you. This means that we could only give accounts to those logging in with academically-granted OpenIDs. Alternately (and much more powerfully), we can allow someone to create an account with one OpenID, and log in once with an academic ID to prove this link. This identity claiming is essentially how ClaimID[http://claimid.com/about] operates, and it's a powerful tool-- I can get the power of my academic affiliation, but I don't need to use it to log in. This is how we can tailor the barrier to entry to the level the archive wants.
This outsourced trust helps defeat robots, a major flaw in scheme #2; while it's true that robot overlords could register elsewhere first, then at the archive, the additional steps required add overhead to them. This has actually worked at Ma.gnolia, a social bookmarking site; they have posted an explanation at their blog.
It helps integrate the archive into the rest of the world; rather than having YATIHTCAIMEEW (Yet Another Thing I Have To Carry Around In My Ever-Expanding Wallet), the same virtual "card" gets me in everywhere. This is another cited reason at the Ma.gnolia article. Sourceforge has added OpenID support (just today, actually) for this reason.
Detaching the archive from a login system, through not creating one, allows the archive to more-easily change in the future. If, in ten years, everyone has an identity chip implanted in their foreheads, OpenID might not be particularly useful-- but the layer of indirection created (and required) though externalizing authentication will allow the shim layer (what tells the archive that a user is logged in) to be simply and quickly recreated for the new ForeheadChip system; the shim still answers the same simple question (isTheUserLoggedIn()), whether they logged in with an OpenID, a ForeheadChip, or a telepathic newt. As digital archives struggle with porting data and metadata, it would be nice not to have to deal with reworking their authentication systems as well, each time technology is upgraded.

It seems, then, that externalizing the problem is both the closest fit to what the Library is doing right now, and provides benefits that the Library can't get today (in addition to the ones it can); our quest for a digital transition has found its goal. OpenID today seems to provide the things we want from authentication, takes a lot of issues away that we don't want to deal with if we can avoid them, and externalization in general allows us to migrate with the shifting winds of technology, with much less pain than data migration.

So where do we go from here? If I had a year to continue to explore these issues (in unrelated news, I'm looking for a thesis and/or project advisor for my MSECS :-) ), these are some of the directions I might go in:

Creating a general, usable system for incorporating live-verifiable authentication and authorization assertions from trusted identity brokers into a minimal-information identity system (Short title: Ident-i-Eeze In Real Life, or: Don't Give the Barkeep Your Home Address).
Extensions to OpenID to allow its use in high-importance transactions.

I hope that both my professors and anyone lucky enough to stumble in off the virtual street (and persistent enough to have read this far) have enjoyed reading this exploration. One of the advantages of this format is that though the article is now concluded, the conversation can continue at length in the comments; fittingly, my blog accepts OpenID comments, so feel free to log in with any OpenID-providing organization and tell the world your thoughts!