Requirements and Approaches for a Publicly Visible Persistent Identifier for Person Entries in the Stanford University Enterprise Directory Service

Jeff Hodges (NeuStar)
RL "Bob" Morgan (University of Washington)

Both formerly of Computer and Communication Services,
Stanford University

v1.1
original: 29-Jul-1998
revised: 23-Aug-2006

Abstract

Directory entries representing people in the Stanford University enterprise directory service have both non-unique names and unique identifiers, although not all entries have "palatable" unique identifiers. This presents issues when one is designing a user interface where queries may return result sets with more than one entry -- a user may need to view and utilize unique identifiers in order to refine such result sets. We analyze this need and identify various plausible solution options, given our naming and identification infrastructure. Two solution approaches are presented and their differentiating factors discussed. The approach we decided on and its detailed description closes the paper.

Executive Summary

Character-based user interfaces to directories need to supply the user with some identifier to use to refine searches that return multiple entries. This concept is explicitly embodied in the Whois protocol [Whois] and termed a handle. An entry's handle is a guaranteed-unique identifier for that entry. It is typically displayed in all query result sets, whether a set of one or many.

We are migrating our enterprise whitepages directory from Whois-based technology onto LDAP-based technology [LDAP]. We anticipate that many of the user interfaces (UIs) to the new LDAP-based enterprise directory service will be graphical (i.e. GUIs) in nature. However, character-based UIs will remain in use for the foreseeable future. For example, we intend to provide Whois-based access to the LDAP-based directory via a Whois-to-LDAP gateway.

Note that although one can argue that a common web-based GUI approach to displaying directory query result sets is as a simple textual list (perhaps tabularly arranged) in a subwindow, disambiguation between entries with identical names can be accomplished via rendering the entries using hyperlinks to further information, obviating the need for a disambiguating ancillary entry identifier. Since we plan to provide a Whois-to-LDAP gateway, we need to provide some identifier for entries which serves the purposes that the Whois handle does.

Though, the Whois handle, as instantiated in our Whois-based directory, has been problematic for users in our experience. Our Whois handles are based on peoples' names and are used strictly within the context of the directory. Since other people with similar names may arrive at the university or people may change their names, one's Whois handle is not guaranteed persist over time, i.e. it may change. But users continually mistake the Whois handles for email addresses and are perturbed when they change without their knowledge. We believe the amount of user confusion and our on-going effort to dispel it justifies looking into the issues and seeing what we can do to resolve them while we're migrating to a new technology base.

This document analyzes the requirements for the "handle" concept, which herein is termed a Publicly Visible Persistent Identifier (PVPI), and presents two solution approaches. They are summarized as..
 
A1.  Construct and utilize a guaranteed-unique form of Registered Name, i.e. a "Uniquified" * Registered Name (URN). 
 
 
A2.  Construct and utilize a unique, fixed-length, non-Registered-Name-based alphanumeric identifier. 
 

The tradeoffs involved in utilizing either approach are subtle and varied. They are summarized in the section entitled An Alternative Solution Scenario and Two Subsequent Approaches. The Bottom Line section summarizes the quandary, and Our Chosen Approach specifies the solution. We recommend that readers read at least those three sections. Readers interested in the gory details are invited to begin with the following section.

* "Uniquified" was first coined by Tim Howes in a message to Jeff Hodges about how they rendered people's names unique in the UMich X.500 directory service. It means, essentially "to have been rendered unique".

Document Conventions

The key words "MUST", "SHOULD", and "MAY" used in this document are to be interpreted as described in [ReqsKeywords].

Definitions, Background, and Motivation

Currently, person entries in the Stanford University enterprise directory service have, as attributes, various forms of "natural names" and "general identifiers" [SUNetIDReqs, SUNetIDDesign]. First, we will outline the definitions of these terms (see the referenced documents for a more detailed discussion).

A subject is a named, often real-world, entity. A person, for example. A person entry in the directory is an entry with an objectclass of "person" [LDAPattributes] which we map to a real-world subject, i.e. a person. We accomplish such mapping through the use of subjects' names and identifiers. A name is a character string that may map to zero, one, or more subjects. A natural name is a name that is based on a subject's real-world name. An identifier, in contrast, is a name that maps to exactly one subject. An identifier may be based on a subject's natural name, or it may be artificial -- a number or a bit string, for example.

We use different forms of names and identifiers in various situations depending upon contextual requirements. For example, a user interface (UI) to some application might prompt a user for their "name", and expect that most people will enter some form of their natural name. Another UI might prompt a user for an identifier and expect that people will enter one rather than their natural name. Though, there may be environments where one's natural name is treated as an identifier, but we explicitly do not encourage that approach in our environment. This is because although natural names typically map to individual subjects, occasionally some map to an additional number of subjects.

In our interwoven Stanford University Network Identifier [SUNetID], Person Registry [Registry], and Directory [Directory] environment we have defined the following types of names and identifiers...

There are both short- and long-form Registered Name-based GeneralIDs. Short forms are suitable as "login names" for most of the various currently-utilized types of operating systems (OSs) extant on the Stanford University Network (SUNet). Each form of a subject's Registered Name-based generalIDs must be unique across the entire General ID namespace, by the definition of an identifier. These forms aren't germane to this document's discussion because they aren't based on a subject's natural name. Note that person entries in the directory will have these properties... This document's topic is directly due to these properties. Here's a prime motivational example...
Some UI scenarios for directory queries require selection by the user of particular entries from a set of returned, and often summarized, entries. Examples of this arise with "whois" and "finger" command line-based UI "frontends" to the directory. With these UIs, it is typically up to the user to enter an identifier from the first result set in order to obtain a complete view of a single entry . Other examples arise in the context of graphical UIs (GUIs) to the directory. Here, though, the GUI can hide from the user the unique identifier used to make the singular selection from the original result set.

Currently Recognized High-level Requirements

In the former example, where the user must view and potentially re-enter the identifier, there are the typical requirements for the identifier to be based upon one of the subject's names, to be of "reasonalble" length, or otherwise be "human palatable". E.g., a long alphanumeric string bearing no relation to the subject's natural name(s) is usually considered sub-optimal, i.e. "not terribly palatable/memorable/wieldable". However, for example, a short numeric string, similar to a drivers license number or a University ID number is considered by some to be acceptably palatable. Also, there are simple techniques for rendering a subject's Registered Name unique. The resultant length depends up the length of the subject's Registered Name, of course. This approach is also considered palatable by some. And there are some that consider the former or the latter or both to be unpalatable.

Another clear requirement we've derived over the years of running the Whois-based directory service is that subjects would prefer it if all things in their entries smacking of "names" or "identifiers" would only change if they perform some explicit action that reasonably causes them to change. E.g. by changing their Registered Name. This of course occurs "natuarally" through various life changes, e.g. matrimony.

The third requirement is one of consistency -- the style or form of the human palatable identifier should be the same for all person entries. Our experience suggests that user and subject confusion will result if there is not a consistent form of human palatable identifier across all person entries.

The fourth/final requirement is that this identifier must be visible to all directory clients whether they are authenticated and authorized or not -- i.e. it must be "publicly visible".

The Problem Statement

Thus, the problem statement containing these essential requirements is... We term such an identifier simply a publicly-visible persistent identifier (PVPI), since it would be expressly intended to be used in any query from any requestor and would not itself be protected by an access control list. Note that the PVPI is essentially the same in concept as the "handle" utilized in the Whois protocol [Whois] and is typically used directly in user interfaces to Whois-based directory services [SUNetWhois].

Note that we're explicitly not using the term handle in this document in order to avoid confusion with the particular-to-Whois instantiation of this concept.

Detailed Technical Requirements

These are the detailed, nominal requirements that we feel define the problem space upon which the above expose' is based... E.g. no more than 2 changes per year per entry. E.g. less than 17 not-totally random characters containing no whitespace or control characters. The below anciliary requirements help to further refine the problem space. Some of the requirements below are not finalized, they are the ones with options indicated in []'s with " | " separating the options. Finializing these requirements is part of refining the solution options, since these requirements essentially present tradeoffs in system behavior. The identified anciliary requirements are...

Readily Apparent Solution Scenarios and their Issues

One potential approach is to copy what the Whois-based directory does. It has the concept of a unique "handle" which is entirely based on the equivalent of one's Registered Name. The handle is an algorithmically derived abbreviation of Registered Name and is guaranteed unique across all entries in the Whois-based directory at a given point in time. However, one's handle can change over time. Each time a new directory entry is created, collisions between the new entry's handle and other similarly-named entries' handles are resolved by adjusting the new entry's handle and possibly those of the similarly-named entries. Thus a subject's handle may change without the subject's knowledge. This clearly violates the requirement of having no arbitrary changes made to one's names or identifiers...
 
I1.  Whois-style "handles" may arbitrarily change over time without the subject's input or approval. 
 
The Registered Name-based identifiers supplied by the SUNet ID system would seem to be prime candidates to utilize as human palatable identifiers. However, not all entries in the directory have SUNet IDs, as discussed above. Thus only a subset of person entries will have a Registered Name-based identifier which might be presented in result sets. Currently, the only other identifiers in, or planned for inclusion in, all person entries are numerically-based and bear no relationship to a subject's natural name. Additionally, our default authorization posture is that we reveal current SUNet IDs only to previously authorized clients, e.g. authenticated users. This further limits our being able to utilize current SUNet ID-based identifiers as humanly palatable identifiers. Thus this scenario's issues are...
 
I2.  Not all subjects have SUNet ID-based identifiers, and some of the rest will not have publicly revealed any of theirs. 
 
A third potential solution would be to utilize whichever form of SUNet ID a subject has authorized to be publicly-visible, e.g. by authorizing it as an email alias [SEAS], or, if no publicly-visible SUNet ID form exists, utilize some other identifier that's based on Registered Name. This isn't consistent. Additionally, if one authorizes a SUNet ID as an email alias, where one wasn't authorized before, then it follows that one's human palatable identifier returned to clients should probably change. But this violates the no arbitrary changes requirement.
 
I3.   Using SUNet ID-supplied, Registered Name-based identifiers for some entries and something else for others is inconsistent, plus one's human palatable identifier would be subject to arbirary change. 

An Alternative Solution Scenario and Two Subsequent Approaches

An alternative solution is to create yet another form of General ID for all person entries and use it as the PVPI. The form of this identifier arguably hinges on this single question...
Should PVPIs be "name-like"?
If they are name-like, then... Or, if they are not name-like, then.. Both approaches objectively satisfy the primary PVPI requirements as long as we haven't specified a maximum-length requirement for PVPIs (that's less than the longest possible name which might enter the system).

In summary, the two approaches are..

A1. The Guaranteed-unique Form of Registered Name Approach

A2. The Fixed-length Non-Registered-Name-based Alphanumeric Identifier Approach

The implications of each approach are discussed below.

Implications of Approach A1

The advantages of this approach are... The disadvantages are...

Implications of Approach A2

The advantages of this approach are... The disadvantages are...


The Bottom Line Decision...

To some degree the differences between the approaches center on aesthetics, however, the factor of constrained or relatively unconstrained identifier length could be a quantifiable tradeoff some situations, and the Registered Name-based approach has a quantifiable impact on the SUNet ID system.

A2 is much easier to implement, but may not be as palatable.

Though, can we afford to do A1; is it worth the effort?

In the next section, we discuss and specify the approach we decided to implement.
 

Our Chosen Approach

We decided to go with A2: the Fixed-length Non-Registered-Name-based Alphanumeric Identifier approach. We feel its advantages outweigh its disadvantages and that approach A1's cost to system developer/maintainers was too high given what we feel are not terribly great advantages over A2 from the user's perspective.

Here's the format we chose for the PVPI...

DSnnnAnnn Where... The PVPI is an alphanumeric string constructed according to these rules... (Please note that an [ABNF] specification for the PVPI is presented in Appendix A, below.) For example... Our motivations for choosing this particular format are that it is...

Conclusion

We did not yet have any operational experience with this approach at the time of the writing of the 1.0 version of this paper, and we both left Stanford shortly thereafter -- and we have not yet caught up with our colleagues there to see how it has worked in the intervening eight years. PVPIs were implemented in the SUNet Person Registry [Registry] and Enterprise Directory Service in Fall 1998 as a part of the rollout of the then new, LDAP-based enterprise identity, registry and directory services called "StanfordWho". PVPIs can be seen in action by, for example, using the following command (at a command prompt) on most any internet-connected *nix system: "whois -h whois.stanford.edu <some common family name>". Note that in the gratuitous whois help information returned, the PVPI is refered to as a "DS number".

Appendix A

This is the PVPI syntax expressed in [ABNF] form...
 
pvIdent 
= alphaTag numericTag alphaChar numericTag
alphaTag 
= %x44 %x53            ; "DS" 
alphaTag 
=/ alphaChar alphaChar ; may alloc other values in 
                       ; the future from the alphaChar set
alphaChar 
= %x41-48 / %x4A-4E / %x50-5A
                       ; A-H, J-N, P-Z, I & O are excluded
numericTag 
= DIGIT DIGIT DIGIT    ; 0-9 0-9 0-9
 
 

References

[ABNF] Augmented BNF for Syntax Specifications: ABNF. D. Crocker, Ed., P. Overell. Internet Engineering Task Force, RFC 2234. November 1997. Available as: http://info.internet.isi.edu:80/in-notes/rfc/files/rfc2234.txt
 
[Directory] Directory Services Project. Stanford University. Jeff Hodges, Directory Services Lead Geek. Project established July 1994. In production as of May 1996 with on-going evolution. Info available at: http://www.stanford.edu/group/networking/directory/
 
 
[LDAP] An LDAP Roadmap and FAQ. Annotated bibliography of resources about the Lightweight Directory Access Protocol. Jeff Hodges. Periodically updated. Available at: http://www.stanford.edu/group/networking/directory/x500ldapfaq.html
 
 
[LDAPattributes] A Summary of the X.500(96) User Schema for use with LDAPv3. Mark Wahl. Internet Engineering Task Force, RFC 2256. Available as: http://info.internet.isi.edu:80/in-notes/rfc/files/rfc2256.txt
 
 
[Registry] Registries Project. Stanford University. The Registry Team. Project established ca. 1996. In production as of June 1998, with evolution on-going. Info available at: http://www.stanford.edu/group/itss-ccs/project/registry/
 
 
[ReqsKeywords] Key Words for use in RFCs to Indicate Requirement Levels. Scott Bradner. Internet Engineering Task Force, RFC 2119/Best Current Practice 14. Available as: 
http://info.internet.isi.edu:80/in-notes/rfc/files/rfc2119.txt 
 
 
[SEAS] Stanford Email Alias Service. Stanford University. The SUNet ID Team. Project established ca. 1995. In production as of May 1996 with on-going evolution. Info available at: 
http://www.stanford.edu/group/itss/services/sunetid 
  /alias-info.html
 
 
[SUNetID] Stanford University Network Identity System. Stanford University. The SUNet ID Team. Project established ca. 1995. In production as of May 1996 with on-going evolution.
 
 
[SUNetIDDesign] Stanford University Network Identity System: Design. RL "Bob" Morgan, Stanford University. 1-March-1996. Available as: http://www.stanford.edu/group/itss-ccs/project/sunetid 
  /sunetid.design/sunetid.design
 
 
[SUNetIDReqs] Stanford University Network Identity System: Scope and Requirements
RL "Bob" Morgan, Stanford University. 1-March-1996. Available as: http://www.stanford.edu/group/itss-ccs/project/sunetid 
  /sunetid.design/sunetid.requirements
 
 
[SUNetWhois] SUNet Whois-based Directory Service. Stanford University. Originally by the Systems and Technology team of Networking Systems. In production since near the beginning of time. Available at: http://www.stanford.edu/cgi-bin/whois2html/
 
 
[Whois] Nicname/Whois. K. Harrenstien, M.K. Stahl, E.J. Feinler. Internet Engineering Task Force, RFC 954. October 1985. Available as: http://info.internet.isi.edu:80/in-notes/rfc/files/rfc954.txt