Home‎ > ‎

How to design ID numbers

 This is a pre-publication draft and is incomplete. Bugger off.

 When you first poke your head out of the warm comfort of your DBM's Identity type, weary of its sameness and lack of distinction, you'll soon realize that the world is full of unique codes and numbers like zip codes, telephone numbers, social security numbers, express mail tracking numbers and whatever it is being stomped-on by tall black lines packed closely together on every box of cornflakes or packet of johnnys you can buy in a supermarket. Each one was carefully designed to serve a specific purpose in the damaging real world of physical things. They were engineered to be unique in their own context and recognizable in others. They were designed to be folded, spindled and mutilated. They are holy numbers. Magic numbers. They identify.

 Anything is appropriate given the right context, which is why you always felt satisfied with Identity columns until now. They lived in the context of your database application and never had reason to stray outside of this cozy little world. But one day you find that your PO numbers, SKUs, customer IDs and warehouse location numbers are all integers, all have reached the same number of digits, and all of them are completely indistinguishable on paper. And even when you know what kind of number you're looking at, they still don't convey any more information than what point in an infinite sequence it was generated, so users have to keep running back to the computer to find out whether 115948 is a bottle of cologne or a massage roller.

 Having designed a few ID numbers for different applications, as well as study classic numbering systems familiar to you in the real world, I've identified six characteristics that all numbers and codes have in some degree.

1. Uniqueness

 You don't want to generate the same number twice by accident, and there are three ways to achieve this:
  1. Create a Registration Authority to serialize the assignment of IDs
  2. Combine a timestamp with a token to control scope
  3. Generate them randomly with a sufficiently large keyspace

Registration Authorities

 A central authority that serializes the assignment of ID numbers can be as simple as a database manager that supports an Identity (or "Counter" or "Autoincrement") column type, or it could be as institutional as IANA. The benefit is that you can have guaranteed uniqueness, but the disadvantage is that you have a new bottleneck in the design of your system. If you can tolerate that bottleneck then RAs are a no-brainer, they can give you short numbers with guaranteed uniqueness. 

 RAs are undesirable in many situations besides the bottleneck, however:
  1. You need the overhead of establishing a connection to the RA, so they aren't good if you need to generate numbers on equipment with poor or no network connection, or need to generate numbers faster than the overhead can cope
  2. You can't assign numbers speculatively without wasting your number space
  3. The serialization of numbers limits your ability to generate them in concurrent tasks

Scoped Timestamps

 Time itself is the ultimate registration authority. Once synchronized you need no overhead to generate new unique IDs, and those IDs are as plentiful as the precision of your clock. But since they are not unique on their own they have to be scoped and qualified before they're useful as IDs, then given a resolution that fits the application.
  1. Scope is how you distinguish a timestamp as an identifier from any other timestamp that exists. Concatenating the timestamp with a machine hostname, for example, or a symbol that represents the application are two ways of adding scope to a timestamp to make it serve better as an ID. 
  2. Qualification is about making the value clear for what it is, and for a timestamp the very minimum is to specify the timezone it applies to, either in the ID itself or as part of the specification for the ID.
  3. Resolution--or precision--tells you how many IDs you can generate per minute or second or day. If you're certain that a new number will only be needed once a day, then YYYYMMDD will suffice. But if they're needed a couple of times an hour then suddenly you need to go all the way down to the second because two requests could come in simultaneously. 
 In practice I've found that it's impractical to use scoped timestamps for ID values if you'd need a resolution greater than 1 minute, but they are perfect for the products of a scheduled task--where time is necessarily part of the artifact's identity anyway. Log files, emails, instant messages, EDI transmissions and such are ideal for this form of ID because their time of creation is part of their meaning.

 I've also found that synchronization is not a major issue in practice, either. There are now many ways to synchronize clocks either over the network or from something in the device's own hardware (cell phones and GPS hardware for mobile devices, NTP for anything with a network connection). For what scoped timestamps are good for, there are plenty of ways to keep the clock set. For other applications, though, timestamps have too many collision or overlap problems.

Really big random numbers

 The GUID is the God of Unique IDs. That's what it stands for, in fact. Heathens call it the Globally Unique ID, and to the non-Microsoft world it's the Universally Unique ID or UUID. At 128-bits of information it's not just long, it has the strongest immunity to collision of any standard ID number. You could program a computer to randomly generate a million GUIDs every second and you'd see only one collision every 100 years (and this is a number that I didn't just pull out of my ass).

 Uniqueness in our definition means uniqueness in time and space, or: if Dr. Who generated a GUID on Gallifrey in the year 12,483 on a Type 40 TARDIS computer, then it's almost certain not to conflict with a GUID generated on a ZX Spectrum 48K in Lancing, England circa 1988. In one poof you rid yourself of the overhead that comes with a registration authority, plus the synchronization and scoping problems of timestamps.

 But this uniqueness comes at a price: to represent a GUID you need 16 bytes, and since each byte can encode 256 values it means a human-readable GUID is usually printed in hexadecimal digits, bringing its representable length to a whopping 32 characters. Printed GUIDs are also further divided into 4 data blocks separated by hyphens, chocking another 3 characters onto the mess to give you a bloated 35-character number that looks like this:


 This is how God looks in the morning when he hasn't brushed his hair.

 The above is hexadecimal (base 16), which is a step more compact in text than decimal and a reminder that you can convey more information per character if you have more symbols to work with. The next step-up in encoding is base 32, which combines the digits with most of the alphabet. We'll drop I, L and O to avoid confusion with 1 and 0, plus the letter U to avoid accidental cussing, and we'll name this particular alphabet after Douglas Crockford, who was the first to propose it.

 This table will give you an idea of how many unique values you can have for each encoding.

   Decimal  Hexadecimal  Base 32
 5 Characters  100,000  1,048,576  33,554,432
 10 Characters  10,000,000,000  1,099,511,627,776  1,125,899,906,842,624 (1.1 quadrillion)
 15 Characters  1,000,000,000,000,000  1,152,921,504,606,847,000 (1.1 quintillion)  Lots

 And for comparison, the number of characters required to represent an unpunctuated GUID in each encoding:

   Decimal Hexadecimal Base 32 
 GUID  48 characters 32 characters  26 characters

 If you threw in both upper and lowercase letters, plus the symbols to get a Base-64 numbering set, you could boil a GUID down to a mere 13 characters, but this is generally a bad idea for reasons we'll discuss later.

 Considering how huge the number space is for Base 32 at a mere 15 characters, a GUID's payload--being many orders of magnitude bigger--might help you appreciate why it's designers considered it completely safe to generate them at random as often as you want and in as many places as you want without any chance of a collision--suffice that you have a good random number generator, that is.

2. Persistence

 The venerable MD5, although humbled by recent cryptanalysis, is a persistent bugger. If you run the algorithm on the same file it'll give you the same unique number, and that's the point. MD5 has since been supplanted in the security world by the SHA family of hashing algorithms, which will in turn be supplanted by ever stronger algorithms that strive to be more pre-image resistant than the last. But you don't need to worry about security to have a useful and persistent ID for many applications.

 This kind of persistence is useful for IDs that don't have a good place to be remembered. GUIDs and counters are generated independently from what they represent and can't be recovered by an analysis of the thing itself. But hashes can.

3. Recognizability

 Phone numbers are one of the most recognizable ID numbers around, but they differ from country to country. This is important to remember because the shape or pattern of your ID number will not be unmistakable everywhere. After you've run it through the grist mill of durability (which I'll talk about last), the compromises you'll submit to will take a heavy toll on the ability of your ID number to be recognized out of its native context.  

4. Verifiability

 In the late 90s a half-dozen wide-eyed entrepreneurs attempted to create a form of digital money, or Digicash (that being the name of one). Digital coins were made of two numbers that had a mathematical relationship to each other, the purpose of which was authentication. If somebody tried to forge digital coins by making-up random numbers then they were likely to fail if they didn't know the key that was used to generate the numbers and validate their authenticity.

 Anonymous digital cash hasn't survived the optimism of its dreamers, but good 'ol plastic goes strong with a remarkably simple verification scheme called the Luhn code, named after Hans Peter Luhn who invented it while working at IBM. Every major payment card issuer uses the Luhn algorithm to create and verify credit and debit card numbers. Even before you query a database you can know if any 16-digit number is technically valid by running it through the "mod 10" algorithm designed by Luhn.

5. Capacity

 GUIDs pack a lot of bits, but who knows what any given GUID is for? In contrast, a Vehicle Identification Number, or VIN, can encode the make, model, country of manufacture, year of manufacture, engine type, factory-installed features, and hell, even the factory it was made in. It can do this in 17 characters because of the number of symbols it can contain and international standards for shorthand codes that represent the year and country of manufacture.

6. Durability

 A Universal Product Code, or UPC, has to survive trucks, trains, condensation, shopping carts and mothers fighting over the last cabbage patch doll. For this they have three defenses:
  1. They're short. 8 to 11 digits, 4 bits per digit
  2. They have a checkdigit
  3. They're short and they have a checkdigit
 Oh, was I being redundant? Well so are UPCs. They're nothing, however, compared to Turbocodes, which leverage Claude E. Shannon like Hercules with a crowbar the size of a whale's penis. 

 [filler goes here]

 Our #1 concern way up above was uniqueness, a problem we looked at solving by using a larger symbol set to squeeze more bits into a shorter printable space. The Crockford Base-32 alphabet is a way to solve this, but while it's omission of "O" and "U" reduces accidental profanity it doesn't survive the telephone test of durability. Take the example of the humble order ID given to rubes buying books, perfumes or pornography online. When they call customer service on the phone you can't expect them to know the full RAF phonetics alphabet by heart. They'll try to read it out loud the way they were taught in school, and over the phone "Em" and "En" sound the same, and so do "Ee", "Dee" and "Pee".