the djb way

cdb


cctlds: a cdb for country codes

If you've been poking around in your publicfile logs, doing reverse DNS lookups with djbdns utilities like dnsname, you may sometimes wonder where in the world hostnames like thishost.sc or thathost.tt are coming from.

And what about that cr.yp.to, after all? Where the heck is .to?

On an OpenBSD system we can find a few miscellaneous plain-text data files, among which is /usr/share/misc/countrycodes. This provides a list of country code top-level domains, as shown in the following snippet:


# $OpenBSD: countrycodes,v 1.3 2002/10/12 02:14:15 jsyn Exp $
#
# ISO 3166-1 country names and code elements
# http://www.din.de/gremien/nas/nabd/iso3166ma/codlstp1/en_listp1.html
#
# ccTLDs are derived from this list, though a few ccTLDs in popular use 
# are not on this list (i.e., 'UK')
#
# Country Code : Country 

AD:ANDORRA
AE:UNITED ARAB EMIRATES
AF:AFGHANISTAN
<snip>
ZA:SOUTH AFRICA
ZM:ZAMBIA
ZW:ZIMBABWE

[As the header comments note, this particular file is incomplete. A current listing in HTML format may be found at http://www.iana.org/cctld/cctld-whois.html.]

With this file in hand, we could easily grep(1) around in it to satisfy our curiosity:

$ grep -i "^sc" countrycodes
SC:SEYCHELLES

While grep-ing for a match over an entire file may be acceptable for casual queries, the performance can be vastly improved by using a fast hash database like cdb.

Converting countrycodes to a cdb is extremely simple. We observe that key/data pairs are colon (":") delimited, so we will adapt the cdbmake-12 script we saw in the overview section to cdbmake-cctlds.sh:


#!/bin/sh
# cdbmake-cctlds.sh
# convert countrycodes data file to cdb
# ===
awk '
BEGIN {
  FS = ":"
}

/^[A-Z]/ {
    print "+" length($1) "," length($2) ":" tolower($1) "->" tolower($2)
}

END{
  print ""
}
' | /usr/local/bin/cdbmake ${@}

### that's all, folks!

In addition to defining the field separator as the ":" character in the BEGIN block, we process lines only beginning with A-Z, and then normalize the output data to lower case. Make the script executable, chmod 755, then generate a cdb file named cctlds.cdb with:

$ ./cdbmake-cctlds.sh cctlds.cdb cctlds.tmp < countrycodes

Howzit work? Use cdbget for lookups:

$ cdbget tt < cctlds.cdb && echo
trinidad and tobago
$ cdbget td < cctlds.cdb && echo
chad
$ cdbget iq < cctlds.cdb && echo
iraq

The cdb package also includes some utilities for summarizing and validating cdb files. For example, to check that a cdb file is ok and internally consistent, use cdbtest:

$ cdbtest < cctlds.cdb
found: 239
different record: 0
bad length: 0
not found: 0
untested: 0

The "found" line reports the total number of records; here there are 239 in the cctlds database. The "different record" line reports number of records with duplicate keys. For this database there are none. There are no built-in constraints against duplicate keys, though, so if you want to make sure there aren't any, use cdbtest to verify.

The next two items --"bad length" and "not found"-- should be 0. If not, something bad happened. The "untested" item refers to records with keys greater than 1024 bytes in length; cdbtest simply skips these.

To get some "distance" statistics for a cdb, use cdbstats:

$ cdbstats < cctlds.cdb
records        239
d0             239
d1               0
d2               0
d3               0
d4               0
d5               0
d6               0
d7               0
d8               0
d9               0
>9               0

These numbers may be used to assess the efficiency of the hash. Sometimes different keys will hash to the same internal value. When this happens, cdb uses a chaining algorithm and seeks to the next nearest available slot. This report shows the number of records found at incremental distances from the internal hash value. In this particular case, all records are found at "d0", the internal hash value computed for each key. In other words, the cctlds cdb has no key collisions.

Finally, you may sometimes want to spit all the records back out of a cdb. Use cdbdump:

$ cdbdump < cctlds.cdb
+2,7:ad->andorra
+2,20:ae->united arab emirates
+2,11:af->afghanistan
<snip>
+2,12:za->south africa
+2,6:zm->zambia
+2,8:zw->zimbabwe

Note that the output format of cdbdump is the same as the input format to cdbmake, including the empty line at the end of the data.

Oh yeah, where is cr.yp.to:

$ cdbget to < cctlds.cdb && echo
tonga

Tonga? Say, isn't that a long way from Chicago?...


Copyright © 2003, 2004, Wayne Marshall.
All rights reserved.

Last edit 2004.03.09, wcm.