the djb way


using heterogeneous records

If you are trained to think in terms of relational database systems, you may be wondering how you might use a set of cdb files to represent a number of "tables" as is typical in a relational model of data.

That is, the RDBMS way of thinking creates separate tables/indexes for different types of records, where all the records in each table are defined to be of the same specific type. You generally don't mix different types of records in the same table.

But all the integrity and simplicity of the cdb paradigm is based on representing application data within a single cdb file. If you start splitting data into multiple cdb's, in such a way as to require the concurrent update of more than one cdb at a time, you lose the "atomicity" and uninterrupted read access that the cdb model otherwise provides.

The problem is generally solved by using a single cdb file to represent a collection of records of different types: a cdb places no constraints on the use of heterogeneous record types within the same file. In practice, different types of records within a cdb are easily differentiated by prepending the key with a unique record type indicator.

It is a different way of thinking, and it pretty much blows the goals of normalized relations for tabular data all to hell. But for many applications of constant databases, the use of heterogenous records works pretty well.

Let's reconsider the cctlds database as an example. It permitted us only to do one-way lookups: from country code as key, to the name of the country represented by that code.

But what if we also wanted to support reverse lookups? That is, given some country name, find the top-level domain code for that country. In this case, we would need two types of keyed records in the same data file:

To differentiate among these two record types, we can simply prepend a dot (".") character to the key when the key identifies a country code. The country name key records will be marked by the absence of this leading dot.

The script in the previous example then becomes

# convert countrycodes data file to cdb
# ===
awk '
  FS = ":"

/^[A-Z]/ {
    code = tolower($1)
    code_len = length(code)
    country = tolower($2)
    country_len = length(country)

    # country code key:
    print "+" code_len+1  "," country_len ":" "." code  "->" country
    # country name key:
    print "+" country_len  "," code_len ":" country  "->" code

  print ""
' | /usr/local/bin/cdbmake "$@"

### that's all, folks!

As before, awk parses the input file, splitting each line into country code and country name fields. Then it writes two lines into the cdb for each record. The first for a country code key record, with the country code prepended with a "." dot. The second for a country name key record.

This script may then be used to generate a new cdb from countrycodes with the usual cdbmake semantics:

$ ./ cctlds-v2.cdb cctlds-v2.tmp < countrycodes

The resulting cctlds-v2.cdb now supports lookups by either country code or country name:

$ cdbget .to < cctlds-v2.cdb
$ cdbget tonga  < cctlds-v2.cdb

Out of curiousity, verify the intergrity of the cdb:

$ cdbtest < cctlds-v2.cdb
found: 478
different record: 0
bad length: 0
not found: 0
untested: 0

The cdb is okay, now with twice the number of records as the first version. The database is still indexed with unique keys, since the "different record" stat is still 0. What about the cdbstats:

$ cdbstats < cctlds-v2.cdb
records        478
d0             421
d1              42
d2              13
d3               2
d4               0
d5               0
d6               0
d7               0
d8               0
d9               0
>9               0

Now we see some hash collisions, but 88% of the records are still found at the location directly computed from the hash function, and almost 97% found within a distance of d1.

Although this example is very simple, it hopefully illustrates a methodology for using cdb to model any number of "views" required by an application. This is the technique tinydns uses to store NS records, MX records, and A records, etc., all in the same data file.

Copyright © 2003, 2004, Wayne Marshall.
All rights reserved.

Last edit 2004.03.09, wcm.