the djb way: cdb

After installing the cdb package, the following binaries will be found in /usr/local/bin (or as otherwise configured in conf-home):

cdbget
cdbmake
cdbdump
cdbstats
cdbtest
cdbmake-12
cdbmake-sv

Of these, cdbmake and cdbget provide the primary interface utilities to create and access cdb data files.

To create a cdb file, use cdbmake:

$ cdbmake cdb temp < data

The first argument is the filename of the cdb file to create. The second argument is the name of a temporary file cdbmake should use for interim processing. Finally, cdbmake reads its input from stdin; the command-line here shows data redirected from an input file by the shell.

In the cdb paradigm a cdb data file is always built/rebuilt from scratch; there are no facilities for in-place or in-core modification of individual records. Although this may at first seem unconventional and inefficient, it is practical because cdbmake is on the order of 100 times faster than, say, adding records to an equivalent Berkeley DB or GNU DBM database. So an entire 10,000 record database can be rebuilt from scatch with cdb, for roughly the same "cost" as adding a mere 100 records to a db or gdbm hash file.

For reliability, cdbmake first constructs the cdb database in a temporary file. Then, after the temporary file is successfully written to disk, it is "atomically" moved into the target output file location. This methodology protects the original cdb file from unintended corruption, such as from power-failures during data file creation. It also permits applications to have uninterrupted read access to data, without the need for any special locking operations on the cdb file itself.

The input data consists of plain-text, one record per line, with each record described according to the following syntax:

+keylen,datalen:key->data

Both key and data may be arbitrary, including NULLs. The only limitation is that the cdb file may not exceed 4 gigabytes. The end of data is signaled by an empty line.

As an example of a record coded for input to an airport code database:

+3,14:EBB->Entebbe,Uganda

In this example, the key is the string "EBB" and the data is the string "Entebbe,Uganda".

At first glance it may seem inconvenient to construct input data with predetermined key and record lengths. In practice, however, it is quite simple to preprocess input and munge it into the required format. Consider this cdbmake-12 script which is included in the cdb installation:

#!/bin/sh
# WARNING: This file was auto-generated. Do not edit!
awk '
  /^[^#]/ {
    print "+" length($1) "," length($2) ":" $1 "->" $2
  }
  END {
    print ""
  }
' | /usr/local/bin/cdbmake "$@"

Here awk is used as a preprocessor, setup for an input file using space-delimited key/data pairs, and allowing comments in the input by ignoring lines beginning with "#". It computes the length of the key and data, prints it in the proper format, and then pipes it into the cdbmake utility. The END block of the awk script also adds the empty line required to signal the end of input.

Use of such a preprocessor allows the input file for the airport code database to look something like this instead:

# airport.dat
# international airport codes
#key city,country
#== ==============
EBB Entebbe,Uganda
EBJ Esbjerg,Denmark
# etc.

Processed like so:

$ cdbmake-12 airport.cdb airport.tmp < airport.dat

Once the cdb datafile has been created, use the cdbget utility for key-based retrieval:

$ cdbget key [skip] < cdb

The first argument is the key to use for lookup in the cdb file. The second argument is an optional skip parameter, used when records may have duplicate keys. The cdb file itself is read from stdin. Note, however, that cdbget requires the cdb file to be seekable; reading from a pipe won't work here. The example shows how the redirect facilities of the shell may be used to read the cdb from a file.

Here is an example lookup from the airport database:

$ cdbget EBB  < airport.cdb && echo
Entebbe,Uganda

If the key is found, cdbget prints the associated data and returns 0. (If a record is found, the example shown here demonstrates the idiom used to add a newline to the output with && echo.) If no record is found, cdbget returns 100.

The example here also suggests that any number of "fields" may be represented within the data associated with a key. In this case the fields are comma-delimited, but any scheme can be used. Then awk or Perl might be used to further parse the results of a lookup into separate fields, such as city and country.

There are no built-in constraints against entering records with duplicate keys when creating a cdb database. If a cdb has duplicate keys, the second argument to cdbget can be used to iterate over successive records. This may be demonstrated with a simple wrapper script for cdbget called cdbgetm.sh:

#!/bin/sh
# cdbgetm.sh
# cdbget for multiple records/key
# ===
skip=0
while cdbget ${1} ${skip} ${@} ; do
    echo ""
    skip=$((${skip} + 1))
done
### that's all, folks!

Usage on the command line would look like:

$ ./cdbgetm.sh EBB < airport.cdb
Entebbe,Uganda

If the airport cdb happened to have more than one record with a key matching "EBB", cdbget would print each as the "skip" parameter is incremented, until no more matching records are found. Records with duplicate keys will be found in the same order as entered with cdbmake.

That's a quick overview of the cdb package, designed to show that entire cdb applications can easily be built up from just a couple simple command-line tools --cdbmake and cdbget-- plus a little shell-script glue.

The next section describes a simple yet complete example, cctlds, a lookup database for country code top-level domains.