red-black revision bf668d00172e2308da0e01048284d7efa85495b1
0b062f4990db5cc6db2fe3398926f71b92a67407Brian Wellington Red-Black Tree Implementation Notes
4a14ce5ba00ab7bc55c99ffdcf59c7a4ab902721Automatic UpdaterBIND9's basic name storage mechanism is to use a modified form of
0b062f4990db5cc6db2fe3398926f71b92a67407Brian Wellingtonbalanced binary tree known as a red-black tree. Red-black trees
0b062f4990db5cc6db2fe3398926f71b92a67407Brian Wellingtonprovide for relatively efficient storage, retrieval and removal of
60e5e10f8d2e2b0c41e8abad38cacd867caa6ab2Rob Austeindata while maintaining the lexical order of all stored keys, a
9016767f4e15191b7c763b8a4ad36a57dc2705a2Mark Andrewsnecessary function for DNS security.
9016767f4e15191b7c763b8a4ad36a57dc2705a2Mark AndrewsA red-black tree is a balanced binary tree named for the coloring that
9016767f4e15191b7c763b8a4ad36a57dc2705a2Mark Andrewsis done in the tree, identifying each node as either red or black.
9016767f4e15191b7c763b8a4ad36a57dc2705a2Mark AndrewsThere are two simple rules for maintaining the color of nodes:
0b062f4990db5cc6db2fe3398926f71b92a67407Brian Wellington (1) A red node has only black children.
f8e3e03cacd16ffb923a9603fca23a9e1a1fee07Automatic Updater (2) The path from the root to any leaf node always includes the
60e5e10f8d2e2b0c41e8abad38cacd867caa6ab2Rob Austein same number of black nodes.
60e5e10f8d2e2b0c41e8abad38cacd867caa6ab2Rob AusteinWhenever a key is added or removed, adjustments are made to adhere to
60e5e10f8d2e2b0c41e8abad38cacd867caa6ab2Rob Austeinthose two rules. These adjustments are relatively cheap to make but
ca67ebfe9eef0b8f04179f7e511a19e0337a5422Automatic Updatermaintain the balance of the tree, thus making for efficient addition,
60e5e10f8d2e2b0c41e8abad38cacd867caa6ab2Rob Austeinlookup and deletion operations, all of which are O(log N). The color
60e5e10f8d2e2b0c41e8abad38cacd867caa6ab2Rob Austeinof a node is not relevant to external users of the tree; it is needed
5a4557e8de2951a2796676b5ec4b6a90caa5be14Mark Andrewsonly to maintain the balance of the tree.
60e5e10f8d2e2b0c41e8abad38cacd867caa6ab2Rob AusteinFor more information on basic red-black trees, see _Introduction to
60e5e10f8d2e2b0c41e8abad38cacd867caa6ab2Rob AusteinAlgorithms_, Cormen, Leiserson, and Rivest, MIT Press / McGraw Hill,
60e5e10f8d2e2b0c41e8abad38cacd867caa6ab2Rob Austein1990, ISBN 0-262-03141-8, chapter 14.
60e5e10f8d2e2b0c41e8abad38cacd867caa6ab2Rob AusteinIn BIND9, the red-black tree implementation uses DNS names as keys,
f8e3e03cacd16ffb923a9603fca23a9e1a1fee07Automatic Updaterand can store arbitrary data with each key value. "name" and "key"
60e5e10f8d2e2b0c41e8abad38cacd867caa6ab2Rob Austeinare used interchangably in this document.
f8e3e03cacd16ffb923a9603fca23a9e1a1fee07Automatic UpdaterThe basic red-black tree algorithm is further adapted for use in BIND9
60e5e10f8d2e2b0c41e8abad38cacd867caa6ab2Rob Austeinto incorporate the notion of hierarchy, creating a tree of red-black
60e5e10f8d2e2b0c41e8abad38cacd867caa6ab2Rob Austeintrees. Where there is more than one name with a common suffix, all
d71e2e0c61df16ff37c9934c371a4a60c08974f7Mark Andrewsnames with that suffix are stored in their own red-black tree, with a
0a7ed88633a680bb881868b75ded4d09a7bbbc50Automatic Updaterdown pointer from the suffix locating the subtree.
60e5e10f8d2e2b0c41e8abad38cacd867caa6ab2Rob AusteinFor example, consider storing the following names:
ad671240d635376dd8681550eebee799d2e3d1fdAutomatic UpdaterNo matter which order the keys were added, this would result in a tree
60e5e10f8d2e2b0c41e8abad38cacd867caa6ab2Rob Austeinthat can be visualized as:
0a7ed88633a680bb881868b75ded4d09a7bbbc50Automatic UpdaterThis tree shows that when there is no key for a particular label, and
0a7ed88633a680bb881868b75ded4d09a7bbbc50Automatic Updaterwhen there is only one known label for its immediate subordinate, then
60e5e10f8d2e2b0c41e8abad38cacd867caa6ab2Rob Austeinmultiple labels can appear in a single node, such as at d.e.f and g.h.
60e5e10f8d2e2b0c41e8abad38cacd867caa6ab2Rob AusteinIt also demonstrates that there can be more nodes in the tree of trees
60e5e10f8d2e2b0c41e8abad38cacd867caa6ab2Rob Austeinthan there are actual keys (which degrades the O(log N) performance
731cc132f22dbc9e0ecd7035dce314a61076d31bAutomatic Updatermarginally); the nodes at d.e.f and w.y do not represent keys.
60e5e10f8d2e2b0c41e8abad38cacd867caa6ab2Rob AusteinAs an aside, remember that when ordering DNS names, labels are
60e5e10f8d2e2b0c41e8abad38cacd867caa6ab2Rob Austeinexamined from the right, therefore w.y sorts after x and before z.
0a7ed88633a680bb881868b75ded4d09a7bbbc50Automatic UpdaterWhen bitfields are fully supported (they aren't yet) then a split can
60e5e10f8d2e2b0c41e8abad38cacd867caa6ab2Rob Austeinoccur not only on a regular label boundary, but also between any two
60e5e10f8d2e2b0c41e8abad38cacd867caa6ab2Rob Austeinbits. The common-suffix rules will be applied to keep as many bits
60e5e10f8d2e2b0c41e8abad38cacd867caa6ab2Rob Austeintogether as possible.
b272d38cc5d24f64c0647a9afb340c21c4b9aaf7Evan HuntIn the current implementation of the tree of trees, a node is
60e5e10f8d2e2b0c41e8abad38cacd867caa6ab2Rob Austeinconsidered to "formally" exist only if it has data associated with
60e5e10f8d2e2b0c41e8abad38cacd867caa6ab2Rob Austeinit. So if the above tree then had the key d.e.f added to it, the
b272d38cc5d24f64c0647a9afb340c21c4b9aaf7Evan Huntoperation would succeed rather than getting an "already exists"
60e5e10f8d2e2b0c41e8abad38cacd867caa6ab2Rob AusteinAlong the same lines, if a key is added with a name which is a proper
b272d38cc5d24f64c0647a9afb340c21c4b9aaf7Evan Huntsuperdomain of the name stored in an existing node, the operation will
0a7ed88633a680bb881868b75ded4d09a7bbbc50Automatic Updatersucceed by splitting the existing node into one node that is the key
0a7ed88633a680bb881868b75ded4d09a7bbbc50Automatic Updaterand another node that is the remaining parts of the name. Adding e.f
0a7ed88633a680bb881868b75ded4d09a7bbbc50Automatic Updaterto the above tree results in the top level red-black tree having a
0a7ed88633a680bb881868b75ded4d09a7bbbc50Automatic Updaternode named e.f where the current d.e.f is, and a down pointer from
0a7ed88633a680bb881868b75ded4d09a7bbbc50Automatic Updaterd.e.f to a "tree" of a single node named d. The down pointer from d
0a7ed88633a680bb881868b75ded4d09a7bbbc50Automatic Updaterwould be kept to the level which has x, w.y, and z.
0a7ed88633a680bb881868b75ded4d09a7bbbc50Automatic UpdaterA similar split of d.e.f would occur if the name k.e.f were added.
b272d38cc5d24f64c0647a9afb340c21c4b9aaf7Evan HuntThe top level tree would have the node e.f with a down pointer to a
b272d38cc5d24f64c0647a9afb340c21c4b9aaf7Evan Huntlevel that had both d and k, and d would continue to have its down
60e5e10f8d2e2b0c41e8abad38cacd867caa6ab2Rob Austeinpointer to the x, w.y and z level.
60e5e10f8d2e2b0c41e8abad38cacd867caa6ab2Rob AusteinIt is guaranteed when splitting that external references to the node
60e5e10f8d2e2b0c41e8abad38cacd867caa6ab2Rob Austeinthat is split will remain valid --- in the previous examples, anything
60e5e10f8d2e2b0c41e8abad38cacd867caa6ab2Rob Austeinthat was pointing to the node that was d.e.f will still point to the
60e5e10f8d2e2b0c41e8abad38cacd867caa6ab2Rob Austeinnode that is now just d.
2f8d63983c297c62630044d28a6f66676b4d339dMark AndrewsWhen deleting keys, nodes can be rejoined. If both of p.w.y.d.e.f and
2f8d63983c297c62630044d28a6f66676b4d339dMark Andrewsq.w.y.d.e.f were removed from the example tree, the node named w.y
60e5e10f8d2e2b0c41e8abad38cacd867caa6ab2Rob Austeinwould become o.w.y. Unlike splitting, it is _not_ guaranteed that
0a7ed88633a680bb881868b75ded4d09a7bbbc50Automatic Updaterexternal references remain consistent; sometimes they will, sometimes
0a7ed88633a680bb881868b75ded4d09a7bbbc50Automatic Updaterthey won't. Also, note that deletion is not perfectly symmetric with
0a7ed88633a680bb881868b75ded4d09a7bbbc50Automatic Updateraddition. If you "undo" the last addition with a deletion of the same
0a7ed88633a680bb881868b75ded4d09a7bbbc50Automatic Updaterkey then the tree of trees is not guaranteed to have exactly the same
0a7ed88633a680bb881868b75ded4d09a7bbbc50Automatic Updaterstructure as it had prior to the addition. Sometimes, but not always.
0a7ed88633a680bb881868b75ded4d09a7bbbc50Automatic UpdaterRejoining does not happen if it would violate any of the rules that
0a7ed88633a680bb881868b75ded4d09a7bbbc50Automatic Updatercause a split. o would not be rejoined with w.y if w.y had data
0a7ed88633a680bb881868b75ded4d09a7bbbc50Automatic Updaterassociated with the key; o would remain as a single node on its own
0a7ed88633a680bb881868b75ded4d09a7bbbc50Automatic Updaterlevel. This emphasizes the rule that a node is considered to formally
0a7ed88633a680bb881868b75ded4d09a7bbbc50Automatic Updaterexist only if data is associated with it, because even if w.y.d.e.f
0a7ed88633a680bb881868b75ded4d09a7bbbc50Automatic Updaterhad been explicitly added as a key but with no data, then o would
0a7ed88633a680bb881868b75ded4d09a7bbbc50Automatic Updaterstill be merged with the w.y node when p and q were deleted.
0a7ed88633a680bb881868b75ded4d09a7bbbc50Automatic UpdaterSearching for a node generally returns one of three possible results:
0a7ed88633a680bb881868b75ded4d09a7bbbc50Automatic Updatereither the key is found, a superdomain (partial match) of the key is
0a7ed88633a680bb881868b75ded4d09a7bbbc50Automatic Updaterfound, or no part of the key is found. The first and last are rather
60e5e10f8d2e2b0c41e8abad38cacd867caa6ab2Rob Austeinobvious, and the second result basically means that a hierarchically
60e5e10f8d2e2b0c41e8abad38cacd867caa6ab2Rob Austeinenclosing name is found; e.g, searching for bb.rc.vix.com turned up
60e5e10f8d2e2b0c41e8abad38cacd867caa6ab2Rob Austeinrc.vix.com, but not the full name.
60e5e10f8d2e2b0c41e8abad38cacd867caa6ab2Rob AusteinNo locking is done within the RBT library. @@@
60e5e10f8d2e2b0c41e8abad38cacd867caa6ab2Rob AusteinEXTERNAL PROGRAMMATIC DETAILS
60e5e10f8d2e2b0c41e8abad38cacd867caa6ab2Rob AusteinThis section details the functions used to interact with the BIND9
60e5e10f8d2e2b0c41e8abad38cacd867caa6ab2Rob Austeinred-black tree library, or RBT for short.
60e5e10f8d2e2b0c41e8abad38cacd867caa6ab2Rob AusteinA source file that will be using RBT will usually need to include
0a7ed88633a680bb881868b75ded4d09a7bbbc50Automatic Updater<dns/rbt.h>. This header file automatically includes <isc/result.h),
f8e3e03cacd16ffb923a9603fca23a9e1a1fee07Automatic UpdaterThe rbt.h file has more complete descriptions of each of the functions
f8e3e03cacd16ffb923a9603fca23a9e1a1fee07Automatic Updaternamed here, including what is required for each argument, what each
f8e3e03cacd16ffb923a9603fca23a9e1a1fee07Automatic Updaterfunction ensures (and might not ensure) will occur, and the full range
f8e3e03cacd16ffb923a9603fca23a9e1a1fee07Automatic Updaterof possible results for each call. Note well: if a function returns a
60e5e10f8d2e2b0c41e8abad38cacd867caa6ab2Rob Austeindns_result_t rather than void, it definitely means there is something
60e5e10f8d2e2b0c41e8abad38cacd867caa6ab2Rob Austeinthat can go possibly wrong in the function and it should be checked by
60e5e10f8d2e2b0c41e8abad38cacd867caa6ab2Rob AusteinA new tree of trees must be initialized using:
60e5e10f8d2e2b0c41e8abad38cacd867caa6ab2Rob Austein dns_result_t dns_rbt_create(isc_mem_t *mctx, void (*deleter)(void *, void *),
60e5e10f8d2e2b0c41e8abad38cacd867caa6ab2Rob Austein void *deleter_arg, dns_rbt_t **rbtp);
60e5e10f8d2e2b0c41e8abad38cacd867caa6ab2Rob AusteinThe memory context, mctx, must be a non-null pointer that was
60e5e10f8d2e2b0c41e8abad38cacd867caa6ab2Rob Austeininitialized with isc_mem_create(). The deleter argument, if non-null,
60e5e10f8d2e2b0c41e8abad38cacd867caa6ab2Rob Austeinshould point to a function that is responsible for cleaning up any
0a7ed88633a680bb881868b75ded4d09a7bbbc50Automatic Updatermemory associated with the data pointer of a node when the node is
0a7ed88633a680bb881868b75ded4d09a7bbbc50Automatic Updaterdeleted. It is passed the deleted node's data pointer as its first
0a7ed88633a680bb881868b75ded4d09a7bbbc50Automatic Updaterargument and deleter_arg as its second argument.
60e5e10f8d2e2b0c41e8abad38cacd867caa6ab2Rob AusteinAfter initializing an RBT manager, to add keys to the tree, use:
0a7ed88633a680bb881868b75ded4d09a7bbbc50Automatic Updater dns_result_t dns_rbt_addname(dns_rbt_t *rbt, dns_name_t *name, void *data);
60e5e10f8d2e2b0c41e8abad38cacd867caa6ab2Rob AusteinThe name _must_ be an absolute name. It is not required that the data
60e5e10f8d2e2b0c41e8abad38cacd867caa6ab2Rob Austeinpointer be non-null, but it is recommended that it point to something,
60e5e10f8d2e2b0c41e8abad38cacd867caa6ab2Rob Austeineven just invalid memory, because of the various searching and
60e5e10f8d2e2b0c41e8abad38cacd867caa6ab2Rob Austeindeletion issues described in the previous section. The RBT code will
60e5e10f8d2e2b0c41e8abad38cacd867caa6ab2Rob Austeinnot attempt to dereference the pointer.
60e5e10f8d2e2b0c41e8abad38cacd867caa6ab2Rob AusteinTo find a key in the tree, use:
60e5e10f8d2e2b0c41e8abad38cacd867caa6ab2Rob Austein dns_result_t dns_rbt_findname(dns_rbt_t *rbt, dns_name_t *name, void **data);
60e5e10f8d2e2b0c41e8abad38cacd867caa6ab2Rob AusteinThe data parameter must not be NULL, but *data must be NULL. The
60e5e10f8d2e2b0c41e8abad38cacd867caa6ab2Rob Austeinresult will be either DNS_R_SUCCESS, DNS_R_PARTIALMATCH or
60e5e10f8d2e2b0c41e8abad38cacd867caa6ab2Rob AusteinDNS_R_NOTFOUND. In the first case, an exact match was found for the
60e5e10f8d2e2b0c41e8abad38cacd867caa6ab2Rob Austeinname and there was an associate data pointer, which is returned via
60e5e10f8d2e2b0c41e8abad38cacd867caa6ab2Rob Austeinthe data parameter. A partial match results when the name has not
60e5e10f8d2e2b0c41e8abad38cacd867caa6ab2Rob Austeinbeen found but a superdomain name, with data, does exist; then the
60e5e10f8d2e2b0c41e8abad38cacd867caa6ab2Rob Austeindata for that name is returned in the data parameter. If no data is
60e5e10f8d2e2b0c41e8abad38cacd867caa6ab2Rob Austeinfound for the name or a superdomain, *data will remain NULL.
60e5e10f8d2e2b0c41e8abad38cacd867caa6ab2Rob AusteinINTERNAL PROGRAMMATIC DETAILS
60e5e10f8d2e2b0c41e8abad38cacd867caa6ab2Rob AusteinThis section is mainly relevant to the RBT DB implementation. It is
60e5e10f8d2e2b0c41e8abad38cacd867caa6ab2Rob Austeinhighly recommended that programmers using the RBT library stick to the
0a7ed88633a680bb881868b75ded4d09a7bbbc50Automatic Updaterfunctions named in the previous section.
0a7ed88633a680bb881868b75ded4d09a7bbbc50Automatic UpdaterThe dns_rbt_addname and dns_rbt_findname functions named in the
0a7ed88633a680bb881868b75ded4d09a7bbbc50Automatic Updaterprevious section are wrappers around dns_rbt_addnode and
0a7ed88633a680bb881868b75ded4d09a7bbbc50Automatic Updaterdns_rbt_findnode. The *node functions for the most part do not
0a7ed88633a680bb881868b75ded4d09a7bbbc50Automatic Updaterparticularly care whether a node has an associated data pointer or
0a7ed88633a680bb881868b75ded4d09a7bbbc50Automatic Updaternot, whereas the *name functions do. The one exception to this is
0a7ed88633a680bb881868b75ded4d09a7bbbc50Automatic Updaterthat when a PARTIALMATCH is returned for a search, the indicated node
0a7ed88633a680bb881868b75ded4d09a7bbbc50Automatic Updateris the deepest match that has data, rather than just the deepest
0a7ed88633a680bb881868b75ded4d09a7bbbc50Automatic Updatermatch. This inconsistency could be addressed by adding another
0a7ed88633a680bb881868b75ded4d09a7bbbc50Automatic Updaterparameter to dns_rbt_findnode.
0a7ed88633a680bb881868b75ded4d09a7bbbc50Automatic UpdaterEach node in the tree of trees is represented by the following structure:
0a7ed88633a680bb881868b75ded4d09a7bbbc50Automatic Updater typedef struct dns_rbtnode {
60e5e10f8d2e2b0c41e8abad38cacd867caa6ab2Rob Austein struct dns_rbtnode *left;
60e5e10f8d2e2b0c41e8abad38cacd867caa6ab2Rob Austein struct dns_rbtnode *right;
60e5e10f8d2e2b0c41e8abad38cacd867caa6ab2Rob Austein struct dns_rbtnode *down;
60e5e10f8d2e2b0c41e8abad38cacd867caa6ab2Rob Austein * The following bitfields add up to a total bitwidth of 32.
60e5e10f8d2e2b0c41e8abad38cacd867caa6ab2Rob Austein * The range of values necessary for each item is indicated,
60e5e10f8d2e2b0c41e8abad38cacd867caa6ab2Rob Austein * but in the case of "attributes" the field is wider to accomodate
60e5e10f8d2e2b0c41e8abad38cacd867caa6ab2Rob Austein * possible future expansion. "offsetlen" could be one bit
60e5e10f8d2e2b0c41e8abad38cacd867caa6ab2Rob Austein * narrower by always adjusting its value by 1 to find the real
60e5e10f8d2e2b0c41e8abad38cacd867caa6ab2Rob Austein * offsetlen, but doing so does not gain anything (except perhaps
60e5e10f8d2e2b0c41e8abad38cacd867caa6ab2Rob Austein * another bit for "attributes", which doesn't yet need any more).
60e5e10f8d2e2b0c41e8abad38cacd867caa6ab2Rob Austein unsigned int color:1; /* range is 0..1 */
60e5e10f8d2e2b0c41e8abad38cacd867caa6ab2Rob Austein unsigned int attributes:6; /* range is 0..2 */
f8e3e03cacd16ffb923a9603fca23a9e1a1fee07Automatic Updater unsigned int namelen:8; /* range is 1..255 */
0a7ed88633a680bb881868b75ded4d09a7bbbc50Automatic Updater unsigned int offsetlen:8; /* range is 1..128 */
0a7ed88633a680bb881868b75ded4d09a7bbbc50Automatic Updater unsigned int padbytes:9; /* range is 0..380 */
2895f101b5585a19015ac2c2c1e1812ac467fa12Automatic Updater * These values are used in the RBT DB implementation. The
2895f101b5585a19015ac2c2c1e1812ac467fa12Automatic Updater * appropriate node lock must be held before accessing them.
2895f101b5585a19015ac2c2c1e1812ac467fa12Automatic Updater unsigned int dirty:1;
2895f101b5585a19015ac2c2c1e1812ac467fa12Automatic Updater unsigned int locknum:DNS_RBT_LOCKLENGTH;
0a7ed88633a680bb881868b75ded4d09a7bbbc50Automatic Updater unsigned int references:DNS_RBT_REFLENGTH;
0a7ed88633a680bb881868b75ded4d09a7bbbc50Automatic Updater } dns_rbtnode_t;