red-black revision 499b34cea04a46823d003d4c0520c8b03e8513cb
10N/ACopyright (C) 1999-2001 Internet Software Consortium.
10N/ASee COPYRIGHT in the source root or http://isc.org/copyright.html for terms.
10N/A
10N/A$Id: red-black,v 1.8 2001/01/09 21:46:51 bwelling Exp $
10N/A
10N/A Red-Black Tree Implementation Notes
10N/A
10N/AOVERVIEW
10N/A
10N/ABIND9's basic name storage mechanism is to use a modified form of
10N/Abalanced binary tree known as a red-black tree. Red-black trees
10N/Aprovide for relatively efficient storage, retrieval and removal of
10N/Adata while maintaining the lexical order of all stored keys, a
10N/Anecessary function for DNS security.
10N/A
10N/ADESCRIPTION
10N/A
10N/AA red-black tree is a balanced binary tree named for the coloring that
10N/Ais done in the tree, identifying each node as either red or black.
10N/AThere are two simple rules for maintaining the color of nodes:
10N/A (1) A red node has only black children.
10N/A (2) The path from the root to any leaf node always includes the
10N/A same number of black nodes.
10N/A
10N/AWhenever a key is added or removed, adjustments are made to adhere to
10N/Athose two rules. These adjustments are relatively cheap to make but
10N/Amaintain the balance of the tree, thus making for efficient addition,
10N/Alookup and deletion operations, all of which are O(log N). The color
10N/Aof a node is not relevant to external users of the tree; it is needed
10N/Aonly to maintain the balance of the tree.
10N/A
10N/AFor more information on basic red-black trees, see _Introduction to
10N/AAlgorithms_, Cormen, Leiserson, and Rivest, MIT Press / McGraw Hill,
10N/A1990, ISBN 0-262-03141-8, chapter 14.
10N/A
10N/AIn BIND9, the red-black tree implementation uses DNS names as keys,
10N/Aand can store arbitrary data with each key value. "name" and "key"
10N/Aare used interchangably in this document.
10N/A
10N/AThe basic red-black tree algorithm is further adapted for use in BIND9
10N/Ato incorporate the notion of hierarchy, creating a tree of red-black
10N/Atrees. Where there is more than one name with a common suffix, all
10N/Anames with that suffix are stored in their own red-black tree, with a
10N/Adown pointer from the suffix locating the subtree.
10N/A
10N/AFor example, consider storing the following names:
10N/A a x.d.e.f o.w.y.d.e.f
10N/A b z.d.e.f p.w.y.d.e.f
10N/A c g.h q.w.y.d.e.f
10N/A
10N/ANo matter which order the keys were added, this would result in a tree
10N/Athat can be visualized as:
10N/A
10N/A b
10N/A / \
10N/A a d.e.f
10N/A /|\
10N/A c | g.h
10N/A |
10N/A w.y
10N/A /|\
10N/A x | z
10N/A |
10N/A p
10N/A / \
10N/A o q
10N/A
10N/AThis tree shows that when there is no key for a particular label, and
10N/Awhen there is only one known label for its immediate subordinate, then
10N/Amultiple labels can appear in a single node, such as at d.e.f and g.h.
10N/AIt also demonstrates that there can be more nodes in the tree of trees
10N/Athan there are actual keys (which degrades the O(log N) performance
10N/Amarginally); the nodes at d.e.f and w.y do not represent keys.
10N/A
10N/AAs an aside, remember that when ordering DNS names, labels are
10N/Aexamined from the right, therefore w.y sorts after x and before z.
10N/A
10N/AA split can occur not only on a regular label boundary, but also
10N/Abetween any two bits in an EDNS bitstring label. The common-suffix
10N/Arules will be applied to keep as many bits together as possible.
10N/A
10N/AIn the current implementation of the tree of trees, a node is
10N/Aconsidered to "formally" exist only if it has data associated with
10N/Ait. So if the above tree then had the key d.e.f added to it, the
10N/Aoperation would succeed rather than getting an "already exists"
10N/Aerror.
10N/A
10N/AAlong the same lines, if a key is added with a name which is a proper
10N/Asuperdomain of the name stored in an existing node, the operation will
10N/Asucceed by splitting the existing node into one node that is the key
10N/Aand another node that is the remaining parts of the name. Adding e.f
10N/Ato the above tree results in the top level red-black tree having a
10N/Anode named e.f where the current d.e.f is, and a down pointer from
10N/Ad.e.f to a "tree" of a single node named d. The down pointer from d
10N/Awould be kept to the level which has x, w.y, and z.
10N/A
10N/AA similar split of d.e.f would occur if the name k.e.f were added.
10N/AThe top level tree would have the node e.f with a down pointer to a
10N/Alevel that had both d and k, and d would continue to have its down
10N/Apointer to the x, w.y and z level.
10N/A
10N/AIt is guaranteed when splitting that external references to the node
10N/Athat is split will remain valid --- in the previous examples, anything
10N/Athat was pointing to the node that was d.e.f will still point to the
10N/Anode that is now just d.
10N/A
10N/AWhen deleting keys, nodes can be rejoined. If both of p.w.y.d.e.f and
10N/Aq.w.y.d.e.f were removed from the example tree, the node named w.y
10N/Awould become o.w.y. Unlike splitting, it is _not_ guaranteed that
10N/Aexternal references remain consistent; sometimes they will, sometimes
10N/Athey won't. Also, note that deletion is not perfectly symmetric with
10N/Aaddition. If you "undo" the last addition with a deletion of the same
10N/Akey then the tree of trees is not guaranteed to have exactly the same
10N/Astructure as it had prior to the addition. Sometimes, but not always.
10N/A
10N/ARejoining does not happen if it would violate any of the rules that
10N/Acause a split. o would not be rejoined with w.y if w.y had data
10N/Aassociated with the key; o would remain as a single node on its own
10N/Alevel. This emphasizes the rule that a node is considered to formally
10N/Aexist only if data is associated with it, because even if w.y.d.e.f
10N/Ahad been explicitly added as a key but with no data, then o would
10N/Astill be merged with the w.y node when p and q were deleted.
10N/A
10N/ASearching for a node generally returns one of three possible results:
10N/Aeither the key is found, a superdomain (partial match) of the key is
10N/Afound, or no part of the key is found. The first and last are rather
10N/Aobvious, and the second result basically means that a hierarchically
10N/Aenclosing name is found; e.g, searching for bb.rc.vix.com turned up
10N/Arc.vix.com, but not the full name.
10N/A
10N/ANo locking is done within the RBT library. @@@
10N/A
10N/ACHAINS
10N/A
10N/A@@@
10N/A
10N/AWhen a partial match is made, level_matches is set while the chain
10N/Apoints to the partial match node that was found. Then the chain is
10N/Aadjusted to point to the DNSSEC predecessor node, which might not even
10N/Abe under the same top level domain as the name that was searched for.
10N/AFor example, consider a database that had only the names vix.com and
10N/Aisc.org. A search for uu.net would leave the chain pointed to
10N/Avix.com, the DNSSEC predecessor. Though this might first appear to
10N/Acause level_matches to be bogus because the chain has been unwound and
10N/Asent down another path, note that the partial match node will always
10N/Abe in the chain of the predecessor, too --- and often the partial
10N/Amatch node will be the predecessor itself. In the vix.com/isc.org
10N/Aexample, the search for uu.net finds a partial match at ".", which is
10N/Aof course also in the path to the vix.com predecessor. A search for
10N/Awww.isc.org would find that isc.org is both the partial match and the
10N/Apredecessor.
EXTERNAL PROGRAMMATIC DETAILS
This section details the functions used to interact with the BIND9
red-black tree library, or RBT for short.
A source file that will be using RBT will usually need to include
<dns/rbt.h>. This header file automatically includes <isc/result.h),
<isc/mem.h>, <dns/types.h>, and <dns/name.h>.
The rbt.h file has more complete descriptions of each of the functions
named here, including what is required for each argument, what each
function ensures (and might not ensure) will occur, and the full range
of possible results for each call. Note well: if a function returns a
dns_result_t rather than void, it definitely means there is something
that can go possibly wrong in the function and it should be checked by
the caller.
A new tree of trees must be initialized using:
dns_result_t dns_rbt_create(isc_mem_t *mctx, void (*deleter)(void *, void *),
void *deleter_arg, dns_rbt_t **rbtp);
The memory context, mctx, must be a non-null pointer that was
initialized with isc_mem_create(). The deleter argument, if non-null,
should point to a function that is responsible for cleaning up any
memory associated with the data pointer of a node when the node is
deleted. It is passed the deleted node's data pointer as its first
argument and deleter_arg as its second argument.
After initializing an RBT manager, to add keys to the tree, use:
dns_result_t dns_rbt_addname(dns_rbt_t *rbt, dns_name_t *name, void *data);
The name _must_ be an absolute name. It is not required that the data
pointer be non-null, but it is recommended that it point to something,
even just invalid memory, because of the various searching and
deletion issues described in the previous section. The RBT code will
not attempt to dereference the pointer.
To find a key in the tree, use:
dns_result_t dns_rbt_findname(dns_rbt_t *rbt, dns_name_t *name, void **data);
The data parameter must not be NULL, but *data must be NULL. The
result will be either DNS_R_SUCCESS, DNS_R_PARTIALMATCH or
DNS_R_NOTFOUND. In the first case, an exact match was found for the
name and there was an associate data pointer, which is returned via
the data parameter. A partial match results when the name has not
been found but a superdomain name, with data, does exist; then the
data for that name is returned in the data parameter. If no data is
found for the name or a superdomain, *data will remain NULL.
INTERNAL PROGRAMMATIC DETAILS
This section is mainly relevant to the RBT DB implementation. It is
highly recommended that programmers using the RBT library stick to the
functions named in the previous section.
The dns_rbt_addname and dns_rbt_findname functions named in the
previous section are wrappers around dns_rbt_addnode and
dns_rbt_findnode. The *node functions for the most part do not
particularly care whether a node has an associated data pointer or
not, whereas the *name functions do. The one exception to this is
that when a PARTIALMATCH is returned for a search, the indicated node
is the deepest match that has data, rather than just the deepest
match. Even that behavior is selectable, however, using the boolean
empty_data_ok argument to dns_rbt_findnode.
Each node in the tree of trees is represented by the following structure:
typedef struct dns_rbtnode {
struct dns_rbtnode *left;
struct dns_rbtnode *right;
struct dns_rbtnode *down;
/*
* The following bitfields add up to a total bitwidth of 32.
* The range of values necessary for each item is indicated,
* but in the case of "attributes" the field is wider to accomodate
* possible future expansion. "offsetlen" could be one bit
* narrower by always adjusting its value by 1 to find the real
* offsetlen, but doing so does not gain anything (except perhaps
* another bit for "attributes", which doesn't yet need any more).
*/
unsigned int color:1; /* range is 0..1 */
unsigned int attributes:6; /* range is 0..2 */
unsigned int namelen:8; /* range is 1..255 */
unsigned int offsetlen:8; /* range is 1..128 */
unsigned int padbytes:9; /* range is 0..380 */
/*
* These values are used in the RBT DB implementation. The
* appropriate node lock must be held before accessing them.
*/
void *data;
unsigned int dirty:1;
unsigned int locknum:DNS_RBT_LOCKLENGTH;
unsigned int references:DNS_RBT_REFLENGTH;
} dns_rbtnode_t;
@@@