database revision fb2d509d86af3a33350a1703316bed5b219edeca
346af0930020342df40a1ca8d13eb185ad48067evboxsyncBIND 9 DNS database allows named rdatasets to be stored and retrieved.
346af0930020342df40a1ca8d13eb185ad48067evboxsyncDNS databases are used to store two different categories of data:
346af0930020342df40a1ca8d13eb185ad48067evboxsyncauthoritative zone data and non-authoritative cache data. Unlike
346af0930020342df40a1ca8d13eb185ad48067evboxsyncprevious versions of BIND which used a monolithic database, BIND 9 has
346af0930020342df40a1ca8d13eb185ad48067evboxsyncone database per zone or cache. Certain database operations, for
346af0930020342df40a1ca8d13eb185ad48067evboxsyncexample updates, have differing requirements and actions depending
346af0930020342df40a1ca8d13eb185ad48067evboxsyncupon whether the database contains zone data or cache data.
346af0930020342df40a1ca8d13eb185ad48067evboxsyncDatabase Semantics
346af0930020342df40a1ca8d13eb185ad48067evboxsyncA database instance either has zone semantics or cache semantics. The
346af0930020342df40a1ca8d13eb185ad48067evboxsyncsemantics are chosen when the database is created and cannot be
346af0930020342df40a1ca8d13eb185ad48067evboxsyncchanged. The differences between zone databases and cache databases
346af0930020342df40a1ca8d13eb185ad48067evboxsyncwill be discussed further below.
346af0930020342df40a1ca8d13eb185ad48067evboxsyncReference Safety
346af0930020342df40a1ca8d13eb185ad48067evboxsyncIt is a general principle of the BIND 9 project, and of the database
346af0930020342df40a1ca8d13eb185ad48067evboxsyncAPI, that all references returned to the caller remain valid until the
346af0930020342df40a1ca8d13eb185ad48067evboxsynccaller discards the reference.
346af0930020342df40a1ca8d13eb185ad48067evboxsyncThe database interface also mandates that the rdata in a retrieved
346af0930020342df40a1ca8d13eb185ad48067evboxsyncrdataset shall remain unaltered while any reference to the rdataset is
346af0930020342df40a1ca8d13eb185ad48067evboxsyncheld. Some other properties of the rdataset, e.g. its DNSSEC
346af0930020342df40a1ca8d13eb185ad48067evboxsyncvalidation status, may change.
346af0930020342df40a1ca8d13eb185ad48067evboxsyncDatabase Updates
346af0930020342df40a1ca8d13eb185ad48067evboxsyncA master zone is updated by a Dynamic Update message. A slave zone is
346af0930020342df40a1ca8d13eb185ad48067evboxsyncupdated by IXFR or AXFR. AXFR provides the entire contents of the new
346af0930020342df40a1ca8d13eb185ad48067evboxsynczone version, and replaces the entire contents of the database. IXFR
346af0930020342df40a1ca8d13eb185ad48067evboxsyncand Dynamic Update, although completely different protocols, have the
346af0930020342df40a1ca8d13eb185ad48067evboxsyncsame basic database requirements. They are differential update
346af0930020342df40a1ca8d13eb185ad48067evboxsyncprotocols, e.g. "add this record to the records at name 'foo'". The
346af0930020342df40a1ca8d13eb185ad48067evboxsyncupdates are also atomic, i.e. they must either succeed or fail.
346af0930020342df40a1ca8d13eb185ad48067evboxsyncChanges must not become visible to clients until the update has
346af0930020342df40a1ca8d13eb185ad48067evboxsynccommitted. In short, zone updates are transactional. This
346af0930020342df40a1ca8d13eb185ad48067evboxsynctransaction occurs at a database level; the entire database goes from
346af0930020342df40a1ca8d13eb185ad48067evboxsyncone version to another.
346af0930020342df40a1ca8d13eb185ad48067evboxsyncCache updates are done by the server in the ordinary course of
d3e2182f364ab4ca1ea79e166f674193d70eba5evboxsynchandling client requests. Unlike zone databases, there's no need (and
346af0930020342df40a1ca8d13eb185ad48067evboxsyncindeed, no ability) to ensure that data in the cache is consistent.
346af0930020342df40a1ca8d13eb185ad48067evboxsyncFor example, the cache may hold rdatasets from different versions of a
346af0930020342df40a1ca8d13eb185ad48067evboxsyncgiven zone. A typical cache update involves looking at the existing
346af0930020342df40a1ca8d13eb185ad48067evboxsynccache contents for the given name and type (if any), deciding if the
346af0930020342df40a1ca8d13eb185ad48067evboxsyncproposed replacement is better, and if so, doing the replacement.
346af0930020342df40a1ca8d13eb185ad48067evboxsyncConcurrent update attempts to the same node and rdataset type must
346af0930020342df40a1ca8d13eb185ad48067evboxsyncappear to have been executed in some order; there must be no merging
346af0930020342df40a1ca8d13eb185ad48067evboxsyncof data from multiple updates. Caches are not globally versioned like
346af0930020342df40a1ca8d13eb185ad48067evboxsynczones are. There is no need to group changes to multiple rdatasets
346af0930020342df40a1ca8d13eb185ad48067evboxsyncinto a cache transaction.
346af0930020342df40a1ca8d13eb185ad48067evboxsyncDatabase Concurrency and Locking
346af0930020342df40a1ca8d13eb185ad48067evboxsyncA principal goal of the BIND 9 project is multiprocessor scalabilty.
346af0930020342df40a1ca8d13eb185ad48067evboxsyncThe amount of concurrency in database accesses is an important factor
ce51c287b3c3b5258c1f9ac8b6f7cf5b92989836vboxsyncin achieving scalability. Consider a heavily used database, e.g. the
ce51c287b3c3b5258c1f9ac8b6f7cf5b92989836vboxsynccache database serving some mail hubs, or ".com". If access to these
ce51c287b3c3b5258c1f9ac8b6f7cf5b92989836vboxsyncdatabases is not parallalized, then adding another CPU will not help
ce51c287b3c3b5258c1f9ac8b6f7cf5b92989836vboxsyncthe server's performance for the portion of the runtime spent in
ce51c287b3c3b5258c1f9ac8b6f7cf5b92989836vboxsyncdatabase lookup.
ce51c287b3c3b5258c1f9ac8b6f7cf5b92989836vboxsyncSupport for multiple concurrent readers certainly helps both cache
ce51c287b3c3b5258c1f9ac8b6f7cf5b92989836vboxsyncdatabases and zone databases. Zones are typically read much more than
ce51c287b3c3b5258c1f9ac8b6f7cf5b92989836vboxsyncthey are written, though less so than in prior years because dynamic
ce51c287b3c3b5258c1f9ac8b6f7cf5b92989836vboxsyncDNS support is now widely available. Caches are frequently read and
346af0930020342df40a1ca8d13eb185ad48067evboxsyncfrequently written; a non-scientific survey of caching statistics on a
346af0930020342df40a1ca8d13eb185ad48067evboxsyncfew busy caching nameservers showed the ratio of cache hits to misses
346af0930020342df40a1ca8d13eb185ad48067evboxsyncwas about 2 to 1.
346af0930020342df40a1ca8d13eb185ad48067evboxsyncAs mentioned above, zone updates must be serialized, but cache updates
346af0930020342df40a1ca8d13eb185ad48067evboxsynccan often go in parallel.
346af0930020342df40a1ca8d13eb185ad48067evboxsyncA simple approach to these concurrency goals would be to have a single
346af0930020342df40a1ca8d13eb185ad48067evboxsyncread-write lock on the database. This would allow for multiple
346af0930020342df40a1ca8d13eb185ad48067evboxsyncconcurrent readers, and would provide the serialization of updates
346af0930020342df40a1ca8d13eb185ad48067evboxsyncthat zone updates require. This approach also has significant
346af0930020342df40a1ca8d13eb185ad48067evboxsynclimitations. Readers cannot run while an update is running. For a
346af0930020342df40a1ca8d13eb185ad48067evboxsyncshort-lived transaction like a Dynamic Update, this may be acceptable,
346af0930020342df40a1ca8d13eb185ad48067evboxsyncbut an IXFR can take a long time (even hours) to complete. Preventing
346af0930020342df40a1ca8d13eb185ad48067evboxsyncread access for such a long time is unacceptable. Another problem is
346af0930020342df40a1ca8d13eb185ad48067evboxsyncthat it forces updates to be serialized, even for cache databases.
346af0930020342df40a1ca8d13eb185ad48067evboxsyncThere are problems on the reader side of the lock too. If the entire
346af0930020342df40a1ca8d13eb185ad48067evboxsyncdatabase is protected by one lock, then any data retrieved from the
346af0930020342df40a1ca8d13eb185ad48067evboxsyncdatabase must either be used while the lock is held, or it must be
346af0930020342df40a1ca8d13eb185ad48067evboxsynccopied, because the data in the database can change when the lock
346af0930020342df40a1ca8d13eb185ad48067evboxsyncisn't held. Copying is expensive, and the server would like to be
346af0930020342df40a1ca8d13eb185ad48067evboxsyncable to hold a reference to database data for a long time. The most
346af0930020342df40a1ca8d13eb185ad48067evboxsyncsignificant long-running reader problem is outbound AXFR, which could
346af0930020342df40a1ca8d13eb185ad48067evboxsyncpotentially block updates for a long time (hours).
346af0930020342df40a1ca8d13eb185ad48067evboxsyncA finer-grained locking scheme, e.g. one lock per node, helps
346af0930020342df40a1ca8d13eb185ad48067evboxsyncparallelize cache updates, but doesn't help with the long-lived reader
346af0930020342df40a1ca8d13eb185ad48067evboxsyncor long-lived writer problems. These problems are solved by zone
346af0930020342df40a1ca8d13eb185ad48067evboxsyncdatabase versioning, described below.
346af0930020342df40a1ca8d13eb185ad48067evboxsyncThe BIND 9 Database interface does not mandate any particular locking
346af0930020342df40a1ca8d13eb185ad48067evboxsyncscheme. Database implementations are strongly encouraged to provide
346af0930020342df40a1ca8d13eb185ad48067evboxsyncas much concurrency as possible without violating the database
346af0930020342df40a1ca8d13eb185ad48067evboxsyncinterface's rules.
346af0930020342df40a1ca8d13eb185ad48067evboxsyncDatabase Versioning
346af0930020342df40a1ca8d13eb185ad48067evboxsyncVersioning is not available in cache databases.
346af0930020342df40a1ca8d13eb185ad48067evboxsyncA zone database has a "current version" which is the version most
346af0930020342df40a1ca8d13eb185ad48067evboxsyncrecently committed. A database has a set of versions open for reading
346af0930020342df40a1ca8d13eb185ad48067evboxsync(the "open versions"). This set is always non-empty, since the
346af0930020342df40a1ca8d13eb185ad48067evboxsynccurrent version is always open. The openversion method opens a
9a0c48116de3ffe1123a662b0c72fb1029a3b587vboxsyncread-only handle to the current version. All retrievals using the
9a0c48116de3ffe1123a662b0c72fb1029a3b587vboxsynchandle will see the database as it was at the time the version was
346af0930020342df40a1ca8d13eb185ad48067evboxsyncopened, regardless of subsequent changes to the database. It is not
346af0930020342df40a1ca8d13eb185ad48067evboxsyncpossible to open a specific version; only the current version may be
346af0930020342df40a1ca8d13eb185ad48067evboxsyncopened. This helps limit the number of prior versions which must be
346af0930020342df40a1ca8d13eb185ad48067evboxsynckept in the database.
346af0930020342df40a1ca8d13eb185ad48067evboxsyncEach zone update transaction is assigned a new version. Only one such
346af0930020342df40a1ca8d13eb185ad48067evboxsync"future version" may be open at any time. It is the caller's
346af0930020342df40a1ca8d13eb185ad48067evboxsyncresponsibility to serialize and handle the blocking and awakening of
346af0930020342df40a1ca8d13eb185ad48067evboxsyncmultiple update requests. The future version may be committed or
346af0930020342df40a1ca8d13eb185ad48067evboxsyncrolled back by the caller. If the future version commits, its version
346af0930020342df40a1ca8d13eb185ad48067evboxsyncbecomes the current version of the database.