0c27b3fe77ac1d5094ba3521e8142d9e7973133fMark AndrewsCopyright (C) 1999-2001, 2004, 2016 Internet Systems Consortium, Inc. ("ISC")
0c27b3fe77ac1d5094ba3521e8142d9e7973133fMark Andrews
0c27b3fe77ac1d5094ba3521e8142d9e7973133fMark AndrewsThis Source Code Form is subject to the terms of the Mozilla Public
0c27b3fe77ac1d5094ba3521e8142d9e7973133fMark AndrewsLicense, v. 2.0. If a copy of the MPL was not distributed with this
0c27b3fe77ac1d5094ba3521e8142d9e7973133fMark Andrewsfile, You can obtain one at http://mozilla.org/MPL/2.0/.
3bbe4751ff06fc8bfe190289dddb2f58553e308eBob Halley
dafcb997e390efa4423883dafd100c975c4095d6Mark Andrews$Id: database,v 1.9 2004/03/05 05:04:46 marka Exp $
467cf32369bcad1b4874c8dbf35f5ca955cabf37Bob Halley
3bbe4751ff06fc8bfe190289dddb2f58553e308eBob HalleyDatabases
3bbe4751ff06fc8bfe190289dddb2f58553e308eBob Halley
3bbe4751ff06fc8bfe190289dddb2f58553e308eBob HalleyBIND 9 DNS database allows named rdatasets to be stored and retrieved.
3bbe4751ff06fc8bfe190289dddb2f58553e308eBob HalleyDNS databases are used to store two different categories of data:
015156aff4bfcbb7d70dbc7c50bcde0a8fabb108Bob Halleyauthoritative zone data and non-authoritative cache data. Unlike
3bbe4751ff06fc8bfe190289dddb2f58553e308eBob Halleyprevious versions of BIND which used a monolithic database, BIND 9 has
3bbe4751ff06fc8bfe190289dddb2f58553e308eBob Halleyone database per zone or cache. Certain database operations, for
3bbe4751ff06fc8bfe190289dddb2f58553e308eBob Halleyexample updates, have differing requirements and actions depending
3bbe4751ff06fc8bfe190289dddb2f58553e308eBob Halleyupon whether the database contains zone data or cache data.
3bbe4751ff06fc8bfe190289dddb2f58553e308eBob Halley
3bbe4751ff06fc8bfe190289dddb2f58553e308eBob Halley
015156aff4bfcbb7d70dbc7c50bcde0a8fabb108Bob HalleyDatabase Semantics
015156aff4bfcbb7d70dbc7c50bcde0a8fabb108Bob Halley
015156aff4bfcbb7d70dbc7c50bcde0a8fabb108Bob HalleyA database instance either has zone semantics or cache semantics. The
015156aff4bfcbb7d70dbc7c50bcde0a8fabb108Bob Halleysemantics are chosen when the database is created and cannot be
015156aff4bfcbb7d70dbc7c50bcde0a8fabb108Bob Halleychanged. The differences between zone databases and cache databases
015156aff4bfcbb7d70dbc7c50bcde0a8fabb108Bob Halleywill be discussed further below.
015156aff4bfcbb7d70dbc7c50bcde0a8fabb108Bob Halley
015156aff4bfcbb7d70dbc7c50bcde0a8fabb108Bob Halley
015156aff4bfcbb7d70dbc7c50bcde0a8fabb108Bob HalleyReference Safety
015156aff4bfcbb7d70dbc7c50bcde0a8fabb108Bob Halley
015156aff4bfcbb7d70dbc7c50bcde0a8fabb108Bob HalleyIt is a general principle of the BIND 9 project, and of the database
015156aff4bfcbb7d70dbc7c50bcde0a8fabb108Bob HalleyAPI, that all references returned to the caller remain valid until the
015156aff4bfcbb7d70dbc7c50bcde0a8fabb108Bob Halleycaller discards the reference.
015156aff4bfcbb7d70dbc7c50bcde0a8fabb108Bob Halley
015156aff4bfcbb7d70dbc7c50bcde0a8fabb108Bob HalleyThe database interface also mandates that the rdata in a retrieved
015156aff4bfcbb7d70dbc7c50bcde0a8fabb108Bob Halleyrdataset shall remain unaltered while any reference to the rdataset is
015156aff4bfcbb7d70dbc7c50bcde0a8fabb108Bob Halleyheld. Some other properties of the rdataset, e.g. its DNSSEC
015156aff4bfcbb7d70dbc7c50bcde0a8fabb108Bob Halleyvalidation status, may change.
015156aff4bfcbb7d70dbc7c50bcde0a8fabb108Bob Halley
015156aff4bfcbb7d70dbc7c50bcde0a8fabb108Bob Halley
3bbe4751ff06fc8bfe190289dddb2f58553e308eBob HalleyDatabase Updates
3bbe4751ff06fc8bfe190289dddb2f58553e308eBob Halley
3bbe4751ff06fc8bfe190289dddb2f58553e308eBob HalleyA master zone is updated by a Dynamic Update message. A slave zone is
3bbe4751ff06fc8bfe190289dddb2f58553e308eBob Halleyupdated by IXFR or AXFR. AXFR provides the entire contents of the new
3bbe4751ff06fc8bfe190289dddb2f58553e308eBob Halleyzone version, and replaces the entire contents of the database. IXFR
3bbe4751ff06fc8bfe190289dddb2f58553e308eBob Halleyand Dynamic Update, although completely different protocols, have the
3bbe4751ff06fc8bfe190289dddb2f58553e308eBob Halleysame basic database requirements. They are differential update
50a266dd24a1f6ff8589790b9923ef79bd1896e4Bob Halleyprotocols, e.g. "add this record to the records at name 'foo'". The
50a266dd24a1f6ff8589790b9923ef79bd1896e4Bob Halleyupdates are also atomic, i.e. they must either succeed or fail.
50a266dd24a1f6ff8589790b9923ef79bd1896e4Bob HalleyChanges must not become visible to clients until the update has
015156aff4bfcbb7d70dbc7c50bcde0a8fabb108Bob Halleycommitted. In short, zone updates are transactional. This
015156aff4bfcbb7d70dbc7c50bcde0a8fabb108Bob Halleytransaction occurs at a database level; the entire database goes from
015156aff4bfcbb7d70dbc7c50bcde0a8fabb108Bob Halleyone version to another.
3bbe4751ff06fc8bfe190289dddb2f58553e308eBob Halley
3bbe4751ff06fc8bfe190289dddb2f58553e308eBob HalleyCache updates are done by the server in the ordinary course of
015156aff4bfcbb7d70dbc7c50bcde0a8fabb108Bob Halleyhandling client requests. Unlike zone databases, there's no need (and
015156aff4bfcbb7d70dbc7c50bcde0a8fabb108Bob Halleyindeed, no ability) to ensure that data in the cache is consistent.
015156aff4bfcbb7d70dbc7c50bcde0a8fabb108Bob HalleyFor example, the cache may hold rdatasets from different versions of a
015156aff4bfcbb7d70dbc7c50bcde0a8fabb108Bob Halleygiven zone. A typical cache update involves looking at the existing
015156aff4bfcbb7d70dbc7c50bcde0a8fabb108Bob Halleycache contents for the given name and type (if any), deciding if the
015156aff4bfcbb7d70dbc7c50bcde0a8fabb108Bob Halleyproposed replacement is better, and if so, doing the replacement.
015156aff4bfcbb7d70dbc7c50bcde0a8fabb108Bob HalleyConcurrent update attempts to the same node and rdataset type must
015156aff4bfcbb7d70dbc7c50bcde0a8fabb108Bob Halleyappear to have been executed in some order; there must be no merging
015156aff4bfcbb7d70dbc7c50bcde0a8fabb108Bob Halleyof data from multiple updates. Caches are not globally versioned like
015156aff4bfcbb7d70dbc7c50bcde0a8fabb108Bob Halleyzones are. There is no need to group changes to multiple rdatasets
015156aff4bfcbb7d70dbc7c50bcde0a8fabb108Bob Halleyinto a cache transaction.
3bbe4751ff06fc8bfe190289dddb2f58553e308eBob Halley
3bbe4751ff06fc8bfe190289dddb2f58553e308eBob Halley
3bbe4751ff06fc8bfe190289dddb2f58553e308eBob HalleyDatabase Concurrency and Locking
3bbe4751ff06fc8bfe190289dddb2f58553e308eBob Halley
015156aff4bfcbb7d70dbc7c50bcde0a8fabb108Bob HalleyA principal goal of the BIND 9 project is multiprocessor scalabilty.
3bbe4751ff06fc8bfe190289dddb2f58553e308eBob HalleyThe amount of concurrency in database accesses is an important factor
3bbe4751ff06fc8bfe190289dddb2f58553e308eBob Halleyin achieving scalability. Consider a heavily used database, e.g. the
3bbe4751ff06fc8bfe190289dddb2f58553e308eBob Halleycache database serving some mail hubs, or ".com". If access to these
3bbe4751ff06fc8bfe190289dddb2f58553e308eBob Halleydatabases is not parallalized, then adding another CPU will not help
3bbe4751ff06fc8bfe190289dddb2f58553e308eBob Halleythe server's performance for the portion of the runtime spent in
3bbe4751ff06fc8bfe190289dddb2f58553e308eBob Halleydatabase lookup.
3bbe4751ff06fc8bfe190289dddb2f58553e308eBob Halley
3bbe4751ff06fc8bfe190289dddb2f58553e308eBob HalleySupport for multiple concurrent readers certainly helps both cache
3bbe4751ff06fc8bfe190289dddb2f58553e308eBob Halleydatabases and zone databases. Zones are typically read much more than
3bbe4751ff06fc8bfe190289dddb2f58553e308eBob Halleythey are written, though less so than in prior years because dynamic
015156aff4bfcbb7d70dbc7c50bcde0a8fabb108Bob HalleyDNS support is now widely available. Caches are frequently read and
015156aff4bfcbb7d70dbc7c50bcde0a8fabb108Bob Halleyfrequently written; a non-scientific survey of caching statistics on a
015156aff4bfcbb7d70dbc7c50bcde0a8fabb108Bob Halleyfew busy caching nameservers showed the ratio of cache hits to misses
015156aff4bfcbb7d70dbc7c50bcde0a8fabb108Bob Halleywas about 2 to 1.
3bbe4751ff06fc8bfe190289dddb2f58553e308eBob Halley
3bbe4751ff06fc8bfe190289dddb2f58553e308eBob HalleyAs mentioned above, zone updates must be serialized, but cache updates
015156aff4bfcbb7d70dbc7c50bcde0a8fabb108Bob Halleycan often go in parallel.
3bbe4751ff06fc8bfe190289dddb2f58553e308eBob Halley
3bbe4751ff06fc8bfe190289dddb2f58553e308eBob HalleyA simple approach to these concurrency goals would be to have a single
3bbe4751ff06fc8bfe190289dddb2f58553e308eBob Halleyread-write lock on the database. This would allow for multiple
3bbe4751ff06fc8bfe190289dddb2f58553e308eBob Halleyconcurrent readers, and would provide the serialization of updates
3bbe4751ff06fc8bfe190289dddb2f58553e308eBob Halleythat zone updates require. This approach also has significant
3bbe4751ff06fc8bfe190289dddb2f58553e308eBob Halleylimitations. Readers cannot run while an update is running. For a
3bbe4751ff06fc8bfe190289dddb2f58553e308eBob Halleyshort-lived transaction like a Dynamic Update, this may be acceptable,
015156aff4bfcbb7d70dbc7c50bcde0a8fabb108Bob Halleybut an IXFR can take a long time (even hours) to complete. Preventing
015156aff4bfcbb7d70dbc7c50bcde0a8fabb108Bob Halleyread access for such a long time is unacceptable. Another problem is
015156aff4bfcbb7d70dbc7c50bcde0a8fabb108Bob Halleythat it forces updates to be serialized, even for cache databases.
015156aff4bfcbb7d70dbc7c50bcde0a8fabb108Bob HalleyThere are problems on the reader side of the lock too. If the entire
015156aff4bfcbb7d70dbc7c50bcde0a8fabb108Bob Halleydatabase is protected by one lock, then any data retrieved from the
015156aff4bfcbb7d70dbc7c50bcde0a8fabb108Bob Halleydatabase must either be used while the lock is held, or it must be
015156aff4bfcbb7d70dbc7c50bcde0a8fabb108Bob Halleycopied, because the data in the database can change when the lock
015156aff4bfcbb7d70dbc7c50bcde0a8fabb108Bob Halleyisn't held. Copying is expensive, and the server would like to be
015156aff4bfcbb7d70dbc7c50bcde0a8fabb108Bob Halleyable to hold a reference to database data for a long time. The most
015156aff4bfcbb7d70dbc7c50bcde0a8fabb108Bob Halleysignificant long-running reader problem is outbound AXFR, which could
015156aff4bfcbb7d70dbc7c50bcde0a8fabb108Bob Halleypotentially block updates for a long time (hours).
3bbe4751ff06fc8bfe190289dddb2f58553e308eBob Halley
3bbe4751ff06fc8bfe190289dddb2f58553e308eBob HalleyA finer-grained locking scheme, e.g. one lock per node, helps
3bbe4751ff06fc8bfe190289dddb2f58553e308eBob Halleyparallelize cache updates, but doesn't help with the long-lived reader
015156aff4bfcbb7d70dbc7c50bcde0a8fabb108Bob Halleyor long-lived writer problems. These problems are solved by zone
015156aff4bfcbb7d70dbc7c50bcde0a8fabb108Bob Halleydatabase versioning, described below.
015156aff4bfcbb7d70dbc7c50bcde0a8fabb108Bob Halley
015156aff4bfcbb7d70dbc7c50bcde0a8fabb108Bob HalleyThe BIND 9 Database interface does not mandate any particular locking
015156aff4bfcbb7d70dbc7c50bcde0a8fabb108Bob Halleyscheme. Database implementations are strongly encouraged to provide
015156aff4bfcbb7d70dbc7c50bcde0a8fabb108Bob Halleyas much concurrency as possible without violating the database
015156aff4bfcbb7d70dbc7c50bcde0a8fabb108Bob Halleyinterface's rules.
3bbe4751ff06fc8bfe190289dddb2f58553e308eBob Halley
3bbe4751ff06fc8bfe190289dddb2f58553e308eBob Halley
3bbe4751ff06fc8bfe190289dddb2f58553e308eBob HalleyDatabase Versioning
3bbe4751ff06fc8bfe190289dddb2f58553e308eBob Halley
015156aff4bfcbb7d70dbc7c50bcde0a8fabb108Bob HalleyVersioning is not available in cache databases.
015156aff4bfcbb7d70dbc7c50bcde0a8fabb108Bob Halley
fb2d509d86af3a33350a1703316bed5b219edecaBob HalleyA zone database has a "current version" which is the version most
fb2d509d86af3a33350a1703316bed5b219edecaBob Halleyrecently committed. A database has a set of versions open for reading
fb2d509d86af3a33350a1703316bed5b219edecaBob Halley(the "open versions"). This set is always non-empty, since the
fb2d509d86af3a33350a1703316bed5b219edecaBob Halleycurrent version is always open. The openversion method opens a
fb2d509d86af3a33350a1703316bed5b219edecaBob Halleyread-only handle to the current version. All retrievals using the
fb2d509d86af3a33350a1703316bed5b219edecaBob Halleyhandle will see the database as it was at the time the version was
fb2d509d86af3a33350a1703316bed5b219edecaBob Halleyopened, regardless of subsequent changes to the database. It is not
fb2d509d86af3a33350a1703316bed5b219edecaBob Halleypossible to open a specific version; only the current version may be
fb2d509d86af3a33350a1703316bed5b219edecaBob Halleyopened. This helps limit the number of prior versions which must be
fb2d509d86af3a33350a1703316bed5b219edecaBob Halleykept in the database.
015156aff4bfcbb7d70dbc7c50bcde0a8fabb108Bob Halley
015156aff4bfcbb7d70dbc7c50bcde0a8fabb108Bob HalleyEach zone update transaction is assigned a new version. Only one such
015156aff4bfcbb7d70dbc7c50bcde0a8fabb108Bob Halley"future version" may be open at any time. It is the caller's
015156aff4bfcbb7d70dbc7c50bcde0a8fabb108Bob Halleyresponsibility to serialize and handle the blocking and awakening of
015156aff4bfcbb7d70dbc7c50bcde0a8fabb108Bob Halleymultiple update requests. The future version may be committed or
015156aff4bfcbb7d70dbc7c50bcde0a8fabb108Bob Halleyrolled back by the caller. If the future version commits, its version
015156aff4bfcbb7d70dbc7c50bcde0a8fabb108Bob Halleybecomes the current version of the database.