dnode.c revision cf6106c8a0d6598b045811f9650d66e07eb332af
* Define DNODE_STATS to turn on statistic gathering. By default, it is only * turned on when DEBUG is also defined. * Every dbuf has a reference, and dropping a tracked reference is * O(number of references), so don't track dn_holds. * dn_nblkptr is only one byte, so it's OK to read it in either * byte order. We can't read dn_bouslen. * OK to check dn_bonuslen for zero, because it won't matter if * we have the wrong byte order. This is necessary because the * dnode dnode is smaller than a regular dnode. * Note that the bonus length calculated here may be * longer than the actual bonus buffer. This is because * we always put the bonus buffer after the last block * pointer (instead of packing it against the end of the /* Swap SPILL block if we have one */ for (i = 0; i <
size; i++) {
* Defer setting dn_objset until the dnode is ready to be a candidate * for the dnode_move() callback. /* Lost the allocation race. */ * Exclude special dnodes from os_dnodes so an empty os_dnodes * signifies that the special dnodes have no references from * their children (the entries in os_dnodes). This allows * dnode_destroy() to easily determine if the last child has * been removed and then complete eviction of the objset. * Everything else must be valid before assigning dn_objset * makes the dnode eligible for dnode_move(). * Caller must be holding the dnode handle, which is released upon return. /* the dnode can no longer move, so we can release the handle */ /* clean up any unreferenced dbufs */ /* change bonus size and type */ /* fix up the bonus db_size */ * Update back pointers. Updating the handle fixes the back pointer of * every descendant dbuf as well as the bonus dbuf. * Invalidate the original dnode by clearing all of its back pointers. * Set the low bit of the objset pointer to ensure that dnode_move() * recognizes the dnode as invalid in any subsequent callback. * Satisfy the destructor. * The dnode is on the objset's list of known dnodes if the objset * pointer is valid. We set the low bit of the objset pointer when * freeing the dnode to invalidate it, and the memory patterns written * by kmem (baddcafe and deadbeef) set at least one of the two low bits. * A newly created dnode sets the objset pointer last of all to indicate * that the dnode is known and in a valid state to be moved by this * Ensure that the objset does not go away during the move. * If the dnode is still valid, then so is the objset. We know that no * valid objset can be freed while we hold os_lock, so we can safely * ensure that the objset remains in use. * Recheck the objset pointer in case the dnode was removed just before * At this point we know that as long as we hold os->os_lock, the dnode * cannot be freed and fields within the dnode can be safely accessed. * The objset listing this dnode cannot go away as long as this dnode is * Lock the dnode handle to prevent the dnode from obtaining any new * holds. This also prevents the descendant dbufs and the bonus dbuf * from accessing the dnode, so that we can discount their holds. The * handle is safe to access because we know that while the dnode cannot * go away, neither can its handle. Once we hold dnh_zrlock, we can * safely move any dnode referenced only by dbufs. * Ensure a consistent view of the dnode's holds and the dnode's dbufs. * We need to guarantee that there is a hold for every dbuf in order to * determine whether the dnode is actively referenced. Falsely matching * a dbuf to an active hold would lead to an unsafe move. It's possible * that a thread already having an active dnode hold is about to add a * dbuf, and we can't compare hold and dbuf counts while the add is in * A dbuf may be removed (evicted) without an active dnode hold. In that * case, the dbuf count is decremented under the handle lock before the * dbuf's hold is released. This order ensures that if we count the hold * after the dbuf is removed but before its hold is released, we will * treat the unmatched hold as active and exit safely. If we count the * hold before the dbuf is removed, the hold is discounted, and the * removal is blocked until the move completes. /* We can't have more dbufs than dnode holds. */ * At this point we know that anyone with a hold on the dnode is not * actively referencing it. The dnode is known and in a valid state to * move. We're holding the locks needed to execute the critical section. /* If the dnode was safe to move, the refcount cannot have changed. */ * Wait for final references to the dnode to clear. This can * only happen if the arc is asyncronously evicting state that * has a hold on this dnode while we are trying to evict this * The dnode handle lock guards against the dnode moving to * another valid address, so there is no need here to guard * against changes to or from NULL. * If there are holds on this dnode, then there should * be holds on the dnode's containing dbuf as well; thus * it wouldn't be eligible for eviction and this function * would not have been called. * EINVAL - invalid object number. * succeeds even for free dnodes. * If you are holding the spa config lock as writer, you shouldn't * be asking the DMU to do *anything* unless it's the root pool * which may require us to read from the root filesystem while * holding some (not all) of the locks as writer. for (i = 0; i <
epb; i++) {
for (i = 0; i <
epb; i++) {
/* Now we can rely on the hold to prevent the dnode from moving. */ * Return held dnode if the object is allocated, NULL if not. * Can only add a reference if there is already at least one * reference on the dnode. Returns FALSE if unable to add a /* Get while the hold prevents the dnode from moving. */ * It's unsafe to release the last hold on a dnode by dnode_rele() or * indirectly by dbuf_rele() while relying on the dnode handle to * prevent the dnode from moving, since releasing the last hold could * result in the dnode's parent dbuf evicting its dnode handles. For * that reason anyone calling dnode_rele() or dbuf_rele() without some * other direct or indirect hold on the dnode must first drop the dnode /* NOTE: the DNODE_DNODE does not have a dn_dbuf */ * Another thread could add a hold to the dnode handle in * dnode_hold_impl() while holding the parent dbuf. Since the * hold on the parent dbuf prevents the handle from being * destroyed, the hold on the handle is OK. We can't yet assert * that the handle has zero references, but that will be * asserted anyway when the handle gets destroyed. * Determine old uid/gid when necessary * If we are already marked dirty, we're done. * The dnode maintains a hold on its containing dbuf as * long as there are holds on it. Each instantiated child * dbuf maintains a hold on the dnode. When the last child * drops its hold, the dnode will drop its hold on the * containing dbuf. We add a "dirty hold" here so that the * dnode will hang around after we finish processing its /* we should be the only holder... hopefully */ /* ASSERT3U(refcount_count(&dn->dn_holds), ==, 1); */ * If the dnode is already dirty, it needs to be moved from * the dirty list to the free list. * Try to change the block size for the indicated dnode. This can only * succeed if there are no blocks allocated or dirty beyond first block /* Check for any allocated blocks beyond the first */ /* resize the old block */ /* rele after we have fixed the blocksize in the dnode */ /* read-holding callers must not rely on the lock being continuously held */ * if we have a read-lock, check to see if we need to do any work * before upgrading to a write-lock. * Compute the number of levels necessary to support the new maxblkid. /* dirty the left indirects */ /* transfer the dirty records to the new indirect */ * First, block align the region to free: * Freeing the whole block; fast-track this request. * Note that we won't dirty any indirect blocks, * which is fine because we will be freeing the entire * file and thus all indirect blocks will be freed /* Freeing past end-of-data */ /* Freeing part of the block. */ /* zero out any partial block data at the start of the range */ /* don't dirty if it isn't on disk and isn't dirty */ /* If the range was less than one block, we're done */ /* If the remaining range is past end of file, we're done */ /* zero out any partial block data at the end of the range */ /* don't dirty if not on disk and not dirty */ /* If the range did not include a full block, we are done */ * Dirty all the indirect blocks in this range. Note that only * the first and last indirect blocks can actually be written * (if they were partially freed) -- they must be dirtied, even if * they do not exist on disk yet. The interior blocks will * be freed by free_children(), so they will not actually be written. * Even though these interior blocks will not be written, we * dirty them for two reasons: * - It ensures that the indirect blocks remain in memory until * syncing context. (They have already been prefetched by * dmu_tx_hold_free(), so we don't have to worry about reading * - The dirty space accounting will put pressure on the txg sync * mechanism to begin syncing, and to delay transactions if there * is a large amount of freeing. Even though these indirect * blocks will not be written, we could need to write the same * amount of space if we copy the freed BPs into deadlists. * Set i to the blockid of the next non-hole * level-1 indirect block at or after i. Note * that dnode_next_offset() operates in terms of * level-0-equivalent bytes. * Normally we should not see an error, either * from dnode_next_offset() or dbuf_hold_level() * (except for ESRCH from dnode_next_offset). * If there is an i/o error, then when we read * this block in syncing context, it will use * ZIO_FLAG_MUSTSUCCEED, and thus hang/panic according * to the "failmode" property. dnode_next_offset() * doesn't have a flag to indicate MUSTSUCCEED. * Add this range to the dnode range list. * We will finish up this free operation in the syncing phase. /* return TRUE if this blkid was freed in a recent txg, or FALSE if it wasn't */ * If we're in the process of opening the pool, dp will not be * set yet, but there shouldn't be anything dirty. /* call from syncing context when we actually write/free space for this dnode */ * Call when we think we're going to write/free space in open context to track * the amount of memory in use by the currently open txg. * Scans a block at the indicated "level" looking for a hole or data, * If level > 0, then we are scanning an indirect block looking at its * pointers. If level == 0, then we are looking at a block of dnodes. * If we don't find what we are looking for in the block, we return ESRCH. * Otherwise, return with *offset pointing to the beginning (if searching * forwards) or end (if searching backwards) of the range covered by the * block pointer we matched on (or dnode). * The basic search algorithm used below by dnode_next_offset() is to * use this function to search up the block tree (widen the search) until * we find something (i.e., we don't return ESRCH) and then search back * down the tree (narrow the search) until we reach our original search dprintf(
"probing object %llu offset %llx level %d of %u\n",
* This can only happen when we are searching up * the block tree for data. We don't really need to * adjust the offset, as we will just end up looking * at the pointer to this block in its parent, and its * going to be unallocated, so we will skip over it. * This can only happen when we are searching up the tree * and these conditions mean that we need to keep climbing. i >= 0 && i <
epb; i +=
inc) {
/* traversing backwards; position offset at the end */ * Find the next hole, data, or sparse region at or after *offset. * The value 'blkfill' tells us how many items we expect to find * in an L0 data block; this value is 1 for normal objects, * DNODES_PER_BLOCK for the meta dnode, and some fraction of * DNODES_PER_BLOCK when searching for sparse regions thereof. * dnode_next_offset(dn, flags, offset, 1, 1, 0); * Used in dmu_offset_next(). * dnode_next_offset(mdn, flags, offset, 0, DNODES_PER_BLOCK, txg); * Only finds objects that have new contents since txg (ie. * bonus buffer changes and content removal are ignored). * Used in dmu_object_next(). * dnode_next_offset(mdn, DNODE_FIND_HOLE, offset, 2, DNODES_PER_BLOCK >> 2, 0); * Finds the next L2 meta-dnode bp that's at most 1/4 full. * Used in dmu_object_alloc(). * There's always a "virtual hole" at the end of the object, even * if all BP's which physically exist are non-holes.