dbuf.c revision e57a022b8f718889ffa92adbde47a8f08abcdb25
1N/A * The contents of this file are subject to the terms of the 1N/A * Common Development and Distribution License (the "License"). 1N/A * You may not use this file except in compliance with the License. 1N/A * See the License for the specific language governing permissions 1N/A * and limitations under the License. 1N/A * When distributing Covered Code, include this CDDL HEADER in each 1N/A * If applicable, add the following below this CDDL HEADER, with the 1N/A * fields enclosed by brackets "[]" replaced with your own identifying 1N/A * information: Portions Copyright [yyyy] [name of copyright owner] 1N/A * Copyright (c) 2005, 2010, Oracle and/or its affiliates. All rights reserved. 1N/A * Copyright 2011 Nexenta Systems, Inc. All rights reserved. 1N/A * Copyright (c) 2012, 2014 by Delphix. All rights reserved. 1N/A * Copyright (c) 2013 by Saso Kiselkov. All rights reserved. 1N/A * Copyright (c) 2013, Joyent, Inc. All rights reserved. 1N/A * Copyright (c) 2014 Spectra Logic Corporation, All rights reserved. 1N/A * Number of times that zfs_free_range() took the slow path while doing 1N/A * a zfs receive. A nonzero value indicates a potential performance problem. 1N/A#
endif /* ! __lint */ * Global data structures and functions for the dbuf cache. * dbuf hash table routines * Insert an entry into the hash table. If there is already an element * equal to elem in the hash table, then the already existing element * will be returned and the new element will not be inserted. * Otherwise returns NULL. * Remove an entry from the hash table. It must be in the EVICTING state. * We musn't hold db_mtx to maintain lock ordering: * DBUF_HASH_MUTEX > db_mtx. /* Only data blocks support the attachment of user data. */ /* Clients must resolve a dbuf before attaching user data. */ * Immediate eviction occurs when holds == dirtycnt. * For normal eviction buffers, holds is zero on * eviction, except when dbuf_fix_old_data() calls * dbuf_clear_data(). However, the hold count can grow * during eviction even though db_mtx is held (see * dmu_bonus_hold() for an example), so we can only * test the generic invariant that holds >= dirtycnt. * Invoke the callback from a taskq to avoid lock order reversals * The hash table is big enough to fill all of physical memory * with an average 4K block size. The table will take up * totalmem*sizeof(void*)/4K (i.e. 2MB/GB with 8-byte pointers). /* XXX - we should really return an error instead of assert */ * All entries are queued via taskq_dispatch_ent(), so min/maxalloc * configuration is not required. * We can't assert that db_size matches dn_datablksz because it * can be momentarily different when another thread is doing * It should only be modified in syncing context, so * make sure we only have one copy of the data. /* verify db->db_blkptr */ /* db is pointed to by the dnode */ /* ASSERT3U(db->db_blkid, <, dn->dn_nblkptr); */ /* db is pointed to by an indirect block */ * dnode_grow_indblksz() can make this fail if we don't * have the struct_rwlock. XXX indblksz no longer * grows. safe to do this now? * If the blkptr isn't set but they have nonzero data, * it had better be dirty, otherwise we'll lose that * data when we evict this buffer. * Loan out an arc_buf for read. Return the loaned arc_buf. * All reads are synchronous, so we must have a hold on the dbuf /* we were freed in flight; disregard any error */ /* We need the struct_rwlock to prevent db_blkptr from changing. */ * Recheck BP_IS_HOLE() after dnode_block_freed() in case dnode_sync() * processes the delete record and clears the bp while we are waiting * for the dn_mtx (resulting in a "no" from block_freed). * We don't have to hold the mutex to check db_state because it * can't be freed while we have a hold on the buffer. /* dbuf_read_impl has dropped db_mtx for us */ * Another reader came in while the dbuf was in flight * between UNCACHED and CACHED. Either a writer will finish * writing the buffer (sending the dbuf to CACHED) or the * first reader's request will reach the read_done callback * and send the dbuf to CACHED. Otherwise, a failure * occurred and the dbuf went to UNCACHED. /* Skip the wait per the caller's request. */ * This is our just-in-time copy function. It makes a copy of * buffers, that have been modified in a previous transaction * group, before we modify them in the current active group. * This function is used in two places: when we are dirtying a * buffer for the first time in a txg, and when we are freeing * a range in a dnode that includes this buffer. * Note that when we are called from dbuf_free_range() we do * not put a hold on the buffer, we just traverse the active * dbuf list for the dnode. * If the last dirty record for this dbuf has not yet synced * and its referencing the dbuf data, either: * reset the reference to point to a new copy, * or (if there a no active holders) * just null out the current db_data pointer. /* Note that the data bufs here are zio_bufs */ * Release the already-written buffer, so we leave it in * a consistent dirty state. Note that all callers are * modifying the buffer, so they will immediately do * another (redundant) arc_release(). Therefore, leave * the buf thawed to save the effort of freezing & * immediately re-thawing it. * Evict (if its unreferenced) or clear (if its referenced) any level-0 * data blocks in the free range, so that any future readers will find * This is a no-op if the dataset is in the middle of an incremental * receive; see comment below for details. /* There can't be any dbufs in this range; no need to search. */ * If we are receiving, we expect there to be no dbufs in * the range to be freed, because receive modifies each * block at most once, and in offset order. If this is * not the case, it can lead to performance problems, * so note that we unexpectedly took the slow path. /* found a level 0 buffer in the range */ /* mutex has been dropped and dbuf destroyed */ /* will be handled in dbuf_read_done or dbuf_rele */ /* The dbuf is referenced */ * This buffer is "in-use", re-adjust the file * size to reflect that this buffer may * contain new data when we sync. * This dbuf is not dirty in the open context. * Either uncache it (if its not referenced in * the open context) or reset its contents to /* clear the contents if its cached */ * We don't need any locking to protect db_blkptr: * If it's syncing, then db_last_dirty will be set * so we'll ignore db_blkptr. * This logic ensures that only block births for * filled blocks are considered. * If this block don't exist or is in a snapshot, it can't be freed. * Don't pass the bp to dsl_dataset_block_freeable() since we * are holding the db_mtx lock and might deadlock if we are * prefetching a dedup-ed block. /* XXX does *this* func really need the lock? */ * This call to dmu_buf_will_dirty() with the dn_struct_rwlock held * is OK, because there can be no other references to the db * when we are changing its size, so no concurrent DB_FILL can * XXX we should be doing a dbuf_read, checking the return * value and returning that up to our callers /* create the data buffer for the new block */ /* copy old block data to the new block */ * Shouldn't dirty a regular buffer in syncing context. Private * objects may be dirtied in syncing context, but only if they * were already pre-dirtied in open context. * We make this assert for private objects as well, but after we * check if we're already dirty. They are allowed to re-dirty * XXX make this true for indirects too? The problem is that * transactions created with dmu_tx_create_assigned() from * syncing context don't bother holding ahead. * Don't set dirtyctx to SYNC if we're just modifying this as we * If this buffer is already dirty, we're done. * If this buffer has already been written out, * we now need to reset its state. * Only valid if not already dirty. * We should only be dirtying in syncing context if it's the * mos or we're initializing the os or it's a special object. * However, we are allowed to dirty in syncing context provided * we already dirtied it in open context. Hence we must make * this assertion only if we're not already dirty. * Note: we delay "free accounting" until after we drop * the db_mtx. This keeps us from grabbing other locks * (and possibly deadlocking) in bp_get_dsize() while * also holding the db_mtx. * If this buffer is dirty in an old transaction group we need * to make a copy of it so that the changes we make in this * transaction group won't leak out when we sync the older txg. * Release the data buffer from the cache so * that we can modify it without impacting * possible other users of this cached data * block. Note that indirect blocks and * private objects are not released until the * syncing state (since they are only modified * We could have been freed_in_flight between the dbuf_noread * and dbuf_dirty. We win, as though the dbuf_noread() had * happened after the free. * This buffer is now part of this txg * This is only a guess -- if the dbuf is dirty * in a previous txg, we don't know how much * space it will use on disk yet. We should * really have the struct_rwlock to access * db_blkptr, but since this is just a guess, * it's OK if we get an odd answer. * Since we've dropped the mutex, it's possible that * dbuf_undirty() might have changed this out from under us. * Undirty a buffer in the transaction group referenced by the given * transaction. Return whether this evicted the dbuf. * If this buffer is not dirty, we're done. * Any space we accounted for in dp_dirty_* will be cleaned up by * dsl_pool_sync(). This is relatively rare so the discrepancy * Note that there are three places in dbuf_dirty() * where this dirty record may be put on a list. * Make sure to do a list_remove corresponding to * every one of those list_insert calls. /* we were freed while filling */ * Directly assign a provided arc buf to a given dbuf if it's not referenced * by anybody except our caller. Otherwise copy arcbuf's contents to dbuf. * "Clear" the contents of this dbuf. This will mark the dbuf * EVICTING and clear *most* of its references. Unfortunately, * when we are not holding the dn_dbufs_mtx, we can't clear the * entry in the dn_dbufs list. We have to wait until dbuf_destroy() * in this case. For callers from the DMU we will usually see: * dbuf_clear()->arc_clear_callback()->dbuf_do_evict()->dbuf_destroy() * For the arc callback, we will usually see: * dbuf_do_evict()->dbuf_clear();dbuf_destroy() * Sometimes, though, we will get a mix of these two: * DMU: dbuf_clear()->arc_clear_callback() * ARC: dbuf_do_evict()->dbuf_destroy() * This routine will dissociate the dbuf from the arc, by calling * arc_clear_callback(), but will not evict the data from the ARC. * Decrementing the dbuf count means that the hold corresponding * to the removed dbuf is no longer discounted in dnode_move(), * so the dnode cannot be moved until after we release the hold. * The membar_producer() ensures visibility of the decremented * value in dnode_move(), since DB_DNODE_EXIT doesn't actually * If this dbuf is referenced from an indirect dbuf, * decrement the ref count on the indirect dbuf. /* the buffer has no parent yet */ /* this block is referenced from an indirect block */ /* the block is referenced from the dnode */ /* the bonus dbuf is not placed in the hash table */ * Hold the dn_dbufs_mtx while we get the new dbuf * in the hash table *and* added to the dbufs list. * This prevents a possible deadlock with someone * trying to look up this dbuf before its added to the /* someone else inserted it first */ * If this dbuf is still on the dn_dbufs list, * remove it from that list. * Decrementing the dbuf count means that the hold * corresponding to the removed dbuf is no longer * discounted in dnode_move(), so the dnode cannot be * moved until after we release the hold. /* dbuf_find() returns with db_mtx held */ * This dbuf is already in the cache. We assume that * it is already CACHED, or else about to be either * Returns with db_holds incremented, and db_mtx not held. * Note: dn_struct_rwlock must be held. /* dbuf_find() returns with db_mtx held */ * If this buffer is currently syncing out, and we are are * still referencing it from db_data, we need to make a copy * of it in case we decide we want to dirty it again in this txg. /* NOTE: we can't rele the parent until after we drop the db_mtx */ * If you call dbuf_rele() you had better not be referencing the dnode handle * unless you have some other direct or indirect hold on the dnode. (An indirect * hold is a hold on one of the dnode's dbufs, including the bonus buffer.) * Without that, the dbuf_rele() could lead to a dnode_rele() followed by the * dnode's parent dbuf evicting its dnode handles. * dbuf_rele() for an already-locked dbuf. This is necessary to allow * db_dirtycnt and db_holds to be updated atomically. * Remove the reference to the dbuf before removing its hold on the * dnode so we can guarantee in dnode_move() that a referenced bonus * buffer has a corresponding dnode hold. * We can't freeze indirects if there is a possibility that they * may be modified in the current syncing context. * If the dnode moves here, we cannot cross this barrier * until the move completes. * The bonus buffer's dnode hold is no longer discounted * in dnode_move(). The dnode cannot move until after * This is a special case: we never associated this * dbuf with any data allocated from the ARC. * This dbuf has anonymous data associated with it. * A dbuf will be eligible for eviction if either the * 'primarycache' property is set or a duplicate * copy of this buffer is already cached in the arc. * In the case of the 'primarycache' a buffer * is considered for eviction if it matches the * criteria set in the property. * To decide if our buffer is considered a * duplicate, we must call into the arc to determine * if multiple buffers are referencing the same * block on-disk. If so, then we simply evict /* ASSERT(dmu_tx_is_syncing(tx) */ * This buffer was allocated at a time when there was * no available blkptrs from the dnode, or it was * inappropriate to hook it in (i.e., nlevels mis-match). /* Read the block if it hasn't been read yet. */ /* Indirect block size must match what the dnode thinks it is. */ /* Provide the pending dirty record to child dbufs */ * To be synced, we must be dirtied. But we * might have been freed after the dirty. /* This buffer has been freed since it was dirtied */ /* This buffer was freed and is now being re-filled */ * If this is a bonus buffer, simply copy the bonus data into the * dnode. It will be written out when the dnode is synced (and it * will be synced, since it must have been dirty for dbuf_sync to * This function may have dropped the db_mtx lock allowing a dmu_sync * operation to sneak in. As a result, we need to ensure that we * don't check the dr_override_state until we have returned from * If this buffer is in the middle of an immediate write, * wait for the synchronous IO to complete. * If this buffer is currently "in use" (i.e., there * are active holds and db_data still references it), * then make a copy before we start the write so that * any modifications from the open txg will not leak * NOTE: this copy does not need to be made for * objects only modified in the syncing context (e.g. * Although zio_nowait() does not "wait for an IO", it does * initiate the IO. If this is an empty write it seems plausible * that the IO could actually be completed before the nowait * returns. We need to DB_DNODE_EXIT() first in case * zio_nowait() invalidates the dbuf. * If we find an already initialized zio then we * are processing the meta-dnode, and we have finished. * The dbufs for all dnodes are put back on the list * during processing, so that we can zio_wait() * these IOs after initiating all child IOs. * The SPA will call this callback several times for each zio - once * for every physical child i/o (zio->io_phys_children times). This * allows the DMU to monitor the progress of each logical i/o. For example, * there may be 2 copies of an indirect block, or many fragments of a RAID-Z * block. There may be a long delay before all copies/fragments are completed, * so this callback allows us to retire dirty space gradually, as the physical * The callback will be called io_phys_children times. Retire one * portion of our dirty space each time we are called. Any rounding * error will be cleaned up by dsl_pool_sync()'s call to * dsl_pool_undirty_space(). * For nopwrites and rewrites we ensure that the bp matches our * original and bypass all the accounting. /* Issue I/O to commit a dirty buffer to disk. */ * Private object buffers are released here rather * than in dbuf_dirty() since they are only modified * in the syncing context and we don't want the * overhead of making multiple copies of the data. /* Our parent is an indirect block. */ /* We have a dirty parent that has been scheduled for write. */ /* Our parent's buffer is one level closer to the dnode. */ * We're about to modify our parent's db_data by modifying * our block pointer, so the parent must be released. /* Our parent is the dnode itself. */ * The BP for this block has been provided by open context * (by dmu_sync() or dmu_buf_write_embedded()).