dbuf.c revision 44cd46cadd9aab751dae6a4023c1cb5bf316d274
* dbuf hash table routines * Insert an entry into the hash table. If there is already an element * equal to elem in the hash table, then the already existing element * will be returned and the new element will not be inserted. * Otherwise returns NULL. * Remove an entry from the hash table. This operation will * fail if there are any existing holds on the db. * We musn't hold db_mtx to maintin lock ordering: * DBUF_HASH_MUTEX > db_mtx. * The hash table is big enough to fill all of physical memory * with an average 4K block size. The table will take up * totalmem*sizeof(void*)/4K (i.e. 2MB/GB with 8-byte pointers). /* XXX - we should really return an error instead of assert */ /* we can be momentarily larger in dnode_set_blksz() */ * it should only be modified in syncing * context, so make sure we only have /* verify db->db_blkptr */ /* db is pointed to by the dnode */ /* ASSERT3U(db->db_blkid, <, dn->dn_nblkptr); */ /* db is pointed to by an indirect block */ * dnode_grow_indblksz() can make this fail if we don't * have the struct_rwlock. XXX indblksz no longer * grows. safe to do this now? * If the blkptr isn't set but they have nonzero data, * it had better be dirty, otherwise we'll lose that * data when we evict this buffer. * All reads are synchronous, so we must have a hold on the dbuf /* we were freed in flight; disregard any error */ /* We need the struct_rwlock to prevent db_blkptr from changing. */ /* ZIO_FLAG_CANFAIL callers have to check the parent zio's error */ * We don't have to hold the mutex to check db_state because it * can't be freed while we have a hold on the buffer. /* dbuf_read_impl has dropped db_mtx for us */ * This is our just-in-time copy function. It makes a copy of * buffers, that have been modified in a previous transaction * group, before we modify them in the current active group. * This function is used in two places: when we are dirtying a * buffer for the first time in a txg, and when we are freeing * a range in a dnode that includes this buffer. * Note that when we are called from dbuf_free_range() we do * not put a hold on the buffer, we just traverse the active * dbuf list for the dnode. * If this buffer is referenced from the current quiescing * transaction group: either make a copy and reset the reference * to point to the copy, or (if there a no active holders) just * null out the current db_data pointer. * If the quiescing txg is "dirty", then we better not * be referencing the same buffer from the syncing txg. * If this buffer is referenced from the current syncing * transaction group: either * 1 - make a copy and reset the reference, or * 2 - if there are no holders, just null the current db_data. /* we can't copy if we have already started a write */ * This is the "bonus buffer" version of the above routine /* XXX can get silent EIO here */ /* release the already-written buffer */ /* found a level 0 buffer in the range */ /* will be handled in dbuf_read_done or dbuf_rele */ /* The dbuf is CACHED and referenced */ * This dbuf is not currently dirty. We will either * uncache it (if its not referenced in the open * context) or reset its contents to empty. * This dbuf is overridden. Clear that state. /* fill in with appropriate data */ /* Don't count meta-objects */ * We don't need any locking to protect db_blkptr: * If it's syncing, then db_dirtied will be set so we'll /* If we have been dirtied since the last snapshot, its not new */ /* XXX does *this* func really need the lock? */ * This call to dbuf_will_dirty() with the dn_struct_rwlock held * is OK, because there can be no other references to the db * when we are changing its size, so no concurrent DB_FILL can * XXX we should be doing a dbuf_read, checking the return * value and returning that up to our callers /* create the data buffer for the new block */ /* copy old block data to the new block */ * Shouldn't dirty a regular buffer in syncing context. Private * objects may be dirtied in syncing context, but only if they * were already pre-dirtied in open context. * XXX We may want to prohibit dirtying in syncing context even * We make this assert for private objects as well, but after we * check if we're already dirty. They are allowed to re-dirty /* XXX make this true for indirects too? */ * If this buffer is currently part of an "overridden" region, * we now need to remove it from that region. * Don't set dirtyctx to SYNC if we're just modifying this as we * If this buffer is already dirty, we're done. * Only valid if not already dirty. * We should only be dirtying in syncing context if it's the * mos, a spa os, or we're initializing the os. However, we are * allowed to dirty in syncing context provided we already * dirtied it in open context. Hence we must make this * assertion only if we're not already dirty. * If this buffer is dirty in an old transaction group we need * to make a copy of it so that the changes we make in this * transaction group won't leak out when we sync the older txg. * Release the data buffer from the cache so that we * can modify it without impacting possible other users * of this cached data block. Note that indirect blocks * and private objects are not released until the syncing * state (since they are only modified then). * We could have been freed_in_flight between the dbuf_noread * and dbuf_dirty. We win, as though the dbuf_noread() had * happened after the free. * This is only a guess -- if the dbuf is dirty * in a previous txg, we don't know how much * space it will use on disk yet. We should * really have the struct_rwlock to access * db_blkptr, but since this is just a guess, * it's OK if we get an odd answer. * This buffer is now part of this txg * If this buffer is not dirty, we're done. * If this buffer is currently held, we cannot undirty * it, since one of the current holders may be in the * middle of an update. Note that users of dbuf_undirty() * should not place a hold on the dbuf before the call. * XXX - this check assumes we are being called from * dbuf_free_range(), perhaps we should move it there? /* XXX would be nice to fix up dn_towrite_space[] */ /* XXX undo db_dirtied? but how? */ /* db->db_dirtied = tx->tx_txg; */ /* we were freed while filling */ * "Clear" the contents of this dbuf. This will mark the dbuf * EVICTING and clear *most* of its references. Unfortunetely, * when we are not holding the dn_dbufs_mtx, we can't clear the * entry in the dn_dbufs list. We have to wait until dbuf_destroy() * in this case. For callers from the DMU we will usually see: * dbuf_clear()->arc_buf_evict()->dbuf_do_evict()->dbuf_destroy() * For the arc callback, we will usually see: * dbuf_do_evict()->dbuf_clear();dbuf_destroy() * Sometimes, though, we will get a mix of these two: * DMU: dbuf_clear()->arc_buf_evict() * ARC: dbuf_do_evict()->dbuf_destroy() * If this dbuf is referened from an indirect dbuf, * decrement the ref count on the indirect dbuf. /* the buffer has no parent yet */ /* this block is referenced from an indirect block */ /* the block is referenced from the dnode */ /* the bonus dbuf is not placed in the hash table */ * Hold the dn_dbufs_mtx while we get the new dbuf * in the hash table *and* added to the dbufs list. * This prevents a possible deadlock with someone * trying to look up this dbuf before its added to the /* someone else inserted it first */ * If this dbuf is still on the dn_dbufs list, * remove it from that list. /* dbuf_find() returns with db_mtx held */ * This dbuf is already in the cache. We assume that * it is already CACHED, or else about to be either * Returns with db_holds incremented, and db_mtx not held. * Note: dn_struct_rwlock must be held. /* dbuf_find() returns with db_mtx held */ * If this buffer is currently syncing out, and we are * are still referencing it from db_data, we need to make * a copy of it in case we decide we want to dirty it /* NOTE: we can't rele the parent until after we drop the db_mtx */ * This is a special case: we never associated this * dbuf with any data allocated from the ARC. * This dbuf has anonymous data associated with it. * To be synced, we must be dirtied. But we * might have been freed after the dirty. /* This buffer has been freed since it was dirtied */ /* This buffer was freed and is now being re-filled */ * Don't need a lock on db_dirty (dn_mtx), because it can't * Simply copy the bonus data into the dnode. It will * be written out when the dnode is synced (and it will * be synced, since it must have been dirty for dbuf_sync * Use dn_phys->dn_bonuslen since db.db_size is the length * of the bonus buffer in the open transaction rather than * the syncing transaction. * If this buffer is currently "in use" (i.e., there are * active holds and db_data still references it), then make * a copy before we start the write so that any modifications * from the open txg will not leak into this write. * NOTE: this copy does not need to be made for objects only * modified in the syncing context (e.g. DNONE_DNODE blocks) * or if there is no actual write involved (bonus blocks). * Private object buffers are released here rather * than in dbuf_dirty() since they are only modified * in the syncing context and we don't want the * overhead of making multiple copies of the data. * This can happen if we dirty and then free * the level-0 data blocks in the same txg. So * this indirect remains unchanged. * This indirect buffer was marked dirty, but * never modified (if it had been modified, then * we would have released the buffer). There is * no reason to write anything. * This buffer was allocated at a time when there was * no available blkptrs from the dnode, or it was * inappropriate to hook it in (i.e., nlevels mis-match). * Don't write indirect blocks past EOF. * We get these when we truncate a file *after* dirtying * blocks in the truncate range (we undirty the level 0 * blocks in dbuf_free_range(), but not the indirects). * Verify that this indirect block is empty. for (i = 0; i < (
1 <<
epbs); i++) {
"db=%p level=%d id=%llu i=%d\n",
* We may have read this indirect block after we dirtied it, * so never released it from the cache. * We don't need to dnode_setdirty(dn) because if we got * here then the parent is already dirty. * XXX -- we should design a compression algorithm * that specializes in arrays of bps. * Allow dnode settings to override objset settings, * except for metadata checksums. * We can't access db after arc_write, since it could finish * and be freed, and we have no locks on it. /* We must do this after we've set the bp's type and level */