dsl_scan.c revision 38d61036746e2273cc18f6698392e1e29f87d1bf
0N/A * The contents of this file are subject to the terms of the 0N/A * Common Development and Distribution License (the "License"). 0N/A * You may not use this file except in compliance with the License. 0N/A * See the License for the specific language governing permissions 0N/A * and limitations under the License. 0N/A * When distributing Covered Code, include this CDDL HEADER in each 0N/A * If applicable, add the following below this CDDL HEADER, with the 0N/A * fields enclosed by brackets "[]" replaced with your own identifying 0N/A * information: Portions Copyright [yyyy] [name of copyright owner] 0N/A * Copyright (c) 2008, 2010, Oracle and/or its affiliates. All rights reserved. 0N/A * Copyright (c) 2011, 2015 by Delphix. All rights reserved. 0N/A * Copyright 2016 Gary Mills 0N/A/* max number of blocks to free in a single TXG */ 0N/A/* the order has to match pool_scan_type */ 0N/A * It's possible that we're resuming a scan after a reboot so 0N/A * make sure that the scan_async_destroying flag is initialized 0N/A * There was an old-style scrub in progress. Restart a 0N/A * new-style scrub from the beginning. 0N/A "restarting new-style scrub in txg %llu",
0N/A * Load the queue obj from the old location so that it 0N/A * can be freed by dsl_scan_done(). 0N/A * A new-type scrub was in progress on an old 0N/A * pool, and the pool was accessed by old 0N/A * software. Restart from the beginning, since 0N/A * the old software may have changed the pool in 0N/A "by old software; restarting in txg %llu",
0N/A /* rewrite all disk labels */ 0N/A * If this is an incremental scrub, limit the DDT scrub phase 0N/A * to just the auto-ditto class (for correctness); the rest 0N/A * of the scrub should go faster using top-down pruning. 0N/A /* back to the generic stuff */ 127N/A "func=%u mintxg=%llu maxtxg=%llu",
0N/A "scrub_ddt_bookmark",
0N/A "scrub_ddt_class_max",
0N/A /* Remove any remnants of an old-style scrub. */ 0N/A * If we were "restarted" from a stopped state, don't bother 0N/A * with anything else. 0N/A * reflect this. Whether it succeeded or not, vacate 0N/A * all temporary scrub DTLs. 0N/A * We may have finished replacing a device. 0N/A * Let the async thread assess this and handle the detach. 0N/A /* We only know how to resume from level-0 blocks. */ 0N/A * - we have scanned for the maximum time: an entire txg 0N/A * timeout (default 5 sec) 0N/A * - we have scanned for at least the minimum time (default 1 sec 0N/A * for scrub, 3 sec for resilver), and either we have sufficient 0N/A * dirty data that we are starting to write more quickly 0N/A * (default 30%), or someone is explicitly waiting for this txg 0N/A * - the spa is shutting down because this pool is being exported 0N/A * or the machine is rebooting. 0N/A dprintf(
"pausing at bookmark %llx/%llx/%llx/%llx\n",
0N/A dprintf(
"pausing at DDT bookmark %llx/%llx/%llx/%llx\n",
0N/A * One block ("stubby") can be allocated a long time ago; we 0N/A * want to visit that one because it has been allocated 0N/A * (on-disk) even if it hasn't been claimed (even though for 0N/A * scrub there's nothing to do to it). 127N/A * birth can be < claim_txg if this record's txg is 127N/A * already txg sync'ed (but this log block contains 127N/A * other records that are not synced) * We only want to visit blocks that have been claimed but not yet * replayed (or, in read-only mode, blocks that *would* be claimed). * We never skip over user/group accounting objects (obj<0) * If we already visited this bp & everything below (in * a prior txg sync), don't bother doing it again. * If we found the block we're trying to resume from, or * we went past it to a different object, zero it out to * indicate that it's OK to start checking for pausing dprintf(
"resuming at %llx/%llx/%llx/%llx\n",
* Return nonzero on i/o error. * Return new buf to write out in *bufp. * objects, and never skip them, even if we are * pausing. This is necessary so that the space * deltas from this txg get integrated. * The arguments are in this order because mdb can only print the * first 5; we want them to be useful. /* ASSERT(pbuf == NULL || arc_released(pbuf)); */ "visiting ds=%p/%llu zb=%llx/%llx/%llx/%llx bp=%p",
* If dsl_scan_ddt() has aready visited this block, it will have * already done any translations or scrubbing, so don't call the * If this block is from the future (after cur_max_txg), then we * are doing this on behalf of a deleted snapshot, and we will * revisit the future block on the next pass of this dataset. * Don't scan it now unless we need to because something * - scn_cur_{min,max}_txg stays the same. * - Setting the flag is not really necessary if * scn_cur_max_txg == scn_max_txg, because there * is nothing after this snapshot that we care * about. However, we set it anyway and then * ignore it when we retraverse it in zfs_dbgmsg(
"destroying ds %llu; currently traversing; " "reset zb_objset to %llu",
zfs_dbgmsg(
"destroying ds %llu; currently traversing; " "reset bookmark to -1,0,0,0",
* We keep the same mintxg; it could be > * ds_creation_txg if the previous snapshot was zfs_dbgmsg(
"destroying ds %llu; in queue; removing",
* dsl_scan_sync() should be called after this, and should sync * out our changed state, but just to be safe, do it here. zfs_dbgmsg(
"snapshotting ds %llu; currently traversing; " "reset zb_objset to %llu",
zfs_dbgmsg(
"clone_swap ds %llu; currently traversing; " "reset zb_objset to %llu",
zfs_dbgmsg(
"clone_swap ds %llu; currently traversing; " "reset zb_objset to %llu",
/* Both were there to begin with */ * This can happen if this snapshot was created after the * scan started, and we already completed a previous snapshot * that was created after the scan started. This snapshot * only references blocks with: * birth < our ds_creation_txg * cur_min_txg is no less than ds_creation_txg. * We have already visited these blocks. * The scan requested not to visit these blocks. * Subsequent snapshots (and clones) can reference our * blocks, or blocks with even higher birth times. * Therefore we do not need to visit them either, * so we do not add them to the work queue. * Note that checking for cur_min_txg >= cur_max_txg * is not sufficient, because in that case we may need to * visit subsequent snapshots. This happens when min_txg > 0, * which raises cur_min_txg. In this case we will visit * this dataset but skip all of its blocks, because the * rootbp's birth time is < cur_min_txg. Then we will zfs_dbgmsg(
"scanning dataset %llu (%s) is unnecessary because " "cur_min_txg (%llu) >= max_txg (%llu)",
* Only the ZIL in the head (non-snapshot) is valid. Even though * snapshots can have ZIL block pointers (which may be the same * BP as in the head), they must be ignored. So we traverse the * ZIL here, rather than in scan_recurse(), because the regular * snapshot block-sharing rules don't apply to it. * Iterate over the bps in this ds. zfs_dbgmsg(
"scanned dataset %llu (%s) with min=%llu max=%llu; " * We've finished this pass over this dataset. * If we did not completely visit this dataset, do another pass. * Add descendent datasets to work queue. * A bug in a previous version of the code could * cause upgrade_clones_cb() to not set * ds_next_snap_obj when it should, leading to a * missing entry. Therefore we can only use the * next_clones_obj when its count is correct. * If this is a clone, we don't need to worry about it for now. * If there are N references to a deduped block, we don't want to scrub it * N times -- ideally, we should scrub it exactly once. * We leverage the fact that the dde's replication class (enum ddt_class) * is ordered from highest replication class (DDT_CLASS_DITTO) to lowest * (DDT_CLASS_UNIQUE) so that we may walk the DDT in that order. * To prevent excess scrubbing, the scrub begins by walking the DDT * to find all blocks with refcnt > 1, and scrubs each of these once. * Since there are two replication classes which contain blocks with * refcnt > 1, we scrub the highest replication class (DDT_CLASS_DITTO) first. * Finally the top-down scrub begins, only visiting blocks with refcnt == 1. * There would be nothing more to say if a block's refcnt couldn't change * during a scrub, but of course it can so we must account for changes * in a block's replication class. * Here's an example of what can occur: * If a block has refcnt > 1 during the DDT scrub phase, but has refcnt == 1 * when visited during the top-down scrub phase, it will be scrubbed twice. * This negates our scrub optimization, but is otherwise harmless. * If a block has refcnt == 1 during the DDT scrub phase, but has refcnt > 1 * on each visit during the top-down scrub phase, it will never be scrubbed. * To catch this, ddt_sync_entry() notifies the scrub code whenever a block's * reference class transitions to a higher level (i.e DDT_CLASS_UNIQUE to * DDT_CLASS_DUPLICATE); if it transitions from refcnt == 1 to refcnt > 1 * while a scrub is in progress, it scrubs the block right then. dprintf(
"visiting ddb=%llu/%llu/%llu/%llx\n",
/* There should be no pending changes to the dedup table */ zfs_dbgmsg(
"scanned %llu ddt entries with class_max = %u; pausing=%u",
/* First do the MOS & ORIGIN */ * If we were paused, continue from here. Note if the * ds we were paused on was deleted, the zb_objset may * be -1, so we will skip this and find a new objset * In case we were paused right at the end of the ds, zero the * bookmark so we don't think that we're still trying to resume. /* keep pulling things out of the zap-object-as-queue */ * Check for scn_restart_txg before checking spa_load_state, so * that we can restart an old-style scan while the pool is being * imported (see dsl_scan_init). * Only process scans in sync pass 1. * If the spa is shutting down, then stop scanning. This will * ensure that the scan does not dirty any new data during the * If the scan is inactive due to a stalled async destroy, try again. * First process the async destroys. If we pause, don't do * any scrubbing or resilvering. This ensures that there are no * async destroys while we are scanning, so the scan code doesn't * have to worry about traversing it. It is also faster to free the * blocks than to scrub them. "traverse_dataset_destroyed()",
err);
/* finished; deactivate async destroy feature */ * If we didn't make progress, mark the async * destroy as stalled, so that we will not initiate * a spa_sync() on its behalf. Note that we only * check this if we are not finished, because if the * bptree had no blocks for us to visit, we can * finish without "making progress". * Write out changes to the DDT that may be required as a * result of the blocks freed. This ensures that the DDT * We have finished background destroying, but there is still * some space left in the dp_free_dir. Transfer this leaked * space to the dp_leak_dir. /* finished; verify that space accounting went to zero */ /* finished with scan. */ "ddt bm=%llu/%llu/%llu/%llx",
zfs_dbgmsg(
"doing scan sync txg %llu; bm=%llu/%llu/%llu/%llu",
zfs_dbgmsg(
"txg %llu traversal complete, waiting till txg %llu",
* This will start a new scan, or restart an existing one. * If we resume after a reboot, zab will be NULL; don't record * incomplete stats in that case. for (i = 0; i <
4; i++) {
/* If it's an intent log block, failure is expected. */ * Keep track of how much data we've examined so that * zpool(1M) status can make useful progress reports. /* if it's a resilver, this may not be in the target range */ * Gang members may be spread across multiple * vdevs, so the best estimate we have is the * scrub range, which has already been checked. * XXX -- it would be better to change our * allocation policy to ensure that all * gang members reside on the same vdev. * If we're seeing recent (zfs_scan_idle) "important" I/Os * then throttle our workload to limit the impact of a scan. /* do not relocate this block */ * Purge all vdev caches and probe all devices. We do this here * rather than in sync context because this requires a writer lock * on the spa_config lock, which we can't do from sync context. The * spa_scrub_reopen flag indicates that vdev_open() should not * attempt to start another scrub.