spa.c revision fa94a07fd0519b8abfd871ad8fe60e6bebe1e2bb
1N/A * The contents of this file are subject to the terms of the 1N/A * Common Development and Distribution License (the "License"). 1N/A * You may not use this file except in compliance with the License. 1N/A * See the License for the specific language governing permissions 1N/A * and limitations under the License. 1N/A * When distributing Covered Code, include this CDDL HEADER in each 1N/A * If applicable, add the following below this CDDL HEADER, with the 1N/A * fields enclosed by brackets "[]" replaced with your own identifying 1N/A * information: Portions Copyright [yyyy] [name of copyright owner] 1N/A * Copyright 2007 Sun Microsystems, Inc. All rights reserved. 1N/A * Use is subject to license terms. 1N/A#
pragma ident "%Z%%M% %I% %E% SMI" 1N/A * This file contains all the routines used when modifying on-disk SPA state. 1N/A * This includes opening, importing, destroying, exporting a pool, and syncing a 1N/A * ========================================================================== 1N/A * SPA properties routines 1N/A * ========================================================================== 1N/A * Add a (source=src, propname=propval) list to an nvlist. 1N/A * Get property values from the spa configuration. 1N/A * readonly properties 1N/A * settable properties that are not stored in the pool property object. 1N/A * Get zpool property values. 1N/A * Get properties from the spa config. 1N/A /* If no pool property object, no more prop to get. */ 1N/A * Get properties from the MOS pool property object. 1N/A /* integer property */ 1N/A /* string property */ 1N/A * Validate the given pool properties nvlist and modify the list 1N/A * for the property values to be set. 1N/A * A bootable filesystem can not be on a RAIDZ pool 1N/A * nor a striped pool with more than 1 device. 1N/A * This is a special case which only occurs when 1N/A * the pool has completely failed. This allows 1N/A * the user to change the in-core failmode property 1N/A * without syncing it out to disk (I/Os might 1N/A * currently be blocked). We do this by returning 1N/A * EIO to the caller (spa_prop_set) to trick it 1N/A * into thinking we encountered a property validation 1N/A * If the bootfs property value is dsobj, clear it. 1N/A * ========================================================================== 1N/A * ========================================================================== 1N/A * Utility function which retrieves copies of the current logs and 1N/A * re-initializes them in the process. 1N/A * Activate an uninitialized pool. 1N/A * Opposite of spa_activate(). 1N/A * If this was part of an import or the open otherwise failed, we may 1N/A * still have errors left in the queues. Empty them just in case. 1N/A * Verify a pool configuration, and construct the vdev tree appropriately. This 1N/A * will create all the necessary vdevs in the appropriate layout, with each vdev 1N/A * All vdev validation is done by the vdev_alloc() routine. 1N/A * Opposite of spa_load(). 1N/A * Wait for any outstanding prefetch I/O to complete. 1N/A * Drop and purge level 2 cache 1N/A * Close the dsl pool. 1N/A * Load (or re-load) the current list of vdevs describing the active spares for 1N/A * this pool. When this is called, we have some form of basic information in 1N/A * then re-generate a more complete list including status information. 1N/A * First, close and free any existing spare vdevs. 1N/A /* Undo the call to spa_activate() below */ 1N/A * Construct the array of vdevs, opening them to get status in the 1N/A * process. For each spare, there is potentially two different vdev_t 1N/A * structures associated with it: one in the list of spares (used only 1N/A * for basic validation purposes) and one in the active vdev 1N/A * configuration (if it's spared in). During this phase we open and 1N/A * validate each vdev on the spare list. If the vdev also exists in the 1N/A * active configuration, then we also mark this vdev as an active spare. 1N/A * We only mark the spare active if we were successfully 1N/A * able to load the vdev. Otherwise, importing a pool 1N/A * with a bad active spare would result in strange 1N/A * behavior, because multiple pool would think the spare 1N/A * is actively in use. 1N/A * There is a vulnerability here to an equally bizarre 1N/A * circumstance, where a dead active spare is later 1N/A * brought back to life (onlined or otherwise). Given 1N/A * the rarity of this scenario, and the extra complexity 1N/A * it adds, we ignore the possibility. 1N/A * Recompute the stashed list of spares, with status information 1N/A * Load (or re-load) the current list of vdevs describing the active l2cache for 1N/A * this pool. When this is called, we have some form of basic information in 1N/A * then re-generate a more complete list including status information. 1N/A * Devices which are already active have their details maintained, and are 1N/A * Process new nvlist of vdevs. 1N/A * Commit this vdev as an l2cache device, 1N/A * even if it fails to open. 1N/A * Purge vdevs that were dropped 1N/A * Recompute the stashed list of l2cache devices, with status 1N/A * information this time. 1N/A * Checks to see if the given vdev could not be opened, in which case we post a 1N/A * sysevent to notify the autoreplace code that the device has been removed. 1N/A * Load an existing storage pool, using the pool's builtin spa_config as a 1N/A * source of configuration information. 1N/A * Versioning wasn't explicitly added to the label until later, so if 1N/A * it's not present treat it as the initial version. 1N/A * Parse the configuration into a vdev tree. We explicitly set the 1N/A * value that will be returned by spa_version() since parsing the 1N/A * configuration requires knowing the version number. 1N/A * Try to open all vdevs, loading each label in the process. 1N/A * Validate the labels for all leaf vdevs. We need to grab the config 1N/A * lock because all label I/O is done with the ZIO_FLAG_CONFIG_HELD 1N/A * Find the best uberblock. 1N/A * If we weren't able to find a single valid uberblock, return failure. 1N/A * If the pool is newer than the code, we can't open it. 1N/A * If the vdev guid sum doesn't match the uberblock, we have an 1N/A * incomplete configuration. 1N/A * Initialize internal SPA structures. 1N/A "loaded as it was last accessed by " 1N/A "another system (host: %s hostid: 0x%lx). " 1N/A * Load the bit that tells us to use the new accounting function 1N/A * (raid-z deflation). If we have an older pool, this will not 1N/A * Load the persistent error log. If we have an older pool, this will 1N/A * Load the history object. If we have an older pool, this 1N/A * will not be present. 1N/A * Load any hot spares for this pool. 1N/A * Load any level 2 ARC devices for this pool. 1N/A * If the 'autoreplace' property is set, then post a resource notifying 1N/A * the ZFS DE that it should not issue any faults for unopenable 1N/A * devices. We also iterate over the vdevs, and post a sysevent for any 1N/A * unopenable vdevs so that the normal autoreplace handler can take 1N/A * Load the vdev state for all toplevel vdevs. 1N/A * Propagate the leaf DTLs we just loaded all the way up the tree. 1N/A * Check the state of the root vdev. If it can't be opened, it 1N/A * indicates one or more toplevel vdevs are faulted. 1N/A * Claim log blocks that haven't been committed yet. 1N/A * This must all happen in a single txg. 1N/A * Wait for all claims to sync. 1N/A * If the config cache is stale, or we have uninitialized 1N/A * metaslabs (see spa_vdev_add()), then update the config. 1N/A * Update the config cache asychronously in case we're the 1N/A * root pool, in which case the config cache isn't writable yet. 1N/A * The import case is identical to an open except that the configuration is sent 1N/A * down from userland, instead of grabbed from the configuration cache. For the 1N/A * case of an open, the pool configuration will exist in the 1N/A * POOL_STATE_UNINITIALIZED state. 1N/A * the same time open the pool, without having to keep around the spa_t in some 1N/A * As disgusting as this is, we need to support recursive calls to this 1N/A * function because dsl_dir_open() is called during spa_load(), and ends 1N/A * up calling spa_open() again. The real fix is to figure out how to 1N/A * avoid dsl_dir_open() calling this in the first place. 1N/A * If vdev_validate() returns failure (indicated by 1N/A * EBADF), it indicates that one of the vdevs indicates 1N/A * that the pool has been exported or destroyed. If 1N/A * this is the case, the config cache is out of sync and 1N/A * we should remove the pool from the namespace. 1N/A * We can't open the pool, but we still have useful 1N/A * information: the state of each vdev after the 1N/A * attempted vdev_open(). Return this to the user. 1N/A * If we just loaded the pool, resilver anything that's out of date. 1N/A * Lookup the given spa_t, incrementing the inject count in the process, 1N/A * preventing it from being exported or destroyed. 1N/A * Add spares device information to the nvlist. 1N/A * Go through and find any spares which have since been 1N/A * repurposed as an active spare. If this is the case, update 1N/A * their status appropriately. 1N/A * Add l2cache device information to the nvlist, including vdev stats. 1N/A * Update level 2 cache device stats. 1N/A * We want to get the alternate root even for faulted pools, so we cheat 1N/A * and call spa_lookup() directly. 1N/A * Validate that the auxiliary device array is well formed. We must have an 1N/A * array of nvlists, each which describes a valid leaf vdev. If this is an 1N/A * import (mode is VDEV_ALLOC_SPARE), then we allow corrupted spares to be 1N/A * specified, as long as they are well-formed. 1N/A * It's acceptable to have no devs specified. 1N/A * Make sure the pool is formatted with a version that supports this 1N/A * Set the pending device list so we correctly handle device in-use 1N/A * The L2ARC currently only supports disk devices. 1N/A * Generate new dev list by concatentating with the 1N/A * Generate a new dev list. * Stop and drop level 2 ARC devices * If this pool already exists, return failure. * Allocate a new spa_t structure. * Get the list of spares, if specified. * Get the list of level 2 cache devices, if specified. * Create the pool config object. /* Newly created pools with the right version are always deflated. */ * Create the deferred-free bplist object. Turn off compression * because sync-to-convergence takes longer if the blocksize * Create the pool's history object. * We explicitly wait for the first transaction to complete so that our * bean counters are appropriately updated. * Import the given pool into the system. We set up the necessary spa_t and * then call spa_load() to do the dirty work. * If a pool with this name exists, return failure. * Create and initialize the spa structure. * Pass off the heavy lifting to spa_load(). * Pass TRUE for mosconfig because the user-supplied config * is actually the one to trust when doing an import. * Toss any existing sparelist, as it doesn't have any validity anymore, * and conflicts with spa_has_spare(). * Override any spares and level 2 cache devices as specified by * the user, as these may have correct device names/devids, etc. * Update the config cache to include the newly-imported pool. * Resilver anything that's out of date. * This (illegal) pool name is used when temporarily importing a spa_t in order * to get the vdev stats associated with the imported devices. * Create and initialize the spa structure. * Pass off the heavy lifting to spa_load(). * Pass TRUE for mosconfig because the user-supplied config * is actually the one to trust when doing an import. * If 'tryconfig' was at least parsable, return the current config. * Add the list of hot spares and level 2 cache devices. * The act of destroying or exporting a pool is very simple. We make sure there * is no more pending I/O and any references to the pool are gone. Then, we * update the pool state and sync all the labels to disk, removing the * configuration from the cache afterwards. * Put a hold on the pool, drop the namespace lock, stop async tasks, * reacquire the namespace lock, and see if we can export. * The pool will be in core if it's openable, * in which case we can modify its state. * Objsets may be open only because they're dirty, so we * have to force it to sync before checking spa_refcnt. * A pool cannot be exported or destroyed if there are active * references. If we are resetting a pool, allow references by * fault injection handlers. * We want this to be reflected on every label, * so mark them all dirty. spa_unload() will do the * final sync that pushes these changes out. * Destroy a storage pool. * Similar to spa_export(), this unloads the spa_t without actually removing it * from the namespace in any way. * ========================================================================== * ========================================================================== * Add a device to a storage pool. * We must validate the spares and l2cache devices after checking the * children. Otherwise, vdev_inuse() will blindly overwrite the spare. * Transfer each new top-level vdev from vd to rvd. * We have to be careful when adding new vdevs to an existing pool. * If other threads start allocating from these vdevs before we * sync the config cache, and we lose power, then upon reboot we may * fail to open the pool because there are DVAs that the config cache * can't translate. Therefore, we first add the vdevs without * initializing metaslabs; sync the config cache (via spa_vdev_exit()); * and then let spa_config_update() initialize the new metaslabs. * spa_load() checks for added-but-not-initialized vdevs, so that * if we lose power at any point in this sequence, the remaining * steps will be completed the next time we load the pool. * Attach a device to a mirror. The arguments are the path to any device * in the mirror, and the nvroot for the new device. If the path specifies * a device that is not mirrored, we automatically insert the mirror vdev. * If 'replacing' is specified, the new device is intended to replace the * existing device; in this case the two devices are made into their own * mirror using the 'replacing' vdev, which is functionally identical to * the mirror vdev (it actually reuses all the same ops) but has a few * extra rules: you can't attach to it after it's been created, and upon * completion of resilvering, the first disk (the one being replaced) * is automatically detached. * Spares can't replace logs * For attach, the only allowable parent is a mirror or the root * Active hot spares can only be replaced by inactive hot * If the source is a hot spare, and the parent isn't already a * spare, then we want to create a new hot spare. Otherwise, we * want to create a replacing vdev. The user is not allowed to * attach to a spared vdev child unless the 'isspare' state is * the same (spare replaces spare, non-spare replaces * The new device cannot have a higher alignment requirement * than the top-level vdev. * If this is an in-place replacement, update oldvd's path and devid * to make it distinguishable from newvd, and unopenable from now on. * If the parent is not a mirror, or if we're replacing, insert the new * Extract the new device from its root and add it to pvd. * If newvd is smaller than oldvd, but larger than its rsize, * the addition of newvd may have decreased our parent's asize. * Set newvd's DTL to [TXG_INITIAL, open_txg]. It will propagate * upward when spa_vdev_exit() calls vdev_dtl_reassess(). * Mark newvd's DTL dirty in this txg. * Kick off a resilver to update newvd. We need to grab the namespace * lock because spa_scrub() needs to post a sysevent with the pool name. * Detach a device from a mirror or replacing vdev. * If 'replace_done' is specified, only detach if the parent * If replace_done is specified, only remove this device if it's * the first child of a replacing vdev. For the 'spare' vdev, either * Only mirror, replacing, and spare vdevs support detach. * If there's only one replica, you can't detach it. * If all siblings have non-empty DTLs, this device may have the only * valid copy of the data, which means we cannot safely detach it. * XXX -- as in the vdev_offline() case, we really want a more * If we are a replacing or spare vdev, then we can always detach the * latter child, as that is how one cancels the operation. * If we are detaching the original disk from a spare, then it implies * that the spare should become a real disk, and be removed from the * active spare list for the pool. * Erase the disk labels so the disk can be used for other things. * This must be done after all other error cases are handled, * but before we disembowel vd (so we can still do I/O to it). * But if we can't do it, don't treat the error as fatal -- * it may be that the unwritability of the disk is the reason * Remove vd from its parent and compact the parent's children. * Remember one of the remaining children so we can get tvd below. * If we need to remove the remaining child from the list of hot spares, * do it now, marking the vdev as no longer a spare in the process. We * must do this before vdev_remove_parent(), because that can change the * GUID if it creates a new toplevel GUID. * the parent is no longer needed. Remove it from the tree. * We don't set tvd until now because the parent we just removed * may have been the previous top-level vdev. * Reevaluate the parent vdev state. * If the device we just detached was smaller than the others, it may be * possible to add metaslabs (i.e. grow the pool). vdev_metaslab_init() * can't fail because the existing metaslabs are already in core, so * there's nothing to read from disk. * Mark vd's DTL as dirty in this txg. vdev_dtl_sync() will see that * vd->vdev_detached is set and free vd's DTL object in syncing context. * But first make sure we're not on any *other* txg's DTL list, to * prevent vd from being accessed after it's freed. * If this was the removal of the original device in a hot spare vdev, * then we want to go through and remove the device from the hot spare * list of every other pool. * Remove a spares vdev from the nvlist config. * Only remove the hot spare if it's not currently in use in this pool. for (i = 0, j = 0; i <
nspares; i++) {
* Remove an l2cache vdev from the nvlist config. * Remove a device from the pool. Currently, this supports removing only hot * spares and level 2 ARC devices. * Find any device that's done replacing, or a vdev marked 'unspare' that's * current spared, so we can detach it. * Check for a completed replacement. * Check for a completed resilver with the 'unspare' flag set. * If we have just finished replacing a hot spared device, then * we need to detach the parent's first child (the original hot * Update the stored path for this vdev. Dirty the vdev configuration, relying * Determine if this is a reference to a hot spare or l2cache * device. If it is, update the path as stored in their * ========================================================================== * ========================================================================== * Do not give too much work to vdev(s). * We can't scrub this block, but we can continue to scrub * the rest of the pool. Note the error and move along. * Keep track of how much data we've examined so that * zpool(1M) status can make useful progress reports. * Gang members may be spread across multiple * vdevs, so the best we can do is look at the * XXX -- it would be better to change our * allocation policy to ensure that this can't * wait for that to complete. dprintf(
"start %s mintxg=%llu maxtxg=%llu\n",
* Note: we check spa_scrub_restart_txg under both spa_scrub_lock * AND the spa config lock to synchronize with any config changes * that revise the DTLs under spa_vdev_enter() / spa_vdev_exit(). * Even if there were uncorrectable errors, we consider the scrub * completed. The downside is that if there is a transient error during * a resilver, we won't resilver the data properly to the target. But * if the damage is permanent (more likely) we will resilver forever, * which isn't really acceptable. Since there is enough information for * the user to know what has failed and why, this seems like a more dprintf(
"end %s to maxtxg=%llu %s, traverse=%d, %llu errors, stop=%u\n",
* If the scrub/resilver completed, update all DTLs to reflect this. * Whether it succeeded or not, vacate all temporary scrub DTLs. * We may have finished replacing a device. * Let the async thread assess this and handle the detach. * If we were told to restart, our final act is to start a new scrub. * Something happened (e.g. snapshot create/delete) that means * we must restart any in-progress scrubs. The itinerary will * If there's a scrub or resilver already in progress, stop it. * Don't stop a resilver unless forced. * Terminate the previous traverse. * The pool-wide DTL is empty. * If this is a resilver, there's nothing to do except * check whether any in-progress replacements have completed. * The pool-wide DTL is non-empty. * If this is a normal scrub, upgrade to a resilver instead. * Determine the resilvering boundaries. * Note: (mintxg, maxtxg) is an open interval, * i.e. mintxg and maxtxg themselves are not included. * Note: for maxtxg, we MIN with spa_last_synced_txg(spa) + 1 * so we don't claim to resilver a txg that's still changing. * ========================================================================== * SPA async task processing * ========================================================================== * See if the config needs to be updated. * See if any devices need to be marked REMOVED. * XXX - We avoid doing this when we are in * I/O failure state since spa_vdev_enter() grabs * the namespace lock and would not be able to obtain * the writer config lock. * If any devices are done replacing, detach them. * Kick off a scrub. When starting a RESILVER scrub (or an EVERYTHING * scrub which can become a resilver), we need to hold * spa_namespace_lock() because the sysevent we post via * spa_event_notify() needs to get the name of the pool. * Let the world know that we're done. * ========================================================================== * ========================================================================== * Pre-dirty the first block so we sync to convergence faster. * (Usually only the first block is needed.) * Update the MOS nvlist describing the list of available devices. * spa_validate_aux() will have already made sure this nvlist is * valid and the vdevs are labeled appropriately. * Only set version for non-zpool-creation cases * (set/import). spa_create() needs special care * 'altroot' is a non-persistent property. It should * have been set temporarily at creation or import time. * 'cachefile' is a non-persistent property, but note * an async request that the config cache needs to be * Set pool property values in the poolprops mos object. /* normalize the property name */ /* log internal history if this is not a zpool create */ * Sync the specified transaction group. New blocks may be dirtied as * part of the process, so we iterate until it converges. * Lock out configuration changes. * If we are upgrading to SPA_VERSION_RAIDZ_DEFLATE this txg, * set spa_deflate if we have no raid-z vdevs. * If anything has changed in this txg, push the deferred frees * from the previous txg. If not, leave them alone so that we * don't generate work on an otherwise idle system. * Iterate to convergence. * Rewrite the vdev configuration (which includes the uberblock) * to commit the transaction group. * If there are any dirty vdevs, sync the uberblock to all vdevs. * Otherwise, pick a random top-level vdev that's known to be * visible in the config cache (see spa_vdev_add() for details). * If the write fails, try the next vdev until we're tried them all. * Clear the dirty config list. * Now that the new config has synced transactionally, * let it become visible to the config cache. * Make a stable copy of the fully synced uberblock. * We use this as the root for pool traversals. * Clean up the ZIL records for the synced txg. * Update usable space statistics. * It had better be the case that we didn't dirty anything * since vdev_config_sync(). * If any async tasks have been requested, kick them off. * Sync all pools. We don't want to hold the namespace lock across these * operations, so we take a reference on the spa_t and drop the lock during the * ========================================================================== * ========================================================================== * Remove all pools in the system. * Remove all cached state. All pools should be closed now, * so every spa in the AVL tree should be unreferenced. * Stop async tasks. The async thread may need to detach * a device that's been replaced, which requires grabbing * spa_namespace_lock, so we must drop it here. * This should only be called for a non-faulted pool, and since a * future version would result in an unopenable pool, this shouldn't be * Post a sysevent corresponding to the given event. The 'name' must be one of * filled in from the spa and (optionally) the vdev. This doesn't do anything * in the userland libzpool, as we don't want consumers to misinterpret ztest * or zdb as real changes.