vdev.c revision 13506d1eefbbc37e2f12a0528831d9f6d4c361d7
1N/A * The contents of this file are subject to the terms of the 1N/A * Common Development and Distribution License (the "License"). 1N/A * You may not use this file except in compliance with the License. 1N/A * See the License for the specific language governing permissions 1N/A * and limitations under the License. 1N/A * When distributing Covered Code, include this CDDL HEADER in each 1N/A * If applicable, add the following below this CDDL HEADER, with the 1N/A * fields enclosed by brackets "[]" replaced with your own identifying 1N/A * information: Portions Copyright [yyyy] [name of copyright owner] 1N/A * Copyright 2006 Sun Microsystems, Inc. All rights reserved. 1N/A * Use is subject to license terms. 1N/A#
pragma ident "%Z%%M% %I% %E% SMI" 1N/A * Virtual device management. 2N/A * Given a vdev type, return the appropriate ops vector. 1N/A * Default asize function: return the MAX of psize with the asize of 1N/A * all children. This is what's used by anything other than RAID-Z. 2N/A * Get the replaceable or attachable device size. 1N/A * If the parent is a mirror or raidz, the replaceable size is the minimum 1N/A * psize of all its children. For the rest, just return our own psize. 2N/A * If our parent is NULL or the root, just return our own psize. 2N/A * Walk up all ancestors to update guid sum. 2N/A * Walk up all ancestors to update guid sum. 1N/A * Remove any holes in the child array. 2N/A * Allocate and minimally initialize a vdev_t. 2N/A * The root vdev's guid will also be the pool guid, 1N/A * which must be unique among all pools. 1N/A * Any other vdev's guid must be unique within the pool. 1N/A * Free a vdev_t that has been removed from service. 1N/A * Allocate a new vdev. The 'alloctype' is used to control whether we are 1N/A * creating a new vdev or loading an existing one - the behavior is slightly 1N/A * different for each case. 1N/A * If this is a load, get the vdev guid from the nvlist. 2N/A * Otherwise, vdev_alloc_common() will generate one for us. 2N/A * The first allocated vdev must be of type 'root'. 2N/A * Set the nparity propery for RAID-Z vdevs. 2N/A * Currently, we can only support 2 parity devices. 2N/A * Older versions can only support 1 parity device. 2N/A * We require the parity to be specified for SPAs that 2N/A * support multiple parity levels. 1N/A * Otherwise, we default to 1 parity device for RAID-Z. * Set the whole_disk property. If it's not specified, leave the value * Look for the 'not present' flag. This will only be set if the device * was not present at the time of import. * Get the alignment requirement. * Look for the 'is_spare' flag. If this is the case, then we are a * If we're a top-level vdev, try to load the allocation parameters. * If we're a leaf vdev, try to load the DTL object and offline state. * Add ourselves to the parent's list of children. * vdev_free() implies closing the vdev first. This is simpler than * trying to ensure complicated semantics for all callers. * Discard allocation state. * Remove this vdev from its parent's child list. * Transfer top-level vdev state from svd to tvd. * If we created a new toplevel vdev, then we need to change the child's * vdev GUID to match the old toplevel vdev. Otherwise, we could have * detached an offline device, and when we go to import the pool we'll * think we have two toplevel vdevs, instead of a different version of * the same toplevel vdev. for (m = 0; m <
count; m++)
* Prepare a virtual device for access. dprintf(
"%s = %d, osize %llu, state = %d\n",
* This is the first-ever open, so use the computed values. * For testing purposes, a higher ashift can be requested. * Make sure the alignment requirement hasn't increased. * Make sure the device hasn't shrunk. * If all children are healthy and the asize has increased, * then we've experienced dynamic LUN growth. * If this is a top-level vdev, compute the raidz-deflation * ratio. Note, we hard-code in 128k (1<<17) because it is the * current "typical" blocksize. Even if SPA_MAXBLOCKSIZE * changes, this algorithm must never change, or we will * inconsistently account for existing bp's. * This allows the ZFS DE to close cases appropriately. If a device * goes away and later returns, we want to close the associated case. * But it's not enough to simply post this only when a device goes from * CANT_OPEN -> HEALTHY. If we reboot the system and the device is * back, we also need to close the case (otherwise we will try to replay * it). So we have to post this notifier every time. Since this only * occurs during pool open or error recovery, this should not be an * Called once the vdevs are all opened, this routine validates the label * contents. This needs to be done before vdev_load() so that we don't * inadvertently do repair I/Os to the wrong device, and so that vdev_reopen() * won't succeed if the device has been changed underneath. * This function will only return failure if one of the vdevs indicates that it * has since been destroyed or exported. This is only possible if * will be updated but the function will return 0. * If the device has already failed, or was marked offline, don't do * any further validation. Otherwise, label I/O will fail and we will * overwrite the previous state. * If we were able to open and validate a vdev that was previously * marked permanently unavailable, clear that state now. * Close a virtual device. * We record the previous state before we close it, so that if we are * doing a reopen(), we don't generate FMA ereports if we notice that * Reassess root vdev's health. * Normally, partial opens (e.g. of a mirror) are allowed. * For a create, however, we want to fail the request if * there are any components we can't open. * Recursively initialize all labels. * The is the latter half of vdev_create(). It is distinct because it * involves initiating transactions in order to do metaslab creation. * For creation, we want to try to create all vdevs at once and then undo it * if anything fails; this is much harder if we have pending transactions. * Aim for roughly 200 metaslabs per vdev. * Initialize the vdev's metaslabs. This can't fail because * there's nothing to read when creating all new metaslabs. * Quick test without the lock -- covers the common case that * there are no dirty time segments. * Reassess DTLs after a config change or scrub completion. * We're successfully scrubbed everything up to scrub_txg. * Therefore, excise all old DTLs up to that point, then * fold in the DTLs for everything we couldn't scrub. * Make sure the DTLs are always correct under the scrub lock. dprintf(
"%s in txg %llu pass %d\n",
dprintf(
"detach %s committed in txg %llu\n",
* Recursively load all children. * If this is a top-level vdev, initialize its metaslabs. * If this is a leaf vdev, load its DTL. * This special case of vdev_spare() is used for hot spares. It's sole purpose * it to set the vdev state for the associated vdev. To do this, we make sure * that we can open the underlying device, then try to read the label, and make * sure that the label is sane and that it hasn't been repurposed to another * We don't actually check the pool state here. If it's in fact in * use by another pool, we update this fact on the fly when requested. * If the device isn't already offline, try to offline it. * If this device's top-level vdev has a non-empty DTL, * don't allow the device to be offlined. * XXX -- make this more precise by allowing the offline * as long as the remaining devices don't have any DTL holes. * Offline this device and reopen its top-level vdev. * If this action results in the top-level vdev becoming * unusable, undo it and fail the request. * Clear the error counts associated with this vdev. Unlike vdev_online() and * vdev_offline(), we assume the spa config is locked. We also clear all * children. If 'vd' is NULL, then the user wants to clear all vdevs. dprintf(
"returning %d for type %d on %s state %d offset %llx\n",
* Get statistics for the given vdev. * If we're getting stats on the root vdev, aggregate the I/O counts * over all top-level vdevs (i.e. the direct children of the root). * Update completion and end time. Leave everything else alone * so we can report what happened during the previous scrub. * Update the in-core space usage stats for this vdev and the root vdev. * If this is a top-level vdev, apply the * inverse of its psize-to-asize (ie. RAID-Z) * space-expansion factor. We must calculate * this here and not at the root vdev because * the root vdev's psize-to-asize is simply the * max of its childrens', thus not accurate * Various knobs to tune a vdev. "size of the read-ahead cache",
"log2 of cache blocksize",
"largest block size to cache",
"minimum pending I/Os to the disk",
"maximum pending I/Os to the disk",
"maximum size of aggregated I/Os",
"deadline = pri + (lbolt >> time_shift)",
"exponential I/O issue ramp-up rate",
* Mark a top-level vdev's config as dirty, placing it on the dirty list * so that it will be written out next time the vdev configuration is synced. * If the root vdev is specified (vdev_top == NULL), dirty all top-level vdevs. * The dirty list is protected by the config lock. The caller must * either hold the config lock as writer, or must be the sync thread * (which holds the lock as reader). There's only one sync thread, * so this is sufficient to ensure mutual exclusion. * Root special: if there is a toplevel vdev that cannot be * opened due to corrupted metadata, then propagate the root * vdev's aux state as 'corrupt' rather than 'insufficient * Set a vdev's state. If this is during an open, we don't update the parent * state, because we're in the process of opening children depth-first. * Otherwise, we propagate the change to the parent. * If this routine places a device in a faulted state, an appropriate ereport is * If we fail to open a vdev during an import, we mark it as * "not available", which signifies that it was never there to * begin with. Failure to open such a device is not considered * Post the appropriate ereport. If the 'prevstate' field is * set to something other than VDEV_STATE_UNKNOWN, it indicates * that this is part of a vdev_reopen(). In this case, we don't * want to post the ereport if the device was already in the * CANT_OPEN state beforehand.