vdev.c revision 681d9761e8516a7dc5ab6589e2dfe717777e1123
* Virtual device management. * Given a vdev type, return the appropriate ops vector. * Default asize function: return the MAX of psize with the asize of * all children. This is what's used by anything other than RAID-Z. * Get the minimum allocatable size. We define the allocatable size as * the vdev's asize rounded to the nearest metaslab. This allows us to * replace or attach devices which don't have the same physical size but * can still satisfy the same number of allocations. * The our parent is NULL (inactive spare or cache) or is the root, * just return our own asize. * The top-level vdev just returns the allocatable size rounded * to the nearest metaslab. * The allocatable space for a raidz vdev is N * sizeof(smallest child), * so each child must provide at least 1/Nth of its asize. * Walk up all ancestors to update guid sum. * Walk up all ancestors to update guid sum. * Remove any holes in the child array. * Allocate and minimally initialize a vdev_t. * The root vdev's guid will also be the pool guid, * which must be unique among all pools. * Any other vdev's guid must be unique within the pool. * Allocate a new vdev. The 'alloctype' is used to control whether we are * creating a new vdev or loading an existing one - the behavior is slightly * different for each case. * If this is a load, get the vdev guid from the nvlist. * Otherwise, vdev_alloc_common() will generate one for us. * The first allocated vdev must be of type 'root'. * Determine whether we're a log vdev. * Set the nparity property for RAID-Z vdevs. * Currently, we can only support 3 parity devices. * Previous versions could only support 1 or 2 parity * We require the parity to be specified for SPAs that * support multiple parity levels. * Otherwise, we default to 1 parity device for RAID-Z. * Set the whole_disk property. If it's not specified, leave the value * Look for the 'not present' flag. This will only be set if the device * was not present at the time of import. * Get the alignment requirement. * If we're a top-level vdev, try to load the allocation parameters. * If we're a leaf vdev, try to load the DTL object and other state. * When importing a pool, we want to ignore the persistent fault * state, as the diagnosis made on another system may not be * valid in the current context. * Add ourselves to the parent's list of children. * vdev_free() implies closing the vdev first. This is simpler than * trying to ensure complicated semantics for all callers. * Discard allocation state. * Remove this vdev from its parent's child list. * Clean up vdev structure. * Transfer top-level vdev state from svd to tvd. * If cvd will replace mvd as a top-level vdev, preserve mvd's guid. * Otherwise, we could have detached an offline device, and when we * go to import the pool we'll think we have two top-level vdevs, * instead of a different version of the same top-level vdev. * Compute the raidz-deflation ratio. Note, we hard-code * in 128k (1 << 17) because it is the current "typical" blocksize. * Even if SPA_MAXBLOCKSIZE changes, this algorithm must never change, * or we will inconsistently account for existing bp's. for (m = 0; m <
count; m++)
* Determine whether this device is accessible by reading and writing * to several known locations: the pad regions of each vdev label * but the first (which we leave alone in case it contains a VTOC). * To prevent 'probe storms' when a device fails, we create * just one probe i/o at a time. All zios that want to probe * this vdev will become parents of the probe io. * vdev_cant_read and vdev_cant_write can only * transition from TRUE to FALSE when we have the * SCL_ZIO lock as writer; otherwise they can only * transition from FALSE to TRUE. This ensures that * any zio looking at these values can assume that * failures persist for the life of the I/O. That's * important because when a device has intermittent * connectivity problems, we want to ensure that * they're ascribed to the device (ENXIO) and not * Since we hold SCL_ZIO as writer here, clear both * values so the probe can reevaluate from first * in order to handle pools on top of zvols, do the opens * in a single thread so that the same thread holds the * Prepare a virtual device for access. * Make sure the allocatable size hasn't shrunk. * This is the first-ever open, so use the computed values. * For testing purposes, a higher ashift can be requested. * Make sure the alignment requirement hasn't increased. * If all children are healthy and the asize has increased, * then we've experienced dynamic LUN growth. If automatic * expansion is enabled then use the additional space. * Ensure we can issue some IO before declaring the * vdev open for business. * If a leaf vdev has a DTL, and seems healthy, then kick off a * resilver. But don't do this if we are doing a reopen for a scrub, * since this would just restart the scrub we are already doing. * Called once the vdevs are all opened, this routine validates the label * contents. This needs to be done before vdev_load() so that we don't * inadvertently do repair I/Os to the wrong device. * This function will only return failure if one of the vdevs indicates that it * has since been destroyed or exported. This is only possible if * will be updated but the function will return 0. * If the device has already failed, or was marked offline, don't do * any further validation. Otherwise, label I/O will fail and we will * overwrite the previous state. * If this vdev just became a top-level vdev because its * sibling was detached, it will have adopted the parent's * vdev guid -- but the label may or may not be on disk yet. * Fortunately, either version of the label will have the * same top guid, so if we're a top-level vdev, we can * safely compare to that instead. * If spa->spa_load_verbatim is true, no need to check the * If we were able to open and validate a vdev that was * previously marked permanently unavailable, clear that state * Close a virtual device. * We record the previous state before we close it, so that if we are * doing a reopen(), we don't generate FMA ereports if we notice that * Call vdev_validate() here to make sure we have the same device. * Otherwise, a device with an invalid label could be successfully * opened in response to vdev_reopen(). * Reassess parent vdev's health. * Normally, partial opens (e.g. of a mirror) are allowed. * For a create, however, we want to fail the request if * there are any components we can't open. * Recursively initialize all labels. * Aim for roughly 200 metaslabs per vdev. * A vdev's DTL (dirty time log) is the set of transaction groups for which * the vdev has less than perfect replication. There are three kinds of DTL: * DTL_MISSING: txgs for which the vdev has no valid copies of the data * DTL_PARTIAL: txgs for which data is available, but not fully replicated * DTL_SCRUB: the txgs that could not be repaired by the last scrub; upon * scrub completion, DTL_SCRUB replaces DTL_MISSING in the range of * txgs that was scrubbed. * DTL_OUTAGE: txgs which cannot currently be read, whether due to * persistent errors or just some device being offline. * Unlike the other three, the DTL_OUTAGE map is not generally * maintained; it's only computed when needed, typically to * determine whether a device can be detached. * For leaf vdevs, DTL_MISSING and DTL_PARTIAL are identical: the device * either has the data or it doesn't. * For interior vdevs such as mirror and RAID-Z the picture is more complex. * A vdev's DTL_PARTIAL is the union of its children's DTL_PARTIALs, because * if any child is less than fully replicated, then so is its parent. * A vdev's DTL_MISSING is a modified union of its children's DTL_MISSINGs, * comprising only those txgs which appear in 'maxfaults' or more children; * those are the txgs we don't have enough replication to read. For example, * double-parity RAID-Z can tolerate up to two missing devices (maxfaults == 2); * thus, its DTL_MISSING consists of the set of txgs that appear in more than * two child DTL_MISSING maps. * It should be clear from the above that to compute the DTLs and outage maps * for all vdevs, it suffices to know just the leaf vdevs' DTL_MISSING maps. * Therefore, that is all we keep on disk. When loading the pool, or after * a configuration change, we generate all other DTLs from first principles. * Reassess DTLs after a config change or scrub completion. /* XXX should check scrub_done? */ * We completed a scrub up to scrub_txg. If we * did it without rebooting, then the scrub dtl * will be valid, so excise the old region and * fold in the scrub dtl. Otherwise, leave the * dtl as-is if there was an error. * There's little trick here: to excise the beginning * of the DTL_MISSING map, we put it into a reference * tree and then add a segment with refcnt -1 that * covers the range [0, scrub_txg). This means * that each txg in that range has refcnt -1 or 0. * We then add DTL_SCRUB with a refcnt of 2, so that * entries in the range [0, scrub_txg) will have a * positive refcnt -- either 1 or 2. We then convert * the reference tree into the new DTL_MISSING map. continue;
/* leaf vdevs only */ minref =
1;
/* i.e. non-zero */ * Temporarily mark the device as unreadable, and then determine * whether this results in any DTL outages in the top-level vdev. * Determine if resilver is needed, and if so the txg range. * Recursively load all children. * If this is a top-level vdev, initialize its metaslabs. * If this is a leaf vdev, load its DTL. * The special vdev case is used for hot spares and l2cache devices. Its * sole purpose it to set the vdev state for the associated vdev. To do this, * we make sure that we can open the underlying device, then try to read the * label, and make sure that the label is sane and that it hasn't been * repurposed to another pool. * We don't actually check the pool state here. If it's in fact in * use by another pool, we update this fact on the fly when requested. * Mark the given vdev faulted. A faulted vdev behaves as if the device could * not be opened, and no I/O is attempted. * Faulted state takes precedence over degraded. * If marking the vdev as faulted cause the top-level vdev to become * unavailable, then back off and simply mark the vdev as degraded * If we reopen the device and it's not dead, only then do we * Mark the given vdev degraded. A degraded vdev is purely an indication to the * user that something is wrong. The vdev continues to operate as normal as far * If the vdev is already faulted, then don't do anything. * Online the given vdev. If 'unspare' is set, it implies two things. First, * any attached spare device should be detached when the device finishes * resilvering. Second, the online should be treated like a 'test' online case, * so no FMA events are generated if the device fails to open. /* XXX - L2ARC 1.0 does not support expansion */ /* XXX - L2ARC 1.0 does not support expansion */ * If the device isn't already offline, try to offline it. * If this device has the only valid copy of some data, * don't allow it to be offlined. Log devices are always * Offline this device and reopen its top-level vdev. * If the top-level vdev is a log device then just offline * it. Otherwise, if this action results in the top-level * vdev becoming unusable, undo it and fail the request. * If we successfully offlined the log device then we need to * sync out the current txg so that the "stubby" block can be * Clear the error counts associated with this vdev. Unlike vdev_online() and * vdev_offline(), we assume the spa config is locked. We also clear all * children. If 'vd' is NULL, then the user wants to clear all vdevs. * If we're in the FAULTED state or have experienced failed I/O, then * clear the persistent state and attempt to reopen the device. We * also mark the vdev config dirty, so that the new faulted state is * We currently allow allocations from vdevs which may be in the * process of reopening (i.e. VDEV_STATE_CLOSED). If the device * fails to reopen then we'll catch it later when we're holding * the proper locks. Note that we have to get the vdev state * in a local variable because although it changes atomically, * we're asking two separate questions about it. * Get statistics for the given vdev. * If we're getting stats on the root vdev, aggregate the I/O counts * over all top-level vdevs (i.e. the direct children of the root). * If this i/o is a gang leader, it didn't do any actual work. * If this is a root i/o, don't count it -- we've already * counted the top-level vdevs, and vdev_get_stats() will * aggregate them when asked. This reduces contention on * the root vdev_stat_lock and implicitly handles blocks * that compress away to holes, for which there is no i/o. * (Holes never create vdev children, so all the counters * remain zero, which is what we want.) * Note: this only applies to successful i/o (io_error == 0) * because unlike i/o counts, errors are not additive. * When reading a ditto block, for example, failure of * one top-level vdev does not imply a root-level error. * If this is an I/O error that is going to be retried, then ignore the * error. Otherwise, the user may interpret B_FAILFAST I/O errors as * hard errors, when in reality they can happen for any number of * innocuous reasons (bus resets, MPxIO link failure, etc). * This is either a normal write (not a repair), or it's a * repair induced by the scrub thread. In the normal case, * we commit the DTL change in the same txg as the block * was born. In the scrub-induced repair case, we know that * scrubs run in first-pass syncing context, so we commit * the DTL change in spa->spa_syncing_txg. * We currently do not make DTL entries for failed spontaneous * self-healing writes triggered by normal (non-scrubbing) * reads, because we have no transactional context in which to * do so -- and it's not clear that it'd be desirable anyway. * Update completion and end time. Leave everything else alone * so we can report what happened during the previous scrub. * Update the in-core space usage stats for this vdev and the root vdev. * Apply the inverse of the psize-to-asize (ie. RAID-Z) space-expansion * factor. We must calculate this here and not at the root vdev * because the root vdev's psize-to-asize is simply the max of its * childrens', thus not accurate enough for us. * Don't count non-normal (e.g. intent log) space as part of * Mark a top-level vdev's config as dirty, placing it on the dirty list * so that it will be written out next time the vdev configuration is synced. * If the root vdev is specified (vdev_top == NULL), dirty all top-level vdevs. * If this is an aux vdev (as with l2cache and spare devices), then we * update the vdev config manually and set the sync flag. * We're being removed. There's nothing more to do. * Setting the nvlist in the middle if the array is a little * sketchy, but it will work. * The dirty list is protected by the SCL_CONFIG lock. The caller * must either hold SCL_CONFIG as writer, or must be the sync thread * (which holds SCL_CONFIG as reader). There's only one sync thread, * so this is sufficient to ensure mutual exclusion. * Mark a top-level vdev's state as dirty, so that the next pass of * spa_sync() can convert this into vdev_config_dirty(). We distinguish * the state changes from larger config changes because they require * much less locking, and are often needed for administrative actions. * The state list is protected by the SCL_STATE lock. The caller * must either hold SCL_STATE as writer, or must be the sync thread * (which holds SCL_STATE as reader). There's only one sync thread, * so this is sufficient to ensure mutual exclusion. * Propagate vdev state up from children to parent. * Root special: if there is a top-level log * device, treat the root vdev as if it were * Root special: if there is a top-level vdev that cannot be * opened due to corrupted metadata, then propagate the root * vdev's aux state as 'corrupt' rather than 'insufficient * Set a vdev's state. If this is during an open, we don't update the parent * state, because we're in the process of opening children depth-first. * Otherwise, we propagate the change to the parent. * If this routine places a device in a faulted state, an appropriate ereport is * If we are setting the vdev state to anything but an open state, then * always close the underlying device. Otherwise, we keep accessible * but invalid devices open forever. We don't call vdev_close() itself, * because that implies some extra checks (offline, etc) that we don't * want here. This is limited to leaf devices, because otherwise * closing the device will affect other children. * If the previous state is set to VDEV_STATE_REMOVED, then this * device was previously marked removed and someone attempted to * reopen it. If this failed due to a nonexistent device, then * keep the device in the REMOVED state. We also let this be if * it is one of our special test online cases, which is only * attempting to online the device and shouldn't generate an FMA * Indicate to the ZFS DE that this device has been removed, and * any recent errors should be ignored. * If we fail to open a vdev during an import, we mark it as * "not available", which signifies that it was never there to * begin with. Failure to open such a device is not considered * Post the appropriate ereport. If the 'prevstate' field is * set to something other than VDEV_STATE_UNKNOWN, it indicates * that this is part of a vdev_reopen(). In this case, we don't * want to post the ereport if the device was already in the * CANT_OPEN state beforehand. * If the 'checkremove' flag is set, then this is an attempt to * online the device in response to an insertion event. If we * hit this case, then we have detected an insertion event for a * faulted or offline device that wasn't in the removed state. * In this scenario, we don't post an ereport because we are * about to replace the device, or attempt an online with * vdev_forcefault, which will generate the fault for us. /* Erase any notion of persistent removed state */ * Check the vdev configuration to ensure that it's capable of supporting * a root pool. Currently, we do not support RAID-Z or partial configuration. * In addition, only a single top-level vdev is allowed and none of the leaves * It would be nice to call vdev_offline() * directly but the pool isn't fully loaded and * the txg threads have not been started yet. * Expand a vdev if possible.