sp.c revision b30678564674f0af0118547884def6cf721f1360
3853N/A * The contents of this file are subject to the terms of the 3853N/A * Common Development and Distribution License (the "License"). 3853N/A * You may not use this file except in compliance with the License. 3853N/A * See the License for the specific language governing permissions 3853N/A * and limitations under the License. 3853N/A * When distributing Covered Code, include this CDDL HEADER in each 3853N/A * If applicable, add the following below this CDDL HEADER, with the 3853N/A * fields enclosed by brackets "[]" replaced with your own identifying 3853N/A * information: Portions Copyright [yyyy] [name of copyright owner] 3853N/A * Copyright 2009 Sun Microsystems, Inc. All rights reserved. 3853N/A * Use is subject to license terms. 3853N/A * Soft partitioning metadevice driver (md_sp). 4500N/A * This file contains the primary operations of the soft partitioning 3853N/A * metadevice driver. This includes all routines for normal operation 3853N/A * metadevice operations vector (md_ops_t). This driver is loosely 3853N/A * based on the stripe driver (md_stripe). 3853N/A * All metadevice administration is done through the use of ioctl's. 3853N/A * Soft partitions are represented both in-core and in the metadb with a 3853N/A * unit structure. The soft partition-specific information in the unit 3853N/A * structure includes the following information: 3853N/A * - Device information (md_dev64_t & md key) about the device on which 3853N/A * the soft partition is built. 3853N/A * - Soft partition status information. 3853N/A * - The size of the soft partition and number of extents used to 4294N/A * mappings and lengths for each extent. 3853N/A * Typical soft partition operation proceeds as follows: 3853N/A * - The unit structure is fetched from the metadb and placed into 3853N/A * an in-core array (as with other metadevices). This operation 3853N/A * is performed via sp_build_incore( ) and takes place during 3853N/A * "snarfing" (when all metadevices are brought in-core at 3853N/A * once) and when a new soft partition is created. 3853N/A * - A soft partition is opened via sp_open( ). At open time the 3853N/A * the soft partition unit structure is verified with the soft 3853N/A * partition on-disk structures. Additionally, the soft partition 3853N/A * status is checked (only soft partitions in the OK state may be 3853N/A * - Soft partition I/O is performed via sp_strategy( ) which relies on 4500N/A * a support routine, sp_mapbuf( ), to do most of the work. 3853N/A * sp_mapbuf( ) maps a buffer to a particular extent via a binary 3853N/A * search of the extent array in the soft partition unit structure. 3853N/A * Once a translation has been performed, the I/O is passed down 3853N/A * to the next layer, which may be another metadevice or a physical 3853N/A * disk. Since a soft partition may contain multiple, non-contiguous 3853N/A * extents, a single I/O may have to be fragmented. 3853N/A * - Soft partitions are closed using sp_close. 3853N/A * FUNCTION: sp_parent_constructor() 3853N/A * OUTPUT: ps - parent save structure initialized. 3853N/A * RETURNS: void * - ptr to initialized parent save structure. 3853N/A * PURPOSE: initialize parent save structure. 3853N/A * FUNCTION: sp_child_constructor() 3853N/A * OUTPUT: cs - child save structure initialized. 3853N/A * RETURNS: void * - ptr to initialized child save structure. 3853N/A * PURPOSE: initialize child save structure. 3853N/A * PURPOSE: run the md_daemon to clean up memory pool. 3853N/A * FUNCTION: sp_build_incore() 3853N/A * INPUT: p - ptr to unit structure. 3853N/A * snarfing - flag to tell us we are snarfing. 3853N/A * RETURNS: int - 0 (always). 3853N/A * PURPOSE: place unit structure into in-core unit array (keyed from 3853N/A * if we are snarfing, we get the device information 3853N/A * from the metadb record (using the metadb key for 3853N/A /* place various information in the in-core data structures */ 3853N/A * removing - flag to tell us if we are removing 3853N/A * permanently or just reseting in-core 3853N/A * PURPOSE: used to either simply reset in-core structures or to 3853N/A * permanently remove metadevices from the metadb. 3853N/A /* clean up in-core structures */ 3853N/A * Attempt release of minor node 3853N/A /* we are removing the soft partition from the metadb */ 3853N/A * Save off device information so we can get to 3853N/A * it after we do the mddb_deleterec(). 3853N/A * Remove self from the namespace 3853N/A /* Remove the unit structure */ 3853N/A * remove the underlying device name from the metadb. if other 3853N/A * soft partitions are built on this device, this will simply 3853N/A * decrease the reference count for this device. otherwise the 3884N/A * name record for this device will be removed from the metadb. 3853N/A * FUNCTION: sp_send_stat_msg 3853N/A * INPUT: un - unit reference 3853N/A * status - status to be sent to master node 3853N/A * MD_SP_OK - soft-partition is now OK 3853N/A * PURPOSE: send a soft-partition status change to the master node. If the 3853N/A * message succeeds we simply return. If it fails we panic as the 3853N/A * cluster-wide view of the metadevices is now inconsistent. 3853N/A * Blockable. No locks can be held. 3853N/A /* If we're shutting down already, pause things here. */ 5085N/A * commd is available again. Retry the message once. 3853N/A * If it fails we panic as the system is in an 3853N/A * Panic as we are now in an inconsistent state. 3853N/A * FUNCTION: sp_finish_error 3853N/A * INPUT: ps - parent save structure for error-ed I/O. 3853N/A * lock_held - set if the unit readerlock is held 3853N/A * PURPOSE: report a driver error 3858N/A * INPUT: dq - daemon queue referencing failing ps structure 3858N/A * PURPOSE: send a message to the master node in a multi-owner diskset to 3858N/A * update all attached nodes view of the soft-part to be MD_SP_OK. 3858N/A * Blockable. No unit lock held. 3858N/A /* Send a MD_MN_MSG_SP_SETSTAT to the master */ 3858N/A * Successfully transmitted error state to all nodes, now release this 3858N/A * INPUT: dq - daemon queue referencing failing ps structure 3858N/A * PURPOSE: send a message to the master node in a multi-owner diskset to 3858N/A * update all attached nodes view of the soft-part to be MD_SP_ERR. 3858N/A * Blockable. No unit lock held. 3858N/A /* Send a MD_MN_MSG_SP_SETSTAT to the master */ 3858N/A * Successfully transmitted error state to all nodes, now release this 3853N/A * INPUT: ps - parent save structure for error-ed I/O. 3853N/A * PURPOSE: report a driver error. 4500N/A * Interrupt - non-blockable 3853N/A * Drop the mutex associated with this request before (potentially) 3853N/A * enqueuing the free onto a separate thread. We have to release the 3853N/A * mutex before destroying the parent structure. 3853N/A * this should only ever happen if we are panicking, 3853N/A * since DONTFREE is only set on the parent if panicstr 3853N/A * For a multi-owner set we need to send a message to the master so that 3853N/A * all nodes get the errored status when we first encounter it. To avoid 3853N/A * deadlocking when multiple soft-partitions encounter an error on one 3853N/A * physical unit we drop the unit readerlock before enqueueing the 3853N/A * request. That way we can service any messages that require a 3853N/A * writerlock to be held. Additionally, to avoid deadlocking when at 3853N/A * the bottom of a metadevice stack and a higher level mirror has 3853N/A * multiple requests outstanding on this soft-part, we clone the ps 3853N/A * that failed and pass the error back up the stack to release the 3853N/A * reference that this i/o may have in the higher-level metadevice. 3853N/A * The other nodes in the cluster just have to modify the soft-part 3853N/A * status and we do not need to block the i/o completion for this. 3884N/A * INPUT: un - unit structure for soft partition we are doing 3884N/A * voff - virtual offset in soft partition to map. 3884N/A * bcount - # of blocks in the I/O. 4500N/A * OUTPUT: bp - translated buffer to be passed down to next layer. 4500N/A * RETURNS: 1 - request must be fragmented, more work to do, 4500N/A * 0 - request satisified, no more work to do 4500N/A * PURPOSE: Map the the virtual offset in the soft partition (passed 4500N/A * in via voff) to the "physical" offset on whatever the soft 4500N/A * partition is built on top of. We do this by doing a binary 4500N/A * search of the extent array in the soft partition unit 4500N/A * structure. Once the current extent is found, we do the 4500N/A * translation, determine if the I/O will cross extent 5085N/A * boundaries (if so, we have to fragment the I/O), then 4500N/A * fill in the buf structure to be passed down to the next layer. 3853N/A * do a binary search to find the extent that contains the 4500N/A * starting offset. after this loop, mid contains the index 4500N/A /* is the starting offset contained within the mid-ext? */ 4853N/A else /* voff > un->un_ext[mid].un_voff + un->un_ext[mid].len */ 4853N/A /* determine if we need to break the I/O into fragments */ 3884N/A /* only break up the I/O if we're not built on another metadevice */ 3853N/A * INPUT: un - unit structure to be validated. 3853N/A * RETURNS: 0 - soft partition ok. 4141N/A * PURPOSE: called on open to sanity check the soft partition. In 4500N/A * order to open a soft partition: 4500N/A * - it must have at least one extent 4500N/A * - the extent info in core and on disk must match 4518N/A * - it may not be in an intermediate state (which would 4500N/A * imply that a two-phase commit was interrupted) 4500N/A * If the extent checking fails (B_ERROR returned from the read 4500N/A * strategy call) _and_ we're a multi-owner diskset, we send a 4500N/A * message to the master so that all nodes inherit the same view 4500N/A * If we are checking a soft-part that is marked as in error, and 4500N/A * we can actually read and validate the watermarks we send a 4500N/A * message to clear the error to the master node. 4500N/A /* sanity check unit structure components ?? */ 3853N/A /* tally extent lengths to check total size */ 3853N/A /* allocate buffer for watermark */ 3853N/A * make the call non-blocking so that it is not affected 3853N/A "read watermark at block %llu for extent %u, " 3853N/A * If we're a multi-owner diskset we send a message 3853N/A * indicating that this soft-part has an invalid 3853N/A * extent to the master node. This ensures a consistent 3853N/A * view of the soft-part across the cluster. 3853N/A /* make sure the checksum is correct first */ 3853N/A "at block %llu for extent %u does not have a " 4294N/A "at block %llu for extent %u does not have a " 4294N/A "valid watermark magic number, expected 0x%x, " 4294N/A /* make sure sequence number matches the current extent */ 3853N/A "at block %llu for extent %u has invalid " 3853N/A /* make sure watermark length matches unit structure */ 3853N/A "at block %llu for extent %u has inconsistent " 3853N/A "length, expected %llu, found %llu.",
3853N/A * make sure the type is a valid soft partition and not 3853N/A * a free extent or the end. 3853N/A "at block %llu for extent %u is not marked " 3853N/A * If we're a multi-owner set _and_ reset_error is set, we should clear 3853N/A * the error condition on all nodes in the set. Use SP_SETSTAT2 with 3853N/A * INPUT: child_buf - buffer attached to child save structure. 3853N/A * this is the buffer on which I/O has just 3853N/A * PURPOSE: called on I/O completion. 3853N/A /* find the child save structure to which this buffer belongs */ 3853N/A /* now get the parent save structure */ 3853N/A /* pass any errors back up to the parent */ 3853N/A * if this parent has more children, we just free the 5085N/A /* there are no more children */ 5085N/A * this should only ever happen if we are panicking, 5085N/A * since DONTFREE is only set on the parent if panicstr 3853N/A * FUNCTION: md_sp_strategy() 3853N/A * INPUT: parent_buf - parent buffer 3853N/A * PURPOSE: Soft partitioning I/O strategy. Performs the main work 3853N/A * needed to do I/O to a soft partition. The basic 3853N/A * - Allocate a child save structure to keep track 3853N/A * of the I/O we are going to pass down. 3853N/A * - Map the I/O to the correct extent in the soft 3853N/A * partition (see sp_mapbuf()). 3853N/A * - bioclone() the buffer and pass it down the 3853N/A * stack using md_call_strategy. 3853N/A * - If the I/O needs to split across extents, 3853N/A * repeat the above steps until all fragments 3853N/A * When doing IO to a multi owner meta device, check if set is halted. 4500N/A * We do this check without the needed lock held, for performance 4500N/A * If an IO just slips through while the set is locked via an 4500N/A * MD_MN_SUSPEND_SET, we don't care about it. 4500N/A * Only check for suspension if we are a top-level i/o request 4500N/A * (MD_STR_NOTTOP is cleared in 'flag'); 4500N/A /* Here we loop until the set is no longer halted */ 3853N/A * Save essential information from the original buffhdr 3853N/A * if we are at the top and we are panicking, 3853N/A * we don't free in order to save state. 3853N/A * Mark this i/o as MD_STR_ABR if we've had ABR enabled on this 3853N/A * this loop does the main work of an I/O. we allocate a 3853N/A * a child save for each buf, do the logical to physical 3853N/A * mapping, decide if we need to frag the I/O, clone the 3853N/A * new I/O to pass down the stack. repeat until we've 3994N/A * taken care of the entire buf that was passed to us. 4530N/A /* calculate new offset, counts, etc... */ 3853N/A * FUNCTION: sp_directed_read() 3853N/A * INPUT: mnum - minor number 3853N/A * vdr - vol_directed_rd_t from user 3853N/A * mode - access mode for copying data out. 3853N/A * Exxxxx - failure error-code 4500N/A * PURPOSE: Construct the necessary sub-device i/o requests to perform the 4500N/A * directed read as requested by the user. This is essentially the 4500N/A * same as md_sp_strategy() with the exception being that the 4500N/A * underlying 'md_call_strategy' is replaced with an ioctl call. 3853N/A * Construct a parent_buf header which reflects the user-supplied 3853N/A * Save essential information from the original buffhdr 3853N/A * this loop does the main work of an I/O. we allocate a 3853N/A * a child save for each buf, do the logical to physical 3853N/A * mapping, decide if we need to frag the I/O, clone the 3853N/A * new I/O to pass down the stack. repeat until we've 3853N/A * taken care of the entire buf that was passed to us. 3853N/A /* Work out where we are in the allocated buffer */ 3853N/A /* calculate new offset, counts, etc... */ 3853N/A * Free the child structure as we've finished with it. 4500N/A * Normally this would be done by sp_done() but we're just 4500N/A * using md_bioclone() to segment the transfer and we never 4500N/A * issue a strategy request so the iodone will not be called. 4500N/A /* copyout the returned data to vdr_data + offset */ 3853N/A * Update the user-supplied vol_directed_rd_t structure with the 3853N/A * contents of the last issued child request. 4136N/A * RETURNS: 1 - soft partitions were snarfed. 4136N/A * 0 - no soft partitions were snarfed. 4136N/A * PURPOSE: Snarf soft partition metadb records into their in-core 4136N/A * structures. This routine is called at "snarf time" when 4136N/A * md loads and gets all metadevices records into memory. 4136N/A * The basic algorithm is simply to walk the soft partition 4136N/A * records in the metadb and call the soft partitioning 4500N/A * build_incore routine to set up the in-core structures. 4500N/A * walk soft partition records in the metadb and call 4500N/A * sp_build_incore to build in-core structures. 4500N/A /* if we've already gotten this record, go to the next one */ 4500N/A * This means, we have an old and small record. 4500N/A * And this record hasn't already been converted 4500N/A * :-o before we create an incore metadevice 4500N/A * from this we have to convert it to a big 4500N/A /* Record has already been converted */ 4500N/A * Create minor node for snarfed entry. 4500N/A /* unit is already in-core */ 4853N/A * PURPOSE: Perform driver halt operations. As with stripe, we 4853N/A * support MD_HALT_CHECK and MD_HALT_DOIT. The first 4853N/A * does a check to see if halting can be done safely 4853N/A * (no open soft partitions), the second cleans up and * FUNCTION: sp_open_dev() * INPUT: un - unit structure. * PURPOSE: open underlying device via md_layered_open. * Do the open by device id if underlying is regular * INPUT: dev - device to open. * flag - pass-through flag. * otyp - pass-through open type. * md_oflags - open flags. * PURPOSE: open a soft partition. * When doing an open of a multi owner metadevice, check to see if this * node is a starting node and if a reconfig cycle is underway. * If so, the system isn't sufficiently set up enough to handle the * open (which involves I/O during sp_validate), so fail with ENXIO. /* grab necessary locks */ /* open underlying device, if necessary */ /* For probe, don't incur the overhead of validate */ * Don't call sp_validate while * unit_openclose lock is held. So, actually * open the device, drop openclose lock, * call sp_validate, reacquire openclose lock, * and close the device. If sp_validate * succeeds, then device will be re-opened. * Should be in the same state as before /* close the device opened above */ * As we're a multi-owner metadevice we need to ensure * that all nodes have the same idea of the status. * sp_validate() will mark the device as errored (if * it cannot read the watermark) or ok (if it was * previously errored but the watermark is now valid). * This code-path is only entered on the non-probe open * so we will maintain the errored state during a probe * call. This means the sys-admin must metarecover -m * to reset the soft-partition error. /* For probe, don't incur the overhead of validate */ /* close the device opened above */ * we succeeded in validating the on disk * format versus the in core, so reset the * status if it's in error * INPUT: dev - device to close. * flag - pass-through flag. * otyp - pass-through type. * md_cflags - close flags. * PURPOSE: close a soft paritition. /* grab necessary locks */ /* close devices, if necessary */ * If a MN set and transient capabilities (eg ABR/DMR) are set, * clear these capabilities if this is the last close in /* unlock, return success */ /* used in sp_dump routine */ * INPUT: dev - device to dump to. * addr - address to dump. * blkno - blkno on device. * nblk - number of blocks to dump. * RETURNS: result from bdev_dump. * PURPOSE: This routine dumps memory to the disk. It assumes that * the memory has already been mapped into mainbus space. * It is called at disk interrupt priority when the system * NOTE: this function is defined using 32-bit arguments, * but soft partitioning is internally 64-bit. Arguments * are casted where appropriate. * Don't need to grab the unit lock. * Cause nothing else is supposed to be happenning. * Also dump is not supposed to sleep. * If this is a top level and a friendly name metadevice, * update its minor in the namespace. * Update unit with the imported setno /* define the module linkage */