md_mddb.c revision 7c478bd95313f5f23a4c958a745db2134aa03244
2N/A * The contents of this file are subject to the terms of the 2N/A * Common Development and Distribution License, Version 1.0 only 2N/A * (the "License"). You may not use this file except in compliance 2N/A * See the License for the specific language governing permissions 2N/A * and limitations under the License. 2N/A * When distributing Covered Code, include this CDDL HEADER in each 2N/A * If applicable, add the following below this CDDL HEADER, with the 2N/A * fields enclosed by brackets "[]" replaced with your own identifying 2N/A * information: Portions Copyright [yyyy] [name of copyright owner] 2N/A * Copyright 2005 Sun Microsystems, Inc. All rights reserved. 2N/A * Use is subject to license terms. 2N/A#
pragma ident "%Z%%M% %I% %E% SMI" 2N/A {
6000,
6000,
30000 }
* If this is set, more detailed messages about DB init will be given, instead * of just the MDE_DB_NODB. * This lock is used to single-thread load/unload of all sets * You really do NOT want to change this boolean. * It can be VERY dangerous to do so. Loss of * data may occur. USE AT YOUR OWN RISK!!!! * For mirrored root allow reboot with only half the replicas available * Flag inserted for Santa Fe project. #
define ISWHITE(c) (((c) ==
' ') || ((c) ==
'\t') || \
((c) ==
'\r') || ((c) ==
'\n'))
#
define ISNUM(c) (((c) >=
'0') && ((c) <=
'9'))
* Defines for crc calculation for records * rec_crcgen generates a crc checksum for a record block * rec_crcchk checks the crc checksum for a record block * During upgrade, SVM basically runs with the devt from the target * being upgraded. Translations are made from the target devt to the * miniroot devt when writing data out to the disk. This is done by * the following routines: * The following routines are used by the routines listed above and * expect a translated (aka miniroot) devt: * Also, when calling any system routines, such as ddi_lyr_get_devid, * the translated (aka miniroot) devt must be used. * By the same token, the major number and major name conversion operations * need to use the name_to_major file from the target system instead * of the name_to_major file on the miniroot. So, calls to * ddi_name_to_major must be replaced with calls to md_targ_name_to_major * when running on an upgrade. Same is true with calls to int flag,
/* B_ASYNC or 0 passed in here */ * Returns a list of fields to be skipped in the stripe record structure. * These fields are ms_timestamp in the component structure. * Used to skip these fields when calculating the checksum. * walk through all rows to find the total number /* Now walk through the components */ * walk through all rows to find the total number /* Now walk through the components */ /* Return the start of the list of fields to skip */ * Returns a list of fields to be skipped in the mirror record structure. * This includes un_last_read and sm_timestamp for each submirror * Used to skip these fields when calculating the checksum. /* Return the start of the list of fields to skip */ * Returns a list of the timestamp fields in the hotspare record structure. * Used to skip these fields when calculating the checksum. * Calculate or check the checksum for a record * Calculate the crc if check == 0, Check the crc if check == 1 * Record block may be written by different nodes in a multi-owner diskset * (in case of master change), the function rec_crcchk excludes timestamp * fields in crc computation of record data. * Otherwise, timestamp fields will cause each node to have a different * checksum for same record block causing the exclusive-or of all record block * checksums and data block record sums to be non-zero after new master writes * at least one record block. * Generate a list of the areas to be skipped when calculating * First skip rb_checksum, rb_private and rb_userdata. /* For a MN set, skip rb_timestamp */ /* Now add a list of timestamps to be skipped */ "md: mddb: set %u sleeping for buffer", s->
s_setno);
* Determine the max number of blocks. * go through and find highest logical block if (
freeblks == 0)
/* this happen when there is no */ continue;
/* master blk */ * set up reasonable freespace if no /* locator block sectors */ /* locator name sectors */ /* locator block device id information */ /* disk blocks containing actual device ids */ /* Only use data tags if not a MN set */ /* Found a bad tag, do NOT mark the data tag blks busy here */ * Add free space to the device id incore free list. * - During startup when all devid blocks are temporarily placed on the * - After a devid has been deleted via the metadb command. * - When mddb_devid_free_get adds unused space from a disk block * Remove specific free space from the device id incore free list. * Called at startup (after all devid blocks have been placed on * free list) in order to remove the free space from the list that * contains actual devids. * Returns 0 if area successfully removed. * Returns 1 if no matching area is found - so nothing removed. /* find free block for this devid */ * Look through free list of <block, offset, length> to * find our entry in the free list. Our entry should * exist since the entire devid block was placed into * this free list at startup. This code is just removing * the non-free (in-use) portions of the devid block so * that the remaining linked list does indeed just * Our entry has been found if * - the offset (starting address) in the free list is * less than the offset of our entry and * - the length+offset (ending address) in the free list is * greater than the length+offset of our entry. /* Have found our entry - remove from list */ /* did_freep1 - pts to next free block */ * did_freep_before points to area in block before * did_freep_after points to area in block after * Add before and after areas to free list * If area before or after offset, length has length * of 0, that entry is not added. * Find free space of devid length and remove free space from list. * Return a pointer to the previously free area. * If there's not enough free space on the free list, get an empty * disk block, put the empty disk block on the did_ic_dbp linked list, * and add the disk block space not used for devid to the free list. * Return pointer to address (inside disk block) of free area for devid. /* found a free area - remove from free list */ /* find disk block pointer that contains free area */ * If a disk block pointer can't be found - something * is wrong, so don't use this free space. /* Update free list information */ /* Didn't find a free spot */ /* get free logical disk blk in replica */ /* Add disk block to disk block linked list */ /* Update return values */ /* Add unused part of block to free list */ * Add device id information for locator index to device id area in set. * Get free area to store device id from free list. Update checksum * This routine does not write any data out to disk. * After this routine has been called, the routine, writelocall, should * be called to write both the locator block and device id area out /* Check if device id has already been added */ /* Copy devid into devid free area */ /* Update mddb_did_info area for new device id */ /* Add device id pointer to did_ic_devid array */ * Delete device id information for locator index from device id area in set. * Add device id space to free area. * This routine does not write any data out to disk. * After this routine has been called, the routine, writelocall, should * be called to write both the locator block and device id area out /* Get device id information from mddb_did_blk */ * Ensure that the underlying device supports device ids * before arbitrarily removing them. /* Remove device id information from mddb_did_blk */ /* Remove device id from incore area */ /* Add new free space in disk block to free list */ * Check if there is a device id for a locator index. * Caller of this routine should not free devid or minor_name since * these will point to internal data structures that should not * Check if device id is valid on current system. * Needs devid, previously known dev_t and current minor_name. * Returns 0 if valid device id is found and updates * dev_t if the dev_t associated with the device id is * Returns 1 if device id not valid on current system. * See if devid is valid in the current system. * If so, set dev to match the devid. /* devid is valid to use */ /* does dev_t in list match dev */ * If a different dev_t, then setup * new dev and new major name * Free the devid incore data areas return ((
daddr_t)-
1);
/* no such block */ return ((
daddr_t)-
1);
/* no such block */ * when a buf header is passed in the new buffer must be * put on the front of the chain. writerec counts on it int cnt,
/* number of blocks to be written */ /* and put buf address here */ * if a header for a buf chain is passed in this is async io. * currently only done for optimize records * wrtblklst - takes an array of logical block numbers * and writes the buffer to those blocks (scatter). * If called during upgrade, this routine expects a * non-translated (aka target) dev. const int li,
/* locator index */ /* and put buf address here */ * If a MN diskset and only the master can write, * then a non-master node will just return success. /* return successfully if we aren't the master */ * If an MN diskset and any_node_can_write * then this request is coming from writeoptrecord * and l_flags field should not be updated. * l_flags will be updated as a result of sending * a class1 message to the master. Setting l_flags * here will cause slave to be out of sync with * Otherwise, set the error in l_flags * (this occurs if this is not a MN diskset or * only_master_can_write is set). * writeblks - takes a logical block number/block count pair * and writes the buffer to those contiguous logical blocks. * If called during upgrade, this routine expects a non-translated int cnt,
/* number of log blocks to be written */ const int li,
/* locator index */ * If a MN diskset and only the master can write, * then a non-master node will just return success. /* return successfully if we aren't the master */ for (i = 0; i <
cnt; i++)
int cnt,
/* number of log blocks to be written */ * writelocall - write the locator block and device id information (if * Increments the locator block's commitcnt. Updates the device id area's * commitcnt if the replica is in device id format. Regenerates the * checksums after updating the commitcnt(s). /* write out blocks containing actual device ids */ /* write out device id area block */ /* write out locator block */ * If a MN diskset and this is the master, set the PARSE_LOCBLK flag * in the mddb_set structure to show that the locator block has * If called during upgrade, this routine expects a translated int cnt /* number of blocks to read */ * readblklst - takes an array of logical block numbers * and reads those blocks (gather) into the buffer. * If called during upgrade, this routine expects a non-translated int li /* locator index */ * readblks - takes a logical block number/block count pair * and reads those contiguous logical blocks into the buffer. * If called during upgrade, this routine expects a non-translated int cnt,
/* number of logical blocks to be read */ int li /* locator index */ for (i = 0; i <
cnt; i++)
* If called during upgrade, this routine expects a translated * If master blocks are found, set the mn_set parameter to 1 if the * the master block revision number is MDDB_REV_MNMB; otherwise, * If master blocks are not found, do not change the mnset parameter. /* Check for MDDB_REV_MNMB and lower */ * Check the md_devid_destroy and md_keep_repl_state flags * to see if we need to regen the devid or not. * Don't care about devid in local set since it is not used * and this should not be part of set importing * Now check the destroy flag. We also need to handle * the case where the destroy flag is reset after the * Try to regenerate it if the 'keep' flag is not set /* Set mn_set parameter to 1 if a MN set */ /* Check crc for this record */ * Code to read in the locator name information * read in the locator name blocks * if error occurred in locator name blocks free them * code to read in a copy of the database. * read in all the directory blocks * first go through and fix up all de_next pointers * go through and make all of the pointer to record blocks * if error occurred in directory blocks free them /* No mddb_rb32_t structures yet */ /* Don't include CHANGELOG in big XOR */ * scan through and see if data bases have to vary by only device * Check if this is a removable device. If it is we * assume it is something like a USB flash disk, a zip disk * or even a floppy that is being used to help maintain * mddb quorum. We don't want to put any optimized resync * records on these kinds of disks since they are usually * slower or don't have the same read/write lifetimes as * If it's a driver record, and an old style record, and not a DRL * record, we must convert it because it was incore as a 64 bit * structure but its on disk layout has only 32 bit for block sizes for (i = 0; i <
2; i++) {
/* Check the crc for this record */ /* Generate the crc for this record */ * writeoptrecord writes out an optimized record. for (i = 0; i <
2; i++) {
* only possible error is xlate. This can * occur if a replica was off line and came * back. During the mean time the database grew * large than the now on line replica can store * In a MN diskset, any node can write optimized record(s). * For MN diskset, set error in optinfo structure so * that mddb_commitrec knows which replica failed. * If an MN diskset, don't set replica * in error since this hasn't been set in master. * Setting replica in error before master could * leave the nodes with different views of the * world since a class 1 configuration change * could occur in mddb_commitrec as soon as * all locks are dropped. Must keep this * node the same as master and can't afford a * failure from the class 1 config change * Find which de_optinfo (which replica) * had a failure and set the failure in * Fix up the optimized resync record. Used in the traditional and local * disksets to move an optimized record from a failed or deleted mddb * In a MN diskset, the fixing of the optimized record is split between * the master and slave nodes. If the master node moves the optimized * resync record, then the master node will send a MDDB_PARSE_OPTRECS * message to the slave nodes causing the slave nodes to reget the * directory entry containing the location of the optimized resync record. * After the record is reread from disk, then writeoptrecord is called * if the location of the optimized resync record or flags have changed. * When writeoptrecord is called, the node that is the owner of this record * will write the optimized record to the location specified in the directory * entry. Since the master node uses the highest class message (PARSE) * the record owner node is guaranteed to already have an updated * directory entry incore. * is that the directory entry can be written to disk before the optimized * record in a MN diskset if the record is owned by a slave node. So, * the users of an optimized record must handle the failure case when no * data is available from an optimized record since the master node could * have failed during the relocation of the optimized record to another mddb. int rec_owner;
/* Is node owner of record? */ for (i = 0; i <
2; i++) {
* If optimized record has seen a replica failure, * assign new replica to record and re-write data /* Set flag for slaves to reread dep and write rec */ * If just an error in the data was seen, set * the optimized record's replica flag to active (ok) * If a MN diskset then check the owner of optimized record. * If the master node owns the record or if there is * no owner of the record, then the master can write the * optimized record to disk. * Master node can write the optimized record now, but * slave nodes write their records during handling of * the MDDB_PARSE_OPTRECS message. * In traditional diskset and local set, this node * is always the record owner and always the master. * If this node is the record owner, write out record. * In a MN diskset, the master node is the only node that runs * fixoptrecords. If the master node changes anything, then the * master node sends PARSE message to the slave nodes. The slave * nodes will then re-read in the locator block or re-read in the * directory blocks and re-write the optimized resync records. * Checks incore version of mddb data to mddb data ondisk. * - 0 if the data was successfully read and is good. * - MDDB_F_EREAD if a read error occurred. * - 1 if the data read is bad (checksum failed, etc) * first go through and make sure all directory stuff * check if all directory entries are identical for (i = 0; i <
2; i++) {
* If here, all directories are functionally identical * check to make sure all records are identical * the reason the records are not just bcmped is that the * lock flag does not want to be compared. /* Check the crc for this record */ * Get dev associated with device id and minor name. * Setup correct driver name if dev is now different. * Don't change driver name if during upgrade. * Can just use ri_dev for comparison since it would have also been * generated from a device id and minor name, if a valid return (0);
/* already entered return success */ * This replica not represented in the current rip list, * Devid is present, but not valid. This could * happen if device has been powered off or if * the device has been removed. Mark the device in * error. Don't allow any writes to this device * based on the dev_t since another device could * have been placed in its spot and be responding to * If the rip list is empty then this entry * Add this entry to the end of the rip list * writecopy writes the incore data blocks out to all of the replicas. * This is called from writestart * - when a diskset is started or * - when an error has been enountered during the write to a mddb. * and from newdev when a new mddb is being added. * MDDB_WRITECOPY_ALL - write all records to all mddbs. This is * always used for traditional and local disksets. * All nodes can call writecopy, but only the * master node actually writes data to the disk * except for optimized resync records. * An optimized resync record can only be written to * MDDB_WRITECOPY_SYNC - special case for MN diskset. When a new * master has been chosen, the new master may need to * write its incore mddb to disk (this is the case where the * old master had executed a message but hadn't relayed it * to this slave yet). New master should not write the * change log records since new master would be overwriting * valuable data. Only used during a reconfig cycle. * In a multinode diskset, when a new master is * chosen the new master may need to write its * incore copy of the mddb to disk. In this case, * don't want to overwrite the change log records * so new master sets flag to MDDB_WRITECOPY_SYNC. * In a multinode diskset, don't write out optimized * resync resyncs since only the mirror owner node * will have the correct data. If writecopy is * being called from writestart as a result of * an mddb failure, then writestart will handle * the optimized records when it calls fixoptrecords. /* Generate the crc for this record */ /* If no mediator hosts, nothing to do */ * If this is a MN set and we are not the master, then don't * update mediator hosts or mark mediator as golden since * only master node should do that. /* count accessible mediators */ /* count accessible and existing replicas */ * Mediator update quorum is >= 50%: check for less than * "mediator update" quorum. /* panic if <= 50% of all replicas are accessible */ if ((
lc > 0) && ((
alc *
2) <=
lc)) {
"md: Update of 50%% of the mediator hosts failed");
"md: Update of 50%% of the mediator hosts failed");
* If we have mediator update quorum and exactly 50% of the replicas * are accessible then mark the mediator as golden. /* push the change to all the replicas */ /* Should not call for MN diskset since data tags are not supported */ /* Should not call for MN diskset since data tags are not supported */ /* Run to the end of the list */ /* Update the dtag portion of the list */ /* Fix up the id value */ * Even though data tags are not supported in MN disksets, dt_cntl may * be called for a MN diskset since this routine is called even before * it is known the kind of diskset being read in from disk. * For a MNdiskset, s_dtlp is 0 so a count of 0 is returned. * Even though data tags are not supported in MN disksets, dt_cntl may * be called for a MN diskset since this routine is called even before * it is known the kind of diskset being read in from disk. * For a MNdiskset, s_dtlp is 0 so a 0 is returned. /* Should not call for MN diskset since data tags are not supported */ * Even though data tags are not supported in MN disksets, dt_setup will * be called for a MN diskset since this routine is called even before * it is known the kind of diskset being read in from disk. * Once this set is known as a MN diskset, the dtp area will be freed. /* Initialize the setno */ /* Clear the id and flags, this is only used in user land */ /* Should not call for MN diskset since data tags are not supported */ /* Data tags not used in a MN set - so no failure returned */ "No tag record allocated, unable to tag data");
/* Clear the stack variable */ /* Get the HW serial number for this host */ /* Get the nodename that this host goes by */ /* Get a time stamp for NOW */ /* Setup the data tag record */ /* Free any list of tags if they exist */ /* Put the new tag onto the tag list */ * If called during upgrade, this routine expects a non-translated * Should not call for MN diskset since data tags are not supported. /* If have not allocated a data tag record, there is nothing to do */ /* error reading the tag */ /* Mark the locator as having tagged data */ /* Should not call for MN diskset since data tags are not supported */ /* Nowhere to write to */ /* See if the tag is empty. */ /* Write the tag to the locators and reset appropriate flags. */ /* If the tags were written, check to see if any tags remain. */ /* If there are no tags, then clear CLRTAG and TAGDATA */ /* Should not call for MN diskset since data tags are not supported */ * If the data tag record is allocated (blkcnt != 0) and a bad tag was * not detected, there is nothing to do. /* Bitmap not setup, checks can't be done */ /* While reading the tag(s) an invalid tag data record was seen */ /* See if the invalid tag needs to be moved */ /* Need to move or allocate the tag data record */ "Unable to allocate data tag record");
/* Mark the locators so that they get written to disk. */ * Make sure the blocks are owned, since the calculation in * computefreeblks() is bypassed when MD_SET_BADTAG is set. * Writestart writes the incore mddb out to all of the replicas. * This is called when a diskset is started and when an error has * been enountered during the write to a mddb. * MDDB_WRITECOPY_ALL - write all records to all mddbs. This is * always used for traditional and local disksets. * This is the normal path for MN disksets since the slave * nodes aren't actually allowed to write to disk. * MDDB_WRITECOPY_SYNC - special case for MN diskset. When a new * master has been chosen, the new master may need to * write its incore mddb to disk (this is the case where the * old master had executed a message but hadn't relayed it * to this slave yet). New master should not write the * change log records since new master would be overwriting * valuable data. Only used during a reconfig cycle. * Call fixoptrecord even during a reconfig cycle since a replica * failure may force the master to re-assign the optimized * resync record to another replica. /* See if any (ACTIVE and not OLDACT) or (not ACTIVE and OLDACT) */ * If we found (ACTIVE and not OLDACT) or (not ACTIVE and OLDACT) * the lbp identifier and the set identifier doesn't match. /* Only call for traditional and local sets */ (
void)
upd_med(s,
"writestart(0)");
(
void)
upd_med(s,
"writestart(1)");
* If a MN diskset and this is the master, set the PARSE_LOCNM * flag in the mddb_set structure to show that the locator * Don't set parseflags as a result of a new master sync * during reconfig cycle since slaves nodes are already * in-sync with the new master. * selectreplicas selects the working replicas and may write the incore * version of the mddb out to the replicas ondisk. * MDDB_RETRYSCAN - quick scan to see if there is an error. * If no new error, returns without writing mddb * to disks. If a new error is seen, writes out * MDDB_SCANALL - lengthy scan to check out mddbs and always writes * out mddb to the replica ondisk. Calls writecopy * with MDDB_WRITECOPY_ALL flag which writes out * all records to the replicas ondisk. * MDDB_SCANALLSYNC - called during reconfig cycle to sync up incore * and ondisk mddbs by writing incore values to disk. * Calls writecopy with MDDB_WRITECOPY_SYNC flag so * that change log records are not written out. * Only used by MN disksets. * 1 - Unable to write incore mddb data to disk since < 50% replicas. * can never transition from stale to not stale * if there are no errors this is error has already * been processed return current state if (
alc < ((
lc +
1) /
2)) {
/* Set wc_flag based on flag passed in. */ }
while (
alc >= ((
lc +
1) /
2));
/* Run to the end of the list */ *
rip = *
trip;
/* structure assignment */ /* Clear the stuff that is not needed for hints */ * this routine selects the correct replica to use * the rules are as follows * 1. if all replica has same init time select highest commit count * 2. if some but not all replicas are from another hostid discard * 3. find which init time is present is most replicas * 4. discard all replicas which do not match most init times * 5. select replica with highest commit count /* Clear the ri_transplant flag on all the rip entries. */ /* Set ri_commitcnt to locator's commitcnt - if available */ /* If any locators have MN bit set, set flag */ * A data tag is being used, so use it to limit the selection first. * Data tags not used in MN diskset. * now toss any locators that have a different data tag /* If same tag, keep it */ /* Tag used, clear the bit */ * Get rid of the list of tags. * Re-create the list with the tag used. * scan to see if all replicas have same time * if r == NULL then they were all them same. Choose highest * If here, a bogus replica is present and at least 1 lb_inittime * look and see if any but not all are from different id * now go through and throw out different if there are some * go through and pick highest. Use n square because it is * simple and 40 some is max possible * now go though and toss any that are of a different time stamp * Find the locator with the highest commit count, and make it the /* Toss all locator blocks, except the "chosen" one. */ /* Get rid of all dtp's */ /* Get rid of extra locator devid block info */ /* Get rid of extra locators */ /* copy device id from mddb to cfg_loc structure */ for (i = 0; i <
sz; i++) {
* Even if a devid exists, use the dev, drvnm and mnum in the locators * and sidelocators. During startup, the dev, drvnm and mnum in * these structures may not match the devid (the locators and * sidelocators will be updated to match the devid by the routine * load_old_replicas). Using out-of-sync values won't cause any * problems since ridev will re-derive these from the devid and mnum. * After startup, the dev, drvnm and mnum in these structures have * been updated and can be used. * Find the index into the mnsidelocator where entry will go. * Then index can be fed into both splitname2locatorblocks and * cfgloc2locator so that those entries can be kept in sync. * -1 if failed to find unused slot or if a traditional diskset * index, if successful (0 <= index <= MD_MNMAXSIDES) * Checking side locator structure. First, check if * there is already an entry for this side. If so, * then use that entry. Otherwise, find an entry * that has a sideno of 0. /* Found a match - stop looking */ /* Set first empty slot, but keep looking */ /* Didn't find empty slot or previously used slot */ * Takes locator information (driver name, minor number, sideno) and * stores it in the locator block. * For traditional diskset, the sideno is the index into the sidelocator * array in the locator block. * For the MN diskset, the sideno is the nodeid which can be any number, * so the index passed in is the index into the mnsidelocator array int index /* Only useful in MNsets when > 1 */ * Index will be the slot that has the given sideno or * the first empty slot if no match is found. * This was pre-checked out in check locator. * Look for the driver name * Didn't find one, add a new one /* Fill in the drvnm index */ * This device id could already be associated with this index * if this is not the first side added to the set. * If device id is 0, there is no device id for this device. * See if there are mediator hosts and try to use the data. /* Do not have a mediator, then the state is stale */ /* Contact the mediator hosts for the data */ /* No mediator data, stale */ /* Mark all the mediator data that is not for this set as errored */ /* Count the number of mediators contacted */ /* Get the max commitcount */ /* Now mark the records that don't have the highest cc as errored */ /* Now mark the records that don't match the lb commitcnt as errored */ /* Is there a "golden" copy and how many valid mediators */ /* No survivors, stale */ /* No mediator quorum and no golden copies, stale */ /* Skip odd numbers, no exact 50% */ /* Have 50%, allow an accept */ /* We either have a quorum or a golden copy, or both */ * 1. read masterblks and locator blocks for all know database locations * a. keep track of which have good master blks * b. keep track of which have good locators /* May be cast to mddb_mnlb_t */ /* if accessing sidenames in */ * read in master blocks and locator block for all known locators. * lb_blkcnt will be set correctly for MN set later once getmasters * has determined that the set is a MN set. * Translated dev is only used in calls to getmasters and * getblks which expect a translated (aka miniroot) dev. /* Set error flag that getmasters would have set */ /* if getmasters had been allowed to fail */ * Invalid device id on system (due to failed or * removed device) or invalid devt during upgrade * (due to powered off device) will cause this * replica to be marked in error and not used. /* get all master blocks, does mddb_devopen() */ /* if invalid master block - try next replica */ * If lbp alloc'd to wrong size - reset it. * If MN set, lb_blkcnt must be MDDB_MNLBCNT. * If a traditional set, lb_blkcnt must NOT be MDDB_MNLBCNT. /* If a MN set, set lb_blkcnt for MN loc blk size */ * Read in all the sectors for the locator block * NOTE: Need to use getblks, rather than readblklst. * because it is too early and things are * NOT set up yet for read*()'s /* Verify the locator block */ /* If a MN set, check for MNLB revision in lb. */ /* If not a MN set, check for LB revision in lb. */ * With the addition of MultiNode Disksets, we must make sure * to verify that this is the correct set. A node could * have been out of the config for awhile and this disk could * have been moved to a different diskset and we don't want * to accidentally start the wrong set. * We don't do this check if we're in the middle of * a commit count of zero means this locator has been deleted * If replica is in the device ID style and md_devid_destroy * flag is set, turn off device id style. This is only to be * used in a catastrophic failure case. Examples would be * where the device id of all drives in the system * (especially the mirror'd root drives) had been changed * by firmware upgrade or by a patch to an existing disk * driver. Another example would be in the case of non-unique * device ids due to a bug. The device id would be valid on * the system, but would return the wrong dev_t. * If replica is in device ID style, read in device ID * block and verify device ID block information. /* Read in device ID block */ /* Reuse did_icp, but clear out data */ /* Can't reuse blkp since size could be different */ /* Verify the Device ID block */ * Check if device ID block is out of sync with the * Locator Block by checking if the locator block * commitcnt does not match the device id block * commitcnt. If an 'out of sync' condition * exists, discard this replica since it has * inconsistent data and can't be used in * determining the best replica. * An 'out of sync' condition could happen if old * SDS code was running with new devid style replicas * or if a failure occurred between the writing of * the locator block's commitcnt and the device * If old SDS code had been running, the upgrade * process should detect this situation and * have removed all of the device id information * via the md_devid_destroy flag in md.conf. * If replica is still in device ID style, read in all * of the device IDs, verify the checksum of the device IDs. * Reset valid bit in device id info block flags. This * flag is stored on disk, but the valid bit is reset * when reading in the replica. If the corresponding * device id is valid (aka meaning that the system * knows about this device id), the valid bit will * be set at a later time. The valid bit for this * replica's device ID will be set in this routine. * The valid bits for the rest of the device id's * will be set after the 'best' replica has * been selected in routine load_old_replicas. * Reset updated bit in device id info block flags. * This flag is also stored on disk, reset when read * in and set when the locators and side locators * have been updated to match this valid device /* Check if block has already been read in */ /* if block not found, read it in */ * Block read in - alloc Disk Block area /* Add to front of dbp list */ /* Check validity of devid in block */ /* Block now pointed to by did_dbp */ * All blocks containing devids are now in core. * If we're doing a replicated import (also known as * remote copy import), the device id in the locator * block is incorrect and we need to fix it up here * alongwith the l_dev otherwise we run into lots of * If there is a valid devid, verify that this locator * block has information about itself by checking the * device ID, minor_name and block * number from this replica's incore data structure * against the locator block information that has just * been read in from disk. * If not a valid devid, verify that this locator block * has information about itself by checking the minor * number, block number and driver name from this * replica's incore data structure against the locator * block information that has just been read in from disk. * This locator block MUST have locator (replica) * information about itself. Check against devid, * slice part of minor number, and block number. * This locator block MUST have locator (replica) * information about itself. * Check all possible locators locking for * match to the currently read-in locator, * - side locator for this node's side * - side locator minor number * - side locator driver name /* Looking at sidelocs - cast lbp -> mnlbp */ /* No matching side found */ * Didn't find ourself in this locator block it means * the locator block is a stale transplant. Probably from * Keep track of the number of accessed and valid * Read the tag in, skips invalid or blank tags. * Only valid tags allocate storage * Data tags are not used in MN disksets. * Keep track of the number of tagged /* Keep a list of unique tags. */ * go through locator block and add any other * locations of the data base. * For the replicated import case, this was done earlier * and we really don't need or want to do so again /* No locator blocks were ok */ /* No tagged data was found - will be 0 for MN diskset */ /* Find the highest non-deleted replica count */ /* Count the number of unique tags */ /* Should have at least one tag at this point */ * If the number of tagged locators is not the same as the number of * OK locators OR more than one tag exists, then make sure the * selected tag will be written out later. /* Only a single tag, take the tagged data */ /* Multiple tags, not selecting a tag, tag mode is on */ * 2. check if enough locators now have current copies * 3. read in database from one of latest * 4. if known to have latest make all database the same * 5. if configuration has changed rewrite locators * s - pointer to mddb_set structure * flag - used in MN disksets to tell if this node is being joined to * a diskset that is in the STALE state. If the flag is * MDDB_MN_STALE, then this node should be marked in the STALE * state even if > 50% mddbs are available. (The diskset can * only change from STALE->OK if all nodes withdraw from the * MN diskset and then rejoin). /* The only error path out of get_mbs_n_lbs() is MDDB_E_TAGDATA */ /* If a multi-node set, then set md_set.s_status flag */ * If data tag area had been allocated before set type was * If the replica is in devid format, setup the devid incore ptr. * If no devid incore info found - something has gone * Add all blocks containing devids to free list. * Then remove addresses that actually contain devids. /* unable to find disk block */ * create mddb_mbaray, count all locators and active locators. /* Count non-deleted replicas */ * If rip not found, then mark error in master block * so that no writes are later attempted to this * replica. rip may not be setup if ridev * failed due to un-found driver name. /* Save on a divide - calculate 50% + 1 up front */ if (
alc >
tlc) {
/* alc > tlc - OK */ }
else if (
alc <
tlc) {
/* alc < tlc - stale */ }
else if (
lc &
1) {
/* alc == tlc && odd - OK */ }
else {
/* alc == tlc && even - ? */ /* Can do an accept, and are */ }
else {
/* possibly has a mediator */ * Rootdev is 0 only when booting. So, if * we have come this far and rootdev is 0, * 1. We are being called on bootup. * 2. We have mirrored root. * To handle the quorum issue for a 2 disk case, * we add another vote for rootdev. We are not * removing the requirement for majority quorum * but rather delegating the vote to user level * daemons. Since the daemon has checked the valid * bootpath, rootdev is given an additional vote. /* Allow half mode - CAREFUL! */ * - if 50% mddbs are unavailable and this * has been marked STALE above * - master node isn't in the STALE state * - this node isn't the master node (this node * isn't the first node to join the set) * then clear the STALE state and set TOOFEW. * If this node is the master node and set was marked STALE, * then the set stays STALE. * If this node is not the master and this node's state is * STALE and the master node is not marked STALE, * then master node must be in the TOOFEW state or the * master is panic'ing. A MN diskset can only be placed into * the STALE state by having the first node join the set * with <= 50% mddbs. There's no way for a MN diskset to * transition between STALE and not-STALE states unless all * nodes are withdrawn from the diskset or all nodes in the * diskset are rebooted at the same time. * So, mark this node's state as TOOFEW instead of STALE. * If a MN set is marked STALE on the other nodes, * mark it stale here. Override all other considerations * such as a mediator or > 50% mddbs available. * read a good copy of the locator names * if an error occurs reading what is suppose * to be a good copy continue looking for another /* Find rip entry for this locator if one exists */ * Now have a copy of the database that is equivalent * to the chosen locator block with respect to * inittime, identifier and commitcnt. Trying the * equivalent databases in the order that they were * written will provide the most up to date data. * read a good copy of the data base * if an error occurs reading what is suppose * to be a good copy continue looking for another /* Find rip entry for this locator if one exists */ * Now have a copy of the database that is equivalent * to the chosen locator block with respect to * inittime, identifier and commitcnt. Trying the * equivalent databases in the order that they were * written will provide the most up to date data. * go through and find largest record; * Also fixup the user data area's /* If we can clear the tag data record, do it now. */ /* Data tags not supported on MN sets */ /* This will return non-zero if STALE or TOOFEW */ /* This will write out chosen replica image to all replicas */ * If the replica is in device id style - validate the device id's, * if present, in the locator block devid area. /* Validate device id on current system */ * If a device doesn't have a device id, * check if there is now a device ID * associated with device. If one exists, * add it to the locator block devid area. * If there's not enough space to add it, * Don't do this during upgrade. * If a device has a valid device id and if the dev_t * associated with the device id has changed, update the * driver name, minor num and dev_t in the local and side * locators to match the dev_t that the system currently * associates with the device id. * Don't do this during upgrade. /* No match found; take empty */ /* Driver name has changed */ /* Look for the driver name */ /* Didn't find one, add it */ "Unable to update driver" /* Fill in the drvnm index */ * If locator block has been changed by get_mbs_n_lbs, * by addition of new device id, by updated minor name or * by updated driver name - write out locator block. * If the tag was moved, allocated, or a BADTAG was seen for some other * reason, then make sure tags are written to all the replicas. * Data tags not supported on MN sets. /* Free extraneous rip components. */ /* Get rid of lbp's and dtp's */ * if lbp, those out of lb_loccnt bounds * Turn off MDDB_F_EMASTER flag in a diskset since diskset * code always ends up calling ridev for all replicas * before calling load_old_replicas. ridev will reset * MDDB_F_EMASTER flag if flag was due to unresolved devid. * Given the devt from the md.conf info, get the devid for the device. * grab driver name, minor, block and devid out of * strings like "driver:minor:block:devid" while ((*
str !=
':') && (*
str !=
'\0') && (p < e))
* If the md_devid_destroy flag is set, ignore the device ids. * This is only to used in a catastrophic failure case. Examples * would be where the device id of all drives in the system * (especially the mirror'd root drives) had been changed * by firmware upgrade or by a patch to an existing disk * driver. Another example would be in the case of non-unique * device ids due to a bug. The device id would be valid on * the system, but would return the wrong dev_t. /* If no device id associated with device, just return */ * No devid in md.conf; we're in recovery mode so * lookup the devid for the device as specified by * grab driver name, minor, and block out of * strings like "driver:minor:block:devid driver:minor:block:devid ..." for (p =
str; (*p !=
'\0'); ) {
for (; ((*p !=
'\0') && (
ISWHITE(*p))); ++p)
for (e = p; ((*e !=
'\0') && (!
ISWHITE(*e))); ++e)
* Only give parse_db_loc 1 entry, so stuff a null into * the string if we're not at the end. We need to save this * char and restore it after call. * grab database locations supplied by md.conf as properties * size of _bootlist_name should match uses of line and entry in * Step through the bootlist properties one at a time by forming the * correct name, fetching the property, parsing the property and * then freeing the memory. If a property does not exist or returns * some form of error just ignore it. There is no guarantee that * the properties will always exist in sequence, for example * mddb_bootlist1 may exist and mddb_bootlist2 may not exist with * mddb_bootlist3 existing. * init is already underway, block. Return success. /* grab database locations patched by /etc/system */ * KEEPTAG can never be set for a MN diskset since no tags are * allowed to be stored in a MN diskset. No way to check * if this is a MN diskset or not at this point since the mddb * hasn't been read in from disk yet. (flag will only have * MUTLINODE bit set if a new set is being created.) /* If 0 return value - success */ * If here, then the load_old_replicas() failed /* If the database was supposed to exist. */ /* Want a bit more detailed error messages */ * MDDB_NOOLDOK set - Creating a new database, so do /* lb starts on block 0 */ /* locator names starts after locator block */ /* Creating a multinode diskset */ /* Data portion of mddb located after locator names */ /* the btodb that follows is converting the directory block size */ /* Data tag part of mddb located after first block of mddb data */ /* Data tags are not used in MN diskset - so set count to 0 */ * Set up Device ID portion of Locator Block. * Do not set locator to device id style if * md_devid_destroy is 1 and md_keep_repl_state is 1 * (destroy all device id data and keep replica in * This is logically equivalent to set locator to * device id style if md_devid_destroy is 0 or * md_keep_repl_state is 0. * In SunCluster environment, device id mode is disabled * which means diskset will be run in non-devid mode. For * localset, the behavior will remain intact and run in * In multinode diskset devids are turned off. * if we weren't devid style before and md_keep_repl_state=1 * we need to stay non-devid /* Allocate s_un and s_ui arrays if not already present. */ * Release the set mutex - it will be acquired and released in * initit after acquiring the mddb_lock. This is done to assure * that mutexes are always acquired in the same order to prevent * Release the set lock for a given set. * In a MN diskset, this routine may send messages to the rpc.mdcommd * in order to have the slave nodes re-parse parts of the mddb. * Messages are only sent if the global ioctl lock is not held. * With the introduction of multi-threaded ioctls, there is no way * to determine which thread(s) are holding the ioctl lock. So, if * the ioctl lock is held (by process X) process X will send the * messages to the slave nodes when process X releases the ioctl lock. * a MN diskset but this node isn't master, * then release the mutex. * If global ioctl lock is held, then send no messages, * just release mutex and return. * This thread is not holding the ioctl lock, so drop the set * lock, send messages to slave nodes to reparse portions * of the mddb and return. * If the block parse flag is set, do not send parse messages. * This flag is set when master is adding a new mddb that would * cause parse messages to be sent to the slaves, but the slaves * don't have knowledge of the new mddb yet since the mddb add * operation hasn't been run on the slave nodes yet. When the * master unblocks the parse flag, the parse messages will be * If s_mn_parseflags_sending is non-zero, then another thread * is already currently sending a parse message, so just release * the mutex and return. If an mddb change occurred that results * in a parse message to be generated, the thread that is currently * sending a parse message would generate the additional parse message. * If s_mn_parseflags_sending is zero and parsing is not blocked, * then loop until s_mn_parseflags is 0 (until there are no more * While s_mn_parseflags is non-zero, * put snapshot of parse_flags in s_mn_parseflags_sending * set s_mn_parseflags to zero * set s_mn_parseflags_sending to zero /* Grab snapshot of parse flags */ * Send the message to the slaves to re-parse * the indicated portions of the mddb. Send the status * of the 50 mddbs in this set so that slaves know which * mddbs that the master node thinks are 'good'. * Otherwise, slave may reparse, but from wrong replica. "mddb update message to other nodes in " * Re-grab mutex to clear sending field and to * see if another parse message needs to be generated. /* Need disk block(s) to hold mddb_did_blk_t */ * Alloc mddb_did_blk_t disk block and fill in header area. * Don't fill in did magic number until end of routine so * if machine panics in the middle of conversion, the * device id information will be thrown away at the * next snarfing of this set. * Need to set DEVID_STYLE so that mddb_devid_add will /* Fill in information in mddb_did_info_t array */ * No translation available for replica. * Could fail conversion to device id replica, * but instead will just continue with next * Just count each devid as at least 1 block. This * is conservative since several device id's may fit * into 1 disk block, but it's better to overestimate * the number of blocks needed than to underestimate. "Not enough space in metadb" /* have a config struct, copy mediator information */ /* Data tags not supported on MN sets. */ /* Take care of things setup in the md_set array */ * returns 0 if name can be put into locator block * returns 1 if locator block prefixes are all used * Takes splitname (suffix, prefix, sideno) and * stores it in the locator name structure. * For traditional diskset, the sideno is the index into the suffixes * array in the locator name structure. * For the MN diskset, the sideno is the nodeid which can be any number, * so the index passed in is the index into the mnsuffixes array * in the locator structure. This index was computed by the * routine checklocator which basically checked the locator block * mnside locator structure. /* If a MN diskset, use index */ * Find the locator name for the given sideno and convert the locator name * information into a splitname structure. * go through and count active entries for (i = 0; i <
loccnt; i++) {
* add the ability to accept a locator block index * which is not relative to previously deleted replicas. This * is for support of MD_DEBUG=STAT in metastat since it asks for * replica information specifically for each of the mirror resync * records. MDDB_CONFIG_SUBCMD uses one of the pad spares in * the mddb_config_t type. for (
li = 0, j = 0;
/* void */;
li++) {
"Deletion of replica not allowed during upgrade.\n");
* If here, replica delete in progress. * Don't need to write out device id area, since locator * block on this replica is being deleted by setting the /* Only support data tags for traditional and local sets */ /* Write data tags to all accessible devices */ /* Only support data tags for traditional and local sets */ /* Delete device id of deleted replica */ /* write new locator to all devices */ * update_valid_replica - updates the locator block namespace (prefix * and/or suffix) with new pathname and devname. * Future note: Need to do something here for the MN diskset case * when device ids are supported in disksets. * Can't add until merging devids_in_diskset code into code base * Currently only called with side of 0. * Check if prefix (Ex: /dev/dsk) needs to be changed. * If new prefix is the same as the previous prefix - no change. * If new prefix is not the same, check if new prefix * matches an existing one. If so, use that one. * If new prefix doesn't exist, add a new prefix. If not enough /* Check if new prefix is the same as the old prefix. */ /* Check if new prefix is an already known prefix. */ /* If no match found for new prefix - add the new prefix */ /* No space to add new prefix - return failure */ /* Now, update the suffix (Ex: c0t0d0s0) if needed */ * md_update_locator_namespace - If in devid style and active and the devid's * exist and are valid update the locator namespace pathname /* must be DEVID_STYLE */ /* replica also must be active */ /* only update if did exists and is valid */ * If a MN diskset and this is the master, set the PARSE_LOCNM * flag in the mddb_set structure to show that the locator * update_locatorblock - for active entries in the locator block, check * the devt to see if it matches the given devt. If so, and * there is an associated device id which is not the same * as the passed in devid, delete old devid and add a new one. /* find replicas that haven't been deleted */ * check to see if locator devt matches given dev * and if there is a device ID associated with it continue;
/* cont to nxt active entry */ * There is case where a disk may not have mddb, * and only has dummy mddb which contains * a valid devid we like to update and in this * case, the rip_lbp will be NULL but we still * like to update the devid embedded in the * Done if it is non-replicated set * Replace the mb_devid with the new/valid one * Zero out what we have previously * - regain the s_dbmx lock * Need to update this if we wants to handle * mb_next != NULL which it is unlikely will happen * We only update what is asked "Addition and deletion of sides not allowed" * If a MN diskset, need to find the index where the new * locator information is to be stored in the mnsidelocator * field of the locator block so that the locator name can * be stored at the same array index in the mnsuffixes * field of the locator names structure. * Store the locator name before the sidelocator information * in case a panic occurs between these 2 steps. Must have * the locator name information in order to print reasonable /* write new locator names to all devices */ * If a MN diskset and this is the master, set the PARSE_LOCNM * flag in the mddb_set structure to show that the locator /* write new locator to all devices */ /* Currently don't allow addition of new replica during upgrade */ "Addition of new replica not allowed during upgrade.\n");
/* Determine the flag settings for multinode sets */ * Really is a new replica, go get the master blocks * Compute free blocks in replica. * Check if this is large enough /* Look for a deleted slot */ /* If no deleted slots, add a new one */ /* Already have the max replicas, bail */ /* Initialize the new or deleted slot */ * If a MN diskset, need to find the index where the new * locator information is to be stored in the mnsidelocator * field of the locator block so that the locator name can * be stored at the same array index in the mnsuffixes * field of the locator names structure. * Store the locator name before the sidelocator information * in case a panic occurs between these 2 steps. Must have * the locator name information in order to print reasonable * Compute free blocks in replica before calling cfgloc2locator * since cfgloc2locator may attempt to alloc an unused block * to store the device id. * mbiarray needs to be setup before calling computefreeblks. /* write db copy to new device */ /* write new locator names to all devices */ * If a MN diskset and this is the master, set the PARSE_LOCNM * flag in the mddb_set structure to show that the locator /* Data tags not supported on MN sets */ /* Write data tags to all accessible devices */ /* Data tags not supported on MN sets */ /* write new locator to all devices */ * Note: must allow USEDEV ioctl during upgrade to support * Also during the set import if the md_devid_destroy * flag is set then error out /* LINTED variable unused - used for sizeof calculations */ * everyone is supposed to sepcify if it's a * 32 bit or a 64 bit record * and new directory block so to avoid sleeping * after starting single_thread * if this is the largest record allocate new buffer for * this test is incase when to sleep during kmem_alloc * and some other task bumped max record size * see if a directory block exists which will hold this entry * need to add directory block * Optimized records have an owner node associated with them in * a MN diskset. The owner is only set on a node that is actively * writing to that record. The other nodes will show that record * as having an invalid owner. The owner for an optimized record * is used during fixoptrecord to determine which node should * write out the record when the replicas associated with that * optimized record have been changed. * try to get all blocks consecutive. If not possible * just get them one at a time /* Do we have to create an old style (32 bit) record? */ /* set de_rb_userdata for non optimization records */ /* Generate the crc for this record */ * the following code writes new records to all instances of * the data base. Writing one block at a time to each instance * is safe because they are not yet in a directory entry which * has been written to the data base for (i = 0; i <
blkcnt; i++) {
* If a MN diskset then only master writes out newly * created optimized record. /* Don't include opt resync and change log records in global XOR */ * staledelete is used to mark deletes which failed. * its only use is to not panic when the user retries * the delete once the database is active again /* LINTED variable unused - used for sizeof calculations */ "nonoptimized records can be resized\n");
* Commit given record to disk. * If committing an optimized record, do not call * with md ioctl lock held. * following code allows multiple processes to be doing * optimization commits in parallel. * NOTE: if lots of optimization commits then the lock * will not get released until it winds down /* Generate the crc for this record */ /* If last thread out, release single_thread_start */ * If this thread had a writeoptrecords failure, then * need to send message to master. * But, multiple threads could all be running on the * same single_thread_start, so serialize the threads * by making each thread grab single_thread_start. * After return from sending message to master message, * replicas associated with optimized record will havei * been changed (via a callback from the master to all * nodes), so retry call to writeoptrecord. * This code is replacing the call to writeretry that * occurs for the local and traditional disksets. * If > 50% of replicas are alive then continue * to send message to master until writeoptrecord * succeeds. For now, assume that minor name, * major number on this node is the same as on * the master node. Once devids are turned on * for MN disksets, can send devid. for (i = 0; i <
2; i++) {
* Send message to master about optimized * record failure. After return, master * should have marked failed replicas * and sent parse message to slaves causing * slaves to have fixed up the optimized * On return from ksend_message, retry * the write since this node should have fixed * the optimized resync records it owns. "Unable to send optimized " "message to other nodes in " "MD_MN_MSG_MDDB_OPTRECERR");
/* Start over in case mddb changed */ /* Generate the crc for this record */ * If writeoptrecord succeeds, then /* Resync record should be fixed - if possible */ /* All errors have been handled */ /* If set is a traditional or local set */ * scan through and make sure ids are from the same set * scan through and make sure ids all exist * scan through records fix commit counts and * zero fiddles and update time stamp and rechecksum record /* Don't do fiddles for CHANGE LOG records */ /* Generate the crc for this record */ /* Don't do fiddles for CHANGE LOG records */ * If this is a MN set but we are not the master, then we are not * supposed to update the mddb on disk. So we finish at this point. * This should be the only thing that prevents LOCAL sets from having * mediators, at least in the kernel, userland needs to have some code (
void)
upd_med(s,
"updmed_ioctl()");
* In the case that the snarf failed, the diskset is * left with s_db set, but s_lbp not set. The node is not * an owner of the set and won't be allowed to release the * diskset in order to cleanup. With s_db set, any call to the * GETDEV or ENDDEV ioctl (done by libmeta routine metareplicalist) * will cause the diskset to be loaded. So, cleanup the diskset so * that an inadvertent start of the diskset doesn't happen later. * Attempt to mark set as HOLD. If it is marked as HOLD, this means * that the mirror code is currently searching all mirrors for a * errored component that needs a hotspare. While this search is in * progress, we cannot release the set and thgerefore we return EBUSY. * Once we have set HOLD, the mirror function (check_4_hotspares) will * block before the search until the set is released. * Data tags not supported on MN sets so return invalid operation. * This ioctl could be called before the mddb has been read in so * the set status may not yet be set to MNSET, so code following * this check must handle a MN diskset properly. /* s_dtlp is NULL for MN diskset */ /* Walked the whole list and id not found, return error */ * Data tags not supported on MN sets so return invalid operation. * This ioctl could be called before the mddb has been read in so * the set status may not yet be set to MNSET, so code following * this check must handle a MN diskset properly. /* Validate and find the id requested - nothing found if MN diskset */ /* Usetag is only valid when more than one tag exists */ /* Put the selected tag in place */ /* Save the hint information */ /* Let unload know not to free the tag */ /* Re-init set using the saved mddb_config_t structure */ /* use the saved rip structure */ /* Let the take code know a tag is being used */ * Data tags not supported on MN sets so return invalid operation. * mddb is guaranteed to be incore at this point, so this * check will catch all MN disksets. /* If we had a BADTAG, it will be re-written, so clear the bit. */ /* Re-init set using the saved mddb_config_t structure */ /* Free the allocated rip structure */ /* use the saved rip structure */ /* Let the set init code know an accept is in progress */ * mddb_getinvlb_devid - cycles through the locator block and determines * if the device id's for any of the replica disks are invalid. * If so, it returns the diskname in the ctdptr. * cnt number of invalid device id's /* check for lb being devid style */ /* Only if devid exists and isn't valid */ * if we count more invalid did's than * was passed in there's an error somewhere * Future note: Need to do something here * for the MN diskset case when device ids * are supported in disksets. * Can't add until merging devids_in_diskset * check to make sure length of device name is * not greater than computed first time through /* strip off slice part */ /* look to see if diskname is already in list */ for (i = 0; i < (
cnt-
1); i++) {
/* already there, don't add */ /* point to next diskname in list */ /* add diskname to list */ /* null terminate the list */ * need to save the new pointer so that calling routine can continue * to add information onto the end. * mddb_validate_lb - count the number of lb's with invalid device id's. Keep * track of length of longest devicename. * cnt number of lb's with invalid devid's /* lb must be in devid style */ /* Here we know, did exists but isn't valid */ * Future note: Need to do something here * for the MN diskset case when device ids * are supported in disksets. * Can't add until merging devids_in_diskset /* there is nothing here..so we can unload */ * Update the in-core optimized resync record contents by re-reading the * record from the on-disk metadb. * The contents of the resync record will be overwritten by calling this * routine. This means that callers that require the previous contents to * be preserved must save the data before calling this routine. for (i = 0; i <
2; i++) {
/* Check the crc for this record */ /* Generate the crc for this record */ * Re-read the resync record from the on-disk copy. This is required for * multi-node support so that a new mirror-owner can determine if a resync * operation is required to guarantee data integrity. * -1 invalid set (not multi-node or non-existant) * >0 metadb state invalid * Set owner associated with MN optimized resync record. * Optimized records have an owner node associated with them in * a MN diskset. The owner is only set on a node that is actively * writing to that record. The other nodes will show that record * as having an invalid owner. The owner for an optimized record * is used during fixoptrecord to determine which node should * write out the record when the replicas associated with that * optimized record have been changed. * Called directly from mirror driver and not from an ioctl. * MDDB_E_NORECORD if record not found. * mddb_parse re-reads portions of the mddb from disk given a list * of good replicas to read from and flags describing * which portion of the mddb to read in. * Used in a MN diskset when the master has made a change to some part * of the mddb and wants to relay this information to the slaves. * Master node initiated this request, so there's no work for /* Walk through master's active list */ /* Assumes master blocks are already setup */ * a commit count of zero means this locator has /* Found a good locator - keep it */ * If found a good copy of the mddb, then read it into * this node's locator block. Fix up the set's s_mbiarray * pointer (master block incore array pointer) to be * in sync with the newly read in locator block. If a * new mddb was added, read in the master blocks associated * with the new mddb. If an mddb was deleted, free the * master blocks associated with deleted mddb. /* Compare old and new view of mddb locator blocks */ /* If old and new views match, continue */ * If new mddb has been added - delete * old mbiarray and get new one. * When devids are supported, will * need to get dev from devid. * If getmasters fails, getmasters * will set appropriate error flags. * If old one has been deleted - /* Free this node's old view of mddb locator blocks */ s->
s_lnp =
NULL;
/* readlocnames does this anyway */ /* Successfully read the locator names */ /* Did not successfully read locnames; restore lnp */ /* readlocnames successful, free old struct */ * Walk through directory block and directory entry incore * linked list looking for optimized resync records. * For each opt record found, re-read in directory block. * The directoy block consists of a number of directory * entries. The directory entry for this opt record will * describe which 2 mddbs actually contain the resync record * since it could have been relocated by the master node * due to mddb failure or mddb deletion. If this node * is the record owner for this opt record, then write out * the record to the 2 mddbs listed in the directory entry * if the mddbs locations are different than previously known. /* Found an opt record */ /* If no opt records found, go to next dbp */ * Reread directory block from disk since * master could have rewritten in during fixoptrecord. /* Reverify db; go to next mddb if bad */ * If all mddbs are unavailable then panic since * this slave cannot be allowed to continue out-of-sync * with the master node. Since the optimized resync * records are written by all nodes, all nodes must * stay in sync with the master. * This also handles the case when all storage * connectivity to a slave node has failed. The * slave node will send an MDDB_OPTRECERR message to * the master node when the slave node has been unable * to write an optimized resync record to both * designated mddbs. After the master has fixed the * optimized records to be on available mddbs, the * MDDB_PARSE message (with the flag MDDB_PARSE_OPTRECS) * is sent to all slave nodes. If a slave node is * unable to access any mddb in order to read in the * relocated optimized resync record, then the slave "access any SVM state database " "replicas for diskset %s\n",
* Setup temp copy of linked list of de's. * Already have an incore copy, but need to walk * the directory entry list contained in the * new directory block that was just read in above. * After finding the directory entry of an opt record * by walking the incore list, find the corresponding * entry in the temporary list and then update * the incore directory entry record with * the (possibly changed) mddb location stored * for the optimized resync records. /* Now, walk the incore directory entry list */ * Found an opt record in the incore copy. * Find the corresponding entry in the temp * list. If anything has changed in the * opt record info between the incore copy * and the temp copy, update the incore copy * and set a flag to writeout the opt record * to the new mddb locations. /* Check first mddb location */ /* Check second mddb location */ /* Record owner should rewrite it */ * Update the incore checksum information for this * directory block to match the newly read in checksum. * This should have only changed if the incore and * temp directory entries differed, but it takes * more code to do the check than to just update * the information everytime. /* Now free everything */ * If the new_master flag is set for this setno we are in the middle * of a reconfig cycle, and blocking or unblocking is not needed. * Hence we can return success immediately * mddb_optrecfix marks up to 2 mddbs as failed and calls fixoptrecords * to relocate any optimized resync records to available mddbs. * This routine is only called on the master node. * Used in a MN diskset when a slave node has failed to write an optimized * resync record. The failed mddb information is sent to the master node * so the master can relocate the optimized records, if possible. If the * failed mddb information has a mddb marked as failed that was previously * marked active on the master, the master sets its incore mddb state to * EWRITE and sets the PARSE_LOCBLK flag. The master node then attempts * to relocate any optimized records on the newly failed mddbs by calling * fixoptrecords. (fixoptrecords will set the PARSE_OPTRECS flag if any * optimized records are relocated.) * When mddb_optrecfix is finished, the ioctl exit code will notice the PARSE * flags and will send a PARSE message to the slave nodes. The PARSE_LOCBLK * flag causes the slave node to re-read in the locator block from disk. * The PARSE_OPTRECS flag causes the slave node to re-read in the directory * blocks and write out any optimized resync records that have been * relocated to a different mddb. * If slave node has seen an mddb failure, but the master node * hasn't encountered this failure, mark the mddb as failed on * the master node and set the something_changed flag to 1. for (i = 0; i <
2; i++) {
/* Do quick check using li */ * Passed in li from slave does not match * the replica in the master's structures. * This could have occurred if a delete * mddb command was running when the * optimized resync record had a failure. * Search all replicas for this entry. * If no match, just ignore. * If a match, set replica in error. * If this message changed nothing, then we're done since this * failure has already been handled. * If some mddb state has been changed, send a parse message to * the slave nodes so that the slaves will re-read the locator * Scan replicas setting MD_SET_TOOFEW if * 50% or more of the mddbs have seen errors. * Note: Don't call selectreplicas or writeretry * since these routines may end up setting the ACTIVE flag * on a failed mddb if the master is able to access the mddb * but the slave node couldn't. Need to have the ACTIVE flag * turned off in order to relocate the optimized records to * mddbs that are (hopefully) available on all nodes. * If more than 50% mddbs have failed, then don't relocate opt recs. * The node sending the mddb failure information will detect TOOFEW * and will panic when it attempts to re-write the optimized record. if (
alc < ((
lc +
1) /
2)) {
/* Attempt to relocate optimized records that are on failed mddbs */ /* Push changed locator block out to disk */ /* Recheck for TOOFEW after writing out locator blocks */ /* If more than 50% mddbs have failed, then don't relocate opt recs */ if (
alc < ((
lc +
1) /
2)) {
* Check if incore mddb on master node matches ondisk mddb. * If not, master writes out incore view to all mddbs. * Have previously verified that master is an owner of the * diskset (master has snarfed diskset) and that diskset is * Meant to be called during reconfig cycle during change of master. * Previous master in diskset may have changed the mddb and * panic'd before relaying information to slave nodes. New * master node just writes out its incore view of the mddb and * the replay of the change log will resync all the nodes. * Only supported for MN disksets. /* Verify that setno is in valid range */ /* Calling diskset must be a MN diskset */ /* Re-verify that set is not stale */ * Previous master could have died during the write of data to * the mddbs so that the ondisk mddbs may not be consistent. * So, need to check the contents of the first and last active mddb * to see if the mddbs need to be rewritten. /* Find replica that is active */ /* Check locator block */ /* read in on-disk locator block */ /* If err, try next mddb */ * We resnarf all changelog entries for this set. * They may have been altered by the previous master /* This has been alloc'ed while joining the set */ * When we see on error while reading the * changelog entries, we move on to the next break;
/* out of inner for-loop */ break;
/* out of outer for-loop */ /* If err, try next mddb */ /* Is incore locator block same as ondisk? */ /* If lb ok, check locator names */ /* read in on-disk locator names */ /* If err, try next mddb */ /* Are incore locator names same as ondisk? */ * If a read error is encountered, set the error flag and * continue to the next mddb. Otherwise, if incore data is * different from ondisk, then set the flag to write out * the mddb and break out. * Have found first active mddb and the data is the same as * incore - break out of loop * Skip checking for last active mddb if: * - already found a mismatch in the first active mddb * (write_out_mddb is 1) OR * - didn't find a readable mddb when looking for first * active mddb (there are mddbs present but all failed * when read was attempted). * In either case, go to write_out_mddb label in order to attempt * to write out the data. If < 50% mddbs are available, panic. * Save which index was checked for the first active mddb. If only 1 * active mddb, don't want to recheck the same mddb when looking for * Now, checking for last active mddb. If found same index as before * (only 1 active mddb), then skip. /* Find replica that is active */ /* If already checked mddb, bail out */ /* Check locator block */ /* read in on-disk locator block */ /* If err, try next mddb */ /* Is incore locator block same as ondisk? */ /* If lb ok, check locator names */ /* read in on-disk locator names */ /* If err, try next mddb */ /* Are incore locator names same as ondisk? */ * If a read error is encountered, set the error flag and * continue to the next mddb. Otherwise, if incore data is * different from ondisk, then set the flag to write out * the mddb and break out. * Have found last active mddb and the data is the same as * incore - break out of loop * If ondisk and incore versions of the mddb don't match, then * write out this node's incore version to disk. * Or, if unable to read a copy of the mddb, attempt to write /* Recompute free blocks based on incore information */ * Write directory entries and record blocks. * Use flag MDDB_WRITECOPY_SYNC so that writecopy * routine won't write out change log records. /* Don't write to inactive or deleted mddbs */ /* If encounter a write error, save it for later */ * Write out locator blocks to all replicas. * push_lb will set MDDB_F_EWRITE on replicas that fail. /* Write out locator names to all replicas */ /* writeall sets MDDB_F_EWRITE if writes fails to replica */ * The writes to the replicas above would have set * the MDDB_F_EWRITE flags if any write error was * If < 50% of the mddbs are available, panic. * - is not active (previously had an error) * - had an error reading the master blocks or * - had an error in writing to the mddb * then don't count this mddb in the active count. if (
alc < ((
lc +
1) /
2)) {
"md: Panic due to lack of DiskSuite state\n" " database replicas. Fewer than 50%% of " "the total were available,\n so panic to " "ensure data integrity.");
* If encountered an error during checking or writing of * mddbs, call selectreplicas so that replica error can * be properly handled. This will involve another attempt * to write the mddb out to any mddb marked MDDB_F_EWRITE. * If mddb still fails, it will have the MDDB_F_ACTIVE bit * turned off. Set the MDDB_SCANALLSYNC flag so that * selectreplicas doesn't overwrite the change log entries. * Set the PARSE_LOCBLK flag in the mddb_set structure to show * that the locator block has been changed. * Used during reconfig cycle * Only supported for MN disksets. /* Verify that setno is in valid range */ * When setting the flags, the set may not * be snarfed yet. So, don't check for SNARFED or MNset * and don't call mddb_setenter. * In order to discourage bad ioctl calls, * verify that magic field in structure is set correctly. * Load the devid name space if it exists * Unload the devid namespace * Find the entry, update its n_minor if metadevice * It is a non-replicated set * and there is no need to update * We have it, go ahead and update the namespace. /* Update setname embedded in the namespace */ /* Create and fill in set record */ /* Create and fillin drive records */ * Add entry and create the record * We need to check to see if the drive on * the rip has a replica. If it doesn't have * a replica, then we need to set the dr_dbcnt * and dr_dbsize to 0 to reflect that. /* Add on the linked list */ * Alloc and setup recids which include set record * namespace is loaded before this is called. * The purpose of this function is to update the device ids in the entire * namespace using the data in the ri structure. Compare the devid found in * the namespace with ri_old_devid and if they are the same, update with the * It is okay if we dont have any configuration /* check out every entry in the namespace */ /* find this devid in the incore replica */ * found the corresponding entry /* first remove old devid info */ /* add in new devid info */ /* Set the bit first otherwise load_old_replicas can fail */ * Upon completion of load_old_replicas, the old setno is * restored from the disk so we need to reset * Fixup the NM records before loading namespace * Load the devid name space if it exists * and ask each module to fixup unit records * (2) locator name block if necessary * calls appropriate writes to push changes out * Create set in MD_LOCAL_SET * update the namespace device ids if necessary (ie. block copy disk) * Unload the namespace for the imported set