ztest.c revision 503ad85c168c7992ccc310af845a581cff3c72b5
1N/A * The contents of this file are subject to the terms of the 1N/A * Common Development and Distribution License (the "License"). 1N/A * You may not use this file except in compliance with the License. 1N/A * See the License for the specific language governing permissions 1N/A * and limitations under the License. 1N/A * When distributing Covered Code, include this CDDL HEADER in each 1N/A * If applicable, add the following below this CDDL HEADER, with the 1N/A * fields enclosed by brackets "[]" replaced with your own identifying 1N/A * information: Portions Copyright [yyyy] [name of copyright owner] 1N/A * Copyright 2009 Sun Microsystems, Inc. All rights reserved. 1N/A * Use is subject to license terms. 1N/A * The objective of this program is to provide a DMU/ZAP/SPA stress test 1N/A * that runs entirely in userland, is easy to use, and easy to extend. 1N/A * The overall design of the ztest program is as follows: 1N/A * (1) For each major functional area (e.g. adding vdevs to a pool, 1N/A * creating and destroying datasets, reading and writing objects, etc) 1N/A * we have a simple routine to test that functionality. These 1N/A * individual routines do not have to do anything "stressful". 1N/A * (2) We turn these simple functionality tests into a stress test by 1N/A * running them all in parallel, with as many threads as desired, 1N/A * and spread across as many datasets, objects, and vdevs as desired. 1N/A * (3) While all this is happening, we inject faults into the pool to 1N/A * verify that self-healing data really works. 1N/A * (4) Every time we open a dataset, we change its checksum and compression 1N/A * functions. Thus even individual objects vary from block to block 1N/A * in which checksum they use and whether they're compressed. 1N/A * (5) To verify that we never lose on-disk consistency after a crash, 1N/A * we run the entire test in a child of the main process. 1N/A * At random times, the child self-immolates with a SIGKILL. 1N/A * This is the software equivalent of pulling the power cord. 1N/A * The parent then runs the test again, using the existing 1N/A * storage pool, as many times as desired. 1N/A * (6) To verify that we don't have future leaks or temporal incursions, 1N/A * many of the functional tests record the transaction group number 1N/A * as part of their data. When reading old data, they verify that 1N/A * the transaction group number is less than the current, open txg. 1N/A * If you add a new test, please do this if applicable. 1N/A * When run with no arguments, ztest runs for about five minutes and 1N/A * produces no output if successful. To get a little bit of information, 1N/A * specify -V. To get more information, specify -VV, and so on. 1N/A * To turn this into an overnight stress test, use -T to specify run time. 1N/A * You can ask more more vdevs [-v], datasets [-d], or threads [-t] 1N/A * to increase the pool capacity, fanout, and overall stress level. 1N/A * The -N(okill) option will suppress kills, so each child runs to completion. 1N/A * This can be useful when you're trying to distinguish temporal incursions 1N/A * from plain old race conditions. * Thread-local variables can go here to aid debugging. * Note: these aren't static because we want dladdr() to work. * Stuff we need to share writably between parent and child. * These libumem hooks provide a reasonable set of defaults for the allocator's return (
"default,verbose");
/* $UMEM_DEBUG setting */ return (
"fail,contents");
/* $UMEM_LOGGING setting */ const char *
ends =
"BKMGTPEZ";
}
else if (
end[0] ==
'.') {
"\t[-v vdevs (default: %llu)]\n" "\t[-s size_of_each_vdev (default: %s)]\n" "\t[-a alignment_shift (default: %d) (use 0 for random)]\n" "\t[-m mirror_copies (default: %d)]\n" "\t[-r raidz_disks (default: %d)]\n" "\t[-R raidz_parity (default: %d)]\n" "\t[-d datasets (default: %d)]\n" "\t[-t threads (default: %d)]\n" "\t[-g gang_block_threshold (default: %s)]\n" "\t[-i initialize pool i times (default: %d)]\n" "\t[-k kill percentage (default: %llu%%)]\n" "\t[-p pool_name (default: %s)]\n" "\t[-f file directory for vdev files (default: %s)]\n" "\t[-V(erbose)] (use multiple times for ever more blather)\n" "\t[-E(xisting)] (use existing pool instead of creating new one)\n" "\t[-T time] total run time (default: %llu sec)\n" "\t[-P passtime] time per pass (default: %llu sec)\n" /* By default, test gang blocks for blocks 32K and greater */ "v:s:a:m:r:R:d:t:g:i:k:p:f:VET:P:h")) !=
EOF) {
int log,
int r,
int m,
int t)
for (c = 0; c < t; c++) {
fatal(0,
"dmu_object_set_blocksize('%s', %llu, %d, %d) = %d",
(
void)
printf(
"replay create of %s object %llu" NULL,
/* 0 no such transaction type */ * Verify that we can't destroy an active pool, create an existing pool, * or create a pool with a bad vdev spec. * Attempt to create using a bad file. * Attempt to create using a bad mirror. * Attempt to create an existing pool. It shouldn't matter * what's in the nvroot; we should fail with EEXIST. * Verify that vdev_add() works as expected. * Make 1/4 of the devices be log devices. * Verify that adding/removing aux devices (l2arc, hot spare) works as expected. * Pick a random device to remove. * Find an unused device we can add. * Remove an existing device. Sometimes, dirty its * vdev state first to make sure we handle removal * of devices that have pending state changes. * Verify that we can attach and detach devices. * Decide whether to do an attach or a replace. * Pick a random top-level vdev. * Pick a random leaf within it. * If we're already doing an attach or replace, oldvd may be a * mirror vdev -- in which case, pick a random child. * If oldvd has siblings, then half of the time, detach it. * For the new vdev, choose with equal probability between the two * standard paths (ending in either 'a' or 'b') or a random hot spare. * Make newsize a little bigger or smaller than oldsize. * If it's smaller, the attach should fail. * If it's larger, and we're doing a replace, * we should get dynamic LUN growth when we're done. * If pvd is not a mirror or root, the attach should fail with ENOTSUP, * unless it's a replace; in that case any non-replacing parent is OK. * If newvd is already part of the pool, it should fail with EBUSY. * If newvd is too small, it should fail with EOVERFLOW. * Build the nvlist describing newpath. * If our parent was the replacing vdev, but the replace completed, * then instead of failing with ENOTSUP we may either succeed, * fail with ENODEV, or fail with EOVERFLOW. * If someone grew the LUN, the replacement may be too small. /* XXX workaround 6690467 */ fatal(0,
"attach (%s %llu, %s %llu, %d) " "returned %d, expected %d",
* Callback function which expands the physical size of the vdev. (
void)
printf(
"%s grew from %lu to %lu bytes\n",
* Callback function which expands a given vdev by calling vdev_online(). /* Calling vdev_online will initialize the new metaslabs */ * Since we dropped the lock we need to ensure that we're * still talking to the original vdev. It's possible this (
void)
printf(
"vdev %p has disappeared, was " * Traverse the vdev tree calling the supplied function. * We continue to walk the tree until we either have walked all * children or we receive a non-NULL return from the callback. * If a NULL callback is passed, then we just return back the first * leaf vdev we encounter. * Verify that dynamic LUN growth works as expected. * Determine the size of the first leaf vdev associated with * We only try to expand the vdev if it's less than 4x its * original size and it has a valid psize. (
void)
printf(
"Expanding vdev %s from %lu to %lu\n",
* Growing the vdev is a two step process: * 1). expand the physical size (i.e. relabel) * 2). online the vdev to create the new metaslabs (
void)
printf(
"Could not expand LUN because " "some vdevs were not healthy\n");
* Expanding the LUN will update the config asynchronously, * thus we must wait for the async thread to complete any * pending tasks before proceeding. * Make sure we were able to grow the pool. (
void)
printf(
"Top-level vdev metaslab count: " "before %llu, after %llu\n",
fatal(0,
"LUN expansion failed: before %llu, " (
void)
printf(
"%s grew from %s to %s\n",
* Create the directory object. * Verify that the dataset contains a directory object. /* We could have crashed in the middle of destroying it */ * Verify that dmu_objset_{create,destroy,open,close} work as expected. * If this dataset exists from a previous run, process its replay log * half of the time. If we don't replay it, then dmu_objset_destroy() * (invoked from ztest_destroy_cb() below) should just throw it away. * There may be an old instance of the dataset we're about to * create lying around from a previous run. If so, destroy it * and all of its snapshots. * Verify that the destroyed dataset is no longer in the namespace. fatal(
1,
"dmu_objset_open(%s) found destroyed dataset %p",
* Verify that we can create a new dataset. * Open the intent log for it. * Put a random number of objects in there. * Verify that we cannot create an existing dataset. fatal(0,
"created existing dataset, error = %d",
error);
* Verify that we can hold an objset that is also owned. * Verify that we can not own an objset that is already owned. fatal(0,
"dmu_objset_open('%s') = %d, expected EBUSY",
* Verify that dmu_snapshot_{create,destroy,open,close} work as expected. * Cleanup non-standard snapshots and clones. * Verify dsl_dataset_promote handles EBUSY * Verify that dmu_object_{alloc,free} work as expected. * Create a batch object if necessary, and record it in the directory. * Destroy the previous batch of objects. * Read and validate contents. * We expect the nth byte of the bonus buffer to be n. "bad bonus: %s, obj %llu, off %d: %u != %u",
* We expect the word at endoff to be our object number. fatal(0,
"bad data in %s, got %llu, expected %llu",
* Destroy old object and clear batch entry. fatal(0,
"dmu_object_free('%s', %llu) = %d",
* Before creating the new batch of objects, generate a bunch of churn. fatal(0,
"dmu_object_free('%s', %llu) = %d",
* Create a new batch of objects with randomly chosen * blocksizes and record them in the batch directory. * Write to both the bonus buffer and the regular data. * See comments above regarding the contents of * the bonus buffer and the word at endoff. * Write to a large offset to increase indirection. * Verify that dmu_{read,write} work as expected. * This test uses two objects, packobj and bigobj, that are always * updated together (i.e. in the same tx) so that their contents are * in sync and can be compared. Their contents relate to each other * in a simple way: packobj is a dense array of 'bufwad' structures, * while bigobj is a sparse array of the same bufwads. Specifically, * for any index n, there are three bufwads that should be identical: * packobj, at offset n * sizeof (bufwad_t) * bigobj, at the head of the nth chunk * bigobj, at the tail of the nth chunk * The chunk size is arbitrary. It doesn't have to be a power of two, * and it doesn't have any relation to the object blocksize. * The only requirement is that it can hold at least two bufwads. * Normally, we write the bufwad to each of these locations. * However, free_percent of the time we instead write zeroes to * packobj and perform a dmu_free_range() on bigobj. By comparing * bigobj to packobj, we can verify that the DMU is correctly * tracking which parts of an object are allocated and free, * and that the contents of the allocated blocks are correct. * Read the directory info. If it's the first time, set things up. * Prefetch a random chunk of the big object. * Our aim here is to get some async reads in flight * for blocks that we may free below; the DMU should * handle this race correctly. * Pick a random index and compute the offsets into packobj and bigobj. * free_percent of the time, free a range of bigobj rather than * Read the current contents of our objects. * Get a tx for the mods to both packobj and bigobj. * For each index from n to n + s, verify that the existing bufwad * in packobj matches the bufwads at the head and tail of the * corresponding chunk in bigobj. Then update all three bufwads * with the new values we want to write out. for (i = 0; i < s; i++) {
fatal(0,
"future leak: got %llx, open txg is %llx",
fatal(0,
"wrong index: got %llx, wanted %llx+%llx",
* We've verified all the old bufwads, and made new ones. (
void)
printf(
"freeing offset %llx size %llx" (
void)
printf(
"writing offset %llx size %llx" * Sanity check the stuff we just wrote. * For each index from n to n + s, verify that the existing bufwad * in packobj matches the bufwads at the head and tail of the * corresponding chunk in bigobj. Then update all three bufwads * with the new values we want to write out. for (i = 0; i < s; i++) {
fatal(0,
"future leak: got %llx, open txg is %llx",
fatal(0,
"wrong index: got %llx, wanted %llx+%llx",
* This test uses two objects, packobj and bigobj, that are always * updated together (i.e. in the same tx) so that their contents are * in sync and can be compared. Their contents relate to each other * in a simple way: packobj is a dense array of 'bufwad' structures, * while bigobj is a sparse array of the same bufwads. Specifically, * for any index n, there are three bufwads that should be identical: * packobj, at offset n * sizeof (bufwad_t) * bigobj, at the head of the nth chunk * bigobj, at the tail of the nth chunk * The chunk size is set equal to bigobj block size so that * dmu_assign_arcbuf() can be tested for object updates. * Read the directory info. If it's the first time, set things up. * Pick a random index and compute the offsets into packobj and bigobj. * Iteration 0 test zcopy for DB_UNCACHED dbufs. * Iteration 1 test zcopy to already referenced dbufs. * Iteration 2 test zcopy to dirty dbuf in the same txg. * Iteration 3 test zcopy to dbuf dirty in previous txg. * Iteration 4 test zcopy when dbuf is no longer dirty. * Iteration 5 test zcopy when it can't be done. * Iteration 6 one more zcopy write. for (i = 0; i <
7; i++) {
* In iteration 5 (i == 5) use arcbufs * that don't match bigobj blksz to test * dmu_assign_arcbuf() when it can't directly * assign an arcbuf to a dbuf. for (j = 0; j < s; j++) {
* Get a tx for the mods to both packobj and bigobj. for (j = 0; j < s; j++) {
* 50% of the time don't read objects in the 1st iteration to * test dmu_assign_arcbuf() for the case when there're no * existing dbufs for the specified offsets. * We've verified all the old bufwads, and made new ones. (
void)
printf(
"writing offset %llx size %llx" * Sanity check the stuff we just wrote. * Make sure that, if there is a write record in the bonus buffer * of the ZTEST_DIROBJ, that the txg for this record is <= the * last synced txg of the pool. * Have multiple threads write to large offsets in ZTEST_DIROBJ * to verify that having multiple threads writing to the same object * in parallel doesn't cause any trouble. * Do the bonus buffer instead of a regular block. * We need a lock to serialize resize vs. others, * so we hash on the objset ID. * Occasionally, write an all-zero block to test the behavior * of blocks that compress into holes. (
void)
poll(
NULL, 0,
1);
/* open dn_notxholds window */ * dmu_sync() the block we just wrote. * Read the block that dmu_sync() returned to make sure its contents * match what we wrote. We do this while still txg_suspend()ed * to ensure that the block can't be reused before we read it. * The semantic of dmu_sync() is that we always push the most recent * version of the data, so in the face of concurrent updates we may * see a newer version of the block. That's OK. * Verify that zap_{create,destroy,add,remove,update} work as expected. * Create a new object if necessary, and record it in the directory. fatal(0,
"zap_create('%s', %llu) = %d",
* Generate a known hash collision, and verify that * we can lookup and remove both entries. for (i = 0; i <
2; i++) {
for (i = 0; i <
2; i++) {
for (i = 0; i <
2; i++) {
* If these zap entries already exist, validate their contents. for (i = 0; i <
ints; i++) {
* Atomically update two entries in our zap object. * The first is named txg_%llu, and contains the txg * in which the property was last updated. The second * is named prop_%llu, and the nth element of its value * should be txg + object + n. for (i = 0; i <
ints; i++)
fatal(0,
"zap_update('%s', %llu, '%s') = %d",
fatal(0,
"zap_update('%s', %llu, '%s') = %d",
* Remove a random pair of entries. fatal(0,
"zap_remove('%s', %llu, '%s') = %d",
fatal(0,
"zap_remove('%s', %llu, '%s') = %d",
* Once in a while, destroy the object. fatal(0,
"zap_destroy('%s', %llu) = %d",
* Generate a random name of the form 'xxx.....' where each * x is a random printable character and the dots are dots. * There are 94 such characters, and the name length goes from * 6 to 20, so there are 94^3 * 15 = 12,458,760 possible names. * Select an operation: length, lookup, add, update, remove. fatal(0,
"name '%s' != val '%s' len %d",
for (i = 0; i <
2; i++) {
(
void)
printf(
"%s %s = %s for '%s'\n",
* Inject random faults into the on-disk data. * We need SCL_STATE here because we're going to look at vd0->vdev_tsd. * Inject errors on a normal data device. * Generate paths to the first leaf in this top-level vdev, * and to the random leaf we selected. We'll induce transient * and we'll write random garbage to the randomly chosen leaf. * Make vd0 explicitly claim to be unreadable, * or unwriteable, or reach behind its back * and close the underlying fd. We can do this if * maxfaults == 0 because we'll fail and reexecute, * and we can do it if maxfaults >= 2 because we'll * have enough redundancy. If maxfaults == 1, the * combination of this with injection of random data * corruption below exceeds the pool's fault tolerance. * Inject errors on an l2cache device. * If we can tolerate two or more faults, randomly online/offline vd0. * We have at least single-fault tolerance, so inject data corruption. if (
fd == -
1)
/* we hit a gap in the device namespace */ (
void)
printf(
"injecting bad word into %s," fatal(
1,
"can't inject bad word at 0x%llx in %s",
(
void)
poll(
NULL, 0,
1000);
/* wait a second, then force a restart */ * Rename the pool to a different name and then rename it back. * Try to open it under the old name, which shouldn't exist * Open it under the new name and make sure it's still the same spa_t. * Rename it back to the original * Make sure it can still be opened * Completely obliterate one disk. * Rename the old device to dev_name.old (useful for debugging). * Build the nvlist describing dev_name. fatal(0,
"spa_vdev_attach(in-place) = %d",
error);
* Clean up from previous runs. * Get the pool's configuration and guid. * Import it under the new name. * Try to import it again -- should fail with EEXIST. * Try to import it under a different name -- should fail with EEXIST. * Verify that the pool is no longer visible under the old name. * Verify that we can open and close the pool using the new name. * See if it's time to force a crash. * Pick a random function. * Decide whether to call it, based on the requested frequency. (
void)
printf(
"%6.2f sec in %s\n",
* If we're getting ENOSPC with some regularity, stop. * Kick off threads to run tests on all datasets in parallel. * Destroy one disk before we even start. * It's mirrored, so everything should work just fine. * This makes us exercise fault handling very early in spa_load(). * Verify that the sum of the sizes of all blocks in the pool * equals the SPA's allocated space total. * Kick off a replacement of the disk we just obliterated. * Verify that we can export the pool and reimport it under a * Verify that we can loop over all pools. * We don't expect the pool to suspend unless maxfaults == 0, * in which case ztest_fault_inject() temporarily takes away * the only valid replica. * Create a thread to periodically resume suspended I/O. * Verify that we can safely inquire about about any object, * whether it's allocated or not. To make it interesting, * we probe a 5-wide window around each power of two. * This hits all edge cases, including zero and the max. for (t = 0; t <
64; t++) {
for (d = -
5; d <=
5; d++) {
* Now kick off all the tests that run in parallel. (
void)
printf(
"starting main threads...\n");
fatal(0,
"dmu_objset_create(%s) = %d",
fatal(0,
"dmu_objset_open('%s') = %d",
* If we had out-of-space errors, destroy a random objset. (
void)
printf(
"Destroying %s to free up space\n",
name);
/* Cleanup any non-standard clones and snapshots */ /* Kill the resume thread */ * Right before closing the pool, kick off a bunch of async I/O; * spa_close() should wait for it to complete. "%llud%02lluh%02llum%02llus", d, h, m, s);
* Create a storage pool with the given name and initial vdev size. * Then create the specified number of datasets in the pool. * Create the storage pool. /* Override location of zpool.cache */ * Blow away any existing copy of zpool.cache (
void)
printf(
"%llu vdevs, %d datasets, %d threads," * Create and initialize our storage pool. (
void)
printf(
"ztest_init(), pass %d\n", i);
* Initialize the call targets for each function. * Run the tests in a loop. These tests include fault injection * to verify that self-healing data works, and forced crashes * to verify that we never lose on-disk consistency. * Initialize the workload counters for each function. /* Set the allocation switch size */ if (
pid == 0) {
/* child */ "child exited with code %d\n",
"child died with signal %d\n",
(
void)
printf(
"Pass %3d, %8s, %3llu ENOSPC, " "%4.1f%% of %5s used, %3.0f%% done, %8s to go\n",
(
void)
printf(
"\nWorkload summary:\n\n");
"Calls",
"Time",
"Function");
"-----",
"----",
"--------");
(
void)
printf(
"%7llu %9s %s\n",
* It's possible that we killed a child during a rename test, in * which case we'll have a 'ztest_tmp' pool lying around instead * of 'ztest'. Do a blind rename in case this happened. (
void)
printf(
"%d killed, %d completed, %.0f%% kill rate\n",