zfs_vnops.c revision dd6ef5383c0b29543894f993c2ab3ab8ab6e6f20
1612N/A * The contents of this file are subject to the terms of the 1612N/A * Common Development and Distribution License (the "License"). 1612N/A * You may not use this file except in compliance with the License. 1612N/A * See the License for the specific language governing permissions 1612N/A * and limitations under the License. 1612N/A * When distributing Covered Code, include this CDDL HEADER in each 1612N/A * If applicable, add the following below this CDDL HEADER, with the 1612N/A * fields enclosed by brackets "[]" replaced with your own identifying 1612N/A * information: Portions Copyright [yyyy] [name of copyright owner] 1612N/A * Copyright 2006 Sun Microsystems, Inc. All rights reserved. 1612N/A * Use is subject to license terms. 1612N/A#
pragma ident "%Z%%M% %I% %E% SMI" 1612N/A * Each vnode op performs some logical unit of work. To do this, the ZPL must 1612N/A * properly lock its in-core state, create a DMU transaction, do the work, 1612N/A * record this work in the intent log (ZIL), commit the DMU transaction, 1612N/A * and wait the the intent log to commit if it's is a synchronous operation. 1612N/A * Morover, the vnode ops must work in both normal and log replay context. 1612N/A * The ordering of events is important to avoid deadlocks and references 1612N/A * to freed memory. The example below illustrates the following Big Rules: 1612N/A * (1) A check must be made in each zfs thread for a mounted file system. 1612N/A * This is done avoiding races using ZFS_ENTER(zfsvfs). 1612N/A * A ZFS_EXIT(zfsvfs) is needed before all returns. 1612N/A * (2) VN_RELE() should always be the last thing except for zil_commit() 1612N/A * and ZFS_EXIT(). This is for 3 reasons: 1612N/A * can be freed, so the zp may point to freed memory. Second, the last * reference will call zfs_zinactive(), which may induce a lot of work -- * pushing cached pages (which acquires range locks) and syncing out * cached atime changes. Third, zfs_zinactive() may require a new tx, * which could deadlock the system if you were already holding one. * (3) All range locks must be grabbed before calling dmu_tx_assign(), * as they can span dmu_tx_assign() calls. * (4) Always pass zfsvfs->z_assign as the second argument to dmu_tx_assign(). * In normal operation, this will be TXG_NOWAIT. During ZIL replay, * it will be a specific txg. Either way, dmu_tx_assign() never blocks. * This is critical because we don't want to block while holding locks. * Note, in particular, that if a lock is sometimes acquired before * the tx assigns, and sometimes after (e.g. z_lock), then failing to * use a non-blocking assign can deadlock the system. The scenario: * Thread A has grabbed a lock before calling dmu_tx_assign(). * Thread B is in an already-assigned tx, and blocks for this lock. * Thread A calls dmu_tx_assign(TXG_WAIT) and blocks in txg_wait_open() * forever, because the previous txg can't quiesce until B's tx commits. * If dmu_tx_assign() returns ERESTART and zfsvfs->z_assign is TXG_NOWAIT, * then drop all locks, call txg_wait_open(), and try again. * (5) If the operation succeeded, generate the intent log entry for it * before dropping locks. This ensures that the ordering of events * in the intent log matches the order in which they actually occurred. * (6) At the end of each vnode op, the DMU tx must always commit, * regardless of whether there were any errors. * (7) After dropping all locks, invoke zil_commit(zilog, seq, ioflag) * to ensure that synchronous semantics are provided when necessary. * In general, this is how things should be ordered in each vnode op: * ZFS_ENTER(zfsvfs); // exit if unmounted * zfs_dirent_lock(&dl, ...) // lock directory entry (may VN_HOLD()) * rw_enter(...); // grab any other locks you need * tx = dmu_tx_create(...); // get DMU tx * dmu_tx_hold_*(); // hold each object you might modify * error = dmu_tx_assign(tx, zfsvfs->z_assign); // try to assign * dmu_tx_abort(tx); // abort DMU tx * rw_exit(...); // drop locks * zfs_dirent_unlock(dl); // unlock directory entry * VN_RELE(...); // release held vnodes * if (error == ERESTART && zfsvfs->z_assign == TXG_NOWAIT) { * txg_wait_open(dmu_objset_pool(os), 0); * ZFS_EXIT(zfsvfs); // finished in zfs * return (error); // really out of space * error = do_real_work(); // do whatever this VOP does * seq = zfs_log_*(...); // on success, make ZIL entry * dmu_tx_commit(tx); // commit DMU tx -- error or not * rw_exit(...); // drop locks * zfs_dirent_unlock(dl); // unlock directory entry * VN_RELE(...); // release held vnodes * zil_commit(zilog, seq, ioflag); // synchronous when necessary * ZFS_EXIT(zfsvfs); // finished in zfs * return (error); // done, report error * Clean up any locks held by this process on the vp. * Lseek support for finding holes (cmd == _FIO_SEEK_HOLE) and * data (cmd == _FIO_SEEK_DATA). "off" is an in/out parameter. * Handle the virtual hole at the end of file. * The following two ioctls are used by bfu. Faking out, * necessary to avoid bfu errors. /* offset parameter is in/out */ * When a file is memory mapped, we must keep the IO data synchronized * between the DMU cache and the memory mapped pages. What this means: * On Write: If we find a memory mapped page, we write to *both* * the page and the dmu buffer. * NOTE: We will always "break up" the IO into PAGESIZE uiomoves when * the file is memory mapped. * We don't want a new page to "appear" in the middle of * the file update (because it may not get the write * update data), so we grab a lock to block * When a file is memory mapped, we must keep the IO data synchronized * between the DMU cache and the memory mapped pages. What this means: * On Read: We "read" preferentially from memory mapped pages, * else we default from the dmu buffer. * NOTE: We will always "break up" the IO into PAGESIZE uiomoves when * the file is memory mapped. /* XXX use dmu_read here? */ * Read bytes from specified file into supplied buffer. * IN: vp - vnode of file to be read from. * uio - structure supplying read location, range info, * ioflag - SYNC flags; used to provide FRSYNC semantics. * cr - credentials of caller. * OUT: uio - updated offset and range, buffer filled. * vp - atime updated if byte count > 0 * Check for mandatory locks * If we're in FRSYNC mode, sync out this znode before reading it. * Lock the range against changes. * If we are reading past end-of-file we can skip * to the end; but we might still need to set atime. * Compute the adjustment to align the dmu buffers * XXX -- this is correct, but may be suboptimal. * If the pages are all clean, we don't need to * go through mappedread(). Maybe the VMODSORT * stuff can help us here. * Fault in the pages of the first n bytes specified by the uio structure. * 1 byte in each page is touched and the uio struct is unmodified. * Any error will exit this routine as this is only a best * attempt to get the pages resident. This is a copy of ufs_trans_touch(). * touch each page in this segment. * touch the last byte in case it straddles a page. * Write the bytes to a file. * IN: vp - vnode of file to be written to. * uio - structure supplying write location, range info, * ioflag - FAPPEND flag set if in append mode. * cr - credentials of caller. * OUT: uio - updated offset and range. * vp - ctime|mtime updated if byte count > 0 * Pre-fault the initial pages to ensure slow (eg NFS) pages * If in append mode, set the io offset pointer to eof. * Range lock for a file append: * The value for the start of range will be determined by * zfs_range_lock() (to guarantee append semantics). * If this write will cause the block size to increase, * zfs_range_lock() will lock the entire file, so we must * later reduce the range after we grow the block size. /* overlocked, zp_size can't change */ * If we need to grow the block size then zfs_range_lock() * will lock a wider range than we request here. * Later after growing the block size we reduce the range. * Check for mandatory locks * If zfs_range_lock() over-locked we grow the blocksize * and then reduce the lock range. * The file data does not fit in the znode "cache", so we * will be writing to the file block data buffers. * Each buffer will be written in a separate transaction; * this keeps the intent log records small and allows us * to do more fine-grained space accounting. * XXX - should we really limit each write to z_max_blksz? * Perhaps we should use SPA_MAXBLOCKSIZE chunks? /* XXX - do we need to "clean up" the dmu buffer? */ * privileged and at least one of the excute bits is set. * It would be nice to to this after all writes have * been done, but that would still expose the ISUID/ISGID * to another app after the partial write is committed. * We have more work ahead of us, so wrap up this transaction * and start another. Exact same logic as tx_done below. /* Pre-fault the next set of pages */ * Start another transaction. * Update the file size if it has changed; account * for possible concurrent updates. * If we're in replay mode, or we made no progress, return error. * Otherwise, it's at least a partial write, so it's successful. * Get data to generate a TX_WRITE intent log record. * Nothing to do if the file has been removed * Write records come in two flavors: immediate and indirect. * For small writes it's cheaper to store the data with the * log record (immediate); for large writes it's cheaper to * sync the data and get a pointer to it (indirect) so that * we don't have to write the data twice. if (
buf !=
NULL) {
/* immediate write */ /* test for truncation needs to be done while range locked */ }
else {
/* indirect write */ * Have to lock the whole block to ensure when it's * written out and it's checksum is being calculated * that no one can change the data. We need to re-check * blocksize after we get the lock in case it's changed! /* test for truncation needs to be done while range locked */ * Lookup an entry in a directory, or an extended attribute directory. * If it exists, return a held vnode reference for it. * IN: dvp - vnode of directory to search. * nm - name of entry to lookup. * pnp - full pathname to lookup [UNUSED]. * flags - LOOKUP_XATTR set if looking for an attribute. * rdir - root directory vnode [UNUSED]. * cr - credentials of caller. * OUT: vpp - vnode of located entry, NULL if not found. * We don't allow recursive attributes.. * Do we have permission to get into attribute directory? * Check accessibility of directory. * Convert device special files * Attempt to create a new entry in a directory. If the entry * already exists, truncate the file if permissible, else return * an error. Return the vp of the created or trunc'd file. * IN: dvp - vnode of directory to put new file entry in. * name - name of new file entry. * vap - attributes of new file. * excl - flag indicating exclusive or non-exclusive mode. * mode - mode to open file with. * cr - credentials of caller. * flag - large file flag [UNUSED]. * OUT: vpp - vnode of created or trunc'd entry. * dvp - ctime|mtime updated if new entry created * vp - ctime|mtime always, atime if new * Null component name refers to the directory itself. /* possible VN_HOLD(zp) */ * Create a new file object and update the directory * We only support the creation of regular files in * extended attribute directories. * A directory entry already exists for this name. * Can't truncate an existing file if in exclusive mode. * Can't open a directory for writing. * Verify requested access to file. * Truncate regular files if requested. * Need to update dzp->z_seq? /* Lock the whole range of the file */ * If vnode is for a device return a specfs vnode instead. * Remove an entry from a directory. * IN: dvp - vnode of directory to remove entry from. * name - name of entry to remove. * cr - credentials of caller. * vp - ctime (if nlink > 0) * Attempt to lock directory; fail if entry doesn't exist. * Need to use rmdir for removing directories. * We may delete the znode now, or we may put it on the delete queue; * it depends on whether we're the last link, and on whether there are * other holds on the vnode. So we dmu_tx_hold() the right things to /* are there any extended attributes? */ * XXX - There is a possibility that the delete * of the parent file could succeed, but then we get * an ENOSPC when we try to delete the xattrs... * so we would need to re-try the deletes periodically /* XXX - do we need this if we are deleting? */ /* are there any additional acls */ /* charge as an update -- would be nice not to charge at all */ * Remove the directory entry. /* this rele delayed to prevent nesting transactions */ * Create a new directory and insert it into dvp using the name * provided. Return a pointer to the inserted directory. * IN: dvp - vnode of directory to add subdir to. * dirname - name of new directory. * vap - attributes of new directory. * cr - credentials of caller. * OUT: vpp - vnode of created directory. * dvp - ctime|mtime updated * vp - ctime|mtime|atime updated * First make sure the new directory doesn't exist. * Add a new entry to the directory. * Now put new name in parent dir. * Remove a directory subdir entry. If the current working * directory is the same as the subdir to be removed, the * IN: dvp - vnode of directory to remove from. * name - name of directory to be removed. * cwd - vnode of current working directory. * cr - credentials of caller. * dvp - ctime|mtime updated * Attempt to lock directory; fail if entry doesn't exist. * Grab a lock on the parent pointer make sure we play well * with the treewalk and directory rename code. * Read as many directory entries as will fit into the provided * buffer from the given directory cursor position (specified in * IN: vp - vnode of directory to read. * uio - structure supplying read location, range info, * cr - credentials of caller. * OUT: uio - updated offset and range, buffer filled. * eofp - set to true if end-of-file detected. * Note that the low 4 bits of the cookie returned by zap is always zero. * This allows us to use the low range for "special" directory entries: * We use 0 for '.', and 1 for '..'. If this is the root of the filesystem, * we use the offset 2 for the '.zfs' directory. * If we are not given an eof variable, * Check for valid iov_len. * Quit if directory has been removed (posix) * Initialize the iterator cursor. * Start iteration from the beginning of the directory. * The offset is a serialized cursor. * Get space to change directory entries into fs independent format. * Transform to file-system independent format * Special case `.', `..', and `.zfs'. "entry, obj = %lld, offset = %lld\n",
* Will this entry fit in the buffer? * Did we manage to fit anything in the buffer? /* NOTE: d_off is the offset for the *next* entry */ * Move to the next entry, fill in the previous offset. * Regardless of whether this is required for standards conformance, * this is the logical behavior when fsync() is called on a file with * dirty pages. We use B_ASYNC since the ZIL transactions are already * going to be pushed out as part of the zil_commit(). * Get the requested file attributes and place them in the provided * IN: vp - vnode of file. * vap - va_mask identifies requested attributes. * cr - credentials of caller. * OUT: vap - attribute values. * RETURN: 0 (always succeeds) * Return all attributes. It's cheaper to provide the answer * than to determine whether we were asked the question. * If ACL is trivial don't bother looking for ACE_READ_ATTRIBUTES. * Also, if we are the owner don't bother, since owner should * always be allowed to read basic attributes of file. * Block size hasn't been set; suggest maximal I/O transfers. * Set the file attributes to the values contained in the * IN: vp - vnode of file to be modified. * vap - new attribute values. * flags - ATTR_UTIME set if non-default time values provided. * cr - credentials of caller. * vp - ctime updated, mtime updated if size changed. * First validate permissions * NOTE: even if a new mode is being set, * Take ownership or chgrp to group we are a member of * If both AT_UID and AT_GID are set then take_owner and * take_group must both be set in order to allow taking * Otherwise, send the check through secpolicy_vnode_setattr() * If trim_mask is set then take ownership * has been granted. In that case remove * UID|GID from mask so that * secpolicy_vnode_setattr() doesn't revoke it. * secpolicy_vnode_setattr, or take ownership may have * Range lock the entire file, to ensure the truncate /* we will rewrite this block if we grow */ * Set each attribute requested. * We group settings according to the locks they need to acquire. * Note: you cannot set ctime directly, although it will be * updated as a side-effect of calling this function. * XXX - Note, we are not providing any open * mode flags here (like FNDELAY), so we may * block if there are locks present... this * should be addressed in openat(). * Search back through the directory tree, using the ".." entries. * Lock each directory in the chain to prevent concurrent renames. * Fail any attempt to move a directory into one of its own descendants. * XXX - z_parent_lock can overlap with map or grow locks * First pass write-locks szp and compares to zp->z_id. * Later passes read-lock zp and compare to zp->z_parent. if (*
oidp ==
szp->
z_id)
/* We're a descendant of szp */ * Drop locks and release vnodes that were held by zfs_rename_lock(). * Move an entry from the provided source directory to the target * directory. Change the entry name as indicated. * IN: sdvp - Source directory containing the "old entry". * tdvp - Target directory to contain the "new entry". * cr - credentials of caller. * sdvp,tdvp - ctime|mtime updated * Make sure we have the real vp for the target directory. * This is to prevent the creation of links into attribute space * by renaming a linked file into/outof an attribute directory. * See the comment in zfs_link() for why this is considered bad. * Lock source and target directory entries. To prevent deadlock, * a lock ordering must be defined. We lock the directory with * the smallest object id first, or if it's a tie, the one with * the lexically first name. * POSIX: "If the old argument and the new argument * both refer to links to the same existing file, * the rename() function shall return successfully * and perform no other action." * Source entry invalid or not there. * Must have write access at the source to remove the old entry * and write access at the target to create the new entry. * Note that if target and source are the same, this can be * done in a single check. * Check to make sure rename is valid. * Can't do a move like this: /usr/a/b to /usr/a/b/c/d * Source and target must be the same type. * POSIX dictates that when the source and target * entries refer to the same file object, rename * must do nothing and exit without error. if (
tzp)
/* Attempt to remove the existing target */ * Insert the indicated symbolic reference entry into the directory. * IN: dvp - Directory to contain new symbolic link. * link - Name for new symlink entry. * vap - Attributes of new entry. * target - Target path of new symlink. * cr - credentials of caller. * dvp - ctime|mtime updated * Attempt to lock directory; fail if entry already exists. * Create a new object for the symlink. * Put the link content into bonus buffer if it will fit; * otherwise, store it just like any other file data. * Nothing can access the znode yet so no locking needed * for growing the znode's blocksize. * Insert the new object into the directory. * Return, in the buffer contained in the provided uio structure, * the symbolic path referred to by vp. * IN: vp - vnode of symbolic link. * uoip - structure to contain the link path. * cr - credentials of caller. * OUT: uio - structure to contain the link path. * Insert a new entry into directory tdvp referencing svp. * IN: tdvp - Directory to contain new entry. * svp - vnode of new entry. * name - name of new entry. * cr - credentials of caller. * tdvp - ctime|mtime updated * We do not support links between attributes and non-attributes * because of the potential security risk of creating links * into "normal" file space in order to circumvent restrictions * imposed in attribute space. * POSIX dictates that we return EPERM here. * Better choices include ENOTSUP or EISDIR. * Attempt to lock directory; fail if entry already exists. * zfs_null_putapage() is used when the file system has been force * unmounted. It just drops the pages. * Can't push pages past end-of-file. * Copy the portion of the file indicated from pages into the file. * The pages are stored in a page list attached to the files vnode. * IN: vp - vnode of file to push page data to. * off - position in file to put data. * len - amount of data to write. * flags - flags to control the operation. * cr - credentials of caller. * vp - ctime|mtime updated * Search the entire vp list for pages >= off. * Found a dirty page to push * Attempt to push any data in the page cache. If this fails * we will get kicked out later in zfs_zinactive(). * Bounds-check the seek operation. * IN: vp - vnode seeking within * noffp - pointer to new file offset * EINVAL if new offset invalid * Pre-filter the generic locking function to trap attempts to place * a mandatory lock on a memory mapped file. * We are following the UFS semantics with respect to mapcnt * here: If we see that the file is mapped already, then we will * return an error, but we don't worry about races between this * function and zfs_map(). * If we can't find a page in the cache, we will create a new page * and fill it with file data. For efficiency, we may try to fill * multiple pages at once (klustering). * If we are only asking for a single page don't bother klustering. * Try to fill a kluster of pages (a blocks worth). /* Only one block in the file. */ * Some other thread entered the page before us. * Return to zfs_getpage to retry the lookup. * Fill the pages in the kluster. /* On error, toss the entire kluster */ * Fill in the page list array from the kluster. If * there are too many pages in the kluster, return * as many pages as possible starting from the desired * NOTE: the page list will always be null terminated. * Return pointers to the pages for the file region [off, off + len] * in the pl array. If plsz is greater than len, this function may * also return page pointers from before or after the specified * region (i.e. some region [off', off' + plsz]). These additional * pages are only returned if they are already in the cache, or were * created as part of a klustered read. * IN: vp - vnode of file to get data from. * off - position in file to get data from. * len - amount of data to retrieve. * plsz - length of provided page list. * seg - segment to obtain pages for. * addr - virtual address of fault. * rw - mode of created pages. * cr - credentials of caller. * OUT: protp - protection mode of created pages. * pl - list of pages created. /* no faultahead (for now) */ * Make sure nobody restructures the file in the middle of the getpage. /* can't fault past EOF */ * If we already own the lock, then we must be page faulting * in the middle of a write to this file (i.e., we are writing * to this file using data from a mapped region of the file). * Loop through the requested range [off, off + len] looking * for pages. If we don't find a page, we will need to create * a new page and fill it with data from the file. * klustering may have changed our region * Release any pages we have locked. * Fill out the page array with any pages already in the cache. * Request a memory map for a section of a file. This code interacts * with common code and the VM system as follows: * common code calls mmap(), which ends up in smmap_common() * this calls VOP_MAP(), which takes you into (say) zfs * zfs_map() calls as_map(), passing segvn_create() as the callback * segvn_create() creates the new segment and calls VOP_ADDMAP() * zfs_addmap() updates z_mapcnt * If file is locked, disallow mapping. * User specified address - blow away any previous mappings * The reason we push dirty pages as part of zfs_delmap() is so that we get a * more accurate mtime for the associated file. Since we don't have a way of * detecting when the data was actually modified, we have to resort to * heuristics. If an explicit msync() is done, then we mark the mtime when the * last page is pushed. The problem occurs when the msync() call is omitted, * which by far the most common case: * If we wait until fsflush to come along, we can have a modification time that * is some arbitrary point in the future. In order to prevent this in the * common case, we flush pages whenever a (MAP_SHARED, PROT_WRITE) mapping is * Free or allocate space in a file. Currently, this function only * supports the `F_FREESP' command. However, this command is somewhat * misnamed, as its functionality includes the ability to allocate as * IN: vp - vnode of file to free data in. * cmd - action to take (only F_FREESP supported). * flag - current file open mode flags. * offset - current file offset. * cr - credentials of caller [UNUSED]. * vp - ctime|mtime updated len =
bfp->
l_len;
/* 0 means from off to end of file */ * If we will change zp_size (in zfs_freesp) then lock the whole file, * otherwise just lock the range being freed. /* recheck, in case zp_size changed */ /* lost race: file size changed, lock whole file */ * We are increasing the length of the file, * and this may mean a block size increase. * If len == 0, we are truncating the file. /* Must have a non-zero generation number to distinguish from .zfs */ /* XXX - this should be the generation number for the objset */ * If there aren't extended attributes, it's the * same as having zero of them. * Predeclare these here so that the compiler assumes that * this is an "old style" function declaration that does * not include arguments => we won't get type mismatch errors * in the initializations that follow. * Directory vnode operations template * Regular file vnode operations template * Symbolic link vnode operations template * Extended attribute directory vnode operations template * This template is identical to the directory vnodes * operation template except for restricted operations: * Note that there are other restrictions embedded in: * zfs_create() - restrict type to VREG * zfs_link() - no links into/out of attribute space * zfs_rename() - no moves into/out of attribute space * Error vnode operations template