ufs_vnops.c revision 31ceb98b622e1a310256f4c4a1472beb92046db3
2N/A * The contents of this file are subject to the terms of the 2N/A * Common Development and Distribution License (the "License"). 2N/A * You may not use this file except in compliance with the License. 2N/A * See the License for the specific language governing permissions 2N/A * and limitations under the License. 2N/A * When distributing Covered Code, include this CDDL HEADER in each 2N/A * If applicable, add the following below this CDDL HEADER, with the 2N/A * fields enclosed by brackets "[]" replaced with your own identifying 2N/A * information: Portions Copyright [yyyy] [name of copyright owner] 2N/A * Copyright 2007 Sun Microsystems, Inc. All rights reserved. 2N/A * Use is subject to license terms. 2N/A/* Copyright (c) 1983, 1984, 1985, 1986, 1987, 1988, 1989 AT&T */ 2N/A/* All Rights Reserved */ 2N/A * Portions of this source code were derived from Berkeley 4.3 BSD 2N/A * under license from the Regents of the University of California. 2N/A#
pragma ident "%Z%%M% %I% %E% SMI" * For lockfs: ulockfs begin/end is now inlined in the ufs_xxx functions. * XXX - ULOCKFS in fs_pathconf and ufs_ioctl is not inlined yet. /* NOTE: "not blkd" below means that the operation isn't blocked by lockfs */ * Created by ufs_dumpctl() to store a file's disk block info into memory. * Used by ufs_dump() to dump data to disk directly. struct inode *
ip;
/* the file we contain */ * Previously there was no special action required for ordinary files. * (Devices are handled through the device file system.) * Now we support Large Files and Large File API requires open to * We could take care to prevent data corruption * by doing an atomic check of size and truncate if file is opened with * FTRUNC flag set but traditionally this is being done by the vfs/vnode * layers. So taking care of truncation here is a change in the existing * semantics of VOP_OPEN and therefore we chose not to implement any thing * here. The check for the size of the file > 2GB is being done at the * vfs layer in routine vn_open(). * Push partially filled cluster at last close. * ``last close'' is approximated because the dnlc * may have a hold on the vnode. * Checking for VBAD here will also act as a forced umount check. * Mandatory locking needs to be done before ufs_lockfs_begin() * and TRANS_BEGIN_SYNC() calls since mandatory locks can sleep. * ufs_getattr ends up being called by chklock * In the case that a directory is opened for reading as a file * (eg "cat .") with the O_RSYNC, O_SYNC and O_DSYNC flags set. * The locking order had to be changed to avoid a deadlock with * an update taking place on that directory at the same time. * Only transact reads to files opened for sync-read and * sync-write on a file system that is not write locked. * The ``not write locked'' check prevents problems with * logging exists at the beginning of the read but does not extern int ufs_HW;
/* high water mark */ extern int ufs_LW;
/* low water mark */ * If the FDSYNC flag is set then ignore the global * ufs_allow_shared_writes in this case. * Filter to determine if this request is suitable as a * concurrent rewrite. This write must not allocate blocks * by extending the file or filling in holes. No use trying * through FSYNC descriptors as the inode will be synchronously * updated after the write. The uio structure has not yet been * checked for sanity, so assume nothing. * Mandatory locking needs to be done before ufs_lockfs_begin() * and TRANS_BEGIN_[A]SYNC() calls since mandatory locks can sleep. * Check for forced unmounts normally done in ufs_lockfs_begin(). * ufs_getattr ends up being called by chklock /* i_rwlock can change in chklock */ * Check for fast-path special case of directio re-writes. * Special treatment of access times for re-writes. * If IMOD is not already set, then convert it * to IMODACC for this operation. This defers * entering a delta into the log until the inode * is flushed. This mimics what is done for read * operations and inode access time. * Mandatory locking could have been enabled * after dropping the i_rwlock. * Amount of log space needed for this write * If the write is a rewrite there is no need to open a transaction * if the FDSYNC flag is set and not the FSYNC. In this case just * set the IMODACC flag to modify do the update at a later time * thus avoiding the overhead of the logging transaction that is * In append mode start at end of file. * Mild optimisation, don't call ufs_trans_write() unless we have to * Also, suppress file system full messages if we will retry. * Any blocks tied up in pending deletes? * Don't cache write blocks to files with the sticky bit set. * Used to keep swap files from blowing the page cache on a server. * Free behind hacks. The pager is busted. * XXX - need to pass the information down to writedone() in a flag like B_SEQ * or B_FREE_IF_TIGHT_ON_MEMORY. * While we should, in most cases, cache the pages for write, we * may also want to cache the pages for read as long as they are * If cache_read_ahead = 1, the pages for read will go to the tail * of the cache list when they are released, otherwise go to the head. * Freebehind exists so that as we read large files sequentially we * don't consume most of memory with pages from a few files. It takes * longer to re-read from disk multiple small files as it does reading * one large one sequentially. As system memory grows customers need * to retain bigger chunks of files in memory. The advent of the * cachelist opens up of the possibility freeing pages to the head or * Not freeing a page is a bet that the page will be read again before * it's segmap slot is needed for something else. If we loose the bet, * it means some other thread is burdened with the page free we did * not do. If we win we save a free and reclaim. * Freeing it at the tail vs the head of cachelist is a bet that the * page will survive until the next read. It's also saying that this * page is more likely to be re-used than a page freed some time ago * Freebehind maintains a range of file offset [smallfile1; smallfile2] * 0 < offset < smallfile1 : pages are not freed. * smallfile1 < offset < smallfile2 : pages freed to tail of cachelist. * smallfile2 < offset : pages freed to head of cachelist. * The range is computed at most once per second and depends on * freemem and ncpus_online. Both parameters are bounded to be * >= smallfile && >= smallfile64. * smallfile1 = (free memory / ncpu) / 1000 * smallfile2 = (free memory / ncpu) / 10 * Free Mem (in Bytes) [smallfile1; smallfile2] [smallfile1; smallfile2] * ncpus_online = 4 ncpus_online = 64 * ------------------ ----------------------- ----------------------- * 1G [256K; 25M] [32K; 1.5M] * 10G [2.5M; 250M] [156K; 15M] * 100G [25M; 2.5G] [1.5M; 150M] * wrip does the real work of write requests for ufs. * ip->i_size is incremented before the uiomove * is done on a write. If the move fails (bad user * address) reset ip->i_size. * The better way would be to increment ip->i_size * only if the uiomove succeeds. * check for forced unmount - should not happen as * the request passed the lockfs checks. /* check for valid filetype */ * the actual limit of UFS file size * if largefiles are disallowed, the limit is * the pre-largefiles value of 2GB * If ufs_directio wrote to the file or set the flags, * we need to update i_seq, but it may be deferred. * o shadow inodes: vfs_dqrwlock is not held at all * o quota updates: vfs_dqrwlock is read or write held * o other updates: vfs_dqrwlock is read held * The first case is the only one where we do not hold * vfs_dqrwlock at all while entering wrip(). * have it as writer, i.e. if we are updating the quota inode. * There is no potential deadlock scenario in this case as * ufs_getpage() takes care of this and avoids reacquiring * vfs_dqrwlock in that case. * This check is done here since the above conditions do not change * and we possibly loop below, so save a few cycles. * Large Files: We cast MAXBMASK to offset_t * inorder to mask out the higher bits. Since offset_t * is a signed value, the high order bit set in MAXBMASK * value makes it do the right thing by having all bits 1 * in the higher word. May be removed for _SOLARIS64_. * since uoff + n >= limit, * therefore n >= limit - uoff, and n is an int * so it is safe to cast it to an int * We are extending the length of the file. * bmap is used so that we are sure that * if we need to allocate new blocks, that it * is done here before we up the file size. * bmap_write never drops i_contents so if * the flags are set it changed the file. * There is a window of vulnerability here. * The sequence of operations: allocate file * system blocks, uiomove the data into pages, * and then update the size of the file in the * inode, must happen atomically. However, due * to current locking constraints, this can not * If we are writing from the beginning of * the mapping, we can just create the * pages without having to read them. * Going to do a whole mappings worth, * so we can just create the pages w/o * having to read them in. But before * we do that, we need to make sure any * needed blocks are allocated first. * bmap_write never drops i_contents so if * the flags are set it changed the file. * check if the new created page needed the * allocation of new disk blocks. * In sync mode flush the indirect blocks which * may have been allocated and not written on * disk. In above cases bmap_write will allocate * At this point we can enter ufs_getpage() in one * 1) segmap_getmapflt() calls ufs_getpage() when the * forcefault parameter is true (pagecreate == 0) * 2) uiomove() causes a page fault. * We have to drop the contents lock to prevent the VM * system from trying to reaquire it in ufs_getpage() * should the uiomove cause a pagefault. * We have to drop the reader vfs_dqrwlock here as well. * Copy data. If new pages are created, part of * the page that is not written will be initizliazed * segmap_pagecreate() returns 1 if it calls * page_create_va() to allocate any pages. * If "newpage" is set, then a new page was created and it * does not contain valid data, so it needs to be initialized * Otherwise the page contains old data, which was overwritten * partially or as a whole in uiomove. * If there is only one iovec structure within uio, then * on error uiomove will not be able to update uio->uio_loffset * and we would zero the whole page here! * If uiomove fails because of an error, the old valid data * is kept instead of filling the rest of the page with zero's. * We created pages w/o initializing them completely, * thus we need to zero the part that wasn't set up. * This happens on most EOF write cases and if * we had some sort of error during the uiomove. * Unlock the pages allocated by page_create_va() * If the size of the file changed, then update the * size field in the inode now. This can't be done * before the call to segmap_pageunlock or there is * a potential deadlock with callers to ufs_putpage(). * They will be holding i_contents and trying to lock * a page, while this thread is holding a page locked * and trying to acquire i_contents. * file has grown larger than 2GB. Set flag * in superblock to indicate this, if it * If we failed on a write, we may have already * allocated file blocks as well as pages. It's * hard to undo the block allocation, but we must * be sure to invalidate any pages that may have * If the page was created without initialization * then we must check if it should be possible * to destroy the new page and to keep the old data * It is possible to destroy the page without * having to write back its contents only when * - the size of the file keeps unchanged * - bmap_write() did not allocate new disk blocks * it is possible to create big files using "seek" and * write to the end of the file. A "write" to a * position before the end of the file would not * change the size of the file but it would allocate * - uiomove intended to overwrite the whole page. * - a new page was created (newpage == 1). /* unwind what uiomove eventually last did */ * destroy the page, do not write ambiguous * write the page back to the disk, if dirty, * and remove the page from the cache. * Force write back for synchronous write cases. * If the sticky bit is set but the * execute bit is not set, we do a * synchronous write back and free * the page when done. We set up swap * files to be handled this way to * prevent servers from keeping around * the client's swap pages too long. * XXX - there ought to be a better way. * Have written a whole block. * Start an asynchronous write and * mark the buffer to indicate that * it won't be needed again soon. * If the operation failed and is synchronous, * then we need to unwind what uiomove() last * did so we can potentially return an error to * the caller. If this write operation was * done in two pieces and the first succeeded, * then we won't return an error for the second * piece that failed. However, we only want to * return a resid value that reflects what was * Failures for non-synchronous operations can * be ignored since the page subsystem will * retry the operation until it succeeds or the * file system is unmounted. * Re-acquire contents lock. * If it was dropped, reacquire reader vfs_dqrwlock as well. * If the uiomove() failed or if a synchronous * page push failed, fix up i_size. * The uiomove failed, and we * allocated blocks,so get rid * XXX - Can this be out of the loop? * Only do one increase of i_seq for multiple * pieces. Because we drop locks, record * the fact that we changed the timestamp and * are deferring the increase in case another thread * pushes our timestamp update. * Clear Set-UID & Set-GID bits on * successful write if not privileged * and at least one of the execute bits * is set. If we always clear Set-GID, * mandatory file and record locking is * In the case the FDSYNC flag is set and this is a * "rewrite" we won't log a delta. * The FSYNC flag overrides all cases. * Make sure i_seq is increased at least once per write * Inode is updated according to this table - * -------------------------- * always@ IATTCHG|IBDWRITE * @ - If we are doing synchronous write the only time we should * not be sync'ing the ip here is if we have the stickyhack * activated, the file is marked with the sticky bit and * no exec bit, the file length has not been changed and * no new blocks have been allocated during this write. * we have eliminated nosync * If we've already done a partial-write, terminate * the write but return no error unless the error is ENOSPC * because the caller can detect this and free resources and * rdip does the real work of read requests for ufs. /* check for valid filetype */ * We update smallfile2 and smallfile1 at most every second. * At this point we can enter ufs_getpage() in one of two * 1) segmap_getmapflt() calls ufs_getpage() when the * forcefault parameter is true (value of 1 is passed) * 2) uiomove() causes a page fault. * We cannot hold onto an i_contents reader lock without * risking deadlock in ufs_getpage() so drop a reader lock. * The ufs_getpage() dolock logic already allows for a * thread holding i_contents as writer to work properly * so we keep a writer lock. * If reading sequential we won't need this * buffer again soon. For offsets in range * [smallfile1, smallfile2] release the pages * at the tail of the cache list, larger * offsets are released at the head. * In POSIX SYNC (FSYNC and FDSYNC) read mode, * we want to make sure that the page which has * been read, is written on disk if it is dirty. * And corresponding indirect blocks should also * Inode is updated according to this table if FRSYNC is set. * -------------------------- * always IATTCHG|IBDWRITE * The inode is not updated if we're logging and the inode is a * directory with FRSYNC, FSYNC and FDSYNC flags set. * If we've already done a partial read, terminate * the read but return no error. /* Translate ILP32 lockfs to LP64 lockfs */ #
endif /* _SYSCALL32_IMPL */ /* Translate LP64 to ILP32 lockfs */ #
endif /* _SYSCALL32_IMPL */ * get file system locking status /* Translate ILP32 lockfs to LP64 lockfs */ #
endif /* _SYSCALL32_IMPL */ /* Translate LP64 to ILP32 lockfs */ #
endif /* _SYSCALL32_IMPL */ * if mounted w/o atime, return quietly. * I briefly thought about returning ENOSYS, but * figured that most apps would consider this fatal * but the idea is to make this as seamless as poss. * Contract-private interface for Legato * Purge this vnode from the DNLC and decide * if this vnode is busy (*arg == 1) or not * Tune the file system (aka setting fs attributes) * The following 3 ioctls are for TSufs support * although could potentially be used elsewhere /* Copy stucture if statistics are being kept */ /* offset paramater is in/out */ * for performance, if only the size is requested don't bother * Return all the attributes. This should be refined so * that it only returns what's asked for. * If there is an ACL and there is a mask entry, then do the * extra work that completes the equivalent of an acltomode(3) * call. According to POSIX P1003.1e, the acl mask should be * returned in the group permissions field. * - start with the original permission and mode bits (from above) * - clear the group owner bits * - add in the mask bits. * Cannot set these attributes. * check for forced unmount * Acquire i_rwlock before TRANS_BEGIN_CSYNC() if this is a file. * This follows the protocol for read()/write(). * ufs_tryirwlock uses rw_tryenter and checks for SLOCK to * avoid i_rwlock, ufs_lockfs_begin deadlock. If deadlock * possible, retries the operation. * Truncate file. Must have write permission and not be a directory. * Acquire i_rwlock after TRANS_BEGIN_CSYNC() if this is a directory. * This follows the protocol established by * Grab quota lock if we are changing the file's owner. * ufs_iaccess is "close enough"; that's because it doesn't * Change file access modes. * Don't change ownership of the quota inode. * No real ownership change. * Remove the blocks and the file, from the old user's * There is a real ownership change. * Add the blocks and the file to the new * Change file access or modified times. /* Check that the time value is within ufs range */ * if the "noaccess" mount option is set and only atime * update is requested, do nothing. No error is returned. * In 2038, ctime sticks forever.. * The presence of a shadow inode may indicate an ACL, but does * not imply an ACL. Future FSD types should be handled here too * and check for the presence of the attribute-specific data * XXX if ufs_iupdat is changed to sandbagged write fix * ufs_acl_setattr to push ip to keep acls consistent * Suppress out of inodes messages if we will retry. * Setattr always increases the sequence number * if nfsd and not logging; push synchronously * If out of inodes or blocks, see if we can free something * up from the delete queue. * The ufs_iaccess function wants to be called with * mode bits expressed as "ufs specific" bits. * I.e., VWRITE|VREAD|VEXEC do not make sense to * ufs_iaccess() but IWRITE|IREAD|IEXEC do. * But since they're the same we just pass the vnode mode * bit but just verify that assumption at compile time. #
error "ufs_access needs to map Vmodes to Imodes" * If the symbolic link is empty there is nothing to read. * Fast-track these empty symbolic links * The ip->i_rwlock protects the data blocks used for FASTSYMLINK struct uio tuio;
/* temp uio struct */ int tflag = 0;
/* flag to indicate temp vars used */ /* can this be a fast symlink and is it a user buffer? */ * setup a kernel buffer to read link into. this * is to fix a race condition where the user buffer * got corrupted before copying it into the inode. /* error, clear garbage left behind */ /* now, copy it into the user buffer */ * First push out any data pages * Delta any delayed inode times updates * All other inode deltas will have already been delta'd * and will be pushed during the commit. * Commit the Moby transaction * Deltas have already been made so we just need to * commit them with a synchronous transaction. * TRANS_BEGIN_SYNC() will return an error * if there are no deltas to commit, for an error = 0;
/* commit wasn't needed */ }
else {
/* not logging */ /* Just update the inode only */ /* Do data-synchronous writes */ /* Do synchronous writes */ * Unix file system operations having to do with directory manipulation. * Check flags for type of lookup (regular file or attribute file) * We don't allow recursive attributes... * Check accessibility of directory. * Check for a null component, which we should treat as * looking at dvp from within it's parent, so we don't * need a call to ufs_iaccess(), as it has already been * Check for "." ie itself. this is a quick check and * avoids adding "." into the dnlc (which have been seen * to occupy >10% of the cache). if ((
nm[0] ==
'.') && (
nm[
1] == 0)) {
* Don't return without checking accessibility * of the directory. We only need the lock if * we are going to return it. * Fast path: Check the directory name lookup cache. * Check accessibility of directory. * Keep the idle queue from getting too long by * idling two inodes before attempting to allocate another. * This operation must be performed before entering * lockfs or a transaction. * If vnode is a device return special vnode instead. * Null component name refers to the directory itself. * Even though this is an error case, we need to grab the * quota lock since the error handling code below is common. * ufs_tryirwlock_trans uses rw_tryenter and checks for SLOCK * to avoid i_rwlock, ufs_lockfs_begin deadlock. If deadlock * possible, retries the operation. * Suppress file system full message if we will retry * If the file already exists and this is a non-exclusive create, * check permissions and allow access for non-directories. * Read-only create of an existing directory is also allowed. * We fail an exclusive create of anything which already exists. * If the error EEXIST was set, then i_seq can not * have been updated. The sequence number interface * is defined such that a non-error VOP_CREATE must * increase the dir va_seq it by at least one. If we * have cleared the error, increase i_seq. Note that * we are increasing the dir i_seq and in rare cases * ip may actually be from the dvp, so we already have * the locks and it will not be subject to truncation. * In case we have to update i_seq of the parent * directory dip, we have to defer it till we have * released our locks on ip due to lock ordering requirements. * Truncate regular files, if requested by caller. * Grab i_rwlock to make sure no one else is * currently writing to the file (we promised * bmap we would do this). * Must get the locks in the correct order. * Large Files: Why this check here? * Though we do it in vn_create() we really * want to guarantee that we do not destroy * Large file data by atomically checking * the size while holding the contents * If vnode is a device return special vnode instead. * Do the deferred update of the parent directory's sequence * If we haven't had a more interesting failure * already, then anything that might've happened * here should be reported. * If no inodes available, try to free one up out of the * don't let the delete queue get too long * ufs_tryirwlock_trans uses rw_tryenter and checks for SLOCK * to avoid i_rwlock, ufs_lockfs_begin deadlock. If deadlock * possible, retries the operation. * This must be called after the remove transaction is closed. /* Only send the event if there were no errors */ * Link a file or a directory. Only privileged processes are allowed to * make links to directories. * Make sure link for extended attributes is valid * We only support hard linking of attr in ATTRDIR to ATTRDIR * Make certain we don't attempt to look at a device node as * ufs_tryirwlock_trans uses rw_tryenter and checks for SLOCK * to avoid i_rwlock, ufs_lockfs_begin deadlock. If deadlock * possible, retries the operation. * Rename a file or directory. * We are given the vnode and entry string of the source and the * vnode and entry string of the place we want to move the source * to (the target). The essential operation is: * but "atomically". Can't do full commit without saving state in * the inode on disk, which isn't feasible at this time. Best we * can do is always guarantee that the TARGET exists. struct vnode *
sdvp,
/* old (source) parent vnode */ char *
snm,
/* old (source) entry name */ struct vnode *
tdvp,
/* new (target) parent vnode */ char *
tnm,
/* new (target) entry name */ struct inode *
sdp;
/* old (source) parent inode */ struct inode *
tdp;
/* new (target) parent inode */ * We only allow renaming of attributes from ATTRDIR to ATTRDIR. * Look up inode of file we're supposed to rename. * Lock both the source and target directories (they may be * the same) to provide the atomicity semantics that was * previously provided by the per file system vfs_rename_lock * with vfs_rename_lock removed to allow simultaneous renames * within a file system, ufs_dircheckpath can deadlock while * traversing back to ensure that source is not a parent directory * of target parent directory. This is because we get into * ufs_dircheckpath with the sdp and tdp locks held as RW_WRITER. * If the tdp and sdp of the simultaneous renames happen to be * in the path of each other, it can lead to a deadlock. This * can be avoided by getting the locks as RW_READER here and then * upgrading to RW_WRITER after completing the ufs_dircheckpath. * We hold the target directory's i_rwlock after calling * ufs_lockfs_begin but in many other operations (like ufs_readdir) * VOP_RWLOCK is explicitly called by the filesystem independent code * before calling the file system operation. In these cases the order * is reversed (i.e i_rwlock is taken first and then ufs_lockfs_begin * is called). This is fine as long as ufs_lockfs_begin acts as a VOP * counter but with ufs_quiesce setting the SLOCK bit this becomes a * synchronizing object which might lead to a deadlock. So we use * rw_tryenter instead of rw_enter. If we fail to get this lock and * find that SLOCK bit is set, we call ufs_lockfs_end and restart the * We didn't get the lock. Check if the SLOCK is set in the * ufsvfs. If yes, we might be in a deadlock. Safer to give up * and wait for SLOCK to be cleared. * SLOCK isn't set so this is a genuine synchronization * case. Let's try again after giving them a breather. * Need to check if the tdp and sdp are same !!! * We didn't get the lock. Check if the SLOCK is set in the * ufsvfs. If yes, we might be in a deadlock. Safer to give up * and wait for SLOCK to be cleared. * So we couldn't get the second level peer lock *and* * the SLOCK bit isn't set. Too bad we can be * contentding with someone wanting these locks otherway * round. Reverse the locks in case there is a heavy * contention for the second level lock. * Make sure we can delete the source entry. This requires * write permission on the containing directory. * Check for sticky directories. * If this is a rename of a directory and the parent is * different (".." must be changed), then the source * directory must not be in the directory hierarchy * above the target, as this would orphan everything * below the source directory. Also the user must have * write permission in the source so as to be able to * If we got EAGAIN ufs_dircheckpath detected a * potential deadlock and backed out. We need * to retry the operation since sdp and tdp have * to be released to avoid the deadlock. * Check for renaming '.' or '..' or alias of '.' * Simultaneous renames can deadlock in ufs_dircheckpath since it * tries to traverse back the file tree with both tdp and sdp held * as RW_WRITER. To avoid that we have to hold the tdp and sdp locks * as RW_READERS till ufs_dircheckpath is done. * Now that ufs_dircheckpath is done with, we can upgrade the locks * The upgrade failed. We got to give away the lock * as to avoid deadlocking with someone else who is * waiting for writer lock. With the lock gone, we * cannot be sure the checks done above will hold * good when we eventually get them back as writer. * So if we can't upgrade we drop the locks and retry * The upgrade failed. We got to give away the lock * as to avoid deadlocking with someone else who is * waiting for writer lock. With the lock gone, we * cannot be sure the checks done above will hold * good when we eventually get them back as writer. * So if we can't upgrade we drop the locks and retry * Now that all the locks are held check to make sure another thread * didn't slip in and take out the sip. * If the inode was found need to drop the v_count * so as not to keep the filesystem from being * unmounted at a later time. * Release the slot.fbp that has the page mapped and * locked SE_SHARED, and could be used in in * ufs_direnter_lr() which needs to get the SE_EXCL lock * Link source to the target. If a target exists, return its * vnode pointer in tvp. We'll release it after sending the * ESAME isn't really an error; it indicates that the * operation should not be done because the source and target * are the same file, but that no error should be reported. * Remove the source entry. ufs_dirremove() checks that the entry * still reflects sip, and returns an error if it doesn't. * If the entry has changed just forget about it. Release * If no errors, send the appropriate events on the source * and destination (a.k.a, target) vnodes, if they exist. * This has to be done after the rename transaction has closed. * Notify the target directory of the rename event * if source and target directories are not same. * Note that if ufs_direnter_lr() returned ESAME then * this event will still be sent. This isn't expected * to be a problem for anticipated usage by consumers. * Can't make directory in attr hidden dir * ufs_tryirwlock_trans uses rw_tryenter and checks for SLOCK * to avoid i_rwlock, ufs_lockfs_begin deadlock. If deadlock * possible, retries the operation. * don't let the delete queue get too long * ufs_tryirwlock_trans uses rw_tryenter and checks for SLOCK * to avoid i_rwlock, ufs_lockfs_begin deadlock. If deadlock * possible, retries the operation. * This must be done AFTER the rmdir transaction has closed. /* Only send the event if there were no errors */ * Check if we have been called with a valid iov_len * and bail out if not, otherwise we may potentially loop * Large Files: When we come here we are guaranteed that * uio_offset can be used safely. The high word is zero. /* Large Files: directory files should not be "large" */ /* Force offset to be valid (to guard against bogus lseek() values) */ /* Quit if at end of file or link count of zero (posix) */ * Get space to change directory entries into fs independent format. * Do fast alloc for the most commonly used-request size (filesystem /* Truncate request to file size */ /* Comply with MAXBSIZE boundary restrictions of fbread() */ * Read in the next chunk. * We are still holding the i_rwlock. /* Transform to file-system independent format */ * If the current directory entry is mangled, then skip * to the next block. It would be nice to set the FSBAD * flag in the super-block so that a fsck is forced on * next reboot, but locking is a problem. /* Skip to requested offset and skip empty entries */ /* Buffer too small for any entries */ /* If would overrun the buffer, quit */ /* use strncpy(9f) to zero out uninitialized bytes */ /* Read whole block, but got no entries, read another if not eof */ * Large Files: casting i_size to int here is not a problem * because directory sizes are always less than MAXOFF32_T. /* Copy out the entry data */ struct vnode *
dvp,
/* ptr to parent dir vnode */ char *
linkname,
/* name of symbolic link */ char *
target,
/* target path */ struct cred *
cr)
/* user credentials */ * No symlinks in attrdirs at this time * We must create the inode before the directory entry, to avoid * racing with readlink(). ufs_dirmakeinode requires that we * hold the quota lock as reader, and directory locks as writer. * Suppress any out of inodes messages if we will retry on * OK. The inode has been created. Write out the data of the * symbolic link. Since symbolic links are metadata, and should * remain consistent across a system crash, we need to force the * data out synchronously. * (This is a change from the semantics in earlier releases, which * only created symbolic links synchronously if the semi-documented * 'syncdir' option was set, or if we were being invoked by the NFS * server, which requires symbolic links to be created synchronously.) * We need to pass in a pointer for the residual length; otherwise * ufs_rdwri() will always return EIO if it can't write the data, * even if the error was really ENOSPC or EDQUOT. * Suppress file system full messages if we will retry * If the link's data is small enough, we can cache it in the inode. * This is a "fast symbolic link". We don't use the first direct * block because that's actually used to point at the symbolic link's * contents on disk; but we know that none of the other direct or * indirect blocks can be used because symbolic links are restricted * to be smaller than a file system block. /* error, clear garbage left behind */ * OK. We've successfully created the symbolic link. All that * remains is to insert it into the appropriate directory. * Fall through into remove-on-error code. We're either done, or we * need to remove the inode (if we couldn't insert it). * We may have failed due to lack of an inode or of a block to * store the target in. Try flushing the delete queue to free * logically-available things up and try again. * Ufs specific routine used to do ufs io. * Caller has requested a writer lock, but that inhibits any * concurrency in the VOPs that follow. Acquire the lock shared * and defer exclusive access until it is known to be needed in * other VOP handlers. Some cases can be determined here. * If directio is not set, there is no chance of concurrency, * so just acquire the lock exclusive. Beware of a forced * unmount before looking at the mount option. * Mandatory locking forces acquiring i_rwlock exclusive. * Acquire the lock shared in case a concurrent write follows. * Mandatory locking could have become enabled before the lock * was acquired. Re-check and upgrade if needed. * If file is being mapped, disallow frlock. * XXX I am not holding tlock while checking i_mapcnt because the * current locking strategy drops all locks before calling fs_frlock. * So, mapcnt could change before we enter fs_frlock making is * meaningless to have held tlock in the first place. return (
EINVAL);
/* Command not handled here */ * Used to determine if read ahead should be done. Also used to * to determine when write back occurs. * A faster version of ufs_getpage. * We optimize by inlining the pvn_getpages iterator, eliminating * calls to bmap_read if file doesn't have UFS holes, and avoiding * the overhead of page_exists(). * When files has UFS_HOLES and ufs_getpage is called with S_READ, * we set *protp to PROT_READ to avoid calling bmap_read. This approach * victimizes performance when a file with UFS holes is faulted * first in the S_READ mode, and then in the S_WRITE mode. We will get * two MMU faults in this case. * XXX - the inode fields which control the sequential mode are not * protected by any mutex. The read ahead will act wild if * multiple processes will access the file concurrently and * some of them in sequential mode. One particulary bad case * is if another thread will change the value of i_nextrio between * the time this thread tests the i_nextrio value and then reads it * again to use it as the offset for the read ahead. * Obey the lockfs protocol * Try to start a transaction, will return if blocking is * expected to occur and the address space is not the * Use EDEADLK here because the VM code * can normally never see this error. * If this thread owns the lock, i.e., this thread grabbed it * as writer somewhere above, then we don't need to grab the * lock as reader in this routine. * Grab the quota lock if we need to call * bmap_write() below (with i_contents as writer). * We may be getting called as a side effect of a bmap using * fbread() when the blocks might be being allocated and the * size has not yet been up'ed. In this case we want to be * able to return zero pages if we get back UFS_HOLE from * calling bmap for a non write case here. We also might have * to read some frags from the disk into a page if we are * extending the number of frags for a given lbn in bmap(). * Large Files: The read of i_size here is atomic because * i_contents is held here. If dolock is zero, the lock * is held in bmap routines. * Must hold i_contents lock throughout the call to pvn_getpages * since locked pages are returned from each call to ufs_getapage. * Must *not* return locked pages and then try for contents lock * due to lock ordering requirements (inode > page) * We must acquire the RW_WRITER lock in order to * Grab the quota lock before * upgrading i_contents, but if we can't grab it * don't wait here due to lock order: * vfs_dqrwlock > i_contents. * May be allocating disk blocks for holes here as * a result of mmap faults. write(2) does the bmap_write * in rdip/wrip, not here. We are not dealing with frags * Large Files: We cast fs_bmask field to offset_t * just as we do for MAXBMASK because uoff is a 64-bit * data type. fs_bmask will still be a 32-bit type * as we cannot change any ondisk data structures. * Can be a reader from now on. * We can release vfs_dqrwlock early so do it, but make * sure we don't try to release it again at the bottom. * We remove PROT_WRITE in cases when the file has UFS holes * because we don't want to call bmap_read() to check each * page if it is backed with a disk block. * The loop looks up pages in the range [off, off + len). * For each page, we first check if we should initiate an asynchronous * read ahead before we call page_lookup (we may sleep in page_lookup * for a previously initiated disk read). /* Handle async getpage (faultahead) */ * Check if we should initiate read ahead of next cluster. * We call page_exists only when we need to confirm that * we have the current page before we initiate the read ahead. * We always read ahead the next cluster of data * starting from i_nextrio. If the page (vp,nextrio) * is actually in core at this point, the routine * ufs_getpage_ra() will stop pre-fetching data * until we read that page in a synchronized manner * through ufs_getpage_miss(). So, we should increase * i_nextrio if the page (vp, nextrio) exists. * We found the page in the page cache. * We have to create the page, or read it from disk. * Return pages up to plsz if they are in the page cache. * We cannot return pages if there is a chance that they are * backed with a UFS hole and rw is S_WRITE or S_CREATE. *
pl =
NULL;
/* Terminate page list */ * Release any pages we have locked. * If the inode is not already marked for IACC (in rdip() for read) * and the inode is not marked for no access time update (in wrip() * for write) then update the inode access time and mod time now. * ufs_getpage_miss is called when ufs_getpage missed the page in the page * cache. The page is either read from the disk, or it's created. * A page is created (without disk read) if rw == S_CREATE, or if * the page is not backed with a real disk block (UFS hole). * Figure out whether the page can be created, or must be * must be read from the disk. * If its also a fallocated block that hasn't been written to * yet, we will treat it just like a UFS_HOLE and create "ufs_getpage_miss: page_create == NULL"));
* If access is not in sequential order, we read from disk * We limit the size of the transfer to bsize if we are reading * from the beginning of the file. Note in this situation we * will hedge our bets and initiate an async read ahead of * Some other thread has entered the page. * ufs_getpage will retry page_lookup. * Zero part of the page which we are not * going to read from the disk. * If the file access is sequential, initiate read ahead * Read ahead a cluster from the disk. Returns the length in bytes. * If the directio advisory is in effect on this file, * then do not do buffered read ahead. Read ahead makes * it more difficult on threads using directio as they * will be forced to flush the pages from this vnode. * If its a UFS_HOLE or a fallocated block, do not perform * any read ahead's since there probably is nothing to read ahead * Limit the transfer size to bsize if this is the 2nd block. * Zero part of page which we are not going to read from disk * Flags are composed of {B_INVAL, B_FREE, B_DONTNEED, B_FORCE, B_ASYNC} * LMXXX - the inode really ought to contain a pointer to one of these * async args. Stuff gunk in there and just hand the whole mess off. * This would replace i_delaylen, i_delayoff. return (
ufs_fault(
vp,
"ufs_putpage: bad v_count == 0"));
* XXX - Why should this check be made here? * If nobody stalled, start a new cluster. * If we have a full cluster or they are not contig, * then push last cluster and start over. /* LMXXX - flags are new val, not old */ * There is something there, it's not full, and * Must have weird flags or we are not clustering. * If len == 0, do from off to EOF. * The normal cases should be len == 0 & off == 0 (entire vp list), * len == MAXBSIZE (from segmap_release actions), and len == PAGESIZE * any pages in this inode. * The inode lock is held during i/o. * Must synchronize this thread and any possible thread * operating in the window of vulnerability in wrip(). * It is dangerous to allow both a thread doing a putpage * and a thread writing, so serialize them. The exception * is when the thread in wrip() does something which causes * a putpage operation. Then, the thread must be allowed * to continue. It may encounter a bmap_read problem in * ufs_putapage, but that is handled in ufs_putapage. * Allow async writers to proceed, we don't want to block * If there is no thread in the critical * section of wrip(), then proceed. * Otherwise, wait until there isn't one. * Bounce async writers when we have a writer * working on this file so we don't deadlock * Search the entire vp list for pages >= off. * Loop over all offsets in the range looking for * If we are not invalidating, synchronously * freeing or writing pages, use the routine * page_lookup_nowait() to prevent reclaiming * them from the free list. * "io_off" and "io_len" are returned as * the range of pages we actually wrote. * This allows us to skip ahead more quickly * since several pages may've been dealt * with by this iteration of the loop. * We have just sync'ed back all the pages on * the inode, turn off the IMODTIME flag. * Write out a single page, possibly klustering adjacent * dirty pages. The inode lock must be held. * LMXXX - bsize < pagesize not done. * If the modified time on the inode has not already been * set elsewhere (e.g. for write/setattr) we set the time now. * This gives us approximate modified times for mmap'ed files * which are modified via stores in the user address space. * Align the request to a block boundry (for old file systems), * and go ask bmap() how contiguous things are for this file. if (
bn ==
UFS_HOLE) {
/* putpage never allocates */ * logging device is in error mode; simply return EIO * Oops, the thread in the window in wrip() did some * sort of operation which caused a putpage in the bad * range. In this case, just return an error which will * cause the software modified bit on the page to set * and the page will get written out again later. * If the pager is trying to push a page in the bad range * just tell him to try again later when things are better. * If it is an fallocate'd block, reverse the negativity since * we are now writing to it * Take the length (of contiguous bytes) passed back from bmap() * and _try_ and get a set of pages covering that extent. * May have run out of memory and not clustered backwards. * We told bmap off, so we have to adjust the bn accordingly. * bmap was carefull to tell us the right size so use that. * There might be unallocated frags at the end. * LMXXX - bzero the end of the page? We must be writing after EOF. * Handle the case where we are writing the last page after EOF. * XXX - just a patch for i-mt3. * If file is being locked, disallow mapping. * User specified address - blow away any previous mappings * We didn't get the lock. Check if the SLOCK is set in the * ufsvfs. If yes, we might be in a deadlock. Safer to give up * and wait for SLOCK to be cleared. * SLOCK isn't set so this is a genuine synchronization * case. Let's try again after giving them a breather. * Return the answer requested to poll() for non-device files * Have to handle _PC_NAME_MAX here, because the normal way * [fs_pathconf() -> VOP_STATVFS() -> ufs_statvfs()] * results in a lock ordering reversal between * ufs_lockfs_{begin,end}() and * ufs_thread_{suspend,continue}(). * Keep in sync with ufs_statvfs(). * We need a better check. Ideally, we would use another * vnodeops so that hlocked and forcibly unmounted file * systems would return EIO where appropriate and w/o the * For vmpss (pp can be NULL) case respect the quiesce protocol. * ul_lock must be taken before locking pages so we can't use it here * if pp is non NULL because segvn already locked pages * SE_EXCL. Instead we rely on the fact that a forced umount or * applying a filesystem lock via ufs_fiolfs() will block in the * implicit call to ufs_flush() until we unlock the pages after the * return to segvn. Other ufs_quiesce() callers keep ufs_quiesce_pend * above 0 until they are done. We have to be careful not to increment * ul_vnops_cnt here after forceful unmount hlocks the file system. * If pp is NULL use ul_lock to make sure we don't increment * ul_vnops_cnt after forceful unmount hlocks the file system. * segvn may call VOP_PAGEIO() instead of VOP_GETPAGE() to * handle a fault against a segment that maps vnode pages with * large mappings. Segvn creates pages and holds them locked * SE_EXCL during VOP_PAGEIO() call. In this case we have to * use rw_tryenter() to avoid a potential deadlock since in * lock order i_contents needs to be taken first. * Segvn will retry via VOP_GETPAGE() if VOP_PAGEIO() fails. * Return an error to segvn because the pagefault request is beyond * Break the io request into chunks, one for each contiguous * stretch of disk blocks in the target file. * Zero out a page beyond EOF, when the last block of * a file is a UFS fragment so that ufs_pageio() can be used * instead of ufs_getpage() to handle faults against * segvn segments that use large pages. * If the request is not B_ASYNC, wait for i/o to complete * and re-assemble the page list to return to the caller. * If it is B_ASYNC we leave the page list in pieces and * cleanup() will dispose of them. /* Cleanup unprocessed parts of list */ /* Re-assemble list and let caller clean up */ * Called when the kernel is in a frozen state to dump data * directly to the device. It uses a private dump data structure, * set up by dump_ctl, to locate the correct disk block to which to dump. * Validate the inode that it has not been modified since * the dump structure is allocated. * See that the file has room for this write * Find the physical disk block numbers from the dump * private data structure directly and write out the data * in contiguous block lumps * Prepare the file system before and after the dump operation. * Preparation before dump, allocate dump private data structure * to hold all the direct and indirect block info for dump. * Clean up after dump, deallocate the dump private data structure. * Scan dump_info for *blkp DEV_BSIZE blocks of contig fs space; * if found, the starting file-relative DEV_BSIZE lbn is written * to *bklp; that lbn is intended for use with VOP_DUMP() * check for forced unmount * alloc and record dump_info * calculate and allocate space needed according to i_size /* Start saving the info */ for (i = 0; i <
NIADDR; i++) {
/* and time stamp the information */ * scan dblk[] entries; contig fs space is found when: * ((current blkno + frags per block) == next blkno) * index is where size bytes of contig space begins; * conversion from index to the file's DEV_BSIZE lbn * is equivalent to: (index * fs_bsize) / DEV_BSIZE * Recursive helper function for ufs_dumpctl(). It follows the indirect file * system blocks until it reaches the the disk block addresses, which are * then stored into the given buffer, storeblk. * Only grab locks if needed - they're not needed to check vsa_mask * or if the mask contains no acl flags. /* Abort now if the request is either empty or invalid. */ * Following convention, if this is a directory then we acquire the * inode's i_rwlock after starting a UFS logging transaction; * otherwise, we acquire it beforehand. Since we were called (and * must therefore return) with the lock held, we will have to drop it, * and later reacquire it, if operating on a directory. /* Upgrade the lock if required. */ * Check that the file system supports this operation. Note that * ufs_lockfs_begin() will have checked that the file system had * not been forcibly unmounted. /* Do the actual work. */ * Suppress out of inodes messages if we will retry. * top_end_async() can eventually call * top_end_sync(), which can block. We must * therefore observe the lock-ordering protocol * If no inodes available, try scaring a logically- * free one out of the delete queue to someplace * If we need to reacquire the lock then it is safe to do so * as a reader. This is because ufs_rwunlock(), which will be * called by our caller after we return, does not differentiate * between shared and exclusive locks.