devpoll.c revision a5eb7107f06a6e23e8e77e8d3a84c1ff90a73ac6
* The contents of this file are subject to the terms of the * Common Development and Distribution License (the "License"). * You may not use this file except in compliance with the License. * See the License for the specific language governing permissions * and limitations under the License. * When distributing Covered Code, include this CDDL HEADER in each * If applicable, add the following below this CDDL HEADER, with the * fields enclosed by brackets "[]" replaced with your own identifying * information: Portions Copyright [yyyy] [name of copyright owner] * Copyright 2008 Sun Microsystems, Inc. All rights reserved. * Use is subject to license terms. * Copyright (c) 2012 by Delphix. All rights reserved. * Copyright (c) 2015, Joyent, Inc. All rights reserved. /* device local functions */ * The /dev/poll driver shares most of its code with poll sys call whose * structure is per lwp. An implicit assumption is made there that some * portion of pollcache will never be touched by other lwps. E.g., in * poll(2) design, no lwp will ever need to grow bitmap of other lwp. * This assumption is not true for /dev/poll; hence the need for extra * To allow more parallelism, each /dev/poll file descriptor (indexed by * minor number) has its own lock. Since read (dpioctl) is a much more * frequent operation than write, we want to allow multiple reads on same * /dev/poll fd. However, we prevent writes from being starved by giving * priority to write operation. Theoretically writes can starve reads as * well. But in practical sense this is not important because (1) writes * happens less often than reads, and (2) write operation defines the * content of poll fd a cache set. If writes happens so often that they * can starve reads, that means the cached set is very unstable. It may * not make sense to read an unstable cache set anyway. Therefore, the * writers starving readers case is not handled in this design. * dp_pcache_poll has similar logic to pcache_poll() in poll.c. The major * differences are: (1) /dev/poll requires scanning the bitmap starting at * where it was stopped last time, instead of always starting from 0, * (2) since user may not have cleaned up the cached fds when they are * closed, some polldats in cache may refer to closed or reused fds. We * need to check for those cases. * NOTE: Upon closing an fd, automatic poll cache cleanup is done for * poll(2) caches but NOT for /dev/poll caches. So expect some * No Need to search because no poll fd * started from every begining, no need to wrap around. * Examine the bit map in a circular fashion * to avoid starvation. Always resume from * last stop. Scan till end of the map. Then * The fd is POLLREMOVed. This fd is * logically no longer cached. So move * The fd has been closed, but user has not * done a POLLREMOVE on this fd yet. Instead * of cleaning it here implicitly, we return * POLLNVAL. This is consistent with poll(2) * polling a closed fd. Hope this will remind * user to do a POLLREMOVE. * In the epoll compatibility case, we actually * perform the implicit removal to remain * closer to the epoll semantics. * user is polling on a cached fd which was * closed and then reused. Unfortunately * there is no good way to inform user. * If the file struct is also reused, we * may not be able to detect the fd reuse * at all. As long as this does not * cause system failure and/or memory leak, * we will play along. Man page states if * user does not clean up closed fds, polling * results will be indeterministic. * XXX - perhaps log the detection of fd * XXX - pollrelock() logic needs to know which * which pollcache lock to grab. It'd be a * cleaner solution if we could pass pcp as * an arguement in VOP_POLL interface instead * of implicitly passing it using thread_t * struct. On the other hand, changing VOP_POLL * poll routine to change. May want to revisit * layered devices (e.g. console driver) * may change the vnode and thus the pollhead * pointer out from underneath us. * The bit should still be set. * If any of the event bits are set for * which poll and epoll representations * differ, swizzle in the native epoll * We define POLLWRNORM to be POLLOUT, * but epoll has separate definitions * for them; if POLLOUT is set and the * user has asked for EPOLLWRNORM, set * If POLLET is set, clear the bit in the * bitmap -- which effectively latches the * edge on a pollwakeup() from the driver. * If POLLONESHOT is set, perform the implicit * We clear a bit or cache a poll fd if * the driver returns a poll head ptr, * which is expected in the case of 0 * revents. Some buggy driver may return * NULL php pointer with 0 revents. In * this case, we just treat the driver as * "noncachable" and not clearing the bit * An event of interest may have * arrived between the VOP_POLL() and * the pollhead_insert(); check again. * No bit set in the range. Check for wrap around. * Used up every entry in the existing devpoll table. * Grow the table by DEVPOLLSIZE. * allocate a pollcache skeleton here. Delay allocating bitmap * structures until dpwrite() time, since we don't know the * optimal size yet. We also delay setting the pid until either * dpwrite() or attempt to poll on the instance, allowing parents * to create instances of /dev/poll for their children. (In the * epoll compatibility case, this check isn't performed to maintain * semantic compatibility.) * or change poll events for a watched fd. * Copy in the pollfd array. Walk through the array and add * each polled fd to the cached set. * Although /dev/poll uses the write(2) interface to cache fds, it's * not supposed to function as a seekable device. To prevent offset * from growing and eventually exceed the maximum, reset the offset * We are about to enter the core portion of dpwrite(). Make sure this * write has exclusive access in this portion of the code, i.e., no * other writers in this code and no other readers in dpioctl. * We need to do a bit of a dance here: we need to drop * our dpe_lock and grab the pc_lock to broadcast the pc_cv to * epoll semantics demand that we return EBADF if our * specified fd is invalid. * If we're in epoll compatibility mode, check * that the fd is valid before allocating * anything for it; epoll semantics demand that * we return EBADF if our specified fd is * epoll semantics demand that we error out if * a file descriptor is added twice, which we * check (imperfectly) by checking if we both * have the file descriptor cached and the * file pointer that correponds to the file * descriptor matches our cached value. If * there is a pointer mismatch, the file * descriptor was closed without being removed. * The converse is clearly not true, however, * so to narrow the window by which a spurious * EEXIST may be returned, we also check if * this fp has been added to an epoll control * descriptor in the past; if it hasn't, we * know that this is due to fp reuse -- it's * not a true EEXIST case. (By performing this * additional check, we limit the window of * spurious EEXIST to situations where a single * file descriptor is being used across two or * more epoll control descriptors -- and even * then, the file descriptor must be closed and * reused in a relatively tight time span.) * We have decided that the cached * information was stale: it either * didn't match, or the fp had never * actually been epoll()'d on before. * We need to now clear our pd_events * to assure that we don't mistakenly * operate on cached event disposition. * The fd is not valid. Since we can't pass * this error back in the write() call, set * the bit in bitmap to force DP_POLL ioctl * To (greatly) reduce EEXIST false positives, we * denote that this fp has been epoll()'d. We do this * regardless of epoll compatibility mode, as the flag * is harmless if not in epoll compatibility mode. * Don't do VOP_POLL for an already cached fd with * the events are already cached * do VOP_POLL and cache this poll fd. * XXX - pollrelock() logic needs to know which * which pollcache lock to grab. It'd be a * cleaner solution if we could pass pcp as * an arguement in VOP_POLL interface instead * of implicitly passing it using thread_t * struct. On the other hand, changing VOP_POLL * poll routine to change. May want to revisit * We always set the bit when this fd is cached; * this forces the first DP_POLL to poll this fd. * Real performance gain comes from subsequent * DP_POLL. We also attempt a pollhead_insert(); * if it's not possible, we'll do it in dpioctl(). * As with the add case (above), epoll * semantics demand that we error out /* do this now, before we sleep on DP_WRITER_PRESENT */ * We can't turn on epoll compatibility while there * are outstanding operations. * epoll compatibility is a one-way street: there's no way * to turn it off for a particular open. * which otherwise uses the same structure as DP_POLL. /* Kernel-internal ioctl call */ * Convert the deadline from relative milliseconds * to absolute nanoseconds. They must wait for at * Like ppoll() with a non-NULL sigset, we'll * call cv_reltimedwait_sig() just to check for * signals. This call will return immediately * with either 0 (signalled) or -1 (no signal). * There are some conditions whereby we can * get 0 from cv_reltimedwait_sig() without * a true signal (e.g., a directed stop), so * we restore our signal mask in the unlikely * event that lwp_cursig is 0. * We are just using DP_POLL to sleep, so * we don't any of the devpoll apparatus. * Do not check for signals if we have a zero timeout. * XXX It would be nice not to have to alloc each time, but it * requires another per thread structure hook. This can be * implemented later if data suggests that it's necessary. * If nfds is larger than twice the current maximum * open file count, we'll silently clamp it. This * only limits our exposure to allocating an * inordinate amount of kernel memory; it doesn't * otherwise affect the semantics. (We have this * check at twice the maximum instead of merely the * maximum because some applications pass an nfds that * is only slightly larger than their limit.) * A pollwake has happened since we polled cache. * Sleep until we are notified, signaled, or timed out. /* immediate timeout; do not check signals */ * We've been kicked off of our cv because a * writer wants in. We're going to drop our * reference count and then wait until the * writer is gone -- at which point we'll * reacquire the pc_lock and call into * dp_pcache_poll() to get the updated state. * If we were awakened by a signal or timeout * then break the loop, else poll again. * No Need to search because no poll fd * Polling on a /dev/poll fd is not fully supported yet. /* no error in epoll compat. mode */ * devpoll close should do enough clean up before the pollcache is deleted, * i.e., it should ensure no one still references the pollcache later. * There is no "permission" check in here. Any process having the last * reference of this /dev/poll fd can close. * At this point, no other lwp can access this pollcache via the * /dev/poll fd. This pollcache is going away, so do the clean * up without the pc_lock. * pollwakeup() may still interact with this pollcache. Wait until