fss.c revision d4204c85a44d2589b9afff2c81db7044e97f2d1d
* allocation within a zone proportional to fssproj->fssp_shares * (project.cpu-shares); at a higher level zones compete with each other, * receiving allocation in a pset proportional to fsszone->fssz_shares * (zone.cpu-shares). See fss_decay_usage() for the precise formula. * Module linkage information for the kernel. * The fssproc_t structures are kept in an array of circular doubly linked * lists. A hash on the thread pointer is used to determine which list each * thread should be placed in. Each list has a dummy "head" which is never * removed, so the list is never empty. fss_update traverses these lists to * update the priorities of threads that have been waiting on the run queue. #
define FSS_LISTS 16 /* number of lists, must be power of 2 */#
define FSS_TICK_COST 1000 /* tick cost for threads with nice level = 0 */ * Decay rate percentages are based on n/128 rather than n/100 so that * calculations can avoid having to do an integer divide by 100 (divide * by FSS_DECAY_BASE == 128 optimizes to an arithmetic shift). * FSS_DECAY_MIN = 83/128 ~= 65% * FSS_DECAY_MAX = 108/128 ~= 85% * FSS_DECAY_USG = 96/128 ~= 75% #
define FSS_DECAY_MIN 83 /* fsspri decay pct for threads w/ nice -20 */#
define FSS_DECAY_MAX 108 /* fsspri decay pct for threads w/ nice +19 */ for (i = 0; i <
cnt; i++)
* Search for the cpupart pointer in the array of fsspsets. * If we didn't find anything, then use the first * available slot in the fsspsets array. * The following routine returns a pointer to the fsszone structure which * belongs to zone "zone" and cpu partition fsspset, if such structure exists. * already. Try to find our zone among them. * The following routine links new fsszone structure into doubly linked list of * zones active on the specified cpu partition. * This will be the first fsszone for this fsspset * Insert this fsszone to the doubly linked list. * The following routine removes a single fsszone structure from the doubly * linked list of zones active on the specified cpu partition. Note that * global fsspsets_lock must be held in case this fsszone structure is the last * on the above mentioned list. Also note that the fsszone structure is not * freed here, it is the responsibility of the caller to call kmem_free for it. * This is not the last zone in the list. * This was the last zone active in this cpu partition. * The following routine returns a pointer to the fssproj structure * which belongs to project kpj and cpu partition fsspset, if such structure * There are projects running on this cpu partition already. * Try to find our project among them. * The following routine links new fssproj structure into doubly linked list * of projects running on the specified cpu partition. * This will be the first fssproj for this fsspset * Insert this fssproj to the doubly linked list. * The following routine removes a single fssproj structure from the doubly * linked list of projects running on the specified cpu partition. Note that * global fsspsets_lock must be held in case if this fssproj structure is the * last on the above mentioned list. Also note that the fssproj structure is * not freed here, it is the responsibility of the caller to call kmem_free * This is not the last part in the list. * This was the last project part running if (
fssproj ==
NULL)
/* if this thread already exited */ if (
fssproj ==
NULL)
/* if this thread already exited */ * Fair share scheduler initialization. Called by dispinit() at boot time. * We can ignore clparmsz argument since we know that the smallest possible * parameter buffer is big enough for us. * Initialize the fssproc hash table. * Fill in fss_nice_tick and fss_nice_decay arrays: * The cost of a tick is lower at positive nice values (so that it * will not increase its project's usage as much as normal) with 50% * drop at the maximum level and 50% increase at the minimum level. * The fsspri decay is slower at positive nice values. fsspri values * of processes with negative nice levels must decay faster to receive * time slices more frequently than normal. * Calculate the new cpupri based on the usage, the number of shares and * the number of active threads. Reset the tick counter for this thread. * No need to change priority of exited threads. * Special case: threads with no shares. * fsspri += shusage * nrunnable * ticks * The general priority formula: * pri = maxumdpri - ------------------------ * If this thread's fsspri is greater than the previous largest * fsspri, then record it as the new high and priority for this * thread will be one (the lowest priority assigned to a thread * that has non-zero shares). * Note that this formula cannot produce out of bounds priority * values; if it is changed, additional checks may need to be * Decays usages of all running projects and resets their tick counters. * Called once per second from fss_update() after updating priorities. * Go through all active processor sets and decay usages of projects * Decay maxfsspri for this cpu partition with the * fastest possible decay rate. * Decay usage for each project running on * Readjust the project's number of shares if it has * changed since we checked it last time. * Readjust the zone's number of shares if it * has changed since we checked it last time. * Calculate fssp_shusage value to be used * for fsspri increments for the next second. * Project 0 in the global zone has 50% * Thread's priority is based on its project's * normalized usage (shusage) value which gets * pset_shares^2 zone_int_shares^2 * usage * ------------- * ------------------ * kpj_shares^2 zone_ext_shares^2 * Where zone_int_shares is the sum of shares * of all active projects within the zone (and * the pset), and zone_ext_shares is the number * of zone shares (ie, zone.cpu-shares). * If there is only one zone active on the pset * shusage = usage * --------------------- * If there's only one project active in the * zone this formula reduces to: * shusage = usage * ---------------------- * curthread is always onproc * When the priority of a thread is changed, it may be * necessary to adjust its position on a sleep queue or * dispatch queue. The function thread_change_pri accomplishes * The thread was on a run queue. * Update priorities of all fair-sharing threads that are currently runnable * at a user mode priority based on the number of shares and current usage. * Called once per second via timeout which we reset here. * There are several lists of fair-sharing threads broken up by a hash on the * thread pointer. Each list has its own lock. This avoids blocking all * fss_enterclass, fss_fork, and fss_exitclass operations while fss_update runs. * fss_update traverses each list in turn. * Decay and update usages for all projects. * Start with the fss_update_marker list, then do the rest. * Go around all threads, set new priorities and decay * If this is the first list after the current marker to have * threads with priorities updates, advance the marker to this * list for the next time fss_update runs. * Advance marker for the next fss_update call * Updates priority for a list of threads. Returns 1 if the priority of one * of the threads was actually updated, 0 if none were for various reasons * (thread is no longer in the FSS class, is not runnable, has the preemption * control no-preempt bit set, etc.) * Lock the thread and verify the state. * Skip the thread if it is no longer in the FSS class or * is running with kernel mode priority. * Only dequeue the thread if it needs to be moved; otherwise * it should just round-robin here. * Check validity of parameters. * FSS_NOCHANGE (-32768) is outside of the range of values for * fss_uprilim and fss_upri. If the structure fssparms_t is changed, * FSS_NOCHANGE should be replaced by a flag word. * Get the varargs parameter and check validity of parameters. * Use default parameters. * Copy all selected fair-sharing class parameters to the user. The parameters * are specified by a key. * Return the user mode scheduling priority range. * Only root can move threads to FSS class. * Initialize the fssproc structure. * Set the user priority to the requested value or * the upri limit, whichever is lower. * Put a lock on our fsspset structure. * Reset priority. Process goes to a "user mode" priority here * regardless of whether or not it has slept since entering the kernel. * Link new structure into fssproc list. * If this is the first fair-sharing thread to occur since boot, * we set up the initial call to fss_update() here. Use an atomic * compare-and-swap since that's easier and faster than a mutex * (but check with an ordinary load first since most of the time * this will already be done). * Remove fssproc_t from the list. * We should be either getting this thread off the deathrow or * this thread has already moved to another scheduling class and * we're being called with its old cldata buffer pointer. In both * cases, the content of this buffer can not be changed while we're * We're being called as a result of the priocntl() system * call -- someone is trying to move our thread to another * scheduling class. We can't call fss_inactive() here * because our thread's t_cldata pointer already points * to another scheduling class specific data. * We're being called from thread_free() when our thread * is removed from the deathrow. There is nothing we need * do here since everything should've been done earlier * A thread is allowed to exit FSS only if we have sufficient * Initialize fair-share class specific proc structure for a child. * Initialize child's fssproc structure. * Link new structure into fssproc hash table. * Child is placed at back of dispatcher queue and parent gives up processor * so that the child runs first after the fork. This allows the child * immediately execing to break the multiple use of copy on write pages with no * disk home. The parent will get to steal them back rather than uselessly * Grab the child's p_lock before dropping pidlock to ensure the * process does not disappear before we set it running. * We don't want to call fss_setrun(t) here because it may call * fss_active, which we don't need. * Get the fair-sharing parameters of the thread pointed to by fssprocp into * the buffer pointed by fssparmsp. * Make sure the user priority doesn't exceed the upri limit. * Basic permissions enforced by generic kernel code for all classes * require that a thread attempting to change the scheduling parameters * of a target thread be privileged or have a real or effective UID * matching that of the target thread. We are not called unless these * basic permission checks have already passed. The fair-sharing class * requires in addition that the calling thread be privileged if it * is attempting to raise the upri limit above its current value. * This may have been checked previously but if our caller passed us * a non-NULL credential pointer we assume it hasn't and we check it * Set fss_nice to the nice value corresponding to the user priority we * are setting. Note that setting the nice field of the parameter * struct won't affect upri or nice. * The thread is being stopped. * The current thread is exiting, do necessary adjustments to its project * Thread t here is either a current thread (in which case we hold * its process' p_lock), or a thread being destroyed by forklwp_fail(), * in which case we hold pidlock and thread is no longer on the * A thread could be exiting in between clock ticks, so we need to * calculate how much CPU time it used since it was charged last time. * CPU caps are not enforced on exiting processes - it is usually * desirable to exit as soon as possible to free resources. * fss_swapin() returns -1 if the thread is loaded or is not eligible to be * swapped in. Otherwise, it returns the thread's effective priority based * on swapout time and size of process (0 <= epri <= 0 SHRT_MAX). * Threads which have been out for a long time, * have high user mode priority and are associated * with a small address space are more deserving. * Scale epri so that SHRT_MAX / 2 represents zero priority. * fss_swapout() returns -1 if the thread isn't loaded or is not eligible to * be swapped out. Otherwise, it returns the thread's effective priority * based on if the swapper is in softswap or hardswap mode. * Scale epri so that SHRT_MAX / 2 represents zero priority. * If thread is currently at a kernel mode priority (has slept) and is * returning to the userland we assign it the appropriate user mode priority * and time quantum here. If we're lowering the thread's priority below that * of other runnable threads then we will set runrun via cpu_surrender() to * If thread has blocked in the kernel * Swapout lwp if the swapper is waiting for this thread to reach * Arrange for thread to be placed in appropriate location on dispatcher queue. * This is called with the current thread in TS_ONPROC and locked. * If preempted in the kernel, make sure the thread has a kernel t->
t_trapret =
1;
/* so that fss_trapret will run */ * This thread may be placed on wait queue by CPU Caps. In this case we * do not need to do anything until it is removed from the wait queue. * Do not enforce CPU caps on threads running at a kernel priority * If preempted in user-land mark the thread as swappable because it * cannot be holding any kernel locks. * Check to see if we're doing "preemption control" here. If * we are, and if the user has requested that this thread not * be preempted, and if preemptions haven't been put off for * too long, let the preemption happen here but try to make * sure the thread is rescheduled as soon as possible. We do * this by putting it on the front of the highest priority run * queue in the FSS class. If the preemption has been put off * for too long, clear the "nopreempt" bit and let the thread * If not already remembered, remember current * priority for restoration in fss_yield(). * Fall through and be preempted below. * Called when a thread is waking up and is to be placed on the run queue. * If previously were running at the kernel priority then keep that * priority and the fss_timeleft doesn't matter. * Prepare thread for sleep. We reset the thread priority so it will run at the * kernel priority level when it wakes up. * Account for time spent on CPU before going to sleep. * Assign a system priority to the thread and arrange for it to be * retained when the thread is next placed on the run queue (i.e., * when it wakes up) instead of being given a new pri. Also arrange * for trapret processing as the thread leaves the system call so it * will drop back to normal priority range. t->
t_trapret =
1;
/* so that fss_trapret will run */ * The thread has done a THREAD_KPRI_REQUEST(), slept, then * done THREAD_KPRI_RELEASE() (so no t_kpri_req is 0 again), * then slept again all without finishing the current system * call so trapret won't have cleared FSSKPRI * A tick interrupt has ocurrend on a running thread. Check to see if our * time slice has expired. We must also clear the TS_DONT_SWAP flag in * t_schedflag if the thread is eligible to be swapped out. * It's safe to access fsspset and fssproj structures because we're * holding our p_lock here. * Keep track of thread's project CPU usage. Note that projects * get charged even when threads are running in the kernel. * Do not surrender CPU if running in the SYS class. * A thread's execution time for threads running in the SYS class * If thread is not in kernel mode, decrement its fss_timeleft * If we're doing preemption control and trying to * avoid preempting this thread, just note that the * thread should yield soon and let it keep running * (unless it's been a while). * When the priority of a thread is changed, it may * be necessary to adjust its position on a sleep queue * or dispatch queue. The function thread_change_pri * If there is a higher-priority thread which is * waiting for a processor, then thread surrenders * The thread used more than half of its quantum, so assume that * it used the whole quantum. * Update thread's priority just before putting it on the wait * queue so that it gets charged for the CPU time from its * quantum even before that quantum expires. * We need to call cpu_surrender for this thread due to cpucaps * enforcement, but fss_change_priority may have already done * so. In this case FSSBACKQ is set and there is no need to call * Processes waking up go to the back of their queue. We don't need to assign * a time quantum here because thread is still at a kernel mode priority and * the time slicing is not done for threads running in the kernel after * sleeping. The proper time quantum will be assigned by fss_trapret before the * thread returns to user mode. * If we already have a kernel priority assigned, then we * Give thread a priority boost if we were asked. t->
t_trapret =
1;
/* so that fss_trapret will run */ * Otherwise, we recalculate the priority. * fss_donice() is called when a nice(1) command is issued on the thread to * alter the priority. The nice(1) command exists in Solaris for compatibility. * Thread priority adjustments should be done via priocntl(1). * If there is no change to priority, just return current setting. * Specifying a nice increment greater than the upper limit of * FSS_NICE_MAX (== 2 * NZERO - 1) will result in the thread's nice * value being set to the upper limit. We check for this before * computing the new value because otherwise we could get overflow * if a privileged user specified some ridiculous increment. * Reset the uprilim and upri values of the thread. * Although fss_parmsset already reset fss_nice it may not have been * set to precisely the value calculated above because fss_parmsset * determines the nice value from the user priority and we may have * truncated during the integer conversion from nice value to user * priority and back. We reset fss_nice to the value we calculated * Increment the priority of the specified thread by incr and * return the new value in *retvalp. * If there is no change to priority, just return current setting. * Reset the uprilim and upri values of the thread. * Return the global scheduling priority that would be assigned to a thread * entering the fair-sharing class with the fss_upri. * Called from the yield(2) system call when a thread is yielding (surrendering) * the processor. The kernel thread is placed at the back of a dispatch queue. * Collect CPU usage spent before yielding * Clear the preemption control "yield" bit since the user is * If fss_preempt() artifically increased the thread's priority * to avoid preemption, restore the original priority now. * Time slice was artificially extended to avoid preemption, * so pretend we're preempting it now. * If the zone for the new project is not currently active on * the cpu partition we're on, get one of the pre-allocated * buffers and link it in our per-pset zone list. Such buffers * If our new project is not currently running * on the cpu partition we're on, get one of the * pre-allocated buffers and link it in our new cpu * partition doubly linked list. Such buffers should already