intrd.pl revision 7ff178cd8db129d385d3177eb20744d3b6efc59b
2N/A# The contents of this file are subject to the terms of the 2N/A# Common Development and Distribution License (the "License"). 2N/A# You may not use this file except in compliance with the License. 2N/A# See the License for the specific language governing permissions 2N/A# and limitations under the License. 2N/A# When distributing Covered Code, include this CDDL HEADER in each 2N/A# If applicable, add the following below this CDDL HEADER, with the 2N/A# fields enclosed by brackets "[]" replaced with your own identifying 2N/A# information: Portions Copyright [yyyy] [name of copyright owner] 2N/A# Copyright (c) 2010, Oracle and/or its affiliates. All rights reserved. 2N/Amy $
statslen =
60;
# time period (in secs) to keep in @deltas 2N/A# Parse arguments. intrd does not accept any public arguments; the two 2N/A# arguments below are meant for testing purposes. -D generates a significant 2N/A# amount of syslog output. -S <filename> loads the filename as a perl 2N/A# script. That file is expected to implement a kstat "simulator" which 2N/A# can be used to feed information to intrd and verify intrd's responses. 2N/A if ($_
eq "-S" && $
#ARGV != -1) { 2N/A }
elsif ($_
eq "-D") {
2N/A my $
bad = (
shift() ==
0);
# $_[0] == 0 means assert failed 2N/A# What follow are the basic data structures routines of intrd. 2N/A# getstat() is responsible for reading the kstats and generating a "stat" hash. 2N/A# generate_delta() is responsible for taking two "stat" hashes and creating 2N/A# a new "delta" hash that represents what has changed over time. 2N/A# compress_deltas() is responsible for taking a list of deltas and generating 2N/A# a single delta hash that encompasses all the time periods described by the 2N/A# getstat() is handed a reference to a kstat and generates a hash, returned 2N/A# by reference, containing all the fields from the kstats which we need. 2N/A# If it returns the scalar 0, it failed to gather the kstats, and the caller 2N/A# should react accordingly. 2N/A# getstat() is also responsible for maintaining a reasonable $sleeptime. 2N/A# {"snaptime"} kstat's snaptime 2N/A# {<cpuid>} one hash reference per online cpu 2N/A# ->{"tot"} == cpu:<cpuid>:sys:cpu_nsec_{user + kernel + idle} 2N/A# ->{"crtime"} == cpu:<cpuid>:sys:crtime 2N/A# ->{<cookie#>} iterates over pci_intrs::<nexus>:cookie 2N/A# ->{"time"} == pci_intrs:<ivec#>:<nexus>:time (in nsec) 2N/A# ->{"pil"} == pci_intrs:<ivec#>:<nexus>:pil 2N/A# ->{"crtime"} == pci_intrs:<ivec#>:<nexus>:crtime 2N/A# ->{"ino"} == pci_intrs:<ivec#>:<nexus>:ino 2N/A# ->{"num_ino"} == num inos of single device instance sharing this entry 2N/A# Will be > 1 on pcplusmp X86 systems for devices 2N/A# with multiple MSI interrupts. 2N/A# ->{"buspath"} == pci_intrs:<ivec#>:<nexus>:buspath 2N/A# ->{"name"} == pci_intrs:<ivec#>:<nexus>:name 2N/A# ->{"ihs"} == pci_intrs:<ivec#>:<nexus>:ihs 2N/A # Hash of hash which matches (MSI device, ino) combos to kstats. 2N/A # kstats are not generated atomically. Each kstat hierarchy will 2N/A # have been generated within the kernel at a different time. On a 2N/A # thrashing system, we may not run quickly enough in order to get 2N/A # coherent kstat timing information across all the kstats. To 2N/A # determine if this is occurring, $minsnap/$maxsnap are used to 2N/A # find the breadth between the first and last snaptime of all the 2N/A # kstats we access. $maxsnap - $minsnap roughly represents the 2N/A # total time taken up in getstat(). If this time approaches the 2N/A # time between snapshots, our results may not be useful. 2N/A $
minsnap = -
1;
# snaptime is always a positive number 2N/A # Iterate over the cpus in cpu:<cpuid>::. Check 2N/A # cpu_info:<cpuid>:cpu_info<cpuid>:state to make sure the 2N/A # processor is "on-line". If not, it isn't accepting interrupts 2N/A # and doesn't concern us. 2N/A # Record cpu:<cpuid>:sys:snaptime, and check $minsnap/$maxsnap. 2N/A #"state" fld of kstat w/ 2N/A # modname inst name-"cpuinfo0" 2N/A return (
0);
# nothing to do with 1 CPU 2N/A # Iterate over the ivecs. If the cpu is not on-line, ignore the 2N/A # ivecs mapped to it, if any. 2N/A # Record pci_intrs:{inum}:<nexus>:time, snaptime, crtime, pil, 2N/A # ino, name, and buspath. Check $minsnap/$maxsnap. 2N/A # Perl looks beyond NULL chars in pattern matching. 2N/A # Truncate name field at the first NULL 2N/A my $
cookie =
"$intrcfg->{buspath} $intrcfg->{ino}";
2N/A # If this new interrupt sharing $cookie represents a 2N/A # change from an earlier getstat, make sure that 2N/A # generate_delta will see the change by setting 2N/A # crtime to the most recent crtime of its components. 2N/A # All MSI interrupts of a device instance share a single MSI address. 2N/A # On X86 systems with an APIC, this MSI address is interpreted as CPU 2N/A # routing info by the APIC. For this reason, on these platforms, all 2N/A # interrupts for MSI devices must be moved to the same CPU at the same 2N/A # Since all interrupts will be on the same CPU on these platforms, all 2N/A # interrupts can be consolidated into one ivec entry. For such devices, 2N/A # num_ino will be > 1 to denote that a group move is needed. 2N/A # Loop thru all MSI devices on X86 pcplusmp systems. 2N/A # Nop on other systems. 2N/A # Loop thru inos of the device, sorted by lowest value first 2N/A # For each cookie found for a device, incr num_ino for the 2N/A # lowest cookie and remove other cookies. 2N/A # Assumes PIL is the same for first and current cookies 2N/A # Invalidate this cookie, less complicated and 2N/A # more efficient than deleting it. 2N/A # We define the timerange as the amount of time spent gathering the 2N/A # various kstats, divided by our sleeptime. If we take a lot of time 2N/A # to access the kstats, and then we create a delta comparing these 2N/A # kstats with a prior set of kstats, that delta will cover 2N/A # substaintially different amount of time depending upon which 2N/A # interrupt or CPU is being examined. 2N/A # By checking the timerange here, we guarantee that any deltas 2N/A # created from these kstats will contain self-consistent data, 2N/A # in that all CPUs and interrupts cover a similar span of time. 2N/A # $timerange_toohi is the upper bound. Any timerange above 2N/A # this is thrown out as garbage. If the stat is safely within this 2N/A # bound, we treat the stat as representing an instant in time, rather 2N/A # than the time range it actually spans. We arbitrarily choose minsnap 2N/A # as the snaptime of the stat. 2N/A# dumpdelta takes a reference to our "delta" structure: 2N/A# {"missing"} "1" if the delta's component stats had inconsistencies 2N/A# {"minsnap"} time of the first kstat snaptime used in this delta 2N/A# {"maxsnap"} time of the last kstat snaptime used in this delta 2N/A# {"goodness"} cost function applied to this delta 2N/A# {"avgintrload"} avg of interrupt load across cpus, as a percentage 2N/A# {"avgintrnsec"} avg number of nsec spent in interrupts, per cpu 2N/A# {<cpuid>} iterates over on-line cpus 2N/A# ->{"intrs"} cpu's movable intr time (sum of "time" for each ivec) 2N/A# ->{"tot"} CPU load from all sources in nsec 2N/A# ->{"bigintr"} largest value of {ivecs}{<ivec#>}{time} from below 2N/A# ->{"intrload"} intrs / tot 2N/A# ->{<ivec#>} iterates over ivecs for this cpu 2N/A# ->{"time"} time used by this interrupt (in nsec) 2N/A# ->{"pil"} pil level of this interrupt 2N/A# ->{"ino"} interrupt number (or base vector if MSI group) 2N/A# ->{"buspath"} filename of the directory of the device's bus 2N/A# ->{"name"} device name 2N/A# ->{"ihs"} number of different handlers sharing this ino 2N/A# ->{"num_ino"} number of interrupt vectors in MSI group 2N/A# It prints out the delta structure in a nice, human readable display. 2N/A syslog(
'debug',
" avgintrload: %5.2f%% avgintrnsec: %d",
2N/A next if !
ref($
cpst);
# skip non-cpuid entries 2N/A syslog(
'debug',
" cpu %3d intr %7.3f%% (bigintr %7.3f%%)",
2N/A # iterate over ivecs on this cpu 2N/A ($
ivst->{
ihs} >
1 ?
"$ivst->{name}($ivst->{ihs})" :
2N/A# generate_delta($stat, $newstat) takes two stat references, returned from 2N/A# getstat(), and creates a %delta. %delta (not surprisingly) contains the 2N/A# same basic info as stat and newstat, but with the timestamps as deltas 2N/A# instead of absolute times. We return a reference to the delta. 2N/A # Take the worstcase timerange 2N/A "generate_delta: stats aren't ascending")) {
2N/A # if there are a different number of cpus in the stats, set missing 2N/A "generate_delta: number of CPUs changed")) {
2N/A # scan through every cpu in %newstat and compare against %stat 2N/A # If %stat is missing a cpu from %newstat, then it was just 2N/A # onlined. Mark missing. 2N/A "generate_delta: cpu $cpu changed")) {
2N/A "generate_delta: deltas are not ascending?")) {
# Avoid remote chance of division by zero # if the number of ivecs differs, set missing "generate_delta: cpu $cpu has more/less".
# Unused cookie, corresponding to an MSI vector which # is part of a group. The whole group is accounted for # If this ivec doesn't exist in $stat, or if $stat # shows a different crtime, set missing. "generate_delta: cpu $cpu inum $inum".
# Create $delta{$cpu}{ivecs}{$inum}. # calculate time used by this interrupt "generate_delta: ivec went backwards?")) {
# Transfer over basic info about the kstat. We # don't have to worry about discrepancies between # ivec and newivec because we verified that both # Ewww! Hopefully just a rounding error. # compress_delta takes a list of deltas, and returns a single new delta # which represents the combined information from all the deltas. The deltas # provided are assumed to be sequential in time. The resulting compressed # delta looks just like any other delta. This new delta is also more accurate # since its statistics are averaged over a longer period than any of the "compress_deltas: list of delta is empty?")) {
"compressing bad deltas?")) {
next if !
ref($
cpu);
# ignore non-cpu fields # What follow are the core functions responsible for examining the deltas # generated above and deciding what to do about them. # goodness() and its helper goodness_cpu() return a heuristic which describe # how good (or bad) the current interrupt balance is. The value returned will # be between 0 and 1, with 0 representing maximum goodness, and 1 representing # imbalanced() compares a current and historical value of goodness, and # determines if there has been enough change to warrant evaluating a # reconfiguration of the interrupts # do_reconfig(), and its helpers, do_reconfig_cpu(), do_reconfig_cpu2cpu(), # find_goal(), do_find_goal(), and move_intr(), are responsible for examining # a delta and determining the best possible assignment of interrupts to CPUs. # It is important that do_reconfig() be in alignment with goodness(). If # do_reconfig were to generate a new interrupt distribution that worsened # goodness, we could get into a pathological loop with intrd fighting itself, # constantly deciding that things are imbalanced, and then changing things # only to make them worse. # any goodness over $goodness_unsafe_load is considered really bad # goodness must drop by at least $goodness_mindelta for a reconfig # goodness(%delta) examines a delta and return its "goodness". goodness will # be between 0 (best) and 1 (major bad). goodness is determined by evaluating # the goodness of each individual cpu, and returning the worst case. This # helps on systems with many CPUs, where otherwise a single pathological CPU # might otherwise be ignored because the average was OK. # To calculate the goodness of an individual CPU, we start by looking at its # load due to interrupts. If the load is above a certain high threshold and # there is more than one interrupt assigned to this CPU, we set goodness # to worst-case. If the load is below the average interrupt load of all CPUs, # then we return best-case, since what's to complain about? # Otherwise we look at how much the load is above the average, and return # that as the goodness, with one caveat: we never return more than the CPU's # interrupt load ignoring its largest single interrupt source. This is # because a CPU with one high-load interrupt, and no other interrupts, is # perfectly balanced. Nothing can be done to improve the situation, and thus # it is perfectly balanced even if the interrupt's load is 100%. next if !
ref($
cpu);
# skip non-cpuid fields "goodness: cpu goodness out of range?")) {
return (
1);
# worst case, no need to continue # Calculate $load_no_bigintr, which represents the load # due to interrupts, excluding the one biggest interrupt. # This is the most gain we can get on this CPU from # A major imbalance is indicated if a CPU is saturated # with interrupt handling, and it has more than one # source of interrupts. Those other interrupts could be # starved if of a lower pil. Return a goodness of 1, # which is the worst possible return value, # which will effectively contaminate this entire delta. # imbalanced() is used by the main routine to determine if the goodness # has shifted far enough from our last baseline to warrant a reassignment # of interrupts. A very high goodness indicates that a CPU is way out of # whack. If the goodness has varied too much since the baseline, then # perhaps a reconfiguration is worth considering. # Return 1 if we are pathological, or creeping away from the baseline # do_reconfig(), do_reconfig_cpu(), and do_reconfig_cpu2cpu(), are the # decision-making functions responsible for generating a new interrupt # distribution. They are designed with the definition of goodness() in # mind, i.e. they use the same definition of "good distribution" as does # do_reconfig() is responsible for deciding whether a redistribution is # actually warranted. If the goodness is already pretty good, it doesn't # waste the CPU time to generate a new distribution. If it # calculates a new distribution and finds that it is not sufficiently # improved from the prior distirbution, it will not do the redistribution, # mainly to avoid the disruption to system performance caused by # Its main loop works by going through a list of cpus sorted from # highest to lowest interrupt load. It removes the highest-load cpus # one at a time and hands them off to do_reconfig_cpu(). This function # then re-sorts the remaining CPUs from lowest to highest interrupt load, # and one at a time attempts to rejuggle interrupts between the original # high-load CPU and the low-load CPU. Rejuggling on a high-load CPU is # considered finished as soon as its interrupt load is within # $goodness_mindelta of the average interrupt load. Such a CPU will have # a goodness of below the $goodness_mindelta threshold. # move_intr(\%delta, $inum, $oldcpu, $newcpu) # used by reconfiguration code to move an interrupt between cpus within # a delta. This manipulates data structures, and does not actually move # the interrupt on the running system. # Remove ivec from old cpu "move_intr: intr's time > bigintr?");
"Moved interrupts left 100+%% load on src cpu");
"Moved interrupts left 100+%% load on tgt cpu");
$
str =
"$str $ivec->{inum}";
# We can't improve goodness to better than 0. We should stop here # if, even if we achieve a goodness of 0, the improvement is still # too small to merit the action. syslog(
'debug',
"goodness good enough, don't reconfig");
syslog(
'notice',
"Optimizing interrupt assignments");
"have a delta with missing")) {
# Make a list of all cpuids, and also add some extra information # to the ivec structures. next if !
ref($
cpu);
# skip non-cpu entries # Sort the list of CPUs from highest to lowest interrupt load. # Remove the top CPU from that list and attempt to redistribute # its interrupts. If the CPU has a goodness below a threshold, # just ignore the CPU and move to the next one. If the CPU's # load falls below the average load plus that same threshold, # then there are no CPUs left worth reconfiguring, and we're done. # Re-sort cpusortlist each time, since do_reconfig_cpu can # move interrupts around. syslog(
'debug',
"finished reconfig: cpu $cpu load ".
"$delta->{$cpu}{intrload} avgload ".
"$delta->{avgintrload}");
# How good a job did we do? If the improvement was minimal, and # our goodness wasn't pathological (and thus needing any help it # can get), then don't bother moving the interrupts. "reconfig: result has worse goodness?");
syslog(
'debug',
"goodness already near optimum, ".
# Time to move those interrupts! syslog(
'warning',
"Unable to move interrupts")
syslog(
'debug',
"Unable to move buspath ".
"$ivec->{buspath} ino $ivec->{ino} to ".
syslog(
'notice',
"Interrupt assignments optimized");
# We have been asked to rejuggle interrupts between $oldcpuid and # other CPUs found on $cpusortlist so as to improve the load on # $oldcpuid. We reverse $cpusortlist to get our own copy of the # list, sorted from lowest to highest interrupt load. One at a # time, shift a CPU off of this list of CPUs, and attempt to # rejuggle interrupts between the two CPUs. Don't do this if the # other CPU has a higher load than oldcpuid. We're done rejuggling # once $oldcpuid's goodness falls below a threshold. syslog(
'debug',
"reconfiguring $oldcpuid");
while ($
#cputargetlist != -1) { # We've been asked to consider interrupt juggling between srccpuid # (with a high interrupt load) and tgtcpuid (with a lower interrupt # load). First, make a single list with all of the ivecs from both # CPUs, and sort the list from highest to lowest load. syslog(
'debug',
"exchanging intrs between $srccpuid and $tgtcpuid");
# Gather together all the ivecs and sort by load @
ivecs =
sort({$b->{
time} <=> $a->{
time}} @
ivecs);
# Our "goal" load for srccpuid is the average load across all CPUs. # find_goal() will find determine the optimum selection of the # available interrupts which comes closest to this goal without # falling below the goal. # We know that the interrupt load on tgtcpuid is less than that on # srccpuid, but its load could still be above avgintrnsec. Don't # choose a goal which would bring srccpuid below the load on tgtcpuid. # If the largest of the interrupts is on srccpuid, leave it there. # This can help minimize the disruption caused by moving interrupts. syslog(
'debug',
"Keeping $ivecs[0]->{inum} on $srccpuid");
syslog(
'debug',
"GOAL: inums should total $goal");
# find_goal() returned its results to us by setting $ivec->{goal} if # the ivec should be on srccpuid, or clearing it for tgtcpuid. # Call move_intr() to update our $delta with the new results. syslog(
'debug',
"ivec $ivec->{inum} goal $ivec->{goal}");
"interrupt not currently on src or tgt cpu");
"cpu2cpu: new load didn't end up in expected range");
# find_goal() and its helper do_find_goal() are used to find the best # combination of interrupts in order to generate a load that is as close # as possible to a goal load without falling below that goal. Before returning # to its caller, find_goal() sets a new value in the hash of each interrupt, # {goal}, which if set signifies that this interrupt is one of the interrupts # identified as part of the set of interrupts which best meet the goal. # The arguments to find_goal are a list of ivecs (hash references), sorted # by descending {time}, and the goal load. The goal is relative to {time}. # The best fit is determined by performing a depth-first search. do_find_goal # is the recursive subroutine which carries out the search. # It is passed an index as an argument, originally 0. On a given invocation, # it is only to consider interrupts in the ivecs array starting at that index. # It then considers two possibilities: # 1) What is the best goal-fit if I include ivecs[index]? # 2) What is the best goal-fit if I exclude ivecs[index]? # To determine case 1, it subtracts the load of ivecs[index] from the goal, # and calls itself recursively with that new goal and index++. # To determine case 2, it calls itself recursively with the same goal and # It then compares the two results, decide which one best meets the goals, # and returns the result. The return value is the best-fit's interrupt load, # followed by a list of all the interrupts which make up that best-fit. # As an optimization, a second array loads[] is created which mirrors ivecs[]. # loads[i] will equal the total loads of all ivecs[i..$#ivecs]. This is used # by do_find_goal to avoid recursing all the way to the end of the ivecs # array if including all remaining interrupts will still leave the best-fit # at below goal load. If so, it then includes all remaining interrupts on # the goal list and returns. @
goals = ();
# the empty set will best meet the goal syslog(
'debug',
"finding goal from intrs %s",
# Set or clear $ivec->{goal} for each ivec, based on returned @goals if ($
#goals > -1 && $ivec == $goals[0]) { syslog(
'debug',
"inum $ivec->{inum} on source cpu");
syslog(
'debug',
"inum $ivec->{inum} on target cpu");
syslog(
'debug',
"$idx: finding goal $goal inum $ivecs->[$idx]{inum}");
# If we include all remaining items and we're still below goal, # stop here. We can just return a result that includes $idx and all # subsequent ivecs. Since this will still be below goal, there's # nothing better to be done. "$idx: including all remaining intrs %s with load %d",
# Evaluate the "with" option, i.e. the best matching goal which # includes $ivecs->[$idx]. If idx's load is more than our goal load, # stop here. Once we're above the goal, there is no need to consider # further interrupts since they'll only take us further from the goal. syslog(
'debug',
"$idx: with-load $with intrs %s",
# Evaluate the "without" option, i.e. the best matching goal which # excludes $ivecs->[$idx]. syslog(
'debug',
"$idx: without-load $without intrs %s",
# We now have our "with" and "without" options, and we choose which # best fits the goal. If one is greater than goal and the other is # below goal, we choose the one that is greater. If they are both # below goal, then we choose the one that is greater. If they are # both above goal, then we choose the smaller. my $
which;
# 0 == with, 1 == without # Return the load of our best case scenario, followed by all the ivecs # which compose that goal. if ($
which ==
1) {
# without syslog(
'debug',
"$idx: going without");
syslog(
'debug',
"$idx: going with");
syslog(
'debug',
"intrd is starting".($
debug ?
" (debug)" :
""));
$
SIG{
INT} =
sub { $
gotsig =
1; };
# don't die in the middle of retargeting# If no pci_intrs kstats were found, we need to exit, but we can't because # SMF will restart us and/or report an error to the administrator. But # there's nothing an administrator can do. So print out a message for SMF # logs and silently pause forever. print STDERR "$cmdname: no interrupts were found; ".
"your PCI bus may not yet be supported\n";
# See if this is a system with a pcplusmp APIC. # Such systems will get special handling. # Assume that if one bus has a pcplusmp APIC that they all do. # Get a list of pci_intrs kstats. # Use its buspath to query the system. It is assumed that either all or none # of the busses on a system are hosted by the pcplusmp APIC or APIX. $
stat =
0;
# prevent next gen_delta() from setting {missing} # 1. Sleep, update the kstats, and save the new stats in $newstat. exit 0 if $
gotsig;
# if we got ^C / SIGTERM, exit exit 0 if $
gotsig;
# if we got ^C / SIGTERM, exit # $stat or $newstat could be zero if they're uninitialized, or if # getstat() failed. If $stat is zero, move $newstat to $stat, sleep # and try again. If $newstat is zero, then we also sleep and try # again, hoping the problem will clear up. # 2. Compare $newstat with the prior set of values, result in %$delta. $
stat = $
newstat;
# The new stats now become the old stats. # 3. If $delta->{missing}, then there has been a reconfiguration of # either cpus or interrupts (probably both). We need to toss out our # old set of statistics and start from scratch. # Also, if the delta covers a very long range of time, then we've # been experiencing a system overload that has resulted in intrd # not being allowed to run effectively for a while now. As above, # toss our old statistics and start from scratch. syslog(
'debug',
"evaluating interrupt assignments");
# 4. Incorporate new delta into the list of deltas, and associated # statistics. If we've just now received $statslen deltas, then it's # time to evaluate a reconfiguration. # 5. Remove old deltas if total time is more than $statslen. We use # @deltas as a moving average of the last $statslen seconds. Shift # off the olders deltas, but only if that doesn't cause us to fall # below $statslen seconds. # 6. The brains of the operation are here. First, check if we're # imbalanced, and if so set $do_reconfig. If $do_reconfig is set, # either because of imbalance or above in step 4, we evaluate a # First, take @deltas and generate a single "compressed" delta # which summarizes them all. Pass that to do_reconfig and see # $ret == 0 : current config is optimal (or close enough) # $ret == 1 : reconfiguration has occurred # If $ret is -1 or 1, dump all our deltas and start from scratch. # Step 4 above will set do_reconfig soon thereafter. # If $ret is 0, then nothing has happened because we're already # good enough. Set baseline_goodness to current goodness. syslog(
'debug',
"do_reconfig FAILED!")
if $
ret == -
1;
syslog(
'debug',
"setting new baseline of $goodness");
syslog(
'debug',
"---------------------------------------");