mac_datapath_setup.c revision 5adf34bd96da7c794241c78b641902875edb690a
* account mac_soft_ring_enable, mac_rx_soft_ring_count and * mac_rx_soft_ring_10gig_count to determine the soft ring count for a link. * If a bandwidth is specified, the determination of the number of soft * rings is based on specified bandwidth, CPU speed and number of CPUs in * Every Tx and Rx mac_soft_ring_set_t (mac_srs) created gets added * to mac_srs_g_list and mac_srs_g_lock protects mac_srs_g_list. The * list is used to walk the list of all MAC threads when a CPU is * coming online or going offline. * Whether the SRS threads should be bound, or not. * Whether Rx/Tx interrupts should be re-targeted. Disabled by default. * dladm command would override this. * If cpu bindings are specified by user, then Tx SRS and its soft * rings should also be bound to the CPUs specified by user. The * CPUs for Tx bindings are at the end of the cpu list provided by * the user. If enough CPUs are not available (for Tx and Rx * SRSes), then the CPUs are shared by both Tx and Rx SRSes. * Re-targeting is allowed only for exclusive group or for primary. /* INIT and FINI ROUTINES */ * DR. The callbacks from DR are called with cpu_lock held, and hence * can't wait to grab the mac perimeter. The soft ring list is hence * protected for read access by srs_lock. Changing the soft ring list * needs the mac perimeter and the srs_lock. /* POLLING SETUP AND TEAR DOWN ROUTINES */ * mac_srs_client_poll_quiesce and mac_srs_client_poll_restart * These routines are used to call back into the upper layer * (primarily TCP squeue) to stop polling the soft rings or * Register the given SRS and associated soft rings with the consumer and * enable the polling interface used by the consumer.(i.e IP) over this * SRS and associated soft rings. * A SRS is capable of acting as a soft ring for cases * where no fanout is needed. This is the case for userland * TCP and UDP support DLS bypass. Squeue polling * support implies DLS bypass since the squeue poll * path does not have DLS processing. * Non-TCP protocols don't support squeues. Hence we don't * make any ring addition callbacks for non-TCP rings * Unregister the given SRS and associated soft rings with the consumer and * disable the polling interface used by the consumer.(i.e IP) over this * SRS and associated soft rings. * A SRS is capable of acting as a soft ring for cases * where no protocol fanout is needed. This is the case * for userland flows. Nothing to do here. * DLS bypass is now disabled in the case of both TCP and UDP. * Reset the soft ring callbacks to the standard 'mac_rx_deliver' * callback. In addition, in the case of TCP, invoke IP's callback * Enable or disable poll capability of the SRS on the underlying Rx ring. * There is a need to enable or disable the poll capability of an SRS over an * Rx ring depending on the number of mac clients sharing the ring and also * whether user flows are configured on it. However the poll state is actively * manipulated by the SRS worker and poll threads and uncoordinated changes by * yet another thread to the underlying capability can surprise them leading * to assert failures. Instead we quiesce the SRS, make the changes and then /* CPU RECONFIGURATION AND FANOUT COMPUTATION ROUTINES */ * Return the next CPU to be used to bind a MAC kernel thread. * If a cpupart is specified, the cpu chosen must be from that * mac_compute_soft_ring_count(): * This routine computes the number of soft rings needed to handle incoming * load given a flow_entry. * The routine does the following: * 1) soft rings will be created if mac_soft_ring_enable is set. * 2) If the underlying link is a 10Gbps link, then soft rings will be * created even if mac_soft_ring_enable is not set. The number of soft * rings, so created, will equal mac_rx_soft_ring_10gig_count. * 3) On a sun4v platform (i.e., mac_soft_ring_enable is set), 2 times the * mac_rx_soft_ring_10gig_count number of soft rings will be created for a * If a bandwidth limit is specified, the number that gets computed is * dependent upon CPU speed, the number of Rx rings configured, and * If more Rx rings are available, less number of soft rings is needed. * mac_use_bw_heuristic is another "hidden" variable that can be used to * override the default use of soft ring count computation. Depending upon * the usefulness of it, mac_use_bw_heuristic can later be made into a * data-link property or removed altogether. * TODO: Cleanup and tighten some of the assumptions. /* No bandwidth enabled */ /* Is this a 10Gig link? */ /* This is a 10Gig link */ * Use 2 times mac_rx_soft_ring_10gig_count for * Soft ring computation using CPU speed and specified /* Assumption: all CPUs have the same frequency */ /* cpu_speed is in MHz; make bw in units of Mbps. */ * bw is greater than or equal to 1Gbps. * The number of soft rings required is a function * of bandwidth and CPU speed. To keep this simple, * let's use this rule: 1GHz CPU can handle 1Gbps. * If bw is less than 1 Gbps, then there is no need * for soft rings. Assumption is that CPU speeds * (on modern systems) are at least 1GHz. * Give at least 2 soft rings * If the flent has multiple Rx SRSs, then each SRS need not * have that many soft rings on top of it. The number of * soft rings for each Rx SRS is found by dividing srings by * Fanning out to 1 soft ring is not very useful. * Set it as well to 0 and mac_srs_fanout_init() * will take care of creating a single soft ring /* Do some more massaging */ * set up CPUs for Tx interrupt re-targeting and Tx worker * Bind interrupt to the next CPU available * and leave the worker unbound. /* Tx mac_ring_handle_t is stored in st_arg2 */ * Assignment of user specified CPUs to a link. * Minimum CPUs required to get an optimal assignmet: * For each Rx SRS, atleast two CPUs are needed if mac_latency_optimize * flag is set -- one for polling, one for fanout soft ring. * If mac_latency_optimize is not set, then 3 CPUs are needed -- one * for polling, one for SRS worker thread and one for fanout soft ring. * The CPUs needed for Tx side is equal to the number of Tx rings * mac_flow_user_cpu_init() categorizes the CPU assignment depending * upon the number of CPUs in 3 different buckets. * In the first bucket, the most optimal case is handled. The user has * passed enough number of CPUs and every thread gets its own CPU. * The second and third are the sub-optimal cases. Enough CPUs are not * The second bucket handles the case where atleast one distinct CPU is * is available for each of the Rx rings (Rx SRSes) and Tx rings (Tx * In the third case (worst case scenario), specified CPU count is less * than the Rx rings configured for the link. In this case, we round * robin the CPUs among the Rx SRSes and Tx SRS/soft rings. * The check for nbc_ncpus to be within limits for * the user specified case was done earlier and if * not within limits, an error would have been * interrupt has been re-targetted. Poll * thread needs to be bound to interrupt * Find where in the list is the intr * CPU and swap it with the first one. * We will be using the first CPU in the * The number of CPUs that each Rx ring needs is dependent * upon mac_latency_optimize flag. * 1) If set, atleast 2 CPUs are needed -- one for * polling, one for fanout soft ring. * 2) If not set, then atleast 3 CPUs are needed -- one * for polling, one for srs worker thread, and one for /* How many CPUs are needed for Tx side? */ /* CPUs needed for Rx SRSes poll and worker threads */ /* Has the user provided enough CPUs? */ * Best case scenario. There is enough CPUs. All * Rx rings will get their own set of CPUs plus * Tx soft rings will get their own. * fanout_cpu_cnt is the number of CPUs available * for Rx side fanout soft rings. * Divide fanout_cpu_cnt by rx_srs_cnt to find * out how many fanout soft rings each Rx SRS /* fanout_cnt_per_srs should not be > MAX_SR_FANOUT */ /* Do the assignment for the default Rx ring */ /* Retarget the interrupt to the same CPU as the poll */ /* Do the assignment for h/w Rx SRSes */ /* The first CPU in the list is the intr CPU */ * We have the following information: * no_of_cpus - no. of cpus that user passed. * rx_srs_cnt - no. of rx rings. * reqd_rx_cpu_cnt = mac_latency_optimize?rx_srs_cnt*2:rx_srs_cnt*3 * reqd_tx_cpu_cnt - no. of cpus reqd. for Tx side. * reqd_poll_worker_cnt = mac_latency_optimize?rx_srs_cnt:rx_srs_cnt*2 * If we bind the Rx fanout soft rings to the same CPUs * If mac_latency_optimize is not set, are there * enough CPUs to assign a CPU for worker also? * Zero'th Rx SRS is the default Rx ring. It is not * associated with h/w Rx ring. /* Retarget the interrupt to the same CPU as the poll */ /* Do CPU bindings for SRSes having h/w Rx rings */ * Real sub-optimal case. Not enough CPUs for poll and * Tx soft rings. Do a round robin assignment where * each Rx SRS will get the same CPU for poll, worker /* Retarget the interrupt to the same CPU as the poll */ * Copy the user specified CPUs to the effective CPUs * Each SRS has a mac_cpu_t structure, srs_cpu. This routine fills in * the CPU binding information in srs_cpu for all Rx SRSes associated * The maximum number of CPUs available can either be * the number of CPUs in the pool or the number of CPUs * Compute the number of soft rings needed on top for each Rx * SRS. "rx_srs_cnt-1" indicates the number of Rx SRS * associated with h/w Rx rings. Soft ring count needed for * each h/w Rx SRS is computed and the same is applied to * software classified Rx SRS. The first Rx SRS in fe_rx_srs[] * is the software classified Rx SRS. * Even when soft_ring_cnt is 0, we still need * to create a soft ring for TCP, UDP and /* increment ncpus to account for polling cpu */ * Copy cpu list to fe_effective_props for (i = 0; i <
nscpus; i++) {
for (j = 0; j < k; j++) {
* DATAPATH SETUP ROUTINES * (setup SRS and set/update FANOUT, B/W and PRIORITY) * mac_srs_fanout_list_alloc: * The underlying device can expose upto MAX_RINGS_PER_GROUP worth of * rings to a client. In such a case, MAX_RINGS_PER_GROUP worth of * array space is needed to store Tx soft rings. Thus we allocate so * much array space for srs_tx_soft_rings. * And when it is an aggr, again we allocate MAX_RINGS_PER_GROUP worth * of space to st_soft_rings. This array is used for quick access to * soft ring associated with a pseudo Tx ring based on the pseudo * ring's index (mr_index). * Re-target interrupt to the passed CPU. If re-target is successful, * set mc_rx_intr_cpu to the re-targeted CPU. Otherwise set it to -1. * Don't re-target the interrupt for these cases: * 2) the interrupt is shared (mi_ddi_shared) * 3) ddi_handle is NULL and !primary * 4) primary, ddi_handle is NULL but fe_rx_srs_cnt > 2 * Case 3 & 4 are because of mac_client_intr_cpu() routine. * This routine will re-target fixed interrupt for primary * mac client if the client has only one ring. In that * case, mc_rx_intr_cpu will already have the correct value. /* Drop the cpu_lock as ddi_intr_set_affinity() holds it */ * Re-target Tx interrupts * Drop the cpu_lock as ddi_intr_set_affinity() * When a CPU comes back online, bind the MAC kernel threads which * were previously bound to that CPU, and had to be unbound because * the CPU was going away. * These functions are called with cpu_lock held and hence we can't * cv_wait to grab the mac perimeter. Since these functions walk the soft * ring list of an SRS without being in the perimeter, the list itself * is protected by the SRS lock. /* Next tackle the soft rings associated with the srs */ * Change the priority of the SRS's poll and worker thread. Additionally, * update the priority of the worker threads for the SRS's soft rings. * Need to modify any associated squeue threads. * Change the receive bandwidth limit. /* Reset bandwidth limit */ * Give twice the queuing capability before /* Change the transmit bandwidth limit */ * srs->srs_tx->st_func do not hold srs->srs_lock while accessing * st_mode and related fields, which are modified by the code below. /* Reset bandwidth limit */ * Give twice the queuing capability before * The uber function that deals with any update to bandwidth limits. * When the first sub-flow is added to a link, we disable polling on the * link and also modify the entry point to mac_rx_srs_subflow_process. * (polling is disabled because with the subflow added, accounting * for polling needs additional logic, it is assumed that when a subflow is * added, we can take some hit as a result of disabling polling rather than * adding more complexity - if this becomes a perf. issue we need to * re-rvaluate this logic). When the last subflow is removed, we turn back * polling and also reset the entry point to mac_rx_srs_process. * In the future if there are multiple SRS, we can simply * take one and give it to the flow rather than disabling polling and * resetting the entry point. /* Tell mac_srs_poll_state_change to disable polling if necessary */ * If receive function has already been configured correctly for * current subflow configuration, do nothing. * Change the S/W classifier so that we can land in the * correct processing function with correct argument. * If all subflows have been removed we can revert to * mac_rx_srsprocess, else we need mac_rx_srs_subflow_process. * TCP and UDP support DLS bypass. In addition TCP * squeue can also poll their corresponding soft rings. * Make a call in IP to get a TCP squeue assigned to * this softring to maintain full CPU locality through * the stack and allow the squeue to be able to poll * the softring so the flow control can be pushed * Non-TCP protocols don't support squeues. Hence we * don't make any ring addition callbacks for non-TCP * rings. Now create the UDP softring and allow it to /* Create the Oth softrings which has to go through the DLS */ * This routine associates a CPU or a set of CPU to process incoming * traffic from a mac client. If multiple CPUs are specified, then * so many soft rings are created with each soft ring worker thread * bound to a CPU in the set. Each soft ring in turn will be * associated with an squeue and the squeue will be moved to the * same CPU as that of the soft ring's. /* fanout state is REINIT. Set it back to INIT */ /* how many are present right now */ /* soft rings increased */ * Create the protocol softrings and set the * DLS bypass where possible. /* soft rings decreased */ /* Get rid of extra soft rings */ * Bind Tx srs and soft ring threads too. Let's bind tx * srs to the last cpu in mrp list. * Bind SRS threads and soft rings to CPUs/create fanout list. * Remove the no soft ring flag and we will adjust it * appropriately further down. * Ring count can be 0 if no fanout is required and no cpu * were specified. Leave the SRS worker and poll thread /* Step 1: bind cpu contains cpu list where threads need to bind */ /* Create the protocol softrings */ * Bind Tx srs and soft ring threads too. * Let's bind tx srs to the last cpu in * For a subflow, mrp_workerid and mrp_pollid /* Create the protocol softrings */ * This is the case when there is no fanout which is * Calls mac_srs_fanout_init() or modify() depending upon whether * the SRS is getting initialized or re-initialized. * This is an aggregation port. Fanout will be setup * over the aggregation itself. * Set up the fanout on the tx side only once, with the * first rx SRS. The CPU binding, fanout, and bandwidth * criteria are common to both RX and TX, so * initializing them along side avoids redundant code. /* No fanout for subflows */ * Set up fanout for both SW (0th SRS) and HW classified * SRS (the rest of Rx SRSs in flent). * Create a mac_soft_ring_set_t (SRS). If soft_ring_fanout_type is * SRST_TX, an SRS for Tx side is created. Otherwise an SRS for Rx side * Create a SRS and also add the necessary soft rings for TCP and * non-TCP based on fanout type and count specified. * mac_soft_ring_fanout, mac_srs_fanout_modify (?), * mac_soft_ring_stop_workers, mac_soft_ring_set_destroy, etc need * to be heavily modified. * mi_soft_ring_list_size, mi_soft_ring_size, etc need to disappear. * Get the bandwidth control structure from the flent. Get * rid of any residual values in the control structure for * the tx bw struct and also for the rx, if the rx srs is * the 1st one being brought up (the rx bw ctl struct may * be shared by multiple SRSs) * The bw counter (stored in the flent) is shared * by SRS's within an rx group. /* First rx SRS, clear the bw structure */ * It is better to panic here rather than just assert because * on a non-debug kernel we might end up courrupting memory * and making it difficult to debug. panic(
"Array Overrun detected due to MAC client %p " " having more rings than %d", (
void *)
mcip,
* For a flow we use the underlying MAC client's priority range with * the priority value to find an absolute priority value. For a MAC * client we use the MAC client's maximum priority as the value. * We need to insert the SRS in the global list before * binding the SRS and SR threads. Otherwise there is a * is a small window where the cpu reconfig callbacks * may miss the SRS in the list walk and DR could fail * as there are bound threads. /* Initialize bw limit */ * Give twice the queuing capability before * We use the following policy to control Receive * 1) We switch to poll mode anytime the processing thread causes * a backlog to build up in SRS and its associated Soft Rings * 2) As long as the backlog stays under the low water mark * (sr_lowat), we poll the H/W for more packets. * 3) If the backlog (sr_poll_pkt_cnt) exceeds low water mark, we * stay in poll mode but don't poll the H/W for more packets. * 4) Anytime in polling mode, if we poll the H/W for packets and * find nothing plus we have an existing backlog * (sr_poll_pkt_cnt > 0), we stay in polling mode but don't poll * the H/W for packets anymore (let the polling thread go to sleep). * 5) Once the backlog is relived (packets are processed) we reenable * polling (by signalling the poll thread) only when the backlog * dips below sr_poll_thres. * 6) sr_hiwat is used exclusively when we are not polling capable * and is used to decide when to drop packets so the SRS queue * length doesn't grow infinitely. /* Low water mark needs to be less than high water mark */ /* Poll threshold need to be half of low water mark or less */ /* Handle everything about Tx SRS and return */ /* Is the mac_srs created over the RX default group? */ * Some drivers require serialization and don't send * packet chains in interrupt context. For such * drivers, we should always queue in soft ring * so that we get a chance to switch into a polling * Figure out the number of soft rings required. Its dependant on * if protocol fanout is required (for LINKs), global settings * require us to do fanout for performance (based on mac_soft_ring_enable), * or user has specifically requested fanout. /* no fanout for subflows */ /* A primary NIC/link is being plumbed */ /* A VNIC is being created */ * Change a group from h/w to s/w classification. * Remove the SRS associated with the HW ring. * As a result, polling will be disabled. * We need to perform SW classification * for packets landing in these rings * Create the Rx SRS for S/W classifier and for each ring in the * group (if exclusive group). Also create the Tx SRS. * Set up the RX SRSs. If the S/W SRS is not set, set it up, if there * is a group associated with this MAC client, set up SRSs for individual /* Create the SRS for S/W classification if none exists */ * fanout for default SRS is done when default SRS are created * above. As each ring is added to the group, we setup the * Since the group is exclusively ours create * an SRS for this ring to allow the * individual SRS to dynamically poll the * ring. Do this only if the client is not * a VLAN MAC client, since for VLAN we do * s/w classification for the VID check, and * if it has a unicast address. "trying to add UNKNOWN ring = %p\n",
* Set all rings of this group to software classified. * If the group is current RESERVED, the existing mac * client (the only client on this group) is using * this group exclusively. In that case we need to * disable polling on the rings of the group (if it * was enabled), and free the SRS associated with the * If we are opened exclusively (like aggr does for aggr_ports), * don't set up Tx SRS and Tx soft rings as they won't be used. * The same thing has to be done for Rx side also. See bug: * If we have rings, start them here. * Remove all the RX SRSs. If we want to remove only the SRSs associated * with h/w rings, leave the S/W SRS alone. This is used when we want to * move the MAC client from one group to another, so we need to teardown for (i = 0; i <
count; i++) {
* For flows, we need to work with passed * flent to find the Rx/Tx SRS. * The ring itself will be stopped when * we release the group or in the * mac_datapath_teardown (for the default * This is the group state machine. * The state of an Rx group is given by * the following table. The default group and its rings are started in * mac_start itself and the default group stays in SHARED state until * mac_stop at which time the group and rings are stopped and and it * reverts to the Registered state. * Typically this function is called on a group after adding or removing a * client from it, to find out what should be the new state of the group. * If the new state is RESERVED, then the client that owns this group * exclusively is also returned. Note that adding or removing a client from * a group could also impact the default group and the caller needs to * evaluate the effect on the default group. * Group type # of clients mi_nactiveclients Group State * Non-default 0 N.A. REGISTERED * Non-default 1 N.A. RESERVED * Default > 1 N.A. SHARED * For a TX group, the following is the state table. * Group type # of clients Group State * Non-default 0 REGISTERED * OVERVIEW NOTES FOR DATAPATH * =========================== * Create an SRS and setup the corresponding flow function and args. * Add a classification rule for the flow specified by 'flent' and program * the hardware classifier when applicable. * Rx ring assignment, SRS, polling and B/W enforcement * ---------------------------------------------------- * We try to use H/W classification on NIC and assign traffic to a * MAC address to a particular Rx ring. There is a 1-1 mapping * between a SRS and a Rx ring. The SRS (short for soft ring set) * dynamically switches the underlying Rx ring between interrupt * and polling mode and enforces any specified B/W control. * There is always a SRS created and tied to each H/W and S/W rule. * Whenever we create a H/W rule, we always add the the same rule to * S/W classifier and tie a SRS to it. * In case a B/W control is specified, its broken into bytes * per ticks and as soon as the quota for a tick is exhausted, * the underlying Rx ring is forced into poll mode for remianing * tick. The SRS poll thread only polls for bytes that are * allowed to come in the SRS. We typically let 4x the configured * B/W worth of packets to come in the SRS (to prevent unnecessary * drops due to bursts) but only process the specified amount. * A Link (primary NIC, VNIC, VLAN or aggr) can have 1 or more * Rx rings (and corresponding SRSs) assigned to it. The SRS * in turn can have softrings to do protocol level fanout or * softrings to do S/W based fanout or both. In case the NIC * has no Rx rings, we do S/W classification to respective SRS. * The S/W classification rule is always setup and ready. This * allows the MAC layer to reassign Rx rings whenever needed * but packets still continue to flow via the default path and * getting S/W classified to correct SRS. * In other cases where a NIC or VNIC is plumbed, our goal is use * H/W classifier and get two Rx ring assigned for the Link. One * for TCP and one for UDP|SCTP. The respective SRS still do the * polling on the Rx ring. For Link that is plumbed for IP, there * is a TCP squeue which also does polling and can control the * the Rx ring directly (where SRS is just pass through). For * the following cases, the SRS does the polling underneath. * 1) non IP based Links (Links which are not plumbed via ifconfig) * and paths which have no IP squeues (UDP & SCTP) * 2) If B/W control is specified on the Link * 3) If S/W fanout is secified * Note1: As of current implementation, we try to assign only 1 Rx * ring per Link and more than 1 Rx ring for primary Link for * H/W based fanout. We always create following softrings per SRS: * 1) TCP softring which is polled by TCP squeue where possible * (and also bypasses DLS) * 2) UDP/SCTP based which bypasses DLS * 3) OTH softring which goes via DLS (currently deal with IPv6 * It is necessary to create 3 softrings since SRS has to poll * the single Rx ring underneath and enforce any link level B/W * control (we can't switch the Rx ring in poll mode just based * on TCP squeue if the same Rx ring is sharing UDP and other * traffic as well). Once polling is done and any Link level B/W * control is specified, the packets are assigned to respective * softring based on protocol. Since TCP has IP based squeue * which benefits by polling, we separate TCP packets into * its own softring which can be polled by IP squeue. We need * to separate out UDP/SCTP to UDP softring since it can bypass * the DLS layer which has heavy performance advanatges and we * need a softring (OTH) for rest. * ToDo: The 3 softrings for protocol are needed only till we can * get rid of DLS from datapath, make IPv4 and IPv6 paths * symmetric (deal with mac_header_info for v6 and polling for * IPv4 TCP - ip_accept_tcp is IPv4 specific although squeues * are generic), and bring SAP based classification to MAC layer * H/W and S/W based fanout and multiple Rx rings per Link * ------------------------------------------------------- * In case, fanout is requested (or determined automatically based * on Link speed and processor speed), we try to assign multiple * Rx rings per Link with their respective SRS. In this case * the NIC should be capable of fanning out incoming packets between * the assigned Rx rings (H/W based fanout). All the SRS * individually switch their Rx ring between interrupt and polling * mode but share a common B/W control counter in case of Link * level B/W is specified. * If S/W based fanout is specified in lieu of H/W based fanout, * the Link SRS creates the specified number of softrings for * each protocol (TCP, UDP, OTH). Incoming packets are fanned * out to the correct softring based on their protocol and * protocol specific hash function. * Primary and non primary MAC clients * ----------------------------------- * The NICs, VNICs, Vlans, and Aggrs are typically termed as Links * and are a Layer 2 construct. * The Link that owns the primary MAC address and typically * is used as the data NIC in non virtualized cases. As such * H/W resources are preferntially given to primary NIC. As * far as code is concerned, there is no difference in the * primary NIC vs VNICs. They are all treated as Links. * At the very first call to mac_unicast_add() we program the S/W * classifier for the primary MAC address, get a soft ring set * (and soft rings based on 'ip_soft_ring_cnt') * and a Rx ring assigned for polling to get enabled. * When IP get plumbed and negotiates polling, we can * let squeue do the polling on TCP softring. * Same as any other Link. As long as the H/W resource assignments * are equal, the data path and setup for all Links is same. * Can be configured on Links. They have their own SRS and the * S/W classifier is programmed appropriately based on the flow. * The flows typically deal with layer 3 and above and * creates a soft ring set specific to the flow. The receive * side function is switched from mac_rx_srs_process to * mac_rx_srs_subflow_process which first tries to assign the * packet to appropriate flow SRS and failing which assigns it * to link SRS. This allows us to avoid the layered approach * By the time mac_datapath_setup() completes, we already have the * soft rings set, Rx rings, soft rings, etc figured out and both H/W * and S/W classifiers programmed. IP is not plumbed yet (and might * never be for Virtual Machines guest OS path). When IP is plumbed * (for both NIC and VNIC), we do a capability negotiation for polling * and upcall functions etc. * Rx ring Assignement NOTES * ------------------------- * For NICs which have only 1 Rx ring (we treat NICs with no Rx rings * as NIC with a single default ring), we assign the only ring to * primary Link. The primary Link SRS can do polling on it as long as * it is the only link in use and we compare the MAC address for unicast * packets before accepting an incoming packet (there is no need for S/W * classification in this case). We disable polling on the only ring the * moment 2nd link gets created (the polling remains enabled even though * there are broadcast and * multicast flows created). * If the NIC has more than 1 Rx ring, we assign the default ring (the * 1st ring) to deal with broadcast, multicast and traffic for other * NICs which needs S/W classification. We assign the primary mac * addresses to another ring by specifiying a classification rule for * primary unicast MAC address to the selected ring. The primary Link * (and its SRS) can continue to poll the assigned Rx ring at all times * Note: In future, if no fanout is specified, we try to assign 2 Rx * rings for the primary Link with the primary MAC address + TCP going * to one ring and primary MAC address + UDP|SCTP going to other ring. * Any remaining traffic for primary MAC address can go to the default * Rx ring and get S/W classified. This way the respective SRSs don't * need to do proto fanout and don't need to have softrings at all and * can poll their respective Rx rings. * As an optimization, when a new NIC or VNIC is created, we can get * only one Rx ring and make it a TCP specific Rx ring and use the * H/W default Rx ring for the rest (this Rx ring is never polled). * For clients that don't have MAC address, but want to receive and * transmit packets (e.g, bpf, gvrp etc.), we need to setup the datapath. * For such clients (identified by the MCIS_NO_UNICAST_ADDR flag) we * always give the default group and use software classification (i.e. * even if this is the only client in the default group, we will * leave group as shared). * By default we have given the primary all the rings * i.e. the default group. Let's see if the primary * needs to be relocated so that the addition of this * client doesn't impact the primary's performance, * i.e. if the primary is in the default group and * we add this client, the primary will lose polling. * We do this only for NICs supporting dynamic ring * grouping and only when this is the first client * after the primary (i.e. nactiveclients is 2) * Check to see if we can get an exclusive group for * this mac address or if there already exists a * group that has this mac address (case of VLANs). * If no groups are available, use the default group. * Check to see if we can get an exclusive group for * this mac client. If no groups are available, use * Some NICs don't support any Rx rings, so there may not * even be a default group. * Add the client to the group. This could cause * either this group to move to the shared state or * cause the default group to move to the shared state. * The actions on this group are done here, while the * actions on the default group are postponed to * the end of this function. * Setup the Rx and Tx SRSes. If we got a pristine group * exclusively above, mac_srs_group_setup would simply create * the required SRSes. If we ended up sharing a previously * reserved group, mac_srs_group_setup would also dismantle the * SRSes of the previously exclusive group /* We are setting up minimal datapath only */ /* Program the S/W Classifer */ /* Program the H/W Classifier */ /* Initialize the v6 local addr used by link protection */ * All broadcast and multicast traffic is received only on the default * group. If we have setup the datapath for a non-default group above * then move the default group to shared state to allow distribution of * incoming broadcast traffic to the other groups and dismantle the * SRSes over the default group. * If we get an exclusive group for a VLAN MAC client we * need to take the s/w path to make the additional check for * the vid. Disable polling and set it to s/w classification. * Similarly for clients that don't have a unicast address. /* Switch the primary back to default group */ /* Stop sending packets */ /* Stop the packets coming from the H/W */ " address because of error 0x%x",
/* Stop the packets coming from the S/W classifier */ /* Now quiesce and destroy all SRS and soft rings */ * Release our hold on the group as well. We need * to check if the shared group has only one client * left who can use it exclusively. Also, if we * were the last client, release the group. * Only one client left on this RX group. * The only remaining client has exclusive * access on the group. Allow it to * dynamically poll the H/W rings etc. * This is a non-default group being freed up. * We need to reevaluate the default group * to see if the primary client can get * exclusive access to the default group. * Remove the client from the TX group. Additionally, if * this a non-default group, then we also need to release * If the default group is reserved, * then we need to set the effective * rings as we would have given * back some rings when the group * Stop all the rings except the * The mac client using the default group gets exclusive access to the * default group if and only if it is the sole client on the entire * mip. If so set the group state to reserved, and set up the SRSes * over the default group. * If the primary is the only one left and the MAC supports * dynamic grouping, we need to see if the primary needs to * be moved to the default group so that it can use all the * If the primary has an explicit property set, leave it * Switch the primary to the default group. /* DATAPATH TEAR DOWN ROUTINES (SRS and FANOUT teardown) */ * An RX SRS is attached to at most one mac_ring. * Broadcast flows don't have a client impl association, but they * Physical unlink and free of the data structures happen below. This is * driven from mac_flow_destroy(), on the last refrele of a flow. * Assumes Rx srs is 1-1 mapped with an ring. * The block comment above mac_rx_classify_flow_state_change explains the * background. At this point upcalls from the driver (both hardware classified * and software classified) have been cut off. We now need to quiesce the * SRS worker, poll, and softring threads. The SRS worker thread serves as * the master controller. The steps involved are described below in the function * In the case of Rx SRS wait till the poll thread is done. * Turn off polling as part of the quiesce operation. * Then signal the soft ring worker threads to quiesce or quit * as needed and then wait till that happens. * Signal an SRS to start a temporary quiesce, or permanent removal, or restart * a quiesced SRS by setting the appropriate flags and signaling the SRS worker * or poll thread. This function is internal to the quiescing logic and is * called internally from the SRS quiesce or flow quiesce or client quiesce * higher level functions. * The SRS is going away. We need to unbind the SRS and SR * threads before removing from the global SRS list. Otherwise * there is a small window where the cpu reconfig callbacks * may miss the SRS in the list walk and DR could fail since * there are still bound threads. * Wakeup the SRS worker and poll threads. * In the Rx side, the quiescing is done bottom up. After the Rx upcalls * from the driver are done, then the Rx SRS is quiesced and only then can * we signal the soft rings. Thus this function can't be called arbitrarily * without satisfying the prerequisites. On the Tx side, the threads from * top need to quiesced, then the Tx SRS and only then can we signal the * The block comment above mac_rx_classify_flow_state_change explains the * background. At this point the SRS is quiesced and we need to restart the * SRS worker, poll, and softring threads. The SRS worker thread serves as * the master controller. The steps involved are described below in the function * Signal any quiesced soft ring workers to restart and wait for the * soft ring down count to come down to zero. * Signal the poll thread and ask it to restart. Wait till it * actually restarts and the SRS_POLL_THR_QUIESCED flag gets /* Wake up any waiter waiting for the restart to complete */ * When a CPU is going away, unbind all MAC threads which are bound * to that CPU. The affinity of the thread to the CPU is saved to allow * the thread to be rebound to the CPU if it comes back online. /* Next tackle the soft rings associated with the srs */ /* TX SETUP and TEARDOWN ROUTINES */ * XXXHIO need to make sure the two mac_tx_srs_{add,del}_ring() * handle the case where the number of rings is one. I.e. there is * a ring pointed to by mac_srs->srs_tx_arg2. * put this soft ring in quiesce mode too so when we restart * all soft rings in the srs are in the same state. * In the case of aggr, the soft ring associated with a Tx ring * is also stored in st_soft_rings[] array. That entry should * Used to setup Tx rings. If no free Tx ring is available, then default * An attempt is made to reserve 'tx_ring_count' number * of Tx rings. If tx_ring_count is 0, default Tx ring * is used. If it is 1, an attempt is made to reserve one * Tx ring. In both the cases, the ring information is * stored in Tx SRS. If multiple Tx rings are specified, * then each Tx ring will have a Tx-side soft ring. All * these soft rings will be hang off Tx SRS. "trying to add UNKNOWN ring = %p\n",
* Update the fanout of a client if its recorded link speed doesn't match * its current link speed. * Before calling mac_fanout_setup(), check to see if * the SRSes already have the right number of soft * rings. mac_fanout_setup() is a heavy duty operation * where new cpu bindings are done for SRS and soft * ring threads and interrupts re-targeted. * If soft_ring_count returned by * mac_compute_soft_ring_count() is 0, bump it * up by 1 because we always have atleast one * TCP, UDP, and OTH soft ring associated with * Walk through the list of mac clients for the MAC. * For each active mac client, recompute the number of soft rings * associated with every client, only if current speed is different * from the speed that was previously used for soft ring computation. * If the cable is disconnected whlie the NIC is started, we would get * notification with speed set to 0. We do not recompute in that case. * Given a MAC, change the polling state for all its MAC clients. 'enable' is * B_TRUE to enable polling or B_FALSE to disable. Polling is enabled by