tcp_input.c revision 9cd928fe5e3ea4e05f64cfb380beb54b2623e7dc
* Flags returned from tcp_parse_options. * PAWS needs a timer for 24 days. This is the number of ticks in 24 days * Since tcp_listener is not cleared atomically with tcp_detached * being cleared we need this extra bit to tell a detached connection * apart from one that is in the process of being accepted. * Steps to do when a tcp_t moves to TIME-WAIT state. * This connection is done, we don't need to account for it. Decrement * the listener connection counter if needed. * Decrement the connection counter of the stack. Note that this counter * is per CPU. So the total number of connections in a stack is the sum of all * of them. Since there is no lock for handling all of them exclusively, the * resulting sum is only an approximation. * Unconditionally clear the exclusive binding bit so this TIME-WAIT * connection won't interfere with new ones. * Start the TIME-WAIT timer. If upper layer has not closed the connection, * the timer is handled within the context of this tcp_t. When the timer * fires, tcp_clean_death() is called. If upper layer closes the connection * during this period, tcp_time_wait_append() will be called to add this * tcp_t to the global TIME-WAIT list. Note that this means that the * actual wait time in TIME-WAIT state will be longer than the * tcps_time_wait_interval since the period before upper layer closes the * connection is not accounted for when tcp_time_wait_append() is called. * If uppser layer has closed the connection, call tcp_time_wait_append() * If tcp_drop_ack_unsent_cnt is greater than 0, when TCP receives more * than tcp_drop_ack_unsent_cnt number of ACKs which acknowledge unsent * data, TCP will not respond with an ACK. RFC 793 requires that * TCP responds with an ACK for such a bogus ACK. By not following * the RFC, we prevent TCP from getting into an ACK storm if somehow * an attacker successfully spoofs an acceptable segment to our * peer; or when our peer is "confused." * The shift factor applied to tcp_mss to decide if the peer sends us a * valid initial receive window. By default, if the peer receive window * is smaller than 1 MSS (shift factor is 0), it is considered as invalid. /* Process ICMP source quench message or not. */ * Set the MSS associated with a particular tcp based on its current value, * and a new one passed in. Observe minimums and maximums, and reset other * state variables that we want to view as multiples of MSS. * The value of MSS could be either increased or descreased. * Unless naglim has been set by our client to * a non-mss value, force naglim to track mss. * This can help to aggregate small writes. * TCP should be able to buffer at least 4 MSS data for obvious * Set the send lowater to at least twice of MSS. * Update tcp_cwnd according to the new value of MSS. Keep the * previous ratio to preserve the transmit rate. * Extract option values from a tcp header. We put any found values into the * tcpopt struct and return a bitmask saying which options were found. /* Caller must handle tcp_mss_min and tcp_mss_max_* */ /* If TCP is not interested in SACK blks... */ * If the list is empty, allocate one and assume * Make sure tcp_notsack_list is not NULL. * This happens when kmem_alloc(KM_NOSLEEP) * Bounds checking. Make sure the SACK * info is within tcp_suna and tcp_snxt. * If this SACK blk is out of bound, ignore * it but continue to parse the following * Process all TCP option in SYN segment. Note that this function should * be called after tcp_set_destination() is called so that the necessary info * from IRE is already set in the tcp structure. * This function sets up the correct tcp_mss value according to the * MSS option value and our header size. It also sets up the window scale * and timestamp values, and initialize SACK info blocks. But it does not * change receive window size after setting the tcp_mss value. The caller * should do the appropriate change. * Process MSS option. Note that MSS option value does not account * for IP or TCP options. This means that it is equal to MTU - minimum * IP+TCP header size, which is 40 bytes for IPv4 and 60 bytes for /* Process Window Scale option. */ /* Process Timestamp option. */ /* Fill in our template header with basic timestamp option. */ * Process SACK options. If SACK is enabled for this connection, * then allocate the SACK info structure. Note the following ways * when tcp_snd_sack_ok is set to true. * For active connection: in tcp_set_destination() called in * For passive connection: in tcp_set_destination() called in * That's the reason why the extra TCP_IS_DETACHED() check is there. * That check makes sure that if we did not send a SACK OK option, * we will not enable SACK for this connection even though the other * side sends us SACK OK option. For active connection, the SACK * info structure has already been allocated. So we need to free * it if SACK is disabled. * Resetting tcp_snd_sack_ok to B_FALSE so that * no SACK info will be used for this * connection. This assumes that SACK usage * permission is negotiated. This may need * to be changed once this is clarified. * Now we know the exact TCP/IP header length, subtract * that from tcp_mss to get our side's MSS. * Here we assume that the other side's header size will be equal to * our header size. We calculate the real MSS accordingly. Need to * take into additional stuffs IPsec puts in. * Real MSS = Opt.MSS - (our TCP/IP header - min TCP/IP header) * Set MSS to the smaller one of both ends of the connection. * We should not have called tcp_mss_set() before, but our * side of the MSS should have been set to a proper value * by tcp_set_destination(). tcp_mss_set() will also set up the * STREAM head parameters properly. * If we have a larger-than-16-bit window but the other side * didn't want to do window scale, tcp_rwnd_set() will take * Initialize tcp_cwnd value. After tcp_mss_set(), tcp_mss has been * Add a new piece to the tcp reassembly queue. If the gap at the beginning * is filled, return as much as we can. The message passed in may be * multi-part, chained using b_cont. "start" is the starting sequence /* Walk through all the new pieces. */ /* New stuff completely beyond tail? */ /* New stuff at the front? */ /* Yes... Check for overlap. */ * The new piece fits somewhere between the head and tail. * We find our slot, where mp1 precedes us and mp2 trails. /* Trim overlap with following mblk(s) first */ /* Trim overlap with preceding mblk */ /* Anything ready to go? */ /* Eat what we can off the queue */ /* Eliminate any overlap that mp may have over later mblks */ * This function does PAWS protection check. Returns B_TRUE if the * segment passes the PAWS test, else returns B_FALSE. * If timestamp option is aligned nicely, get values inline, * otherwise call general routine to parse. Only do that * if timestamp is the only option. * Do PAWS per RFC 1323 section 4.2. Accept RST * regardless of the timestamp, page 18 RFC 1323.bis. /* This segment is not acceptable. */ * Connection has been idle for * too long. Reset the timestamp * and assume the segment is valid. * If we don't get a timestamp on every packet, we * figure we can't really trust 'em, so we stop sending * Adjust the tcp_mss and tcp_cwnd accordingly. We avoid * doing a slow start here so as to not to lose on the * transfer rate built up so far. * Defense for the SYN attack - * 1. When q0 is full, drop from the tail (tcp_eager_prev_drop_q0) the oldest * one from the list of droppable eagers. This list is a subset of q0. * see comments before the definition of MAKE_DROPPABLE(). * 2. Don't drop a SYN request before its first timeout. This gives every * request at least til the first timeout to complete its 3-way handshake. * 3. Maintain tcp_syn_rcvd_timeout as an accurate count of how many * requests currently on the queue that has timed out. This will be used * as an indicator of whether an attack is under way, so that appropriate * actions can be taken. (It's incremented in tcp_timer() and decremented * either when eager goes into ESTABLISHED, or gets freed up.) * 4. The current threshold is - # of timeout > q0len/4 => SYN alert on * # of timeout drops back to <= q0len/32 => SYN alert off /* Pick oldest eager from the list of droppable eagers */ /* If list is empty. return B_FALSE */ /* If allocated, the mp will be freed in tcp_clean_death_wrapper() */ * Take this eager out from the list of droppable eagers since we are "tcp_drop_q0: listen half-open queue (max=%d) overflow" /* Put a reference on the conn as we are enqueueing it in the sqeue */ * Handle a SYN on an AF_INET6 socket; can be either IPv4 or IPv6 /* Pass up the scope_id of remote addr */ /* Handle a SYN on an AF_INET socket */ * Called via squeue to get on to eager's perimeter. It sends a * TH_RST if eager is in the fanout table. The listener wants the * eager to disappear either by means of tcp_eager_blowoff() or * tcp_eager_cleanup() being called. tcp_eager_kill() can also be * called (via squeue) if the eager cannot be inserted in the * fanout table in tcp_input_listener(). * We could be called because listener is closing. Since * the eager was using listener's queue's, we avoid * using the listeners queues from now on. * An eager's conn_fanout will be NULL if it's a duplicate * for an existing 4-tuples in the conn fanout table. * We don't want to send an RST out in such case. /* We are here because listener wants this eager gone */ * The eager has sent a conn_ind up to the * listener but listener decides to close * instead. We need to drop the extra ref * placed on eager in tcp_input_data() before * sending the conn_ind to listener. * Reset any eager connection hanging off this listener marked * with 'seqnum' and then reclaim it's resources. * Reset any eager connection hanging off this listener * and then reclaim it's resources. * If we are an eager connection hanging off a listener that hasn't * formally accepted the connection yet, get off his list and blow off * any data that we have accumulated. /* Remove the eager tcp from q0 */ * Take the eager out, if it is in the list of droppable /* we have timed out before */ * If we are unlinking the last * element on the list, adjust * tail pointer. Set tail pointer * to nil when list is empty. * We won't get here if there * is only one eager in the * The sockfs ACCEPT path: * ======================= * The eager is now established in its own perimeter as soon as SYN is * received in tcp_input_listener(). When sockfs receives conn_ind, it * completes the accept processing on the acceptor STREAM. The sending * of conn_ind part is common for both sockfs listener and a TLI/XTI * listener but a TLI/XTI listener completes the accept processing * on the listener perimeter. * Common control flow for 3 way handshake: * ---------------------------------------- * incoming SYN (listener perimeter) -> tcp_input_listener() * incoming SYN-ACK-ACK (eager perim) -> tcp_input_data() * send T_CONN_IND (listener perim) -> tcp_send_conn_ind() * open acceptor stream (tcp_open allocates tcp_tli_accept() * soaccept() sends T_CONN_RES on the acceptor STREAM to tcp_tli_accept() * tcp_tli_accept() extracts the eager and makes the q->q_ptr <-> eager * association (we are not behind eager's squeue but sockfs is protecting us * and no one knows about this stream yet. The STREAMS entry point q->q_info * is changed to point at tcp_wput(). * tcp_accept_common() sends any deferred eagers via tcp_send_pending() to * listener (done on listener's perimeter). * tcp_tli_accept() calls tcp_accept_finish() on eagers perimeter to finish * --------------------------- * soaccept() sends T_CONN_RES on the listener STREAM. * tcp_tli_accept() -> tcp_accept_swap() complete the processing and send * a M_SETOPS mblk to eager perimeter to finish accept (tcp_accept_finish()). * listener->tcp_eager_lock protects the listeners->tcp_eager_next_q0 and * and listeners->tcp_eager_next_q. * 1) We start out in tcp_input_listener by eager placing a ref on * listener and listener adding eager to listeners->tcp_eager_next_q0. * 2) When a SYN-ACK-ACK arrives, we send the conn_ind to listener. Before * doing so we place a ref on the eager. This ref is finally dropped at the * end of tcp_accept_finish() while unwinding from the squeue, i.e. the * reference is dropped by the squeue framework. * 3) The ref on listener placed in 1 above is dropped in tcp_accept_finish * The reference must be released by the same entity that added the reference * In the above scheme, the eager is the entity that adds and releases the * references. Note that tcp_accept_finish executes in the squeue of the eager * (albeit after it is attached to the acceptor stream). Though 1. executes * in the listener's squeue, the eager is nascent at this point and the * reference can be considered to have been added on behalf of the eager. * Eager getting a Reset or listener closing: * ========================================== * Once the listener and eager are linked, the listener never does the unlink. * If the listener needs to close, tcp_eager_cleanup() is called which queues * a message on all eager perimeter. The eager then does the unlink, clears * any pointers to the listener's queue and drops the reference to the * listener. The listener waits in tcp_close outside the squeue until its * refcount has dropped to 1. This ensures that the listener has waited for * all eagers to clear their association with the listener. * Similarly, if eager decides to go away, it can unlink itself and close. * When the T_CONN_RES comes down, we check if eager has closed. Note that * the reference to eager is still valid because of the extra ref we put * Listener can always locate the eager under the protection * of the listener->tcp_eager_lock, and then do a refhold * on the eager during the accept processing. * The acceptor stream accesses the eager in the accept processing * based on the ref placed on eager before sending T_conn_ind. * The only entity that can negate this refhold is a listener close * which is mutually exclusive with an active acceptor stream. * Eager's reference on the listener * =================================== * If the accept happens (even on a closed eager) the eager drops its * reference on the listener at the start of tcp_accept_finish. If the * eager is killed due to an incoming RST before the T_conn_ind is sent up, * the reference is dropped in tcp_closei_local. If the listener closes, * the reference is dropped in tcp_eager_kill. In all cases the reference * is dropped while executing in the eager's context (squeue). /* Process the SYN packet, mp, directed at the listener 'tcp' */ * THIS FUNCTION IS DIRECTLY CALLED BY IP VIA SQUEUE FOR SYN. * tcp_input_data will not see any packets for listeners since the listener * has conn_recv set to tcp_input_listener. /* Note this executes in listener's squeue */ * The system is under memory pressure, so we need to do our part * to relieve the pressure. So we only accept new request if there * is nothing waiting to be accepted or waiting to complete the 3-way * handshake. This means that busy listener will not get too many * new requests which they cannot handle in time while non-busy * listener is still functioning properly. "tcp_input_listener: listen backlog (max=%d) " "overflow (%d pending) on %s",
* Q0 is full. Drop a pending half-open req from the queue * to make room for the new SYN req. Also mark the time we * A more aggressive defense against SYN attack will * be to set the "tcp_syn_defense" flag now. "tcp_input_listener: listen half-open " "queue (max=%d) full (%d pending) on %s",
* Enforce the limit set on the number of connections per listener. * Note that tlc_cnt starts with 1. So need to add 1 to tlc_max "Listener (port %d) connection max (%u) " "reached: %u attempts dropped total\n",
* IP sets ira_sqp to either the senders conn_sqp (for loopback) * or based on the ring (for packets from GLD). Otherwise it is * set based on lbolt i.e., a somewhat random number. /* We already know the laddr of the new connection is ours */ /* Prepare for diffing against previous packets */ /* Source routing option copyover (reverse it) */ * If the SYN came with a credential, it's a loopback packet or a * labeled packet; attach the credential to the TPI message. /* Inherit the listener's SSL protection state */ /* Inherit the listener's non-STREAMS flag */ * Pre-allocate the T_ordrel_ind mblk for TPI socket so that * at close time, we will always have that to send up. * Otherwise, we need to do special handling in case the * allocation fails at that time. * Now that the IP addresses and ports are setup in econnp we * can do the IPsec policy work. * Inherit the policy from the listener; use /* Inherit various TCP parameters from the listener */ * tcp_set_destination() may set tcp_rwnd according to the route * metrics. If it does not, the eager's receive window will be set * to the listener's receive window later in this function. * Inherit listener's tcp_init_cwnd. Need to do this before * calling tcp_process_options() which set the initial cwnd. /* Discard any old label */ * If this is an MLP connection or a MAC-Exempt * connection with an unlabeled node, packets are to be * exchanged using the security label of the received * SYN packet instead of the server application's label. * tsol_check_dest called from ip_set_destination * might later update TSF_UNLABELED by replacing * ixa_tsl with a new label. * conn_connect() called from tcp_set_destination will verify * the destination is allowed to receive packets at the * security label of the SYN-ACK we are generating. As part of * that, tsol_check_dest() may create a new effective label for * Finally conn_connect() will call conn_update_label. * All that remains for TCP to do is to call * conn_build_hdr_template which is done as part of * Since we will clear tcp_listener before we clear tcp_detached * in the accept code we need tcp_hard_binding aka tcp_accept_inprogress * so we can tell a TCP_DETACHED_NONEAGER apart. * Adapt our mss, ttl, ... based on the remote address. /* Undo the bind_hash_insert */ /* Process all TCP options. */ /* Is the other end ECN capable? */ * The listener's conn_rcvbuf should be the default window size or a * window size changed via SO_RCVBUF option. First round up the * eager's tcp_rwnd to the nearest MSS. Then find out the window * scale option value if needed. Call tcp_rwnd_set() to finish the * Note if there is a rpipe metric associated with the remote host, * we should not inherit receive window size from listener. * Note that this is the only place tcp_rwnd_set() is called for * accepting a connection. We need to call it here instead of * after the 3-way handshake because we need to tell the other * side our rwnd in the SYN-ACK segment. /* Put a ref on the listener for the eager. */ /* Set tcp_listener before adding it to tcp_conn_fanout */ * Set tcp_listen_cnt so that when the connection is done, the counter * Tag this detached tcp vector for later retrieval * by our listener client in tcp_accept(). * -1 is "special" and defined in TPI as something * that should never be used in T_CONN_IND /* Don't drop the SYN that comes from a good IP source */ * We need to insert the eager in its own perimeter but as soon * as we do that, we expose the eager to the classifier and * should not touch any field outside the eager's perimeter. * So do all the work necessary before inserting the eager * in its own perimeter. Be optimistic that conn_connect() * will succeed but undo everything if it fails. * Increment the ref count as we are going to * enqueueing an mp in squeue * We need to start the rto timer. In normal case, we start * the timer after sending the packet on the wire (or at * least believing that packet was sent by waiting for * conn_ip_output() to return). Since this is the first packet * being sent on the wire for the eager, our initial tcp_rto * is at least tcp_rexmit_interval_min which is a fairly * large value to allow the algorithm to adjust slowly to large * fluctuations of RTT during first few transmissions. * Starting the timer first and then sending the packet in this * case shouldn't make much difference since tcp_rexmit_interval_min * is of the order of several 100ms and starting the timer * first and then sending the packet will result in difference * Without this optimization, we are forced to hold the fanout * lock across the ipcl_bind_insert() and sending the packet * so that we don't race against an incoming packet (maybe RST) * It is necessary to acquire an extra reference on the eager * at this point and hold it until after tcp_send_data() to * ensure against an eager close race. * Insert the eager in its own perimeter now. We are ready to deal * with any packets on eager. * Send the SYN-ACK. Use the right squeue so that conn_ixa is * only used by one thread at a time. * If a connection already exists, send the mp to that connections so * that it can be appropriately dealt with. * Something bad happened. ipcl_conn_insert() * failed because a connection already existed * in connected hash but we can't find it * anymore (someone blew it away). Just * free this message and hopefully remote * will retransmit at which time the SYN can be * treated as a new connection or dealth with * a TH_RST if a connection already exists. /* Nobody wants this packet */ * In an ideal case of vertical partition in NUMA architecture, its * beneficial to have the listener and all the incoming connections * tied to the same squeue. The other constraint is that incoming * connections should be tied to the squeue attached to interrupted * CPU for obvious locality reason so this leaves the listener to * be tied to the same squeue. Our only problem is that when listener * is binding, the CPU that will get interrupted by the NIC whose * IP address the listener is binding to is not even known. So * the code below allows us to change that binding at the time the * CPU is interrupted by virtue of incoming connection's squeue. * This is usefull only in case of a listener bound to a specific IP * address. For other kind of listeners, they get bound the * very first time and there is no attempt to rebind them. * IP sets ira_sqp to either the senders conn_sqp (for loopback) * or based on the ring (for packets from GLD). Otherwise it is * set based on lbolt i.e., a somewhat random number. * No one from read or write side can access us now * except for already queued packets on this squeue. * But since we haven't changed the squeue yet, they * can't execute. If they are processed after we have * changed the squeue, they are sent back to the * correct squeue down below. * But a listner close can race with processing of * incoming SYN. If incoming SYN processing changes * the squeue then the listener close which is waiting * to enter the squeue would operate on the wrong * squeue. Hence we don't change the squeue here unless * the refcount is exactly the minimum refcount. The * minimum refcount of 4 is counted as - 1 each for * TCP and IP, 1 for being in the classifier hash, and * 1 for the mblk being processed. /* No special MT issues for outbound ixa_sqp hint */ * Assume we have picked a good squeue for the listener. Make * subsequent SYNs not try to change the squeue. * Send up all messages queued on tcp_rcv_list. /* Can't drain on an eager connection */ /* Can't be a non-STREAMS connection */ /* No need for the push timer now. */ * Handle two cases here: we are currently fused or we were * previously fused and have some urgent data to be delivered * upstream. The latter happens because we either ran out of * memory or were detached and therefore sending the SIGURG was * deferred until this point. In either case we pass control * over to tcp_fuse_rcv_drain() since it may need to complete /* Does this need SSL processing first? */ * Queue data on tcp_rcv_list which is a b_next chain. * Each element of the chain is a b_cont chain. * M_DATA messages are added to the current element. * Other messages are added as new (b_next) elements. * Provide for protocols above TCP such as RPC. NOPID leaves * The cred could have already been set. /* Generate an ACK-only (no data) segment for a TCP endpoint */ * There are a few cases to be considered while setting the sequence no. * Essentially, we can come here while processing an unacceptable pkt * in the TCPS_SYN_RCVD state, in which case we set the sequence number * to snxt (per RFC 793), note the swnd wouldn't have been set yet. * If we are here for a zero window probe, stick with suna. In all * other cases, we check if suna + swnd encompasses snxt and set * the sequence number to snxt, if so. If snxt falls outside the * window (the receiver probably shrunk its window), we will go with * suna + swnd, otherwise the sequence no will be unacceptable to the * For the complex case where we have to send some * controls (FIN or SYN), let tcp_xmit_mp do it. /* Generate a simple ACK */ * Allocate space for TCP + IP headers /* Update the latest receive window size in TCP header. */ /* copy in prototype TCP + IP header */ /* Set the TCP sequence number. */ /* Set up the TCP flag field. */ /* fill in timestamp option if in use */ /* Fill in SACK options */ * Prime pump for checksum calculation in IP. Include the * adjustment for a source route if any. * Handle M_DATA messages from IP. Its called directly from IP via * squeue for received IP packets. * The first argument is always the connp/tcp to which the mp belongs. * There are no exceptions to this rule. The caller has already put * a reference on this connp/tcp and once tcp_input_data() returns, * the squeue will do the refrele. * The TH_SYN for the listener directly go to tcp_input_listener via * squeue. ICMP errors go directly to tcp_icmp_input(). * sqp: NULL = recursive, sqp != NULL means called from squeue * RST from fused tcp loopback peer should trigger an unfuse. * Record packet information in the ip_pkt_t * IPv6 packets can only be received by applications * that are prepared to receive IPv6 addresses. * The IP fanout must ensure this. /* Could have caused a pullup? */ * This is the correct place to update tcp_last_recv_time. Note * that it is also updated for tcp structure that belongs to * global and listener queues which do not really need updating. * But that should not cause any harm. And it is updated for * all kinds of incoming segments, not only for data segments. * TCP can't handle urgent pointers that arrive before * the connection has been accept()ed since it can't * buffer OOB data. Discard segment if this happens. * We can't just rely on a non-null tcp_listener to indicate * that the accept() has completed since unlinking of the * eager and completion of the accept are not atomic. * tcp_detached, when it is not set (B_FALSE) indicates * that the accept() has completed. * Nor can it reassemble urgent pointers, so discard * if it's not the next segment expected. * Otherwise, collapse chain into one mblk (discard if * that fails). This makes sure the headers, retransmitted * data, and new data all are in the same mblk. /* Update pointers into message */ * Since we can't handle any data with this urgent * pointer that is out of sequence, we expunge * the data. This allows us to still register * the urgent mark and generate the M_PCSIG, * Note that our stack cannot send data before a * connection is established, therefore the * following check is valid. Otherwise, it has /* Process all TCP options. */ * The following changes our rwnd to be a multiple of the * MIN(peer MSS, our MSS) for performance reason. /* Is the other end ECN capable? */ * Clear ECN flags because it may interfere with later /* Allocate room for SACK options if needed. */ * If we can't get the confirmation upstream, pretend * we didn't even see this one. * XXX: how can we pretend we didn't see it if we * have updated rnxt et. al. * For loopback we defer sending up the T_CONN_CON * until after some checks below. * tcp_sendmsg() checks tcp_state without entering * the squeue so tcp_state should be updated before * sending up connection confirmation. Probe the * state change below when we are sure the connection * confirmation has been sent. /* SYN was acked - making progress */ * If SYN was retransmitted, need to reset all * retransmission info. This is because this * segment will be treated as a dup ACK. * Set tcp_cwnd back to 1 MSS, per * Increasing TCP's Initial Window. * Always send the three-way handshake ack immediately * in order to make the connection complete as soon as * possible on the accepting host. * Trace connect-established here. /* Trace change from SYN_SENT -> ESTABLISHED here */ * Special case for loopback. At this point we have * received SYN-ACK from the remote endpoint. In * order to ensure that both endpoints reach the * fused state prior to any data exchange, the final * ACK needs to be sent before we indicate T_CONN_CON * to the module upstream. * For loopback, we always get a pure SYN-ACK * and only need to send back the final ACK * with no data (this is because the other * tcp is ours and we don't do T/TCP). This * final ACK triggers the passive side to * perform fusion in ESTABLISHED state. * Forget fusion; we need to handle more * complex cases below. Send the deferred * T_CONN_CON message upstream and proceed * as usual. Mark this tcp as not capable * Check to see if there is data to be sent. If * yes, set the transmit flag. Then check to see * if received data processing needs to be done. * If not, go straight to xmit_check. This short * cut is OK as we don't support T/TCP. * In this state, a SYN|ACK packet is either bogus * because the other side must be ACKing our SYN which * indicates it has seen the ACK for their SYN and * shouldn't retransmit it or we're crossing SYNs * NOTE: RFC 793 pg. 72 says this should be * tcp->tcp_suna <= seg_ack <= tcp->tcp_snxt * but that would mean we have an ack that ignored * No sane TCP stack will send such a small window * without receiving any data. Just drop this invalid * ACK. We also shorten the abort timeout in case * Only a TLI listener can come through this path when a * acceptor is going back to be a listener and a packet * for the acceptor hits the classifier. For a socket * listener, this can never happen because a listener * can never accept connection on itself and hence a * socket acceptor can not go back to being a listener. * Don't accept any input on a closed tcp as this TCP logically * does not exist on the system. Don't proceed further with * this TCP. For instance, this packet could trigger another * close of this tcp which would be disastrous for tcp_refcnt. * tcp_close_detached / tcp_clean_death / tcp_closei_local must * be called at most once on a TCP. In this case we need to * refeed the packet into the classifier and figure out where /* Drops ref on new_connp */ /* We failed to classify. For now just drop the packet */ * Handle the case where the tcp_clean_death() has happened * on a connection (application hasn't closed yet) but a packet * was already queued on squeue before tcp_clean_death() * was processed. Calling tcp_clean_death() twice on same * connection can result in weird behaviour. * If this is a detached connection and not an eager * connection hanging off a listener then new data * (past the FIN) will cause a reset. * We do a special check here where it * is out of the main line, rather than check * if we are detached every time we see new * This could be an SSL closure alert. We're detached so just * acknowledge it this last time. * This segment is not acceptable. * Drop it and send back an ACK. * SACK info in already updated in tcp_parse_options. Ignore * all other TCP options... * gap is the amount of sequence space between what we expect to see * and what we got for seg_seq. A positive value for gap means * something got lost. A negative value means we got some old stuff. /* Old stuff present. Is the SYN in there? */ /* Recompute the gaps after noting the SYN. */ /* Remove the old stuff from seg_len. */ * Make sure to check for unack'd FIN when rest of data * has been previously ack'd. * Resets are only valid if they lie within our offered * window. If the RST bit is set, we just ignore this * The arriving of dup data packets indicate that we * may have postponed an ack for too long, or the other * side's RTT estimate is out of shape. Start acking * This segment is "unacceptable". None of its * sequence space lies within our advertized window. * Adjust seg_len to the original value for tracing. "tcp_rput: unacceptable, gap %d, rgap %d, " "flags 0x%x, seg_seq %u, seg_ack %u, " "seg_len %d, rnxt %u, snxt %u, %s",
* Arrange to send an ACK in response to the * unacceptable segment per RFC 793 page 69. There * is only one small difference between ours and the * acceptability test in the RFC - we accept ACK-only * packet with SEG.SEQ = RCV.NXT+RCV.WND and no ACK * Note that we have to ACK an ACK-only packet at least * for stacks that send 0-length keep-alives with * SEG.SEQ = SND.NXT-1 as recommended by RFC1122, * section 4.2.3.6. As long as we don't ever generate * an unacceptable packet in response to an incoming * packet that is unacceptable, it should not cause * Continue processing this segment in order to use the * ACK information it contains, but skip all other * sequence-number processing. Processing the ACK * information is necessary in order to * re-synchronize connections that may have lost * We clear seg_len and flag fields related to * sequence number processing as they are not * to be trusted for an unacceptable segment. /* Fix seg_seq, and chew the gap off the front. */ * If the urgent data has already been acknowledged, we * should ignore TH_URG below * rgap is the amount of stuff received out of window. A negative * value is the amount out of window. * seg_len does not include the FIN, so if more than * just the FIN is out of window, we act like we don't * see it. (If just the FIN is out of window, rgap * will be zero and we will go ahead and acknowledge /* Fix seg_len and make sure there is something left. */ * Resets are only valid if they lie within our offered * window. If the RST bit is set, we just ignore this /* Per RFC 793, we need to send back an ACK. */ * Send SIGURG as soon as possible i.e. even * if the TH_URG was delivered in a window probe * packet (which will be unacceptable). * We generate a signal if none has been generated * for this connection or if this is a new urgent * byte. Also send a zero-length "unmarked" message * to inform SIOCATMARK that this is not the mark. * tcp_urp_last_valid is cleared when the T_exdata_ind * is sent up. This plus the check for old data * (gap >= 0) handles the wraparound of the sequence * number space without having to always track the * correct MAX(tcp_urp_last, tcp_rnxt). (BSD tracks * this max in its rcv_up variable). * This prevents duplicate SIGURGS due to a "late" * zero-window probe when the T_EXDATA_IND has already /* Try again on the rexmit. */ * If the next byte would be the mark * then mark with MARKNEXT else mark * If this is a zero window probe, continue to * process the ACK part. But we need to set seg_len * to 0 to avoid data processing. Otherwise just * drop the segment and send back an ACK. /* Pitch out of window stuff off the end. */ * TCP should check ECN info for segments inside the window only. * Therefore the check should be done here. * Note that both ECN_CE and CWR can be set in the * same segment. In this case, we once again turn * Check whether we can update tcp_ts_recent. This test is * NOT the one in RFC 1323 3.4. It is from Braden, 1993, "TCP * Extensions for High Performance: An Update", Internet Draft. * FIN in an out of order segment. We record this in * tcp_valid_bits and the seq num of FIN in tcp_ofo_fin_seq. * Clear the FIN so that any check on FIN flag will fail. * Remember that FIN also counts in the sequence number * space. So we need to ack out of order FIN only segments. /* Fill in the SACK blk list. */ * Attempt reassembly and see if we have something /* Always ack out of order packets */ * A gap is filled and the seq num and len * of the gap match that of a previously * received FIN, put the FIN flag back in. * Restart the timer if there is still * data in the reassembly queue. * Keep going even with NULL mp. * There may be a useful ACK or something else * But TCP should not perform fast retransmit * because of the ack number. TCP uses * seg_len == 0 to determine if it is a pure * ACK. And this is not a pure ACK. * If an out of order FIN was received before, and the seq * num and len of the new segment match that of the FIN, * put the FIN flag back in. * The seq number must be in the window as it should * be "fixed" above. If it is outside window, it should * be already rejected. Note that we allow seg_seq to be * rnxt + rwnd because we want to accept 0 window probe. * If the ACK flag is not set, just use our snxt as the * seq number of the RST segment. * urp could be -1 when the urp field in the packet is 0 * and TCP_OLD_URP_INTERPRETATION is set. This implies that the urgent * byte was at seg_seq - 1, in which case we ignore the urgent flag. * Non-STREAMS sockets handle the urgent data a litte * differently from STREAMS based sockets. There is no * need to mark any mblks with the MSG{NOT,}MARKNEXT * flags to keep SIOCATMARK happy. Instead a * su_signal_oob upcall is made to update the mark. * Neither is a T_EXDATA_IND mblk needed to be * prepended to the urgent data. The urgent data is * delivered using the su_recv upcall, where we set * the MSG_OOB flag to indicate that it is urg data. * Neither TH_SEND_URP_MARK nor TH_MARKNEXT_NEEDED * are used by non-STREAMS sockets. * If we haven't generated the signal yet for * this urgent pointer value, do it now. Also, * send up a zero-length M_DATA indicating * whether or not this is the mark. The latter * is not needed when a T_EXDATA_IND is sent up. * However, if there are allocation failures * this code relies on the sender retransmitting * and the socket code for determining the mark * should not block waiting for the peer to * transmit. Thus, for simplicity we always * send up the mark indication. /* Try again on the rexmit. */ * Mark with NOTMARKNEXT for now. * The code below will change this to MARKNEXT * If there are allocation failures (e.g. in * dupmsg below) the next time tcp_input_data * sees the urgent segment it will send up the "tcp_rput: sent M_PCSIG 2 seq %x urp %x " * An allocation failure prevented the previous * tcp_input_data from sending up the allocated * MSG*MARKNEXT message - send it up this time * If the urgent byte is in this segment, make sure that it is * all by itself. This makes it much easier to deal with the * possibility of an allocation failure on the T_exdata_ind. * Note that seg_len is the number of bytes in the segment, and * urp is the offset into the segment of the urgent byte. * urp < seg_len means that the urgent byte is in this segment. * Break it up and feed it back in. * Re-attach the IP header. * There is stuff before the urgent * Trim from urgent byte on. * The rest will come back. /* Feed this piece back in. */ * If the data passed back in was not * processed (ie: bad ACK) sending * the remainder back in will cause a * loop. In this case, drop the * packet and let the sender try * There is stuff after the urgent * Trim everything beyond the * urgent byte. The rest will * If the data passed back in was not * processed (ie: bad ACK) sending * the remainder back in will cause a * loop. In this case, drop the * packet and let the sender try * This segment contains only the urgent byte. We * have to allocate the T_exdata_ind, if we can. * We should never be in middle of a * fallback, the squeue guarantees that. * Generate any MSG*MARK message now. "tcp_rput: allocated exdata_ind %s",
* There is no need to send a separate MSG*MARK * message since the T_EXDATA_IND will be sent * Now we are all set. On the next putnext upstream, * tcp_urp_mp will be non-NULL and will get prepended * to what has to be this piece containing the urgent * byte. If for any reason we abort this segment below, * if it comes back, we will have this ready, or it * will get blown off in close. * The urgent byte is the next byte after this sequence * number. If this endpoint is non-STREAMS, then there * is nothing to do here since the socket has already * been notified about the urg pointer by the * su_signal_oob call above. * In case of STREAMS, some more work might be needed. * If there is data it is marked with MSGMARKNEXT and * and any tcp_urp_mark_mp is discarded since it is not * needed. Otherwise, if the code above just allocated * a zero-length tcp_urp_mark_mp message, that message * is tagged with MSGMARKNEXT. Sending up these * MSGMARKNEXT messages makes SIOCATMARK work correctly * even though the T_EXDATA_IND will not be sent up * until the urgent byte arrives. "tcp_rput: AT MARK, len %d, flags 0x%x, %s",
/* Data left until we hit mark */ "tcp_rput: URP %d bytes left, %s",
/* 3-way handshake complete - pass up the T_CONN_IND */ * We are here means eager is fine but it can * get a TH_RST at any point between now and till * accept completes and disappear. We need to * ensure that reference to eager is valid after * we get out of eager's perimeter. So we do * The listener also exists because of the refhold * done in tcp_input_listener. Its possible that it * might have closed. We will check that once we * get inside listeners context. * We optimize by not calling an SQUEUE_ENTER * on the listener since we know that the * listener and eager squeues are the same. * We are able to make this check safely only * because neither the eager nor the listener * can change its squeue. Only an active connect * We are seeing the final ack in the three way * hand shake of a active open'ed connection * so we must send up a T_CONN_CON * tcp_sendmsg() checks tcp_state without entering * the squeue so tcp_state should be updated before * sending up connection confirmation. Probe the state * change below when we are sure sending of the confirmation * Don't fuse the loopback endpoints for * simultaneous active opens. * For simultaneous active open, trace receipt of final * ACK as tcp:::connect-established. * For passive open, trace receipt of final ACK as * tcp:::accept-established. /* SYN was acked - making progress */ * If SYN was retransmitted, need to reset all * retransmission info as this segment will be * We set the send window to zero here. * This is needed if there is data to be * processed already on the queue. * Later (at swnd_update label), the * "new_swnd > tcp_swnd" condition is satisfied * the XMIT_NEEDED flag is set in the current * (SYN_RCVD) state. This ensures tcp_wput_data() is * called if there is already data on queue in /* Trace change from SYN_RCVD -> ESTABLISHED here */ /* Fuse when both sides are in ESTABLISHED state */ /* This code follows 4.4BSD-Lite2 mostly. */ * If TCP is ECN capable and the congestion experience bit is * set, reduce tcp_cwnd and tcp_ssthresh. But this should only be * done once per window (or more loosely, per RTT). * If the cwnd is 0, use the timer to clock out * new segments. This is required by the ECN spec. * This makes sure that when the ACK comes * back, we will increase tcp_cwnd by 1 MSS. * This marks the end of the current window of in * flight data. That is why we don't use * tcp_suna + tcp_swnd. Only data in flight can * Fast retransmit. When we have seen exactly three * identical ACKs while we have unacked data * outstanding we take it as a hint that our peer * If TCP is retransmitting, don't do fast retransmit. /* Do Limited Transmit */ * What we need to do is temporarily * increase tcp_cwnd so that new * data can be sent if it is allowed * by the receive window (tcp_rwnd). * tcp_wput_data() will take care of * If the connection is SACK capable, * only do limited xmit when there * Note how tcp_cwnd is incremented. * The first dup ACK will increase * it by 1 MSS. The second dup ACK * will increase it by 2 MSS. This * means that only 1 new segment will * be sent for each dup ACK. * If we have reduced tcp_ssthresh * because of ECN, do not reduce it again * unless it is already one window of data * away. After one window of data, tcp_cwr * should then be cleared. Note that * for non ECN capable connection, tcp_cwr * should always be false. * Adjust cwnd since the duplicate * ack indicates that a packet was * dropped (due to congestion.) * We do Hoe's algorithm. Refer to her * paper "Improving the Start-up Behavior * of a Congestion Control Scheme for TCP," * appeared in SIGCOMM'96. * Save highest seq no we have sent so far. * Be careful about the invisible FIN byte. * Do not allow bursty traffic during. * fast recovery. Refer to Fall and Floyd's * paper "Simulation-based Comparisons of * Tahoe, Reno and SACK TCP" (in CCR?) * This is a best current practise. * Calculate tcp_pipe, which is the * estimated number of bytes in * tcp_fack is the highest sack'ed seq num * tcp_pipe is explained in the above quoted * Fall and Floyd's paper. tcp_fack is * explained in Mathis and Mahdavi's * "Forward Acknowledgment: Refining TCP * Congestion Control" in SIGCOMM '96. * Always initialize tcp_pipe * even though we don't have * any SACK info. If later * tcp_pipe is not initialized, * funny things will happen. * Here we perform congestion * avoidance, but NOT slow start. * This is known as the Fast * We know that one more packet has * left the pipe thus we can update * If the window has opened, need to arrange * to send additional data. /* tcp_suna != tcp_snxt */ /* Packet contains a window update */ * Transmit starting with tcp_suna since * the one byte probe is not ack'ed. * If TCP has sent more than one identical * probe, tcp_rexmit will be set. That means * tcp_ss_rexmit() will send out the one * byte along with new data. Otherwise, * fake the retransmission. * Check for "acceptability" of ACK value per RFC 793, pages 72 - 73. * If the ACK value acks something that we have not yet sent, it might * be an old duplicate segment. Send an ACK to re-synchronize the * Note: reset in response to unacceptable ACK in SYN_RECEIVE * state is handled above, so we can always just drop the segment and * In the case where the peer shrinks the window, we see the new window * update, but all the data sent previously is queued up by the peer. * To account for this, in tcp_process_shrunk_swnd(), the sequence * number, which was already sent, and within window, is recorded. * tcp_snxt is then updated. * If the window has previously shrunk, and an ACK for data not yet * sent, according to tcp_snxt is recieved, it may still be valid. If * the ACK is for data within the window at the time the window was * shrunk, then the ACK is acceptable. In this case tcp_snxt is set to * the sequence number ACK'ed. * If the ACK covers all the data sent at the time the window was * shrunk, we can now set tcp_is_wnd_shrnk to B_FALSE. * Should we send ACKs in response to ACK only segments? /* drop the received segment */ * Send back an ACK. If tcp_drop_ack_unsent_cnt is * greater than 0, check if the number of such * bogus ACks is greater than that count. If yes, * don't send back any ACK. This prevents TCP from * getting into an ACK storm if somehow an attacker * successfully spoofs an acceptable segment to our * peer. If this continues (count > 2 X threshold), * we should abort this connection. * TCP gets a new ACK, update the notsack'ed list to delete those * blocks that are covered by this ACK. * If we got an ACK after fast retransmit, check to see * if it is a partial ACK. If it is not and the congestion * window was inflated to account for the other side's * cached packets, retract it. If it is, do Hoe's algorithm. * Restore the orig tcp_cwnd_ssthresh after * Remove all notsack info to avoid confusion with * Retransmit the unack'ed segment and * restart fast recovery. Note that we * need to scale back tcp_cwnd to the * original value when we started fast * recovery. This is to prevent overly * aggressive behaviour in sending new * TCP is retranmitting. If the ACK ack's all * outstanding data, update tcp_rexmit_max and * tcp_rexmit_nxt. Otherwise, update tcp_rexmit_nxt * Note that SEQ_LEQ() is used. This is to avoid * unnecessary fast retransmit caused by dup ACKs * received when TCP does slow start retransmission * after a time out. During this phase, TCP may * send out segments which are already received. * This causes dup ACKs to be sent back. * If tcp_xmit_head is NULL, then it must be the FIN being ack'ed. * Note that it cannot be the SYN being ack'ed. The code flow * Update the congestion window. * If TCP is not ECN capable or TCP is ECN capable but the * congestion experience bit is not set, increase the tcp_cwnd as * This is to prevent an increase of less than 1 MSS of * tcp_cwnd. With partial increase, tcp_wput_data() * may send out tinygrams in order to preserve mblk * By initializing tcp_cwnd_cnt to new tcp_cwnd and * decrementing it by 1 MSS for every ACKs, tcp_cwnd is * increased by 1 MSS for every RTTs. /* See if the latest urgent data has been acknowledged */ /* Can we update the RTT estimates? */ /* Ignore zero timestamp echo-reply. */ /* If needed, restart the timer. */ * Update tcp_csuna in case the other side stops sending * An ACK sequence we haven't seen before, so get the RTT * and update the RTO. But first check if the timestamp is /* Remeber the last sequence to be ACKed */ /* Eat acknowledged bytes off the xmit queue. */ * Set a new timestamp if all the bytes timed by the * old timestamp have been ack'ed. * This notification is required for some zero-copy * clients to maintain a copy semantic. After the data * is ack'ed, client is safe to modify or reuse the buffer. /* Everything is ack'ed, clear the tail. */ * Cancel the timer unless we are still * waiting for an ACK for the FIN packet. * More was acked but there is nothing more * outstanding. This means that the FIN was * just acked or that we're talking to a clown. /* FIN was acked - making progress */ * We should never get here because * we have already checked that the * number of bytes ack'ed should be * smaller than or equal to what we * have sent so far (it is the * acceptability check of the ACK). * We can only get here if the send * Terminate the connection and * panic the system. It is better * for us to panic instead of * continuing to avoid other disaster. panic(
"Memory corruption " "detected for connection %s.",
* The following check is different from most other implementations. * For bi-directional transfer, when segments are dropped, the * "normal" check will not accept a window update in those * retransmitted segemnts. Failing to do that, TCP may send out * segments which are outside receiver's window. As TCP accepts * the ack in those retransmitted segments, if the window update in * the same segment is not accepted, TCP will incorrectly calculates * that it can send more segments. This can create a deadlock * with the receiver if its window becomes zero. * The criteria for update is: * 1. the segment acknowledges some data. Or * 2. the segment is new, i.e. it has a higher seq num. Or * 3. the segment is not old and the advertised window is * larger than the previous advertised window. * We implement the non-standard BSD/SunOS * FIN_WAIT_2 flushing algorithm. * If there is no user attached to this * TCP endpoint, then this TCP struct * could hang around forever in FIN_WAIT_2 * state if the peer forgets to send us * a FIN. To prevent this, we wait only * 2*MSL (a convenient time value) for * the FIN to arrive. If it doesn't show up, * we flush the TCP endpoint. This algorithm, * though a violation of RFC-793, has worked * for over 10 years in BSD systems. * Note: SunOS 4.x waits 675 seconds before * flushing the FIN_WAIT_2 connection. break;
/* Shutdown hook? */ /* Make sure we ack the fin */ * Generate the ordrel_ind at the end unless we * In the eager case tcp_rsrv will do this when run * after tcp_accept is done. * implies data piggybacked on FIN. * The header has been consumed, so we remove the * We have more unacked data than we should - send /* We don't have an ACK timer for detached TCP. */ * If we get a segment that is less than an mss, and we * already have unacknowledged data, and the amount * unacknowledged is not a multiple of mss, then we * better generate an ACK now. Otherwise, this may be * the tail piece of a transaction, and we would rather /* Start delayed ack timer */ /* Ready for a new signal. */ "tcp_rput: sending exdata_ind %s",
* Check for ancillary data changes compared to last segment. * Side queue inbound data until the accept happens. * M_DATA is queued on b_cont. Otherwise (T_OPTDATA_IND or * T_EXDATA_IND) it is queued on b_next. * XXX Make urgent data use this. Requires: * Removing tcp_listener check for TH_URG * Making M_PCPROTO and MARK messages skip the eager case * Note that no KSSL processing is done here, because * KSSL is not supported for non-STREAMS sockets. * We should never be in middle of a * fallback, the squeue guarantees that. /* PUSH bit set and sockfs is not flow controlled */ "tcp_rput: sending MSGMARKNEXT %s",
/* Does this need SSL processing first? */ /* Does this need SSL processing first? */ * Enqueue the new segment first and then * call tcp_rcv_drain() to send all data * up. The other way to do this is to * send all queued data up and then call * putnext() to send the new segment up. * This way can remove the else part later * We don't do this to avoid one more call to * canputnext() as tcp_rcv_drain() needs to * Enqueue all packets when processing an mblk * from the co queue and also enqueue normal packets. * Make sure the timer is running if we have data waiting * for a push bit. This provides resiliency against * implementations that do not correctly generate push bits. * The connection may be closed at this point, so don't * do anything for a detached tcp. /* Is there anything left to do? */ /* Any transmit work to do and a non-zero window? */ * For TH_LIMIT_XMIT, tcp_wput_data() is called to send * out new segment. Note that tcp_rexmit should not be * set, otherwise TH_LIMIT_XMIT should not be set. * Adjust tcp_cwnd back to normal value after sending * This will restart the timer. Restarting the * timer is used to avoid a timeout before the * limited transmitted segment's ACK gets back. /* Anything more to do? */ * Send up any queued data and then send the mark message "tcp_rput: sending zero-length %s %s",
* Time to send an ack for some reason. * Arrange for deferred ACK or push wait timeout. * Start timer if it is not already running. * Send up the ordrel_ind unless we are an eager guy. * In the eager case tcp_rsrv will do this when run * after tcp_accept is done. * Push any mblk(s) enqueued from co processing. * Attach ancillary data to a received TCP segments for the * ancillary pieces requested by the application that are * different than they were in the previous data segment. * Save the "current" values once memory allocation is ok so that * when memory allocation fails we can just wait for the next data segment. /* If app asked for pktinfo and the index has changed ... */ /* If app asked for hoplimit and it has changed ... */ /* If app asked for tclass and it has changed ... */ * If app asked for hopbyhop headers and it has changed ... * For security labels, note that (1) security labels can't change on * a connected socket at all, (2) we're connected to at most one peer, * (3) if anything changes, then it must be some other extra option. /* If app asked for dst headers before routing headers ... */ /* If app asked for routing headers and it has changed ... */ /* If app asked for dest headers and it has changed ... */ * Defer sending ancillary data until the next TCP segment * If app asked for pktinfo and the index has changed ... * Note that the local address never changes for the connection. /* Save as "last" value */ /* If app asked for hoplimit and it has changed ... */ /* Save as "last" value */ /* If app asked for tclass and it has changed ... */ /* Save as "last" value */ /* The minimum of smoothed mean deviation in RTO calculation. */ * Set RTO for this connection. The formula is from Jacobson and Karels' * "Congestion Avoidance and Control" in SIGCOMM '88. The variable names * are the same as those in Appendix A.2 of that paper. * sa = smoothed RTT average (8 * average estimates). * sv = smoothed mean deviation (mdev) of RTT (4 * deviation estimates). /* tcp_rtt_sa is not 0 means this is a new sample. */ * Update average estimator: * new rtt = 7/8 old rtt + 1/8 Error /* m is now Error in estimate. */ * Don't allow the smoothed average to be negative. * We use 0 to denote reinitialization of the * Update deviation estimator: * new mdev = 3/4 old mdev + 1/4 (abs(Error) - old mdev) * This follows BSD's implementation. So the reinitialized * RTO is 3 * m. We cannot go less than 2 because if the * link is bandwidth dominated, doubling the window size * during slow start means doubling the RTT. We want to be * more conservative when we reinitialize our estimates. 3 * is just a convenient number. * We do not know that if sa captures the delay ACK * effect as in a long train of segments, a receiver * does not delay its ACKs. So set the minimum of sv * to be TCP_SD_MIN, which is default to 400 ms, twice * of BSD DATO. That means the minimum of mean * RTO = average estimates (sa / 8) + 4 * deviation estimates (sv) * Add tcp_rexmit_interval extra in case of extreme environment * where the algorithm fails to work. The default value of * tcp_rexmit_interval_extra should be 0. * As we use a finer grained clock than BSD and update * RTO for every ACKs, add in another .25 of RTT to the * deviation of RTO to accomodate burstiness of 1/4 of /* Now, we can reset tcp_timer_backoff to use the new RTO... */ * On a labeled system we have some protocols above TCP, such as RPC, which * appear to assume that every mblk in a chain has a db_credp. /* Learn the latest rwnd information that we sent to the other side. */ /* This is peer's calculated send window (our receive window). */ * Increase the receive window to max. But we need to do receiver * SWS avoidance. This means that we need to check the increase of * of receive window is at least 1 MSS. * If the window that the other side knows is less than max * deferred acks segments, send an update immediately. * Handle a packet that has been reclassified by TCP. * This function drops the ref on connp that the caller had. /* Note that mp is NULL */ * do not drain, certain use cases can blow /* Not TCP; must be SOCK_RAW, IPPROTO_TCP */ /* Not flow-controlled, open rwnd */ * Send back a window update immediately if TCP is above * ESTABLISHED state and the increase of the rcv window * that the other side knows is at least 1 MSS after flow * The read side service routine is called mostly when we get back-enabled as a * result of flow control relief. Since we don't actually queue anything in * TCP, we have no data to send out of here. What we do is clear the receive * window, and send out a window update. /* No code does a putq on the read side */ * If tcp->tcp_rsrv_mp == NULL, it means that tcp_rsrv() has already * been run. So just return. /* At minimum we need 8 bytes in the TCP header for the lookup */ * tcp_icmp_input is called as conn_recvicmp to process ICMP error messages * passed up by IP. The message is always received on the correct tcp_t. * Assumes that IP has pulled up everything up to and including the ICMP header. /* Assume IP provides aligned packets */ * Verify IP version. Anything other than IPv4 or IPv6 packet is sent * upstream. ICMPv6 is handled in tcp_icmp_error_ipv6. /* Skip past the outer IP and ICMP headers */ * If we don't have the correct outer IP header length * or if we don't have a complete inner IP header /* Skip past the inner IP and find the ULP header */ * If we don't have the correct inner IP header length or if the ULP * is not IPPROTO_TCP or if we don't have at least ICMP_MIN_TCP_HDR * bytes of TCP header, drop it. * Update Path MTU, then try to send something out. * ICMP can snipe away incipient * TCP connections as long as * seq number is same as initial /* Record the error in case we finally time out. */ * Ditch the half-open connection if we * suspect a SYN attack is under way. * use a global boolean to control * whether TCP should respond to ICMP_SOURCE_QUENCH. * Reduce the sending rate as if we got a * tcp_icmp_error_ipv6 is called from tcp_icmp_input to process ICMPv6 * error messages passed up by IP. * Assumes that IP has pulled up all the extension headers as well * Verify that we have a complete IP header. * Verify if we have a complete ICMP and inner IP header. * Validate inner header. If the ULP is not IPPROTO_TCP or if we don't * have at least ICMP_MIN_TCP_HDR bytes of TCP header drop the * Update Path MTU, then try to send something out. /* Record the error in case we finally time out. */ * Ditch the half-open connection if we * suspect a SYN attack is under way. /* If this corresponds to an ICMP_PROTOCOL_UNREACHABLE */ * CALLED OUTSIDE OF SQUEUE! It can not follow any pointers that tcp might * change. But it can refer to fields like tcp_suna and tcp_snxt. * Function tcp_verifyicmp is called as conn_verifyicmp to verify the ICMP * error messages received by IP. The message is always received on the correct * TCP sequence number contained in payload of the ICMP error message * should be within the range SND.UNA <= SEG.SEQ < SND.NXT. Otherwise, * the message is either a stale ICMP error, or an attack from the * network. Fail the verification. /* For "too big" we also check the ignore flag */