dapl_hermon_hw.c revision 2
2N/A * The contents of this file are subject to the terms of the 2N/A * Common Development and Distribution License (the "License"). 2N/A * You may not use this file except in compliance with the License. 2N/A * See the License for the specific language governing permissions 2N/A * and limitations under the License. 2N/A * When distributing Covered Code, include this CDDL HEADER in each 2N/A * If applicable, add the following below this CDDL HEADER, with the 2N/A * fields enclosed by brackets "[]" replaced with your own identifying 2N/A * information: Portions Copyright [yyyy] [name of copyright owner] 2N/A * Copyright 2010 Sun Microsystems, Inc. All rights reserved. 2N/A * Use is subject to license terms. /* handy macro, useful because of cq_resize dynamics */ * Note: The 64 bit doorbells need to written atomically. * In 32 bit libraries we need to use the special assembly rtn * because compiler generated code splits into 2 word writes * dapli_hermon_cq_doorbell() * Takes the specified cq cmd and cq number and rings the cq doorbell /* Build the doorbell from the parameters */ /* Write the doorbell to UAR */ * For 32 bit intel we assign the doorbell in the order * prescribed by the Tavor PRM, lower to upper addresses * dapli_hermon_qp_send_doorbell() * Takes the specified qp number and rings the send doorbell. /* Write the doorbell to UAR */ * For 32 bit intel we assign the doorbell in the order * prescribed by the Tavor PRM, lower to upper addresses * dapli_hermon_wqe_send_build() * Constructs a WQE for a given ibt_send_wr_t * RC is the only supported transport in UDAPL * For RC requests, we allow "Send", "RDMA Read", "RDMA Write" * If this is a Send request, then all we need is * the Data Segment processing below. * Initialize the information for the Data Segments * If this is an RDMA Read or RDMA Write request, then fill * in the "Remote Address" header fields. * Build the Remote Address Segment for the WQE, using * the information from the RC work request. /* Update "ds" for filling in Data Segments (below) */ * Increment the upper "unconstrained" bits and need to keep * the lower "constrained" bits the same it represents /* XXX - need equiv of "hermon_wr_bind_check(state, wr);" */ /* XXX - uses hermon_mr_keycalc - what about Sinai vs. Arbel??? */ * Build the Bind Memory Window Segments for the WQE, * using the information from the RC Bind memory * Update the "ds" pointer. Even though the "bind" * operation requires no SGLs, this is necessary to * facilitate the correct descriptor size calculations "dapli_hermon_wqe_send_build: invalid wr_opcode=%d\n",
* Now fill in the Data Segments (SGL) for the Send WQE based on * the values setup above (i.e. "sgl", "nds", and the "ds" pointer * Start by checking for a valid number of SGL entries * For each SGL in the Send Work Request, fill in the Send WQE's data * segments. Note: We skip any SGL with zero size because Tavor * hardware cannot handle a zero for "byte_cnt" in the WQE. Actually * the encoding for zero means a 2GB transfer. Because of this special * encoding in the hardware, we mask the requested length with * TAVOR_WQE_SGL_BYTE_CNT_MASK (so that 2GB will end up encoded as for (i = 0; i <
nds; i++)
/* need to reduce the length by dword "len" fields */ /* if this sgl overflows the inline segment */ }
else {
/* this sgl fully fits */ /* Return the size of descriptor (in 16-byte chunks) */ for (i = 0; i <
nds; i++) {
* Fill in the Data Segment(s) for the current WQE, * using the information contained in the * scatter-gather list of the work request. /* Return the size of descriptor (in 16-byte chunks) */ *
size = 0;
/* do not use Hermon Blueflame */ * dapli_hermon_wqe_recv_build() * Builds the recv WQE for a given ibt_recv_wr_t /* Fill in the Data Segments (SGL) for the Recv WQE */ /* Check for valid number of SGL entries */ * For each SGL in the Recv Work Request, fill in the Recv WQE's data * segments. Note: We skip any SGL with zero size because Tavor * hardware cannot handle a zero for "byte_cnt" in the WQE. Actually * the encoding for zero means a 2GB transfer. Because of this special * encoding in the hardware, we mask the requested length with * TAVOR_WQE_SGL_BYTE_CNT_MASK (so that 2GB will end up encoded as * Fill in the Data Segment(s) for the receive WQE, using the * information contained in the scatter-gather list of the /* Return the size of descriptor (in 16-byte chunks) */ * dapli_hermon_wqe_srq_build() * Builds the recv WQE for a given ibt_recv_wr_t /* Fill in the Data Segments (SGL) for the Recv WQE */ /* Check for valid number of SGL entries */ * For each SGL in the Recv Work Request, fill in the Recv WQE's data * segments. Note: We skip any SGL with zero size because Tavor * hardware cannot handle a zero for "byte_cnt" in the WQE. Actually * the encoding for zero means a 2GB transfer. Because of this special * encoding in the hardware, we mask the requested length with * TAVOR_WQE_SGL_BYTE_CNT_MASK (so that 2GB will end up encoded as * Fill in the Data Segment(s) for the receive WQE, using the * information contained in the scatter-gather list of the * For SRQ, if the number of data segments is less than the maximum * specified at alloc, then we have to fill in a special "key" entry in * the sgl entry after the last valid one in this post request. We do * Peeks into a given CQ to check if there are any events that can be * polled. It returns the number of CQEs that can be polled. /* Get the consumer index */ /* Calculate the pointer to the first CQ entry */ * Count entries in the CQ until we find an entry owned by /* Error CQE map to multiple work completions */ /* Increment the consumer index */ /* Update the pointer to the next CQ entry */ * dapli_hermon_cq_resize_helper() * This routine switches from the pre-cq_resize buffer to the new buffer. "munmap(%p:0x%llx) failed(%d)\n",
cq->
cq_addr,
return (0);
/* SUCCESS */ * This routine polls CQEs out of a CQ and puts them into the ibt_wc_t * array that is passed in. /* Get the consumer index */ /* Calculate the pointer to the first CQ entry */ * Keep pulling entries from the CQ until we find an entry owned by * the hardware. As long as there the CQE's owned by SW, process * each entry by calling dapli_hermon_cq_cqe_consume() and updating the * CQ consumer index. Note: We only update the consumer index if * dapli_hermon_cq_cqe_consume() returns TAVOR_CQ_SYNC_AND_DB. * Otherwise, it indicates that we are going to "recycle" the CQE * (probably because it is a error CQE and corresponds to more than one /* Reset to hardware ownership is implicit in Hermon */ /* Increment the consumer index */ /* Update the pointer to the next CQ entry */ * If we have run out of space to store work completions, * then stop and return the ones we have pulled of the CQ. * Now we only ring the doorbell (to update the consumer index) if * we've actually consumed a CQ entry. If we have, for example, * pulled from a CQE that we are still in the process of "recycling" * for error purposes, then we would not update the consumer index. * Update the consumer index in both the CQ handle and the * If the CQ is empty, we can try to free up some of the WRID * dapli_hermon_cq_poll_one() * This routine polls one CQE out of a CQ and puts ot into the ibt_wc_t /* Get the consumer index */ /* Calculate the pointer to the first CQ entry */ * Keep pulling entries from the CQ until we find an entry owned by * the hardware. As long as there the CQE's owned by SW, process * each entry by calling dapli_hermon_cq_cqe_consume() and updating the * CQ consumer index. Note: We only update the consumer index if * dapli_hermon_cq_cqe_consume() returns TAVOR_CQ_SYNC_AND_DB. * Otherwise, it indicates that we are going to "recycle" the CQE * (probably because it is a error CQE and corresponds to more than one /* Reset to hardware ownership is implicit in Hermon */ /* Increment the consumer index */ * dapli_hermon_cq_cqe_consume() * Converts a given CQE into a ibt_wc_t object * Determine if this is an "error" CQE by examining "opcode". If it * is an error CQE, then call dapli_hermon_cq_errcqe_consume() and * return whatever status it returns. Otherwise, this is a successful * Fetch the Work Request ID using the information in the CQE. * Parse the CQE opcode to determine completion type. This will set * not only the type of the completion, but also any flags that might * be associated with it (e.g. whether immediate data is present). * The following opcodes will not be generated in uDAPL * case TAVOR_CQE_SND_RDMAWR_IMM: * case TAVOR_CQE_SND_SEND_IMM: * case TAVOR_CQE_SND_ATOMIC_CS: * case TAVOR_CQE_SND_ATOMIC_FA: * The following opcodes will not be generated in uDAPL * case TAVOR_CQE_RCV_RECV_IMM: * case TAVOR_CQE_RCV_RECV_IMM2: * case TAVOR_CQE_RCV_RDMAWR_IMM: * case TAVOR_CQE_RCV_RDMAWR_IMM2: /* If we got here, completion status must be success */ * dapli_hermon_cq_errcqe_consume() * Fetch the Work Request ID using the information in the CQE. * Parse the CQE opcode to determine completion type. We know that * the CQE is an error completion, so we extract only the completion * The following error codes are not supported in the Tavor driver * as they relate only to Reliable Datagram completion statuses: * case TAVOR_CQE_LOCAL_RDD_VIO_ERR: * case TAVOR_CQE_REM_INV_RD_REQ_ERR: * case TAVOR_CQE_EEC_REM_ABORTED_ERR: * case TAVOR_CQE_INV_EEC_NUM_ERR: * case TAVOR_CQE_INV_EEC_STATE_ERR: * case TAVOR_CQE_LOC_EEC_ERR: * Return status to indicate that doorbell and sync may be * dapli_hermon_cq_notify() * This function is used for arming the CQ by ringing the CQ doorbell. * Note: there is something very subtle here. This code assumes a very * specific behavior of the kernel driver. The cmd_sn field of the * arm_dbr is updated by the kernel driver whenever a notification * event for the cq is received. This code extracts the cmd_sn field * from the arm_dbr to know the right value to use. The arm_dbr is * always updated atomically so that neither the kernel driver nor this * will get confused about what the other is doing. * Note: param is not used here. It is necessary for arming a CQ for * N completions (param is N), but no uDAPL API supports this for now. * Thus, we declare ARGSUSED to make lint happy. * Determine if we are trying to get the next completion or the next * "solicited" completion. Then hit the appropriate doorbell. }
/* else it's already armed */ }
/* else it's already armed */ * Since uDAPL posts 1 wqe per request, we * only need to do stores for the last one. * dapli_hermon_post_send() "post_send: invalid qp_state %d\n",
ep->
qp_state);
/* Grab the lock for the WRID list */ /* Save away some initial QP state */ * Check for "queue full" condition. If the queue is already full, * then no more WQEs can be posted, return an error * Increment the "tail index" and check for "queue full" condition. * If we detect that the current work request is going to fill the * work queue, then we mark this condition and continue. * Get the user virtual address of the location where the next * Send WQE should be built * Call tavor_wqe_send_build() to build the WQE at the given address. * This routine uses the information in the ibt_send_wr_t and * returns the size of the WQE when it returns. * Get the descriptor (io address) corresponding to the location * Add a WRID entry to the WRID list. Need to calculate the * "wqeaddr" to pass to dapli_tavor_wrid_add_entry(). * signaled_dbd is still calculated, but ignored. * Now if the WRID tail entry is non-NULL, then this * represents the entry to which we are chaining the * new entries. Since we are going to ring the * doorbell for this WQE, we want set its "dbd" bit. * On the other hand, if the tail is NULL, even though * we will have rung the doorbell for the previous WQE * (for the hardware's sake) it is irrelevant to our * purposes (for tracking WRIDs) because we know the * request must have already completed. /* Update some of the state in the QP */ for (i = 0; i <
desc_sz *
2; i +=
8) {
* dapli_hermon_post_recv() "post_recv: invalid qp_state %d\n",
ep->
qp_state);
/* Grab the lock for the WRID list */ /* Save away some initial QP state */ * For the ibt_recv_wr_t passed in, parse the request and build a * Recv WQE. Link the WQE with the previous WQE and ring the * Check for "queue full" condition. If the queue is already full, * then no more WQEs can be posted. So return an error. * Increment the "tail index" and check for "queue * full" condition. If we detect that the current * work request is going to fill the work queue, then * we mark this condition and continue. /* The user virtual address of the WQE to be built */ * Call tavor_wqe_recv_build() to build the WQE at the given * address. This routine uses the information in the * ibt_recv_wr_t and returns the size of the WQE. * Add a WRID entry to the WRID list. Need to calculate the * "wqeaddr" and "signaled_dbd" values to pass to * dapli_tavor_wrid_add_entry(). * Note: all Recv WQEs are essentially "signaled" * Now if the WRID tail entry is non-NULL, then this * represents the entry to which we are chaining the * new entries. Since we are going to ring the * doorbell for this WQE, we want set its "dbd" bit. * On the other hand, if the tail is NULL, even though * we will have rung the doorbell for the previous WQE * (for the hardware's sake) it is irrelevant to our * purposes (for tracking WRIDs) because we know the * request must have already completed. /* Update some of the state in the QP */ /* Update the doorbell record */ * dapli_hermon_post_srq() /* Grab the lock for the WRID list */ * For the ibt_recv_wr_t passed in, parse the request and build a * Recv WQE. Link the WQE with the previous WQE and ring the * Check for "queue full" condition. If the queue is already full, * ie. there are no free entries, then no more WQEs can be posted. /* Save away some initial SRQ state */ /* Get the descriptor (IO Address) of the WQE to be built */ /* The user virtual address of the WQE to be built */ * Call dapli_hermon_wqe_srq_build() to build the WQE at the given * address. This routine uses the information in the * ibt_recv_wr_t and returns the size of the WQE. * Add a WRID entry to the WRID list. * Now link the chain to the old chain (if there was one) * and update the wqe_counter in the doorbell record. /* Update some of the state in the SRQ */ /* Update the doorbell record */ * dapli_hermon_cq_srq_entries_flush() /* ASSERT(MUTEX_HELD(&qp->qp_rq_cqhdl->cq_lock)); */ /* Get the consumer index */ /* Calculate the pointer to the first CQ entry */ * Loop through the CQ looking for entries owned by software. If an * entry is owned by software then we increment an 'outstanding_cqes' * count to know how many entries total we have on our CQ. We use this * value further down to know how many entries to loop through looking * for our same QP number. /* increment total cqes count */ /* increment the consumer index */ /* update the pointer to the next cq entry */ * Using the 'tail_cons_indx' that was just set, we now know how many * total CQEs possible there are. Set the 'check_indx' and the * 'new_indx' to the last entry identified by 'tail_cons_indx' /* Grab QP number from CQE */ * If the QP number is the same in the CQE as the QP that we * have on this SRQ, then we must free up the entry off the * SRQ. We also make sure that the completion type is of the * 'TAVOR_COMPLETION_RECV' type. So any send completions on * this CQ will be left as-is. The handling of returning * entries back to HW ownership happens further down. /* Add back to SRQ free list */ * Copy the CQE into the "next_cqe" /* Move index to next CQE to check */ /* Initialize removed cqes count */ /* If an entry was removed */ * Set current pointer back to the beginning consumer index. * At this point, all unclaimed entries have been copied to the * index specified by 'new_indx'. This 'new_indx' will be used * as the new consumer index after we mark all freed entries as * having HW ownership. We do that here. /* Loop through all entries until we reach our new pointer */ /* Reset entry to hardware ownership */ * Update consumer index to be the 'new_indx'. This moves it past all * removed entries. Because 'new_indx' is pointing to the last * previously valid SW owned entry, we add 1 to point the cons_indx to * the first HW owned entry. * Now we only ring the doorbell (to update the consumer index) if * we've actually consumed a CQ entry. If we found no QP number * matches above, then we would not have removed anything. So only if * something was removed do we ring the doorbell. * Update the consumer index in both the CQ handle and the p[
1] =
nds;
/* nds is 0 for SRQ */ for (i = 0; i <
numwqe; i++) {
for (j = 0; j <
wqesz; j +=
64,
wqe +=
8)
/* cq_resize -- needs testing */ /* pre-link the whole shared receive queue */