amd64.esc revision 7aec1d6e253b21f9e9b7ef68b4d81ab9859b51fe
/*
* CDDL HEADER START
*
* The contents of this file are subject to the terms of the
* Common Development and Distribution License, Version 1.0 only
* (the "License"). You may not use this file except in compliance
* with the License.
*
* You can obtain a copy of the license at usr/src/OPENSOLARIS.LICENSE
* See the License for the specific language governing permissions
* and limitations under the License.
*
* When distributing Covered Code, include this CDDL HEADER in each
* file and include the License file at usr/src/OPENSOLARIS.LICENSE.
* If applicable, add the following below this CDDL HEADER, with the
* fields enclosed by brackets "[]" replaced with your own identifying
* information: Portions Copyright [yyyy] [name of copyright owner]
*
* CDDL HEADER END
*/
/*
* Copyright 2006 Sun Microsystems, Inc. All rights reserved.
* Use is subject to license terms.
*/
/*
*/
/* #MEM#
* GET_ADDR relies on the fact that variables have global scope across an FME.
* Thus for each FME the assignment only occurs for the first invocation
* but the comparison happens on each. Thus if the new address matches the
* address of an existing open FME, then we return true running in the context
* of that FME. If the new address doesn't match the address of any existing
* open FME, then we return true in the context of a newly opened FME.
*/
/*
* SET_ADDR is used to set a payload value in the fault that we diagnose
* for page faults, to record the physical address of the faulting page.
*/
/*
* RESOURCE_EXISTS is true if a pair with name "resource" exists in the
* payload - regardless of type (e.g., nvlist or nvlist array) or value.
*/
/*
* CONTAINS_DIMM is true if the "resource" nvlist array (as used in memory
* ereports) exists and one if its members matches the path for the
* dimm node. Our memory propogation are of the form "foo@dimm -> blah@cpu"
* since cpus detect memory errors; in eversholt such a propogation, where
* the lhs path and rhs path do not match, expands to the cross-product of
* all dimms and cpus in the system. We use CONTAINS_DIMM to constrain
* the propogation such that it only happens if the payload resource
* matches the dimm.
*/
/*
* The following will tell us whether a syndrome that is known to be
* correctable (from a mem_ecc1) is single-bit or multi-bit. For a
* correctable ChipKill syndrome the number of bits set in the lowest
* nibble indicates how many bit were in error.
*/
((synd) == 0 || \
/*
* A single bit fault in a memory dimm can cause:
*
* - mem_ce : reported by nb for an access from a remote cpu
*
* Single-bit errors are fed into a per-DIMM SERD engine; if a SERD engine
* trips we diagnose a fault.memory.page so that the response agent can
* retire the page that caused the trip. If the total number of pages
* faulted in this way on a single DIMM exceeds a threshold we will
* diagnose a fault.memory.dimm_sb against the DIMM.
*
* Multibit ChipKill-correctable errors produce an immediate page fault
* and corresponding fault.memory.dimm_ck. This is achieved through
* SERD engines using N=0 so the facility is there to be a little more
* tolerant of these errors.
*
* Uncorrectable errors produce an immediate page fault and corresponding
* fault.memory.dimm_ue.
*
* Page faults are essentially internal - action is only required when
* they are accompanied by a dimm fault. As such we include message=0
* on DIMM faults.
*/
/*
* If the address is not valid then no resource member will be included
* in a nb.mem_ce or nb.mem_ue ereport. These cases should be rare.
* We will discard such ereports. An alternative may be to SERD them
* on a per MC basis and trip if we see too many such events.
*/
/* #PAGE#
* Page faults of all types diagnose to a single fault class and are
* counted with a stat.
*
* Single-bit errors are diagnosed as upsets and feed into per-DIMM
* SERD engines which diagnose fault.memory.page if they trip.
*/
/* #DIMM_SB#
* Single-bit DIMM faults are diagnosed when the number of page faults
* (of all types since they all are counted in a single per-DIMM stat engine)
* reaches a threshold. Since our tolerance of ChipKill and UE faults
* is much lower than that for single-bit errors the threshold will only be
* reached for repeated single-bit page faults. We do not stop diagnosing
* further single-bit page faults once we have declared a single-bit DIMM
* fault - we continue diagnosing them and response agents can continue to
* retire those pages up to the system-imposed retirement limit.
*
* We maintain a parallel SERD engine to the page_sb engine which trips
* in unison, but on trip it generates a distinct ereport which we
* diagnose to a dimm_sb fault if the threshold has been reached, or
* to a throwaway upset if not.
*/
/* #DIMM_CK#
* ChipKill-correctable multi-bit faults indicate a likely failing SDRAM
*/
GET_ADDR && GET_OFFSET };
/* #DIMM_UE#
* A multi-bit fault in a memory dimm can cause:
*
* - ue : reported by nb for an access from a remote cpu
*
* Note we use a SERD engine here simply as a way of ensuring that we get
* both dimm and page faults reported
*/
/* #L2D#
* l2 cache data errors.
*/
/* #L2D_SINGLE#
* A single bit data array fault in an l2 cache can cause:
*
* - inf_l2_ecc1 : reported by ic on this cpu
* - inf_l2_ecc1 : reported by dc on this cpu
* - l2d_ecc1 : reported by bu on copyback or on snoop from another cpu
*
* Single-bit errors are diagnosed to cache upsets. SERD engines are used
* to count upsets resulting from CEs.
*/
/* #L2D_MULTI#
* A multi-bit data array fault in an l2 cache can cause:
*
* - inf_l2_eccm : reported by ic on this cpu
* - inf_l2_eccm : reported by dc on this cpu
* - l2d_eccm : reported by bu on copyback or on snoop from another cpu
*/
/* #L2T#
* l2 cache main tag errors
*/
/* #L2T_SINGLE#
* A single bit tag array fault in an l2 cache can cause:
*
* - l2t_ecc1 : reported by bu on this cpu when detected during snoop
* - l2t_par : reported by bu on this cpu when detected other than during snoop
*
* Note that the bu.l2t_par ereport could be due to a single bit or multi bit
* event. If the l2t_sb_trip has already triggered it will be treated as another
* ce, otherwise it will be treated as a ue event.
*/
/* #L2T_MULTI#
* A multi-bit tag array fault in an l2 cache can cause:
*
* - l2t_eccm : reported by bu on this cpu when detected during snoop
* - l2t_par : reported by bu on this cpu when detected other than during snoop
*/
/* #ICD_PAR#
* A data array parity fault in an I cache can cause:
*
* - data_par : reported by ic on this cpu
*/
/* #ICT_PAR#
* A tag array parity fault in an I cache can cause:
*
* - tag_par : reported by ic on this cpu
*/
/* #ICT_SNOOP#
* A snoop tag array parity fault in an I cache can cause:
*
* - stag_par : reported by ic on this cpu
*/
/* #ICTLB_1#
* An l1tlb parity fault in an I cache can cause:
*
* - l1tlb_par : reported by ic on this cpu
*/
/* #ICTLB_2#
* An l2tlb parity fault in an I cache can cause:
*
* - l2tlb_par : reported by ic on this cpu
*/
/* #DCD#
* dcache data errors
*/
/* #DCD_SINGLE#
* A single bit data array fault in an D cache can cause:
*
* - data_ecc1 : reported by dc on this cpu by scrubber
* - data_ecc1_uc : reported by dc on this cpu other than by scrubber
*
* Make data_ecc1_uc fault immediately as it may have caused a panic
*/
/* #DCD_MULTI#
* A multi-bit data array fault in an D cache can cause:
*
* - data_eccm : reported by dc on this cpu
*/
/* #DCT_PAR#
* A tag array parity fault in an D cache can cause:
*
* - tag_par : reported by dc on this cpu
*/
/* #DCT_SNOOP#
* A snoop tag array parity fault in an D cache can cause:
*
* - stag_par : reported by dc on this cpu
*/
/* #DCTLB_1#
* An l1tlb parity fault in an D cache can cause:
*
* - l1tlb_par : reported by dc on this cpu
*/
/* #DCTLB_2#
* An l2tlb parity fault in an D cache can cause:
*
* - l2tlb_par : reported by dc on this cpu
*/
/* #DPATH_SB#
*/
/*
* A single bit fault in the datapath between the NB and requesting core
* can cause:
*
* - inf_sys_ecc1 : reported by ic on access from a local cpu
* - inf_sys_ecc1 : reported by dc on access from a local cpu
* - s_ecc1 : reported by bu on access from a local cpu (hw prefetch etc)
*/
/* #DPATH_MB#
* A multi-bit fault in the datapath between the NB and requesting core
* can cause:
*
* - inf_sys_eccm : reported by ic on access from a local cpu
* - inf_sys_eccm : reported by dc on access from a local cpu
* - s_eccm : reported by bu on access from a local cpu (hw prefetch etc)
*/
/*
* Ereports that should not normally happen and which we will discard
* without diagnosis if they do. These fall into a few categories:
*
* - the corresponding detector is not enabled, typically because
* (nb.ma, nb.ta, ls.rde, ic.rdde, bu.s_rde, nb.gart_walk)
* - the event is associated with a sync flood so even if the detector is
* enabled we will never handle the event and generate an ereport *and*
* even if the ereport did arrive we could perform no useful diagnosis
* e.g., the NB can be configured for sync flood on nb.mem_eccm
* but we don't choose to discard that ereport here since we could have
* made a useful diagnosis from it had it been delivered
* (nb.ht_sync, nb.ht_crc)
* - events that will be accompanied by an immediate panic and
* delivery of the ereport during subsequent reboot but from
* which no useful diagnosis can be made. (nb.rmw, nb.wdog)
*
* Ereports for all of these can be generated by error simulation and
* injection. We will perform a null diagnosos of all these ereports in order
* to avoid "no subscription" complaints during test harness runs.
*/