mcelog.triggers (5) - Linux Manuals

mcelog.triggers: mcelog trigger scripts reference

NAME

mcelog.triggers - mcelog trigger scripts reference

SYNOPSIS

/etc/mcelog/bus-error-trigger
/etc/mcelog/cache-error-trigger
/etc/mcelog/dimm-error-trigger
/etc/mcelog/iomca-error-trigger
/etc/mcelog/page-error-trigger
/etc/mcelog/socket-memory-error-trigger
/etc/mcelog/unknown-error-trigger

DESCRIPTION

mcelog(8) maintains thresholds of errors using a leaky-bucket algorithm. When the number of errors in a specific time window exceeds a pre-configured threshold a trigger will be executed. Triggers are usually shell scripts in the /etc/mcelog directory but can be also other internal actions. Thresholds and triggers can be configured in mcelog.conf(5)

Trigger will run as the user configured for mcelog in mcelog.conf, by default root. The default trigger action can be overridden by specifying a different trigger script in the configuration file. Actions in addition to the default trigger (like notifying an administrator) can be put into the respective /etc/mcelog/*.local script which is executed after the default action. This allows updating the default scripts without overriding local actions. All trigger actions are also logged to syslog.

The DIMM and socket memory error triggers

The /etc/mcelog/dimm-error-trigger and /etc/mcelog/socket-memory-error-trigger scripts are executed when a DIMM or a CPU socket exceeds a configured corrected or uncorrected memory error threshold. The thresholds are configured in the mcelog.conf [dimm] and [socket] sections. The default triggers log a warning message in the system log. The triggers are only executed when mcelog runs as a daemon.

Arguments are passed as environment variables

THRESHOLDhuman readable threshold status
MESSAGEHuman readable consolidated error message
TOTALCOUNTtotal corrected or uncorrected count of errors for current DIMM depending on what triggered the event
LOCATIONConsolidated location as a single string
DMI_LOCATIONDIMM location from DMI/SMBIOS if available
DMI_NAMEDIMM identifier from DMI/SMBIOS if available
DIMMDIMM number reported by hardware
CHANNELChannel number reported by hardware
SOCKETIDSocket ID of CPU that includes the memory controller with the DIMM
CECOUNTTotal corrected error count for DIMM
UCCOUNTTotal uncorrected error count for DIMM
LASTEVENTTime stamp of event that triggered threshold (in time_t format, seconds)
THRESHOLD_COUNTTotal umber of events in current threshold time period of specific type

After the default action local actions in /etc/mcelog/dimm-error-trigger.local or respective /etc/mcelog/socket-memory-error-trigger.local are executed.

The page error trigger

The /etc/mcelog/page-error-trigger script is executed by mcelog in daemon mode when a page in memory exceeds a pre-configured corrected or uncorrected error threshold. mcelog internally also implements offlining the page through the kernel. This is configured through the [page] section of mcelog.conf(5)

The environment arguments are the same as for the dimm-error-trigger script

After the default action local actions in /etc/mcelog/page-error-trigger.loccal are executed.

The cache error trigger

The /etc/mcelog/cache-error-trigger shell script is called for cache error handling in daemon mode when a CPU reports excessive corrected cache errors. This could be a indication for future uncorrected errors.

This trigger is configured through the [cache] section in the mcelog.conf(5) configuration file. The threshold is defined by the CPU. The default trigger offlines the affected CPU cores, unless it is the last core running.

Arguments are passed as environment variables

MESSAGEHuman readable error message
CPULinux CPU number that triggered the error
LEVELCache level affected by error
TYPECache type affected by error (Data,Instruction,Generic)
AFFECTED_CPUSList of CPUs sharing the affected cache
SOCKETIDSocket ID of affected CPU

After the default action local actions in /etc/mcelog/cache-error-trigger.local are executed.

The bus-uc-threshold-trigger

The bus-uc-threshold-trigger runs on uncorrected errors on a IO bus. It is configured through the bus-uc-threshold-trigger and bus-uc-threshold-trigger-threshold options in /etc/mcelog.conf(5). By default it logs a message with the error location to the system log. After the default action local actions in /etc/mcelog/bus-uc-error-trigger.local are executed.

Arguments are passed as environment variables

MESSAGEHuman readable consolidated error message.
LOCATIONConsolidated location as a single string
SOCKETIDSocket ID of CPU that includes the memory controller with the DIMM
LEVELInterconnect level
PARTICIPATIONProcessor Participation (Originator, Responder or Observer)
REQUESTRequest type (read, write, prefetch, etc.)
ORIGIN Memory or IO
TIMEOUTThe request timed out or not

The iomca-error-trigger

The iomca-error-trigger runs when a socket receives bus or interconnect errors. It is configured through the iomca-error-trigger and iomca-error-trigger-threshold options in /etc/mcelog.conf. By default it logs a message with the error location to the system log. After the default action local actions in /etc/mcelog/iomca-error-trigger.local are executed.

Arguments are passed as environment variables

MESSAGEHuman readable consolidated error message
LOCATIONConsolidated location as a single string
SOCKETIDSocket ID of CPU that includes the memory controller with the DIMM
CPULinux CPU number that triggered the error
SETPCI segment number
BUSPCI bus number
DEVICEPCI device number
FUNCTIONPCI function number

The unknown-error-trigger

The unknown-error-trigger runs on any errors not otherwise categorized. It is configured through the unknown-error-trigger and unknown-error-trigger-threshold options in /etc/mcelog.conf. By default it logs a message to the system log. After the default action local actions in /etc/mcelog/unknown-error-trigger.local are executed.

Arguments are passed as environment variables

MESSAGEHuman readable consolidated error message
LOCATIONConsolidated location as a single string
SOCKETIDSocket ID of CPU that includes the memory controller with the DIMM
CPULinux CPU number that triggered the error
STATUSIA32_MCi_STATUS register value
ADDRIA32_MCi_ADDR register value
MISCIA32_MCi_MISC register value
MCGSTATUSIA32_MCG_STATUS register value
MCGCAPIA32_MCG_CAP register value