Thursday, May 14, 2015

MCELOG - hardware error log monitoring tool on x86 systems


mcelog logs hardware related errors on Linux based x86 systems. Mostly this tool is used on physical server and start at boot time (used to be at cron) and runs as a daemon. it can detech hardware error such as system bus errors, CPU error (cache error on processor or hardware) and most importantly memory error (Error Correction code-ECC). Once it detech the error threshold, it can predictively offline memory pages and CPUs  based on the error. If you check the error frequently, you will find the problem before server panic and crash.


Install mcelog
# yum install mcelog

Verify the daemon is running
# mcelog --client
# /etc/init.d/mcelog status
#  service mcelogd status

Dependencies
- Make sure  /dev/mcelog does exists. If not create with mknod command
# mknod /dev/mcelog c 10 227



How to find the error?
- Login to console and run the meclog command which read message from the kernel. Make sure to send output to a file because you can't re-run it see the error.
# /usr/sbin/mcelog >/var/tmp/mymce.log

Check the log
# more /var/log/mcelog
# grep -i "hardware error" /var/log/mcelog
# more /var/log/mcelog
# tail -200 /var/log/mcelog

Put it on cron,
[ $(grep -c "hardware error" /var/log/mcelog) -gt 0 ] && echo "Hardware Error on $(hostname)" | mailx -s "Error on `hostname`" sam@domain.com




Most of the systems are by default set up to dump the log at /var/log/mcelog.

Some commands
mcelog
mcelog --k8
mcelog --k8 --ascii
mcelog --k8 /dev/mcelog
mcelog --ascii /dev/mcelog
mcelog --ascii > changelog.txt
dmesg | grep ADMA
dmesg | grep ata5
vmstat -d


Note: If mcelog running as a daemon, you get the /dev/mcelog output when the MCE actually happens..

More on http://www.mcelog.org/

By analyzing the log, it appears that the server suffered a crash and dump file is generated which infact caused the /var/crash to fill up. It appears that the crash dump also caused the memory errors on the system.

No comments:

Post a Comment