Linux Hardware Monitoring

1. Introduction

Review Pages

1. Introduction
2. Smart Status Page 2
3. Sensor/GPU Monitoring

- Introduction

So you have your new Linux-based system up and running. But you also need to monitor its status and be prepared for possible failures or determine causes of hardware instability. Modern PCs support a variety of tools to help you, and you'll be surprised to learn that with a few simple applications you can probe a number of functional parameters, like hard drive temperature, CPU fan speed and GPU clock. The list of tools shown here is by no means exhaustive, but should serve as a reasonable starting point for most desktop systems. There are more sophisticated tools to be found for server use, or ones which address highly specific needs.

- SMART status

Modern hard drives constantly monitor their functional parameters, including temperature, power-on hours, reallocated sector count, hardware ECC recovered data using a technology known as SMART. As a matter of fact, S.M.A.R.T. is an acronym for “Self Monitoring Analysis and Reporting Technology”. Recent research has shown that some of the SMART attributes may be useful for predicting hard drive failures . As a general rule, even though hard drives frequently fail abruptly, without previous indication of malfunction, the existence of a SMART error greatly increases the probability of failure. Specifically, scan errors, sector reallocations and sector probational counts mean that the drive is very likely to fail within 60 days.

Fortunately, you can use a simple set of tools called “smartmontools” that allow you to view this information, store it in a log file, periodically run automatic hard drive tests and even receive automatically generated email warnings in case of hard drive errors. Smartmontools can be obtained from Sourceforge (source version) but it is probably already included in your favorite distribution.

After the installation you can probe your hard drives for information with the following simple command:

root@hagakure:~# smartctl -d ata -a /dev/sda
Home page is http://smartmontools.sourceforge.net/
=== START OF INFORMATION SECTION ===
Device Model: SAMSUNG SP1614C
Serial Number: 0696J1FX906990
Firmware Version: SW100-25
User Capacity: 160,041,885,696 bytes
Device is: In smartctl database [for details use: -P show]
ATA Version is: 7
ATA Standard is: ATA/ATAPI-7 T13 1532D revision 0
Local Time is: Fri Apr 6 18:20:46 2007 EEST
==> WARNING: May need -F samsung2 disabled; see manual for details.
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
[............ REMOVED TEXT.................]
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x000b 100 100 051 Pre-fail Always - 9
3 Spin_Up_Time 0x0007 068 057 000 Pre-fail Always - 5632
4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 458
5 Reallocated_Sector_Ct 0x0033 253 253 010 Pre-fail Always - 0
7 Seek_Error_Rate 0x000b 253 253 051 Pre-fail Always - 0
8 Seek_Time_Performance 0x0024 253 253 000 Old_age Offline - 0
9 Power_On_Half_Minutes 0x0032 098 098 000 Old_age Always - 11821h+44m
10 Spin_Retry_Count 0x0013 253 253 049 Pre-fail Always - 0
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 458
194 Temperature_Celsius 0x0022 193 112 000 Old_age Always - 15
195 Hardware_ECC_Recovered 0x000a 100 100 000 Old_age Always - 157734777
196 Reallocated_Event_Count 0x0012 253 253 000 Old_age Always - 0
197 Current_Pending_Sector 0x0033 253 253 010 Pre-fail Always - 0
198 Offline_Uncorrectable 0x0031 253 253 010 Pre-fail Offline - 0
199 UDMA_CRC_Error_Count 0x000b 100 100 051 Pre-fail Always - 0
200 Multi_Zone_Error_Rate 0x000b 100 100 051 Pre-fail Always - 0
201 Soft_Read_Error_Rate 0x000b 100 100 051 Pre-fail Always - 0
[..............REMOVED TEXT...............]
SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Extended offline Completed without error 00% 11622 -
# 2 Extended offline Completed without error 00% 11466 -
# 3 Extended offline Completed without error 00% 11310 -
# 4 Extended offline Completed without error 00% 11153 -
# 5 Extended offline Completed without error 00% 10996 -
# 6 Extended offline Completed without error 00% 10357 -
# 7 Short offline Completed without error 00% 10355 -
# 8 Short offline Completed without error 00% 8167 -
# 9 Extended offline Completed without error 00% 7736 -
#10 Extended offline Completed without error 00% 7621 -
#11 Extended offline Completed without error 00% 6575 -
#12 Short offline Completed without error 00% 6573 -
Device does not support Selective Self Tests/Logging