AnsweredAssumed Answered

Monitoring/Altering of SW RAID on API GW Hardware Appliance

Question asked by StuartSmith75811464 on Jan 30, 2019

Hi,

one of our customers have 8 physical HW devices (oracle X4-2).

We had a failed HD in the raid1 cluster on the server (not HW failure - just mdadm taking the server out the raid cluster as it was corrupt)

They wish to know if there is a way of monitoring and/or altering for the health of the RAID cluster in the future to avoid this situation ?

Since the only reason this was noticed is that I saw it during some physical work in the data centre on the console when connected (I then confirmed this as below):


> cat /proc/mdstat
Personalities : [raid1]
md0 : active raid1 sda1[0] sdb1[1]
1048512 blocks super 1.0 [2/2] [UU]
bitmap: 0/1 pages [0KB], 65536KB chunk

md1 : active raid1 sdb2[1](F) sda2[0]
291787584 blocks super 1.1 [2/1] [U_]
bitmap: 2/3 pages [8KB], 65536KB chunk

and:

> mdadm --detail /dev/md1
/dev/md1:
Version : 1.1
Creation Time : Fri Dec 19 19:22:52 2014
Raid Level : raid1
Array Size : 291787584 (278.27 GiB 298.79 GB)
Used Dev Size : 291787584 (278.27 GiB 298.79 GB)
Raid Devices : 2
Total Devices : 2
Persistence : Superblock is persistent

Intent Bitmap : Internal

Update Time : Mon Jan 28 14:46:31 2019
State : clean, degraded
Active Devices : 1
Working Devices : 1
Failed Devices : 1
Spare Devices : 0

Name : localhost.localdomain:1
UUID : 9b7d309a:24840105:66650b93:cbb35cc9
Events : 43398733

Number Major Minor RaidDevice State
0 8 2 0 active sync /dev/sda2
2 0 0 2 removed

1 8 18 - faulty /dev/sdb2

 

They have SNMP enabled on the servers, but as far as I can see there are no counters visible through any of the installed MIBs that expose RAID health ?

 

I have asked CA Support, but they can offer no solutions.

 

mdadm COULD be used to monitor it looking at the documentation:


e.g. mdadm --monitor --daemonise --mail=root@localhost --delay=1800 /dev/md0

And configure the servers to a local email server.

or even use --program to call wget and an API on the GW for alterting/SNMP trap, etc

 

Perhaps it's just the rarity of HW appliances out there that CA Support can offer no support around this whatsoever.

I'd be interested in what solutions others have used out there if there are any before I embark further down the approach above of using mdadm to monitor.

stu

Outcomes