Hello all. I'm looking for some suggestions on how to best handle this situation.
In particular, I'm referring to Cisco bug ID CSCuv18572, as it relates to the Cisco WS-C3850-x switches running IOS-XE revisions prior to 3.6(E). This bug is so pervasive in our environment that we are missing some very important alarms due to the noise these cause. For reasons beyond our control, we are unable to simply upgrade to the code that contains a fix.
So I'm stuck dealing with a massive amount of syslog traps that look something like this:
%NGWC_PLATFORM_FEP-1-FRU_PS_ACCESS: Switch 3: power supply B is not responding
This creates event 0x210c0e with the following varbinds:
- The syslog facility. In this case, NGWC_PLATFORM_FEP
- The syslog severity. In this case, 1 (alert)
- The syslog mnemonic: In this case, FRU_PS_ACCESS
- The syslog message text: In this case, "Switch <switch-number>: power supply <A-or-B> is not responding"
This causes 0x0021001c to fire and alarm accordingly, since {v 2} == 1. In my environment, this means we get a major alarm.
False or Real Alarm?
Chances are that this is a false alarm. Using the above example, if I were to look at Switch 2, power supply B, I may find that it is working just fine.
Essentially, I can have a device that is affected by this bug in one of three conditions regarding power supply status:
- OK: All switches in the stack have power supplies that are functional, no alarms are generated.
- Failed PSU: Power supply B in Switch 2 has failed, and the alarm as described above has been generated.
- Bugged: All switches in the stack have power supplies that are functional, but the alarm as described above has been generated.
To discern via SNMP whether or not a "bugged" condition exists, I can use ciscoEnvMonSupplyState ciscoEnvMonSupplySource. The states are as follows. This example assumes I have three switches in the stack and makes similar assumptions about index numbers. I will assume it is reporting Switch 3, PSU B. I am throwing in the values of ciscoEnvMonSupplyStatusDescr for reference.
- OK:
- ciscoEnvMonSupplyStatusDescr.114 = "Switch 1 - Power Supply A, Normal"
- ciscoEnvMonSupplyState.114 = normal(1)
- ciscoEnvMonSupplySource.114 = ac(2)
- ciscoEnvMonSupplyStatusDescr.115 = "Switch 1 - Power Supply B, Normal"
- ciscoEnvMonSupplyState.115 = normal(1)
- ciscoEnvMonSupplySource.115 = ac(2)
- ciscoEnvMonSupplyStatusDescr.214 = "Switch 2 - Power Supply A, Normal"
- Failed PSU in switch 2, power supply B:
- <snip for brevity>
- ciscoEnvMonSupplyStatusDescr.214 = "Switch 2 - Power Supply A, Normal"
- ciscoEnvMonSupplyState.214 = normal(1)
- ciscoEnvMonSupplySource.214 = ac(2)
- ciscoEnvMonSupplyStatusDescr.215 = "Switch 2 - Power Supply B, Unknown"
- ciscoEnvMonSupplyState.215 = shutdown(4)
- ciscoEnvMonSupplySource.215 = unknown(1) * INDICATOR OF LEGITIMATE ALARM *
- <snip for brevity>
- Bugged device. Everything is OK, but will alarm due to this bug:
- <snip for brevity>
- ciscoEnvMonSupplyStatusDescr.214 = "Switch 2 - Power Supply A, Normal"
- ciscoEnvMonSupplyState.214 = normal(1)
- ciscoEnvMonSupplySource.214 = ac(2)
- ciscoEnvMonSupplyStatusDescr.215 = "Switch 2 - Power Supply B, Unknown"
- ciscoEnvMonSupplyState.215 = shutdown(4)
- ciscoEnvMonSupplySource.215 = ac(2) * INDICATOR OF FALSE ALARM *
- <snip for brevity>
Logic to Detect?
I tried to set a watch for this, but because this is a list attribute, I cannot have a value for a device overall. For example, I set a watch called IsPsuOk with a boolean data type that was simply a matter of:
ciscoEnvMonSupplySource.# == 2
Of course, I do not have a value for IsPsuOk for the device. Instead, i have 6 total values in a table for the device. One corresponding to each particular PSU (2 per switch). So if I'm in a truly failed state, I would see "True, True, True, False, True, True" The fourth value in that table would be false because 1 != 2. If I'm in the bugged state, though, I see "True, True, True, True, True True."
I guess the first question is this: Is it possible for me to have an overall AreAllPsusOk that evaluates to true if all of the values in the list are true, and evaluates to false if at least one value is false? The logic is simple enough, I simply don't know enough about watches.
If I can do that, then I can simply add an event condition that says to scrap the event if AreAllPsusOk == True for the device.
Or is there a better way?
I hope this makes sense. I'm tired.