DX NetOps

  • 1.  Best way to handle false alarm due to known device bug.

    Posted Feb 22, 2016 05:35 PM

    Hello all.  I'm looking for some suggestions on how to best handle this situation.

     

    In particular, I'm referring to Cisco bug ID CSCuv18572, as it relates to the Cisco WS-C3850-x switches running IOS-XE revisions prior to 3.6(E).  This bug is so pervasive in our environment that we are missing some very important alarms due to the noise these cause.  For reasons beyond our control, we are unable to simply upgrade to the code that contains a fix.

     

    So I'm stuck dealing with a massive amount of syslog traps that look something like this:

    %NGWC_PLATFORM_FEP-1-FRU_PS_ACCESS: Switch 3: power supply B is not responding

     

    This creates event 0x210c0e with the following varbinds:

    1. The syslog facility.  In this case, NGWC_PLATFORM_FEP
    2. The syslog severity.  In this case, 1 (alert)
    3. The syslog mnemonic:  In this case, FRU_PS_ACCESS
    4. The syslog message text:  In this case, "Switch <switch-number>: power supply <A-or-B> is not responding"

     

    This causes 0x0021001c to fire and alarm accordingly, since {v 2} == 1.  In my environment, this means we get a major alarm.

     

    False or Real Alarm?

    Chances are that this is a false alarm.  Using the above example, if I were to look at Switch 2, power supply B, I may find that it is working just fine.

     

    Essentially, I can have a device that is affected by this bug in one of three conditions regarding power supply status:

    1. OK:  All switches in the stack have power supplies that are functional, no alarms are generated.
    2. Failed PSU:  Power supply B in Switch 2 has failed, and the alarm as described above has been generated.
    3. Bugged:  All switches in the stack have power supplies that are functional, but the alarm as described above has been generated.

     

    To discern via SNMP whether or not a "bugged" condition exists, I can use ciscoEnvMonSupplyState ciscoEnvMonSupplySource.  The states are as follows.  This example assumes I have three switches in the stack and makes similar assumptions about index numbers.  I will assume it is reporting Switch 3, PSU B.  I am throwing in the values of ciscoEnvMonSupplyStatusDescr for reference.

     

    1. OK:
      • ciscoEnvMonSupplyStatusDescr.114 = "Switch 1 - Power Supply A, Normal"
        • ciscoEnvMonSupplyState.114 = normal(1)
        • ciscoEnvMonSupplySource.114 = ac(2)
      • ciscoEnvMonSupplyStatusDescr.115 = "Switch 1 - Power Supply B, Normal"
        • ciscoEnvMonSupplyState.115 = normal(1)
        • ciscoEnvMonSupplySource.115 = ac(2)
      • ciscoEnvMonSupplyStatusDescr.214 = "Switch 2 - Power Supply A, Normal"

        • <snip for brevity>
    2. Failed PSU in switch 2, power supply B:
      • <snip for brevity>
      • ciscoEnvMonSupplyStatusDescr.214 = "Switch 2 - Power Supply A, Normal"
        • ciscoEnvMonSupplyState.214 = normal(1)
        • ciscoEnvMonSupplySource.214 = ac(2)
      • ciscoEnvMonSupplyStatusDescr.215 = "Switch 2 - Power Supply B, Unknown"
        • ciscoEnvMonSupplyState.215 = shutdown(4)
        • ciscoEnvMonSupplySource.215 = unknown(1)   * INDICATOR OF LEGITIMATE ALARM *
      • <snip for brevity>
    3. Bugged device.  Everything is OK, but will alarm due to this bug:
      • <snip for brevity>
      • ciscoEnvMonSupplyStatusDescr.214 = "Switch 2 - Power Supply A, Normal"
        • ciscoEnvMonSupplyState.214 = normal(1)
        • ciscoEnvMonSupplySource.214 = ac(2)
      • ciscoEnvMonSupplyStatusDescr.215 = "Switch 2 - Power Supply B, Unknown"
        • ciscoEnvMonSupplyState.215 = shutdown(4)
        • ciscoEnvMonSupplySource.215 = ac(2)   * INDICATOR OF FALSE ALARM *
      • <snip for brevity>

     

     

    Logic to Detect?

    I tried to set a watch for this, but because this is a list attribute, I cannot have a value for a device overall.  For example, I set a watch called IsPsuOk with a boolean data type that was simply a matter of:

    ciscoEnvMonSupplySource.# == 2

    Of course, I do not have a value for IsPsuOk for the device.  Instead, i have 6 total values in a table for the device.  One corresponding to each particular PSU (2 per switch).  So if I'm in a truly failed state, I would see "True, True, True, False, True, True"  The fourth value in that table would be false because 1 != 2.  If I'm in the bugged state, though, I see "True, True, True, True, True True."

     

    I guess the first question is this:  Is it possible for me to have an overall AreAllPsusOk that evaluates to true if all of the values in the list are true, and evaluates to false if at least one value is false?  The logic is simple enough, I simply don't know enough about watches.

     

    If I can do that, then I can simply add an event condition that says to scrap the event if AreAllPsusOk == True for the device.

     

    Or is there a better way?

     

    I hope this makes sense.  I'm tired.



  • 2.  Re: Best way to handle false alarm due to known device bug.

    Posted Feb 22, 2016 06:05 PM

    Your watch solution just needs to change from "== 2" to "== 1", make sure Instance is set to All, and Threshold set to "!= 1".

     

    This will essentially say "check each list instance, set to True if the value equals 1, if the value is NOT True, generate an alarm". This should satisfy "if any are False, generate an alarm".



  • 3.  Re: Best way to handle false alarm due to known device bug.

    Posted Feb 23, 2016 09:03 AM

    Thank you, I appreciate the fact that you actually read all of that.

     

    I considered this as an option as well.  My bigger concern, however, is to PREVENT alarms.  Right now, I'm tasked, primarily, with getting rid of noise so that the real alarms will stand out.  We're aware that we have devices that are bit by this bug.

     

    I actually WANT to alarm when the "%NGWC_PLATFORM_FEP-1-FRU_PS_ACCESS: Switch 3: power supply B is not responding" syslog trap is received, as long as it's actually a real alarm and not a false alarm.  We've missed a few of these because of the noise.

     

    At a high level, what I'm wanting to do is something like this....

     

    1. Trap received and identified as this particular message.

    2. Is it a legit message?

          No:  Do nothing.

          Yes:  Carry on as usual.

     

    My thoughts were that if I could look at a device-level attribute, say 0xffff1111 for example, then I could do something like:

      1. Trap received and identified as this message.

      2. pseudocode: If (attr(0xffff1111) == True) then exit

      3. I'm only here if attr 0xffff1111 is false, so alarm as per usual.

     

     

    Alternatively, if there is a way to look at all elements of a list attribute.  In this case, the watch I setup would work:

       1. Trap received and identified as this message.

       2. pseudocode: for IsPsuOk in attr(0xffff1111[]) { if (IsPsuOk == True) then exit }

       3. I'm only here if any element of IsPsuOk[] evaluates false, so alarm as per usual.

     

    I hope this makes sense?



  • 4.  Re: Best way to handle false alarm due to known device bug.

    Posted Feb 24, 2016 09:06 AM

    Easier to implement: Have your AlarmNotifier email script, before sending an email, make a REST API call back into OneClick to check/do something with your list attribute. Set your SANM filter to have a 1minute delay so the initial trap is not emailed.

     

    Most optimized, long QA process/learning curve: Use EventProcedures ($SPECROOT/SS/CsVendor/CA/Procedures/) to read the list attribute and create a new event that creates an alarm accordingly. You'll have to get familiar with If, ForEach, ReadAttributeInstance, CreateEventWithVariables.

     

    An EventCondition can read a model attribute (e.g. Model Name (0x1006e) below, but I don't think it works/I've never tested it with List Attributes



  • 5.  Re: Best way to handle false alarm due to known device bug.

    Posted Feb 24, 2016 09:39 AM

    Yes, exactly that!    I've been focused on EventConditions instead of EventProcedures.  And, as noted, I cannot for the life of me find a way to iterate through the instances of a list attribute in an EventConditon.

     

    I think this was the tip I was looking for.  Let me poke around a bit and hopefully I will, sooner than later, have a reply to this thread with a working example.  Thanks again!