Vague rsp probe alarm

Idea created by KathyMaguire Employee on Aug 10, 2018
    • cduryea
    • GuanHua1378
    • David_Kim
    • Chris_Knowles
    • sirha02
    • KathyMaguire

    When the RSP probe is having sudo execution problems, the alarm asserted is “Connection to <device> timed out.”
    A few simple connectivity checks via ping, port availability, etc. eventually leads to discover the issue is not connectivity per se. but rather some other issue beyond logical or physical connectivity.


    Next step is to validate that the account used for testing can login but this step is unwarranted because there is not a login failure alarm asserted, though one may not assume that in the absence of such alarm. So you are left with physical and logical connectivity working, login works, and without diving deeper into obscure network security causes the next thing left is to check sudo functionality.


    A look at the rsp log indicates errors performing sudo based commands. An engineer for the AIX server would not have readily known this until much later in the process of troubleshooting and not through looking at the rsp log.


    This leads me to wonder why if the rsp log can indicate errors at the sudo step within the process can it not indicate this with and appropriate alarm message? Doing so would allow the alarm to be processed and handled with efficiency resulting in a reduced MTTR and less disruption to engineering staff that may not be related to the issue at hand.


    My guess is this was a simple oversight in the design process in which a number of exceptions were lumped into one error handling function. I’d like to ask that this design defect be corrected to have exceptions be broken out to include an additional alarm message for sudo issues