Prevent unfinished modeloutages in the SRMDB

Idea created by mgrusso on Aug 12, 2016
    New
    Score13

    There are cases based on different conditions which could led to an outage in the modeloutage table with no end time / end time set to null. Those models are therefore being reported by Spectrum Report Manager as not available / having an outage ongoing. The only way to finish the ongoing "outage" is by setting an end time manually, or exempting the outages in the outage editor.

     

    There is also a Tech Doc for this issue:

    http://www.ca.com/us/support/ca-support-online/product-content/knowledgebase-articles/tec581209.aspx?intcmp=searchresultclick&resultnum=1

     

    I can understand this with purged events in the DDMdb and so on, and that there is no way to prevent this. But we are also expierincing this with short term outages. Here, all events needed to stop the outage are existing within the events table:

     

    mysql> select time, model_key, event_msg from event where model_key=272950 and event_msg not like '%onfiguration%' LIMIT 10;

    +---------------------+-----------+-------------------------------------------------------------------------------------------------------------------------------------------------------------+

    | time                | model_key | event_msg                                                                                                                                                   |

    +---------------------+-----------+-------------------------------------------------------------------------------------------------------------------------------------------------------------+

    | 2016-06-23 20:16:38 |    272950 | Model XXXXX-***-XX-****** of type SwCiscoIOS has been contacted.                                                                                            |

    | 2016-07-15 16:23:10 |    272950 | The condition causing the loss of contact on the device model has cleared ( name - XXXXX-***-XX-******, type - SwCiscoIOS ).                                |

    | 2016-07-15 16:23:10 |    272950 | Device XXXXX-***-XX-****** of type SwCiscoIOS has stopped responding to polls and/or external requests.  An alarm will be generated.                        |

    | 2016-07-15 16:23:10 |    272950 | Alarm number 40575551 with probable cause id 0x10009 generated for device XXXXX-***-XX-****** of type SwCiscoIOS. The severity of this alarm is CRITICAL.   |

    | 2016-07-15 16:23:10 |    272950 | Contact has been lost with model XXXXX-***-XX-****** of type SwCiscoIOS.                                                                                    |

     

     

    mysql> select * from modeloutage where datediff(now(), start_time)>14 and end_time is null and start_time between "2016-07-15 16:00:00" and "2016-07-15 16:30:00" and model_key=272950;

    +-----------------+-----------+-------------+---------------------+----------+-------------+--------------------+-----------------+---------------+------------------+----------------------+

    | model_outage_ID | model_key | landscape_h | start_time          | end_time | outage_type | zero_length_outage | start_event_key | end_event_key | legacy_outage_id | legacy_outage_source |

    +-----------------+-----------+-------------+---------------------+----------+-------------+--------------------+-----------------+---------------+------------------+----------------------+

    |       120140530 |    272950 |   222298112 | 2016-07-15 16:23:10 | NULL     |           1 |                  0 |       718091834 |          NULL |             NULL | NULL                 |

    +-----------------+-----------+-------------+---------------------+----------+-------------+--------------------+-----------------+---------------+------------------+----------------------+

    1 row in set (0.11 sec)

     

    I don't know what exactly triggers an INSERT / UPDATE on the modeloutage table, but from my understanding, all events to end the outage are there. I opened a case for this issue, as this as happened to a huge amount of devices and therefore models at exactly that time and have been pointed to the Tech Doc.

     

    Long story short, my idea is: Please prevent those cases somehow. If all events needed to end an outage are being found within the event table, the outage end_time has to be set within the modeloutage table.

     

     

    Regards

     

    Marco