One of four of our production collectors stopped reporting and only until an end user tried to do a search on data within it's smartstor did we find out that the collector had failed.
Still trying to determine what happened but the EM process on the Suse server was present and taking CPU, so to the OS level, it was working just fine. Within Investigator, the
- *SuperDomain*|Custom Metric Host (Virtual)|Custom Metric Process (Virtual)|Custom Metric Agent (Virtual)|Enterprise Manager|MOM|Collectors|<server>.aessuccess.org@5001:Connected
This metric did no report any data but since our alerts are based on the Infrastructure Overview Management Module, there was no alerts generated.
Does anyone have any suggestions on what metric(s) will identify a collector has failed and will drive an alert?
Nice thing about it is the other three collectors picked up the load from the failed collector and were behaving pretty well. So, take away from this is do not run your collectors at 100% capacity. We are running about 60 to 75% per collector.