I'm trying to get a grip on RCA. It has been sold and I'm assigned to get it to work on customer site.
I hope to find people here that have already some experience with these probes.
In my lab we have a server that has a copy of the relevant parts of the customer probes running. For this purpose I created a separate hub NMS-Hub-RCA.
Apart from the discovery_server probe all probes run on the dedicated hub
Discovery of the network works fine. Topology matches what we have in the lab. Devices were set to managed and the net_connect and cisco_monitor probe get configured according to the monitoring profile.
The RCA however is problematic. The test is that I tell a router to reload in 10 minutes and then put all interfaces in shut, the one I use to get into the device the last. So it looks as if the router is gone for the other devices.
RCA quite quickly paints the router as gray as well as the other devices behind this router. It does not manage to see that the router I 'disconnected' is the root cause. I've tried this on two routers with a similar result
See attach state no root cause.jpg
The fault_correlation_engine generates additional alarms to indicate that none of the unreachable devices are the root cause.
Later when the router has rebooted and the net_connect ping fail alarms are gone all grey devices become red as the cisco_monitor SNMP agent events are still present. The topology then remains in that state for about half an hour after the alarms have cleared from the alarm view. This also seems strange.
See attach state all bad.jpg
Does anybody else see this behavior?
Anyone got an idea what goes wrong?