I have the situation where I have a pair of servers (actually I have about 600 pairs) that form kind of a mini cluster where there is a process that has to be running on one of the two servers but not both and not neither.
My legacy approach was essentially to have cron (these are Linux servers) run a process on each of the servers that would essentially use ps and grep to test if the process was running locally and then ssh that same command to the other server. Then count the results. A count of 0 or 2 is bad. A count of 1 is good.
My problem is that ssh is being taken away and this script is no longer a viable option.
I can use either the processes probe or logmon to get whether the process is running on a server. The problem is that when looking at one server, running or not running might be either right or wrong. So I can't count not running alarms because not running might be the correct state. Similarly I can't count just "running" alarms because those are generally mapped to Clear.
What eludes me is how I can get at this information without polluting my alarms. My initial thought is that running or not running is a non-clear alarm and that I encode the running/not running in the message. Then I have an AO profile that generates a new alarm if there are ever two running or two not running "alarms" for a pair of servers otherwise create a clear alarm.
Problem is that this would be thousands of alarms (multiple processes per server pair) and that would wreck my already poor alarm database performance.