I had always mistakenly assumed that the SYS_HOST_ALIVE Script function was similar a UNIX ping and was reflective of the current state of the Agent’s connection status and ability to perform work. It appears though that is only accurate when the Agent has been stopped/terminated by UC4 functions. If the network connection has been lost or the server is genuinely dead then the results are suspect for a period of time.
I believe that this delay is controlled by the KEEP_ALIVE parameter in the UC_HOSTCHAR_DEFAULT System Variable object in Client 0 and is currently set to 1800 seconds for all of our Agents. That means, as far as I can tell, UC4 checks to see if the Agent is “alive” every 31 minutes. It appears that even at its best, setting this value to 60 seconds, it could be up to two minutes out of date as to its current status when checked via SYS_HOST_ALIVE.
Many of our applications have multiple server configurations with the intended usage of a primary and alternate host(s) on which to execute their tasks. The manner in which the selection occurs are most all dependent on the results of the SYS_HOST_ALIVE Script function. We have used this technique for years and it has never been reported as an issue. That means that either something has changed in the UC4 definitions (I don’t think so), nobody ever noticed or this is the first time that we specifically tested this particular condition (the server was physically turned off while it was active). It’s my belief that it is the latter situation that has finally presented itself.
So, does anyone have a suggestion of what can be done to have the state of an Agent be more immediately reflective of its true status or am I missing something obvious?
Thanks, Mark
p.s. we are Operations Manager Version 8.