AutoSys Workload Automation

Expand all | Collapse all

Ping Failure between Manager and Agent

  • 1.  Ping Failure between Manager and Agent

    Posted Sep 07, 2018 02:27 PM

    I think I maybe need some simple education on this. My understanding is that a 'ping' will occur between the DE manager and the agent every x number of minutes based on the heartbeat frequency (in minutes) parm in the agent within the topology (we have it turned off where we don't have 1 global setting for all agents). 

     

    We have agent WCLTROCHP01 which is set for 5 minutes. When a ping fails the agent is marked as inactive in the topology (correct me if that is not right). Then what I seem to be seeing is that the ping does not occur again and essentially you need to manually make the agent 'active' by simply making a simple change in the agent in the topology to 'redrive' the communication OR if a job tries to run on the agent it will set it back to active as long as the agent is pingable.

     

    We've seen situations where this agent as well as others are marked as inactive then when a job tries to run on the agent it creates an AGENTDOWN alert because it was inactive but then the job immediately runs essentially because communication is 'redriven' and the agent is up and functioning fine but was inactive in the topology. Is there a way around this or something I am missing here? This is providing a false representation of the jobs. We are on 11.3 SP3



  • 2.  Re: Ping Failure between Manager and Agent

    Broadcom Employee
    Posted Sep 07, 2018 03:07 PM

    The default behavior is to send agentdown notification the first attempt the agent communication failed but it's configurable. There are fixes after 11.3 SP3 related to agents shown inactive status. What is the 11.3 SP3 build? Also, are these alias agents?



  • 3.  Re: Ping Failure between Manager and Agent

    Posted Sep 07, 2018 03:11 PM

    Build is 1434 and these are not alias'

     

    If this is configurable then maybe it would work to set the agent to be marked as inactive if for example 2 or 3 pings fail. Is it in the 11.3 doc to explain how to change this setting?



  • 4.  Re: Ping Failure between Manager and Agent

    Posted Sep 07, 2018 03:24 PM

    I believe their is a future enhancement to do a few retries after a failed ping before the agent goes inactive.

     

    We have noticed the same behavior with agents going inactive and currently we have to mod agent or run a job to it to activate again.  We are at R12.0 SP1. 

     

    I am not sure if that enhancement is in R12.0 SP2 or R12.1.

     

    Sharon



  • 5.  Re: Ping Failure between Manager and Agent

    Posted Sep 07, 2018 03:52 PM

    I might be wrong on retries for agents going inactive, there is an old post but says under review.

     

    Retries for file monitor events getting  'Not active. "Scan Failed" is currently planned.

     

    We do have issue with agent going inactive and file monitors getting 'Not active. "Scan Failed"

     

    sharon



  • 6.  Re: Ping Failure between Manager and Agent

    Posted Sep 10, 2018 08:04 AM

    Thanks Sharon for the info. We are on 12.0 SP2 on our test environment. Due to issues with the upgrade we one different versions with prod and test. I will check out test to see if it acts the same way.


    Segun, are you able to confirm whether an enhancement will be in later releases? 



  • 7.  Re: Ping Failure between Manager and Agent

    Broadcom Employee
    Posted Sep 10, 2018 10:04 AM

    Travis, the Agent inactive timeout / heartbeat enhancement is not in r12.0 SP2 and r12.1 but will be planned for future release.

    You can try to alleviate the issue of AGENTDOWN notification caused by intermittent network failures by enabling these parameters in <DE_installdir>/conf/server.properties file, and restart the DE Server;

    agentdown.notification.threshold.attempts=5

    agentCommunicationFailed.queue.reprocessing.interval=30000

    With this configuration, DE Server will make 5 attempts at 30 seconds interval before sending AGENTDOWN notification after agent communication failures (default is 1 attempt).



  • 8.  Re: Ping Failure between Manager and Agent

    Posted Sep 10, 2018 01:41 PM

    Thanks Segun, this parm is exactly what I was looking for although would be even better if it could be on a per agent basis.



  • 9.  Re: Ping Failure between Manager and Agent

    Posted Sep 10, 2018 01:52 PM

    hi travis,

    I think you can set it per agent but the the config would be set at 0

     

    Heartbeat Frequency (in minutes)

    Specifies the frequency with which you want the server to send the heartbeat signal in minutes.

    Default: 5

    Limits: 0 and above

    Note: If you want individual agents or advanced integrations to have their own heartbeat frequencies, you can set the shared configuration parameter Global Agent heartbeat frequency to 0 (zero).

    Heartbeat attempts before sending an SNMP notification

    Specifies the number of heartbeat signals the server attempts before it sends an SNMP message indicating agent or advanced integration inactivity.

    Default: 1

    Limits: 1 and above



  • 10.  Re: Ping Failure between Manager and Agent

    Posted Sep 10, 2018 02:54 PM

    Thanks Sharon. I thought that could be used for something but we do have the 'Global agent heartbeat interval in minute' set for 0 within the Server Shared Parameters. Then I looked at an agent that has the 'Heartbeat attempts before sending an SNMP notification' parm set for 5.

     

    When looking in the tracelogs I see the agent was marked as inactive at 2AM today and in the receiver log of that agent the pings were successful every 5 minutes all the way up to 2AM when it failed. So, it seems even though it was setup for 5, it took only 1 failed ping for the DE manager to mark it as inactive.



  • 11.  Re: Ping Failure between Manager and Agent

    Posted Sep 11, 2018 10:18 AM

    Ok here is what I have found through looking at the tracelogs and looking at about 6 different agents which have different setting for their 'Heartbeat frequency (in minutes) and 'Heartbeat attempts before sending an SNMP notification'

     

    In our shop at least, when an agent is marked as inactive DE attempts another ping 50 minutes later and if successful will put the agent back to active. However, if a job tries to run on the agent before the 50 minutes and communication is successful to the agent then it will be put back to active as well so the agent can. I'm still looking to see if the 50 mins can be changed to a much lower number. It's good that the manager auto pings at a 50 min interval but 50 mins seems to be a bit high



  • 12.  Re: Ping Failure between Manager and Agent
    Best Answer

    Broadcom Employee
    Posted Sep 11, 2018 11:58 AM

    That is correct. If the agent ping failed, the agent is marked 'inactive' and the next ping is set by a multiply factor of 10. That is, if the heartbeat frequency is 5 mins, the next ping is set to 50 mins. However, any job submitted to the agent before the next ping will mark the agent 'active' if no communication issue. The idea to have the delay factor configurable is under review and will be planned for future release.



  • 13.  Re: Ping Failure between Manager and Agent

    Posted Sep 11, 2018 12:16 PM

    Thank you Segun, I have voted up on this idea. This explains now what I thought was the case. It isn't a global setting of 50 mins at our shop, I just happened to pick agents that had a heartbeat frequency of 5 (that is our default). When I picked one that had 2 I saw that it was 20 mins from when it was put as inactive to the time an auto ping was sent out.



  • 14.  Re: Ping Failure between Manager and Agent

    Posted Sep 20, 2018 09:34 PM

    Hi Segun. We are wanting to make a change to the below parm. Basically we are thinking of increasing to say 3 or 5 so that agents don't get marked as inactive from 1 ping failure. However, I just want to confirm something first although going off of what has already been stated, I believe I know the answer but just want that reassurance before making any change.

     

    This parm below does not at all impact how the DE manager interacts with the secondary server in a high availability setup correct? I need to make sure that it only impacts how the DE manager interacts with the agents themselves and not the other DE manager.

     

    agentdown.notification.threshold.attempts



  • 15.  Re: Ping Failure between Manager and Agent

    Broadcom Employee
    Posted Sep 21, 2018 08:51 AM

    That is correct. It does not impact other DE manager in HA. You can refer to the docops for more information on the parameter.



  • 16.  Re: Ping Failure between Manager and Agent

    Posted Sep 21, 2018 09:06 AM

    Thank you Segun, just one other question that I forgot to add in the last post. Does CA recommend to have it set as 1 failed ping? I think in our environment with all the traffic and different factors, we occasionally will have our manager not able to reach out to an agent but usually can ping the next time so increasing this parm to 2 or 3 seems to make more sense.

     

    Just wanted to get the 'CA recommendation' if there is one. I imagine it is 1 because that is what it comes with out of the box I believe.



  • 17.  Re: Ping Failure between Manager and Agent

    Broadcom Employee
    Posted Sep 21, 2018 10:53 AM

    The default is 1 ping failure so it depends on the situation and network. If the network is unstable, you can increase the number of attempts as needed.



  • 18.  Re: Ping Failure between Manager and Agent

    Posted Sep 25, 2018 01:57 PM

    I guess I have one final question I didn't really think of before. If the DE manager fails to ping the agent does it immediately attempt to ping again (assuming you have the setting more than 1). Or does it wait X number of minutes based on the heartbeat frequency in the agent definition?