Automic Workload Automation

  • 1.  Crashing and/or hanging Windows agents: Anyone else having this, too?

    Posted Apr 03, 2017 09:55 AM
    Dear Community,

    is anyone beside us having severe problems with crashing or hanging Windows agents, with or without Sophos Anti-Virus?

    Some background:

    Our environment has about 750 active agents (10.0.8), a large number of UC4 users due to no centralized "Job Administration", and some high-profile jobs.

    Unfortunately, we are suffering quite badly from crashing and hanging UC4 agents, mostly on Windows and at least since we switched to Automic V10 some time arround 2014/2015.

    Crashes happen sometimes multiple times per day, almost exclusively on busy servers (usually production servers). Sometimes a server has several crashes per day, and then suddenly stopps crashing for weeks on end, without any discernible reason. We saw all kinds of different crash behaviours - with or without minidumps (or full dumps, once configured), Windows Event log entries or Automic crash logs.

    Funny enough some agents even stopped crashing once we enabled the trace flags, and happily resumed crashing when we disabled it - Heisenberg sends his regards. I kinda refuse, however, to run perpetual level 9 traces to /dev/null on 750 servers in order to improve agent stability :)

    Ultimately and despite serious effort (including old file transfer protocols and a wealth of other attempts), we were unable to replicate the crashes or find any useful pattern.

    We opened several tickets for this over the last year. Some crashes had to be attributed to other factors by Automic (hence we upgraded the agents several times), but most crashes were attributed to Sophos Anti Virus by Automic. The presence of Sophos, however, is a strict demand the company places on every production server which until last month has been not negotiable at all.

    It took quite some time and effort, but we eventually convinced the powers-that-be to uninstall Sophos on one production machine for a week, and this cured all crashes of that partucular agent for the week. Ever since Sophos has been re-enabled, the crashes returned on that machine. This is at least a strong indication for the combination of Automic Agent and Sophos being troublesome, but Sophos is still mandated in our company without alternative.

    We're pretty much resigned at this point to restart the agents whenever they crash (by means of a small UNIX script I made which queries the Service Manager across the agent machines every minute and restarts dead agents).

    In addition, we also see an increase in hanging agents recently. I suspect, however, that this has always been the case and it's only now getting more visible now since we have significantly improved the monitoring of agents. For some time now, we make the agents execute jobs periodically, and Nagios monitors whether the agent is actually performing tasks, not just telling the UC4 server it's alive. Since then, we encounter about one case per week on average of agents which are not performing jobs, but usually still answer keepalive requests. Those agents are "running" in Service Manager Dialog, but one can't stop them. One has to log into the Windows server and kill the process, and start the agent anew to fix the problem. Due to the lesser frequency, this problem is even more elusive than the crashes and we have no idea whether interaction with Sophos may be the culprit here, too.

    So, anyone else?

    Cheers,
    Carsten


  • 2.  Crashing and/or hanging Windows agents: Anyone else having this, too?

    Posted Apr 03, 2017 10:11 AM
    Regarding Sophos, we have a KE article regarding Windows agent crashing when Sophos is installed:


    Regarding the agent "hanging", I would suggest uninstalling Sophos and observe if it still keeps hanging. Check the Windows Agent log as well for any clue. You can log a ticket with Automic Support regarding the agent "hanging" (regarding Sophos crashing, you will be advised to either uninstall it or use a different Anti-virus software) but I think you will need to uninstall Sophos first to eliminate this as a cause, then attach the logs to the ticket when it happens again.

    Regards,
    Christine


  • 3.  Crashing and/or hanging Windows agents: Anyone else having this, too?

    Posted Apr 03, 2017 10:57 AM
    Hi,

    thanks. I know that KB article, but uninstalling Sophos (or replacing it with something else) on all UC4 machines is prohibited by our companies security division. I'd be first to say "get rid of it then" but for some reason they really like their Sophos AV ...

    Alas, I'm afraid we'll have to find some way to fix the root cause on this one.

    edit: same problem for the hangs, unfortunately I can not have them uninstall Sophos on 700+ machines, hence I'm looking for evidence whether other clients who have Sophos see the hangs, or the crashes, or both ...

    Regards,
    Carsten


  • 4.  Crashing and/or hanging Windows agents: Anyone else having this, too?

    Posted Apr 03, 2017 01:34 PM
    What about excepting the Agent´s BIN directory and/or the agent binary from the AV?

    can this be an Option?


  • 5.  Crashing and/or hanging Windows agents: Anyone else having this, too?

    Posted Apr 04, 2017 04:37 AM
    > and/or the agent binary from the AV?

    Nope, good idea in principle, but none of that works. We tried:

    - disabling "Web Intelligence" and other modules
    - excluding Automic binaries from Real Time scans
    - removing the kernel injection of Sophos in-process DLLs by removing the registry keys (this MIGHT even work, were it not for the fact that Sophos re-enables AND re-loads it's DLLs at arbitary times - i.e. not just when Sophos updates or a reboot happens, but at seemingly random times).

    There was one more option for disabling parts of Sophos which might work also by changing a registry key, and that is something Sophos support suggested, but that would disable the entire Real Time Scanner and a bunch of other things, so our security people ultimately denied that.

    It currently appears that Automic support is probably right in claiming that only an entire uninstall of Sophos works. Now, uninstalling Sophos is mighty broken in itself and twice left a server with a borked network stack (hint: netsh and a winsock reset is needed aftre that to get the server back to working condition - splendid ...).



  • 6.  Crashing and/or hanging Windows agents: Anyone else having this, too?

    Posted Apr 04, 2017 04:54 AM
    Btw, on the subject of hanging agents: I had one hanging agent yesterday which, according to the agent log, had it's tcp connection terminated. Could have been a network problem, but to me this looks like something actively terminated a connection by sending an unexpected TCP RST.

    Bottom line, the agent tried to reconnect to the CP, that partially failed, and then the agent never tried to reconnect again and just stuck around for two hours without doing anything (but the local Service Manager considered it "running" because all it seemingly does is monitor the presence of the agent process).

    I have opened a case with Automic for this. The agent immediately managed to reconnect after I killed it and restarted it, so the network was fine at that time, the hanging agent just didn't even try.

    But I also looked at old logs of other hanging agents for this specificially, and those didn't show these messages. So just as with the crashes, there are probably more than one reasons why agents hang :(


  • 7.  Crashing and/or hanging Windows agents: Anyone else having this, too?

    Posted Apr 05, 2017 07:45 AM
    We too have recently seen an increase in the number of agents that are "mysteriously" disconnecting. I would be interested in the coding you use to scan and restart (the agent restart setup in the marketplace does not work in version 10). We have for now also just suck it up and restart the agents via the Service Manager Dialog. It just stinks getting called several times in the middle of night to have to do it  :)


  • 8.  Crashing and/or hanging Windows agents: Anyone else having this, too?

    Posted Apr 05, 2017 08:17 AM
    Hi Clint,

    Our restart script is a Bash shell script running on a Linux box. It gets started from cron (of all things ;) every two minutes, and loops over all lines from a semicolon-separated text file containing agent name, DNS name and UC4 environment for the server (and filters away comments in the text file).

    For each iteration, I use a Linux service manager on that box to connect to the target server to send the "GET_PROCESS_LIST" command, filter out some output junk, and log the output to a file for later processing.

    I check the return code to find out if there was a connection error to the remote (i.e. Windows) service manager itself (in which case, tough luck, I just alert someone from the script about there being a bigger problem by sending an email).

    If the remote service manager is reachable (which it usually still is), I parse the output from it looking for the status, which is "S" for a stopped (usually meaning: crashed) agent. If I find an Agent with state "S", I use the local service manager to send the command to the remote service manager, which makes it restart the agent.

    There's a lot more to that because I like logging and stuff, so the script writes extensive logs along with counts of agents processed/ok/faulty which I can later analyze, and I track the status of agents so I can report failures, like, once a day without spamming people. And also I have multiple lists of servers, so I can run it at night with more servers than at daytime, where I might want to analyze hanging agents myself first.

    But I'll quickly try to break down the actual core of it where the "magic" happens:

    --- snip ---

    IFS=$'\n'

    for LINE in `cat $SERVERLIST.TXT | grep -v "^#" | grep -v "^$"` ; do

      # break line apart into components
      AGENTNAME=`echo $LINE | cut -d ";" -f 1`
      SERVER=`echo $LINE | cut -d ";" -f 2`
      ENVNAME=`echo $LINE | cut -d ";" -f 3`

      # query status from remote service manager
      set -o pipefail
      ./ucybsmcl -c GET_PROCESS_LIST -h $SERVER:8871 -n $ENVNAME 2>${LOGFILE}.err | sed 's/^"//g' | sed 's/"$//g' | sed 's/" "/;/g' > /tmp/agentcheck.out
      CONNECTION_ERROR=$?
      set +o pipefail

      # check if we we're able to connect to remote service manager at all
      if [ "$CONNECTION_ERROR" -ne "0" ] ; then

         # do whatever is warranted here for you as a response, I send an email to myself for example

      fi

      # output gets written to temp file, can be multi-line. Then, loop over those lines.
      for TEMPFILE_LINE in `cat /tmp/agentcheck.out` ; do
        AGENTTYPE=`echo $TEMPFILE_LINE | cut -d ";" -f 1`
        AGENTSTATUS=`echo $TEMPFILE_LINE | cut -d ";" -f 2`

        # if a stopped agent is found, start it
        if [ "$AGENTSTATUS" == "S" ] ; then

          ./ucybsmcl -c START_PROCESS -h $SERVER:8871 -n $ENVNAME -s "$AGENTTYPE"
          RETVAL=$?
          # you should evaluate $RETVAL for any errors and react to them, just in case

        # this here came as a later bugfix, when we realized an agent can in rare cases be neither "S" nor "R"
        # this happens very rarely when something goes very, very wrong in Service Manager, but it should be
        # catched nonetheless
        elif [ "$AGENTSTATUS" != "S" ] && [ "$AGENTSTATUS" != "R" ] ; then

          # do whatever is appropriate for you

        else

          # do here whatever you like to do for agents that are "ok", for example
          # I count them for reporting purposes

        fi
      done
    done
    --- snip ---

    Hope this helps. Any questions, let me know.

    Cheers,
    Carsten












  • 9.  Crashing and/or hanging Windows agents: Anyone else having this, too?