Automic Workload Automation

Back to discussions

Expand all | Collapse all

Crashing and/or hanging Windows agents: Anyone else having this, too?

1. Crashing and/or hanging Windows agents: Anyone else having this, too?

0 Recommend
Carsten Schmitz
Posted Apr 03, 2017 09:55 AM

Reply Reply Privately
Dear Community,

is anyone beside us having severe problems with crashing or hanging Windows agents, with or without Sophos Anti-Virus?

Some background:

Our environment has about 750 active agents (10.0.8), a large number of UC4 users due to no centralized "Job Administration", and some high-profile jobs.

Unfortunately, we are suffering quite badly from crashing and hanging UC4 agents, mostly on Windows and at least since we switched to Automic V10 some time arround 2014/2015.

Crashes happen sometimes multiple times per day, almost exclusively on busy servers (usually production servers). Sometimes a server has several crashes per day, and then suddenly stopps crashing for weeks on end, without any discernible reason. We saw all kinds of different crash behaviours - with or without minidumps (or full dumps, once configured), Windows Event log entries or Automic crash logs.

Funny enough some agents even stopped crashing once we enabled the trace flags, and happily resumed crashing when we disabled it - Heisenberg sends his regards. I kinda refuse, however, to run perpetual level 9 traces to /dev/null on 750 servers in order to improve agent stability :)

Ultimately and despite serious effort (including old file transfer protocols and a wealth of other attempts), we were unable to replicate the crashes or find any useful pattern.

We opened several tickets for this over the last year. Some crashes had to be attributed to other factors by Automic (hence we upgraded the agents several times), but most crashes were attributed to Sophos Anti Virus by Automic. The presence of Sophos, however, is a strict demand the company places on every production server which until last month has been not negotiable at all.

It took quite some time and effort, but we eventually convinced the powers-that-be to uninstall Sophos on one production machine for a week, and this cured all crashes of that partucular agent for the week. Ever since Sophos has been re-enabled, the crashes returned on that machine. This is at least a strong indication for the combination of Automic Agent and Sophos being troublesome, but Sophos is still mandated in our company without alternative.

We're pretty much resigned at this point to restart the agents whenever they crash (by means of a small UNIX script I made which queries the Service Manager across the agent machines every minute and restarts dead agents).

In addition, we also see an increase in hanging agents recently. I suspect, however, that this has always been the case and it's only now getting more visible now since we have significantly improved the monitoring of agents. For some time now, we make the agents execute jobs periodically, and Nagios monitors whether the agent is actually performing tasks, not just telling the UC4 server it's alive. Since then, we encounter about one case per week on average of agents which are not performing jobs, but usually still answer keepalive requests. Those agents are "running" in Service Manager Dialog, but one can't stop them. One has to log into the Windows server and kill the process, and start the agent anew to fix the problem. Due to the lesser frequency, this problem is even more elusive than the crashes and we have no idea whether interaction with Sophos may be the culprit here, too.

So, anyone else?

Cheers,
Carsten
2. Crashing and/or hanging Windows agents: Anyone else having this, too?

0 Recommend
Legacy User
Posted Apr 03, 2017 10:11 AM

Reply Reply Privately
Regarding Sophos, we have a KE article regarding Windows agent crashing when Sophos is installed:

https://automic.force.com/support/apex/CommunityArticleDetail?id=ka4b00000004MLu

Regarding the agent "hanging", I would suggest uninstalling Sophos and observe if it still keeps hanging. Check the Windows Agent log as well for any clue. You can log a ticket with Automic Support regarding the agent "hanging" (regarding Sophos crashing, you will be advised to either uninstall it or use a different Anti-virus software) but I think you will need to uninstall Sophos first to eliminate this as a cause, then attach the logs to the ticket when it happens again.

Regards,
Christine
3. Crashing and/or hanging Windows agents: Anyone else having this, too?

0 Recommend
Carsten Schmitz
Posted Apr 03, 2017 10:57 AM

Reply Reply Privately
Hi,

thanks. I know that KB article, but uninstalling Sophos (or replacing it with something else) on all UC4 machines is prohibited by our companies security division. I'd be first to say "get rid of it then" but for some reason they really like their Sophos AV ...

Alas, I'm afraid we'll have to find some way to fix the root cause on this one.

edit: same problem for the hangs, unfortunately I can not have them uninstall Sophos on 700+ machines, hence I'm looking for evidence whether other clients who have Sophos see the hangs, or the crashes, or both ...

Regards,
Carsten
4. Crashing and/or hanging Windows agents: Anyone else having this, too?

0 Recommend
Anon Anon
Posted Apr 03, 2017 01:34 PM

Reply Reply Privately
What about excepting the Agent´s BIN directory and/or the agent binary from the AV?

can this be an Option?
5. Crashing and/or hanging Windows agents: Anyone else having this, too?

0 Recommend
Carsten Schmitz
Posted Apr 04, 2017 04:37 AM

Reply Reply Privately
> and/or the agent binary from the AV?

Nope, good idea in principle, but none of that works. We tried:

- disabling "Web Intelligence" and other modules
- excluding Automic binaries from Real Time scans
- removing the kernel injection of Sophos in-process DLLs by removing the registry keys (this MIGHT even work, were it not for the fact that Sophos re-enables AND re-loads it's DLLs at arbitary times - i.e. not just when Sophos updates or a reboot happens, but at seemingly random times).

There was one more option for disabling parts of Sophos which might work also by changing a registry key, and that is something Sophos support suggested, but that would disable the entire Real Time Scanner and a bunch of other things, so our security people ultimately denied that.

It currently appears that Automic support is probably right in claiming that only an entire uninstall of Sophos works. Now, uninstalling Sophos is mighty broken in itself and twice left a server with a borked network stack (hint: netsh and a winsock reset is needed aftre that to get the server back to working condition - splendid ...).
6. Crashing and/or hanging Windows agents: Anyone else having this, too?

0 Recommend
Carsten Schmitz
Posted Apr 04, 2017 04:54 AM

Reply Reply Privately
Btw, on the subject of hanging agents: I had one hanging agent yesterday which, according to the agent log, had it's tcp connection terminated. Could have been a network problem, but to me this looks like something actively terminated a connection by sending an unexpected TCP RST.

Bottom line, the agent tried to reconnect to the CP, that partially failed, and then the agent never tried to reconnect again and just stuck around for two hours without doing anything (but the local Service Manager considered it "running" because all it seemingly does is monitor the presence of the agent process).

I have opened a case with Automic for this. The agent immediately managed to reconnect after I killed it and restarted it, so the network was fine at that time, the hanging agent just didn't even try.

But I also looked at old logs of other hanging agents for this specificially, and those didn't show these messages. So just as with the crashes, there are probably more than one reasons why agents hang :(
7. Crashing and/or hanging Windows agents: Anyone else having this, too?

0 Recommend
Clint_Knight_843
Posted Apr 05, 2017 07:45 AM

Reply Reply Privately
We too have recently seen an increase in the number of agents that are "mysteriously" disconnecting. I would be interested in the coding you use to scan and restart (the agent restart setup in the marketplace does not work in version 10). We have for now also just suck it up and restart the agents via the Service Manager Dialog. It just stinks getting called several times in the middle of night to have to do it :)
8. Crashing and/or hanging Windows agents: Anyone else having this, too?

0 Recommend
Carsten Schmitz
Posted Apr 05, 2017 08:17 AM

Reply Reply Privately
Hi Clint,

Our restart script is a Bash shell script running on a Linux box. It gets started from cron (of all things ;) every two minutes, and loops over all lines from a semicolon-separated text file containing agent name, DNS name and UC4 environment for the server (and filters away comments in the text file).

For each iteration, I use a Linux service manager on that box to connect to the target server to send the "GET_PROCESS_LIST" command, filter out some output junk, and log the output to a file for later processing.

I check the return code to find out if there was a connection error to the remote (i.e. Windows) service manager itself (in which case, tough luck, I just alert someone from the script about there being a bigger problem by sending an email).

If the remote service manager is reachable (which it usually still is), I parse the output from it looking for the status, which is "S" for a stopped (usually meaning: crashed) agent. If I find an Agent with state "S", I use the local service manager to send the command to the remote service manager, which makes it restart the agent.

There's a lot more to that because I like logging and stuff, so the script writes extensive logs along with counts of agents processed/ok/faulty which I can later analyze, and I track the status of agents so I can report failures, like, once a day without spamming people. And also I have multiple lists of servers, so I can run it at night with more servers than at daytime, where I might want to analyze hanging agents myself first.

But I'll quickly try to break down the actual core of it where the "magic" happens:

--- snip ---

IFS=$'\n'

for LINE in `cat $SERVERLIST.TXT | grep -v "^#" | grep -v "^$"` ; do

# break line apart into components
AGENTNAME=`echo $LINE | cut -d ";" -f 1`
SERVER=`echo $LINE | cut -d ";" -f 2`
ENVNAME=`echo $LINE | cut -d ";" -f 3`

# query status from remote service manager
set -o pipefail
./ucybsmcl -c GET_PROCESS_LIST -h $SERVER:8871 -n $ENVNAME 2>${LOGFILE}.err | sed 's/^"//g' | sed 's/"$//g' | sed 's/" "/;/g' > /tmp/agentcheck.out
CONNECTION_ERROR=$?
set +o pipefail

# check if we we're able to connect to remote service manager at all
if [ "$CONNECTION_ERROR" -ne "0" ] ; then

     # do whatever is warranted here for you as a response, I send an email to myself for example

fi

# output gets written to temp file, can be multi-line. Then, loop over those lines.
for TEMPFILE_LINE in `cat /tmp/agentcheck.out` ; do
    AGENTTYPE=`echo $TEMPFILE_LINE | cut -d ";" -f 1`
    AGENTSTATUS=`echo $TEMPFILE_LINE | cut -d ";" -f 2`

    # if a stopped agent is found, start it
    if [ "$AGENTSTATUS" == "S" ] ; then

      ./ucybsmcl -c START_PROCESS -h $SERVER:8871 -n $ENVNAME -s "$AGENTTYPE"
      RETVAL=$?
      # you should evaluate $RETVAL for any errors and react to them, just in case

    # this here came as a later bugfix, when we realized an agent can in rare cases be neither "S" nor "R"
    # this happens very rarely when something goes very, very wrong in Service Manager, but it should be
    # catched nonetheless
    elif [ "$AGENTSTATUS" != "S" ] && [ "$AGENTSTATUS" != "R" ] ; then

      # do whatever is appropriate for you

    else

      # do here whatever you like to do for agents that are "ok", for example
      # I count them for reporting purposes

    fi
done
done
--- snip ---

Hope this helps. Any questions, let me know.

Cheers,
Carsten
9. Crashing and/or hanging Windows agents: Anyone else having this, too?

0 Recommend
Carsten Schmitz
Posted Jun 08, 2017 04:42 AM

Reply Reply Privately
Solved.

See https://community.automic.com/discussion/9840/solving-agent-crashes-by-selective-de-sophosimization-psa

Automic Workload Automation

Crashing and/or hanging Windows agents: Anyone else having this, too?

Carsten SchmitzApr 03, 2017 09:55 AM

Legacy UserApr 03, 2017 10:11 AM

Carsten SchmitzApr 03, 2017 10:57 AM

Anon AnonApr 03, 2017 01:34 PM

Carsten SchmitzApr 04, 2017 04:37 AM

Carsten SchmitzApr 04, 2017 04:54 AM

Clint_Knight_843Apr 05, 2017 07:45 AM

Carsten SchmitzApr 05, 2017 08:17 AM

Carsten SchmitzJun 08, 2017 04:42 AM

1. Crashing and/or hanging Windows agents: Anyone else having this, too?

2. Crashing and/or hanging Windows agents: Anyone else having this, too?

3. Crashing and/or hanging Windows agents: Anyone else having this, too?

4. Crashing and/or hanging Windows agents: Anyone else having this, too?

5. Crashing and/or hanging Windows agents: Anyone else having this, too?

6. Crashing and/or hanging Windows agents: Anyone else having this, too?

7. Crashing and/or hanging Windows agents: Anyone else having this, too?

8. Crashing and/or hanging Windows agents: Anyone else having this, too?

9. Crashing and/or hanging Windows agents: Anyone else having this, too?