Carsten_Schmitz

Crashing and/or hanging Windows agents: Anyone else having this, too?

Discussion created by Carsten_Schmitz on Apr 3, 2017
Latest reply on Jun 8, 2017 by Carsten_Schmitz
Dear Community,

is anyone beside us having severe problems with crashing or hanging Windows agents, with or without Sophos Anti-Virus?

Some background:

Our environment has about 750 active agents (10.0.8), a large number of UC4 users due to no centralized "Job Administration", and some high-profile jobs.

Unfortunately, we are suffering quite badly from crashing and hanging UC4 agents, mostly on Windows and at least since we switched to Automic V10 some time arround 2014/2015.

Crashes happen sometimes multiple times per day, almost exclusively on busy servers (usually production servers). Sometimes a server has several crashes per day, and then suddenly stopps crashing for weeks on end, without any discernible reason. We saw all kinds of different crash behaviours - with or without minidumps (or full dumps, once configured), Windows Event log entries or Automic crash logs.

Funny enough some agents even stopped crashing once we enabled the trace flags, and happily resumed crashing when we disabled it - Heisenberg sends his regards. I kinda refuse, however, to run perpetual level 9 traces to /dev/null on 750 servers in order to improve agent stability :)

Ultimately and despite serious effort (including old file transfer protocols and a wealth of other attempts), we were unable to replicate the crashes or find any useful pattern.

We opened several tickets for this over the last year. Some crashes had to be attributed to other factors by Automic (hence we upgraded the agents several times), but most crashes were attributed to Sophos Anti Virus by Automic. The presence of Sophos, however, is a strict demand the company places on every production server which until last month has been not negotiable at all.

It took quite some time and effort, but we eventually convinced the powers-that-be to uninstall Sophos on one production machine for a week, and this cured all crashes of that partucular agent for the week. Ever since Sophos has been re-enabled, the crashes returned on that machine. This is at least a strong indication for the combination of Automic Agent and Sophos being troublesome, but Sophos is still mandated in our company without alternative.

We're pretty much resigned at this point to restart the agents whenever they crash (by means of a small UNIX script I made which queries the Service Manager across the agent machines every minute and restarts dead agents).

In addition, we also see an increase in hanging agents recently. I suspect, however, that this has always been the case and it's only now getting more visible now since we have significantly improved the monitoring of agents. For some time now, we make the agents execute jobs periodically, and Nagios monitors whether the agent is actually performing tasks, not just telling the UC4 server it's alive. Since then, we encounter about one case per week on average of agents which are not performing jobs, but usually still answer keepalive requests. Those agents are "running" in Service Manager Dialog, but one can't stop them. One has to log into the Windows server and kill the process, and start the agent anew to fix the problem. Due to the lesser frequency, this problem is even more elusive than the crashes and we have no idea whether interaction with Sophos may be the culprit here, too.

So, anyone else?

Cheers,
Carsten

Outcomes