Solving Agent Crashes by selective De-Sophosimization (PSA)

Discussion created by Carsten_Schmitz on Jun 7, 2017
Latest reply on Jul 3, 2017 by Carsten_Schmitz
Public Service Anouncement, the Second.

Well, this certainly took a while:

For the best part of two years we had major issues with some of our UC4 v10 Windows agents constantly crashing. But we were unable to replicate the crashes, and they came with quite elusive symptoms: Different log patterns. Some with traces. Some with  Mini Dumps. Few alike. But always exclusively on production servers.

Automic did point to Sophos Anti Virus early on, and advocated nothing but a full removal of all things Sophos. But with being somewhat on the behemoth side of things (as is wont for major corporations), and a security policy demanding this particular AV solution to be active at all times, a corporate-wide replacing of Sophos was out of the question.

Side note: When we finally got approval to uninstall it for very short periods of time, we found uninstalling Sophos to be like ripping off a band-aid, and have your arm come off along: Uninstalling Sophos repeatedly left our servers in a borked, inoperable state (pro tip: use "netsh reset winsock" to fix that - usually).

So, with uninstalling Sophos entirely out of the question, we eventually ended up with some good old fashioned debugging and a lot of trial and error. And lo behold the fruitful outcome (and point to this post):

Thou shalt disable Sophos WebIntelligence on thy servers!

Since we disabled WebIntelligence across all our UC4 servers, we have not seen a single further crash for a few weeks now, compared to dozens of crashes per day(!) before.

I don't claim to entirely know what WebIntelligence does, but it appears to be some sort of web browsing/script protection nested deep in the TCP stack. So this module should not be of much use on most server machines anyway.

Full disclaimer: We do still occasionally see other agent issues, foremost agents not properly reconnecting to the engine after network failures. But we see those at a rate of about one per month, which is nothing compared to the Sophos-related crashing. I was assured by a developer at AutomicWorld that V12 agents did get some love and thus have partly rewritten network code (due to unifying the socket library, related to the upcoming support of IPv6). So we hope to see further stability improvements with version 12.

@Automic: Can we somehow get someone to update the KB article with this knowledge, and change it's recommendation from a full removal to disabling Web Intelligence?