[...]
I hope this
explanation helps understanding the way the Agent works.
KR, Josef
[...]
Sort of, but not really, no.
First off, the overall fact remains that the problem can be replicated and demostratably exists, so it would be for Automic to analyze and solve it - were it not for the fact that at present it can be dismissed on other grounds, namely for being demonstrated (thus far) exclusively with V10.
Once we update to V12, and should the problem persist, I'd still expect Automic to analyze and fix it nevertheless.
I can assure Automic this is not a virus scanner issue (it's a Linux machine - outsourced, but I'm absolutely confident it does not run an on-access virus scanner of any sort. The whole virus scanner problem is really more of a Windows issue, that funky OS where file handles opened for reading are, unlike on Linux, sometimes blocking further access).
For me, that
should be the end of story, but with way too much time already invested, some more remarks.
I was looking further in to this under the premise that the OS write cache might be to blame. Sadly, I can't easily prove this to be true, since I can only replicate that on that one machine (I tried on other machines today) - and on that one machine where I can replicate it, I am not allowed to be root. Otherwise mounting my file system with the "sync" flag might already be enough to prove or disprove that theory.
But calling close() is by far not the end of story. Close() does NOT ensure your data is synced to disk (it did, ca. 1990, but not anymore). Ted Tso claims that it's the
application programmer's responsibility (see booknote #1) to ensure an explicit call to
fsync() for
important data (as which UC4 transfers certainly qualify). In that, he quotes the close() man page, which strongly advises the same (n.b. #2). Your developer unfortunately didn't elaborate on whether the agent makes an explicit call to fsync(), but I just ran an strace on the Linux agent and its child processes during an JOBF. There were 21 close() but no call to sync() or fsync().
As Linux distributions almost all go to ext4, the delay between close() and data
actually written to disk can easily be 30 seconds, so
regardless of whether this is the root cause of this particular problem, I strongly advise your development to take that advise and start calling fsync() after a JOBF from the Linux agent explicitly. Ted Tso, by the way, is probably a bit of an authority on these things, since he wrote large parts of the Linux kernel dealing with that stuff :)
I'm no kernel programmer, but I believe the kernel
should nevertheless keep track of dirty cache pages and further processes
should never see stale cache data regardless. So even by just calling close(), you
should be safe from race conditions (albeit only from race conditions, not the other reasons for which it's strongly suggested to call fsync() unless there is an additional a kernel issue (it happens - see #3), or the agent does something rather exotic. But I guess that's something to be looked into further if this keeps happening with V12, too.
Best regards,
Carsten
(1)
https://thunk.org/tytso/blog/2009/03/15/dont-fear-the-fsync/(2)
https://linux.die.net/man/2/fsync(3)
https://www.redhat.com/archives/linux-lvm/2016-December/msg00000.html