Automic Workload Automation

Expand all | Collapse all

Race Condition with two subsequent JOBF

  • 1.  Race Condition with two subsequent JOBF

    Posted Jul 27, 2017 06:42 AM
    Hi.

    We noticed there appears to be a race condition with subsequent JOBF, especially if high latency network connections are involved. I can quite reliably (about 50% of the time) replicate the problem. The setup is simply two subsequent JOBF transfering a test file of about a megabyte in size, like this:

    Host #1 ---> JOBF over high-latency network link ---> Host #2 ---> JOBF ---> Host #3

    What happens then is this: The second JOBF starts eventhough the file has not yet been fully transfered to HOST #2 by the first JOBF, causing the second JOBF (and the entire JOBP) to hang indefinetly.

    (edit): If one inserts a sleep statement of five seconds in between the two JOBF, the problem entirely disappears, proving to me this is indeed a timing issue and at the core one of lacking serialization in the execution of objects.

    I opened a support case for this, but since we're on version 10 and this is not deemed critical enough to warrant handling as a critical maintainance issue, a problem report will not be created at this time.

    Thus, I will revisit this once we upgrade to version 12, but out of curiosity: Has anyone else seen this behaviour, and if so, with which AE version?

    Cheers,
    Carsten


  • 2.  Race Condition with two subsequent JOBF

    Posted Jul 28, 2017 02:59 AM

    Hi Carsten_Schmitz_7883

    Well, in Automic File Transfer there is the “ft_temp_file” feature, which can be switched on via

    ft_temp_file=yes in the Agents INI file:

    https://docs.automic.com/documentation/webhelp/english/AWA/12.0/DOCU/12.0/AWA%20Guides/help.htm#AE_AWA_Source/Administration/ucaand.htm?Highlight=ft_temp_file

     

    I think the situation you decrypted is a wonderful use case for the feature. Hopefully it works.  :)

    KR, Josef



  • 3.  Race Condition with two subsequent JOBF

    Posted Jul 28, 2017 05:05 AM
    Hi Josef_Scharl_103,

    cheers for that, but "yes" is already the default value for ft_temp_file as per the documentation you kindly linked to. So in practice, by not using that statement in the ini file at present at all, we're already using the temporary file method (at least if the documentation can be trusted - due to lack of root and inotify userland tools on the machine in question, I can't verify that with absolute certainty).

    My feeling is, the agent possibly does not call the sync() system call on Linux, thus not flushing the caches to disk before announcing the JOBF complete. By inserting a JOBS in between the two JOBS which calls /bin/sync (which executes the sync() syscall), the problem also reliably goes away. Thus, I guess I'll be revisiting this nevertheless once my engine and that particular host get V12.

    Best,
    Carsten


  • 4.  Race Condition with two subsequent JOBF

    Posted Aug 01, 2017 05:14 AM

    I’ll try to get some background information, hopefully this will be successful.

    BTW: Can you tell me the ticket number please? I’m not sure which ticket you mentioned above.

    KR, Josef




  • 5.  Race Condition with two subsequent JOBF

    Posted Aug 01, 2017 06:32 AM
    Josef_Scharl_103 much appreciated.

    Ticket # is INC00135056.

    Best,
    Carsten


  • 6.  Race Condition with two subsequent JOBF

    Posted Aug 01, 2017 03:11 PM
    Hi Josef_Scharl_103 ,

    I found one customer that is receivend the status "Connecting" for long time in JOBF, or the JOBF abends with status ended abnormally:

    2017-08-01 09:55:43 - U0011124 Selection started with filter '/sftp/....xxxxx530011.zip*' ...

    2017-08-01 09:55:43 - U0011125   '/sftp/...xxxx9530011.zip'

    2017-08-01 09:55:43 - U0011126 Files selected: '1'.

    2017-08-01 09:55:43 - U0063073 FT'432622437' Connection to Agent '***' terminated.

    2017-08-01 09:55:43 - U0011409 FT '432622437': FileTransfer ended abnormally.

    Agents versions '10.0.8+build.498' and version '10.0.8+hf.5.build.1413' that do filetransfer

    Will be is the same problem?

    thanks,
    Karina






  • 7.  Race Condition with two subsequent JOBF

    Posted Aug 02, 2017 04:30 AM
    It does sound like a bug. If this incident results in a problem ticket, we would be interested to learn about it.

    You might consider inserting, before file transfers, two checks:
    • Is the file closed?
    • Is the file stable (not changing)?
    This would eliminate problems caused by trying to open for reading a file that is currently being written to by another process.


  • 8.  Race Condition with two subsequent JOBF

    Posted Aug 02, 2017 04:38 AM
     If this incident results in a problem ticket, we would be interested to learn about it.

    Already has: INC00135056. But as indicated, suspected version 10 bugs don't get much love unless demonstrated with versions 11/12 too. I didn't get around to test this with v12 thus far, and probably won't be for some time, because I can't upgrade the one production host that currently demonstrates this problem. (edit:) But possibly Josef can achive something more here, or if anyone else has the resources to try to replicate that with either version, that might also help.

    As for the file being closed and stable: Well, it's the agent's job to ensure the file descriptor it's using gets closed after the agent is done transfering the file. And it's definetly stable, because no other process is using it. My proof of concept is basically "dd if=/dev/urandom of=/tmp/testfile.rnd bs=1042 count=1042", and then I copy that file  around. So no chance of any other process interfering. Besides, this is on Linux, thus even if there were any other file descriptors open for reading, this should still not interfere :)

    Cheers,
    Carsten


  • 9.  Race Condition with two subsequent JOBF

    Posted Aug 02, 2017 08:43 AM

    Hello,
    I got very good explanation from a developer:

    The Agent finishes each file with a close(), after the last file – in case of wildcard transfer – was finished / closed the status “finished” is sent to the Automation Engine.
    All this API calls work in a synchrony way, that means the file was really closed and the Agent continued working after the close() function returned.
    So the relation to the high latency network connections is not comprehensible.
    The close() call after writing the last byte into the destination file is done by the destination Agent / receiver. Than this Agent contacts the source Agent confirming that the file war completely written and closed. Next is the both Agents interchange other messages about the end of the transfer (in case there are no more files to transfer). Finally the source Agent sends the transfer end status to the AE.
    I think there is something else in the background. Maybe an anti-virus software or something similar locks the file for a few seconds after the transfer.

    I hope this explanation helps understanding the way the Agent works.

    KR, Josef



  • 10.  Race Condition with two subsequent JOBF

    Posted Aug 02, 2017 08:48 AM
    Hi Karina_Mankauskas_553

    I think validating if it is the same or not, means to do a details analysis, nothing I can do here in the community
    Opening a ticket for v10 seems also to not a good choice.  :/

    KR, Josef



  • 11.  Race Condition with two subsequent JOBF

    Posted Aug 02, 2017 11:00 AM

    [...]
    I hope this explanation helps understanding the way the Agent works.

    KR, Josef

    [...]

    Sort of, but not really, no.

    First off, the overall fact remains that the problem can be replicated and demostratably exists, so it would be for Automic to analyze and solve it - were it not for the fact that at present it can be dismissed on other grounds, namely for being demonstrated (thus far) exclusively with V10.

    Once we update to V12, and should the problem persist, I'd still expect Automic to analyze and fix it nevertheless.

    I can assure Automic this is not a virus scanner issue (it's a Linux machine - outsourced, but I'm absolutely confident it does not run an on-access virus scanner of any sort. The whole virus scanner problem is really more of a Windows issue, that funky OS where file handles opened for reading are, unlike on Linux, sometimes blocking further access).

    For me, that should be the end of story, but with way too much time already invested, some more remarks.

    I was looking further in to this under the premise that the OS write cache might be to blame. Sadly, I can't easily prove this to be true, since I can only replicate that on that one machine (I tried on other machines today) - and on that one machine where I can replicate it, I am not allowed to be root. Otherwise mounting my file system with the "sync" flag might already be enough to prove or disprove that theory.

    But calling close() is by far not the end of story. Close() does NOT ensure your data is synced to disk (it did, ca. 1990, but not anymore). Ted Tso claims that it's the application programmer's responsibility (see booknote #1) to ensure an explicit call to fsync() for important data (as which UC4 transfers certainly qualify). In that, he quotes the close() man page, which strongly advises the same (n.b. #2). Your developer unfortunately didn't elaborate on whether the agent makes an explicit call to fsync(), but I just ran an strace on the Linux agent and its child processes during an JOBF. There were 21 close() but no call to sync() or fsync().

    As Linux distributions almost all go to ext4, the delay between close() and data actually written to disk can easily be 30 seconds, so regardless of whether this is the root cause of this particular problem, I strongly advise your development to take that advise and start calling fsync() after a JOBF from the Linux agent explicitly. Ted Tso, by the way,  is probably a bit of an authority on these things, since he wrote large parts of the Linux kernel dealing with that stuff :)

    I'm no kernel programmer, but I believe the kernel should nevertheless keep track of dirty cache pages and further processes should never see stale cache data regardless. So even by just calling close(), you should be safe from race conditions (albeit only from race conditions, not the other reasons for which it's strongly suggested to call fsync() unless there is an additional a kernel issue (it happens - see #3), or the agent does something rather exotic. But I guess that's something to be looked into further if this keeps happening with V12, too.

    Best regards,
    Carsten

    (1) https://thunk.org/tytso/blog/2009/03/15/dont-fear-the-fsync/
    (2) https://linux.die.net/man/2/fsync
    (3) https://www.redhat.com/archives/linux-lvm/2016-December/msg00000.html