Automic Workload Automation

Expand all | Collapse all

On Automic and the passing of time (and also, possibly some inspiration on monitoring technique)

  • 1.  On Automic and the passing of time (and also, possibly some inspiration on monitoring technique)

    Posted Feb 26, 2018 06:48 AM
    I'm admittedly sometimes rather critical of some things the Automation Engine does. But this one, man, this convinces me that Automic is not only great but it manipulates the very fabric of time and space.

    I have a monitoring job to check whether our agents are not only alive, but do process actual jobs. It runs every 15 minutes for our most vital 74 agents. Nagios picks up on any (repeat) failures.

    Additionally, I have a cron job that, once a week, collects the performance data of any successful runs of this job from the Oracle database. I am specifically interrested in the time it takes between activation and start of the monitoring job - we found that this is a good indicator of actual agent performance:

    select AH_OH_IDNR as OHID, AH_IDNR as Runid, AH_TIMESTAMP1 as Activation, AH_STATUS as Status, AH_HOSTDST as Destination, AH_TIMESTAMP2 as Launch, round((AH_TIMESTAMP2 - AH_TIMESTAMP1) * 24 * 60 * 60,0) as seconds_diff from AH where AH_OH_IDNR in (select OH_IDNR from GFD_IS_UC4_O_01.OH where OH_NAME like '%MY_MONITORING_JOBS_NAME%' AND OH_NAME NOT LIKE '%OLD.%' and OH_NAME not like 'JOBP%') AND AH_STATUS = 1900;

    Occasionally, I break this down by server, take out the huge numbers (e.g. over the last eight months, there was one job that ran for 30 hours due to a file transfer bug), calculate the average "startup lag", i.e. the average lag between activation and start of this job, and make one of those newfangled, fancy-coloured Excel charts from it.

    And behold, Ladies and Gents, in 1.8 million unique datasets for the last eight months, I found one job that has ...

    (wait for it ...)

    Negative time!

    2070694| 455149157|05-09-17 00:09:30|      1900|AGENT_NAME_REMOVED|05-09-17 00:09:09|          -21

    Yes. This job, as the only one among 1.8m jobs, started, according to the UC4 database, 21 seconds before it was activated.

    Stephen Hawking, you know nothing!

    :smiley:


  • 2.  On Automic and the passing of time (and also, possibly some inspiration on monitoring technique)

    Posted Feb 26, 2018 07:18 AM
    Hi Carsten

    thats an out-of-the-box feature of AE, looking at the explanation in object (Client 0): JOBS.UNIX.AE_INTERNAL_FLUX_CAPACITOR

    official documentation here:
    http://backtothefuture.wikia.com/wiki/Flux_capacitor

    cheers, Wolfgang


  • 3.  On Automic and the passing of time (and also, possibly some inspiration on monitoring technique)

    Posted Feb 26, 2018 05:07 PM
    Hi Carsten_Schmitz_7883

    I don't have Oracle DB to check this, but with our system on MS SQL Server we have a lot more jobs that have AH_TimeStamp2 < AH_TimeStamp1.

    About 400 out of 1.2m jobs with more then 100 minutes difference.

    SELECT count(*)   FROM AH   where AH_TimeStamp2  is not NULL and AH_Status = 1900   and DATEDIFF(SS , AH_TimeStamp1 , AH_TimeStamp2) < -6000

    We also run the system with ARA, so I am not certain if that adds more magic into Activation Time and Start Time timing.

    update 20 minutes later....

    We also have some records that show Start Time about 18 month earlier then Activation Time :-)

    I am suspecting that some Table Key (AH_Idnr?) values possibly is reused and later record overrides Activation Time and End Time but does not overwrite the Start Time.

    Just a guess on my part.


  • 4.  On Automic and the passing of time (and also, possibly some inspiration on monitoring technique)

    Posted Feb 27, 2018 03:48 AM
    Vlad_Navazhylau_6186 cheers for that :)

    Thus far I treated this as a one-off anomaly, something curious to laugh at, but it appears from your description this is actually more wide-spread. Did you just discover that in your system now, or did you guys notice it earlier? Has it ever impacted the business for you? Now I'm wondering if this should actually be incidentified ...

    On the other hand, it's mostly frustrating to get incidents through with Automic that ask the devs to look at the source without being able to present a test case, so ... hm, not sure.

    Best regards,
    Carsten



  • 5.  On Automic and the passing of time (and also, possibly some inspiration on monitoring technique)

    Posted Feb 27, 2018 05:53 AM

    Just my two cents:

    The activation time is the timestamp taken from the Automation Engine, whilst the start time is a timestamp taken form the Agent. So it’s most likely two different computer / server.

    Those computers might have different time zone settings or have a timing gap even they have the same time zone.

    However, the Automation Engine takes care about that and converts all this timestamps to UTC and stores them at the static record of the task.

    In addition keep in mind that time service might correct the timing on a computer at any time.

     

    So it’s not that trivial as it seems to be … and there are different possible causes for the observed discrepancy.

     

    In case there is a way to reproduce the discrepancy or at least debug and trace it, a ticket at support is more than welcome.

    KR, Josef



  • 6.  On Automic and the passing of time (and also, possibly some inspiration on monitoring technique)

    Posted Feb 27, 2018 05:56 AM
    Insightful, thanks Josef_Scharl_103

    That also confirms why this is such a good meassure of agent performance: We did in fact observe that servers under high load (swapping and/or trashing heavily) take extra time to start up jobs, which makes sense, and is then reflected in the timestamp1 vs. timestamp2 difference.

    Cheers,
    Carsten


  • 7.  On Automic and the passing of time (and also, possibly some inspiration on monitoring technique)

    Posted Feb 28, 2018 11:04 AM
    Carsten_Schmitz_7883,

    I don't think we paid much attention to Activation Times.
    Start time and End time for the jobs is more important for us, and I don't think we noticed much issues there.

    We don't have much of Batch type of jobs, since most of the Workflows are related to ARA type of processes for Application Deployments Automation.

    There was a question I posted back on October 2017 about "RunID Turnonver" article.
    I did not get clear response from Automic on it.
    I wonder if this is related to what we see with timestamps on AH table.

    https://community.automic.com/discussion/10792/runid-turnover#latest

    https://automic.force.com/support/apex/CommunityArticleDetail?id=kA3b0000000kBF8CAM


  • 8.  On Automic and the passing of time (and also, possibly some inspiration on monitoring technique)

    Posted Feb 28, 2018 11:18 AM
    Vlad_Navazhylau_6186

    Reading the "turnover" description, I don't think that's it, because it sounds like an issue where it simply overflows with the RunIDs (possibly reaching the limit for the database field) and restarts them at "1".

    Based on what Josef said, the explaination is likely in different machine times.

    I can only weigh in for Linux (Windows will not be all that much different) but keeping time among agents and the server is an interesting issue. Even if you have an NTP server, each machine has a hardware clock, possibly a different BIOS clock, and a running clock that is kept by the kernel. The last one is the only remotely useful one (the HW clocks are horribly unstable with time keeping), so if your Automic agents starts before the machine has synced with the NTP server after boot, that'd expain a time difference. With UNIX, you even have two competing ntp daemons (OpenNTPd and Berkley ntpd) that work different, and those correct the clock in different ways, e.g. small clock differences are "skewed" by making the local clock run slightly slower or faster for a while, while bigger differences are compensated in other ways.

    Your 18 hour difference, for instance, might have been (just a theory) a server with a dying CMOS battery that fails at keeping time when it's shut off, gets booted, and executes a job before NTP kicks in. The OS has no choice but to use a legacy hardware clock before it gets a proper time source.

    Take all that together, and that probably explains the time differences. The missing puzzle piece may have been provided by Josef_Scharl_103 in that the timestamps are taken off of different machines.

    But on a closing note, I don't think it's the greatest design decision by Automic either to use time from different machines in the DB statistics. I'd consider it much cleaner design for the Agent to tell the server that it has begun executing the job, and the server then logging the time from it's own time source *)

    Cheers,
    Carsten


    *) edit: or even log both times. That'd be ideal for debugging, but possibly promote DB bloat.


  • 9.  On Automic and the passing of time (and also, possibly some inspiration on monitoring technique)

    Posted Feb 28, 2018 05:19 PM
    Carsten_Schmitz_7883,

    Just want to note that the difference for handful of records was 18+ month (we can rounded to 2 years) and not 18 hours.

    This difference is harder to comprehend.

    All the machines are VMs where the Agents are, I believe, and 98% on RedHat.





  • 10.  On Automic and the passing of time (and also, possibly some inspiration on monitoring technique)

    Posted Mar 01, 2018 03:53 AM
    Vlad_Navazhylau_6186

    Oh, sorry: I must have missread/missremembered that. Yes, 18 months is harder to explain.

    I did have a VMWare VM once which would sometimes switch the date to a day in nineteen-seventy-something (curiously not the start date of the UNIX epoch), and then ntpdate would soon correct it. I never found out why this kept happening. This surely would have thrown off UC4 :)

    Cheers,
    Carsten


  • 11.  On Automic and the passing of time (and also, possibly some inspiration on monitoring technique)

    Posted Mar 07, 2018 06:56 AM
    Hm, yet I must still be onto something somehow, because the Automic "Newsletter" I just got is entitled:

    "Jetzt mit Zuversicht ins neue Jahr starten", meaning: "Start the new year with confidence"!

    It is March. :D