AutoSys Workload Automation

Expand all | Collapse all

Job stuck in RUNNING status for a week but job has actually completed

  • 1.  Job stuck in RUNNING status for a week but job has actually completed

    Posted Apr 01, 2018 08:01 AM

    I have a job and it has been running successfully but recently got stuck in RUNNING status since last week. 

    The job completed but this status was not reflected back to WCC. How to troubleshoot this and where to look what happened to this particular run? Has it ever happened to anyone here? Thanks!



  • 2.  Re: Job stuck in RUNNING status for a week but job has actually completed

    Broadcom Employee
    Posted Apr 02, 2018 11:36 AM

    Hi Lizzzie,

    Does the autorep output for the job show the same output as WCC?

    If so, you can change the status with the sendevent command:

    sendevent -E CHANGE_STATUS -s SUCCESS -J <jobname>



  • 3.  Re: Job stuck in RUNNING status for a week but job has actually completed

    Posted Apr 02, 2018 04:15 PM

    The agent log will tell you the pid, I usually grab that pid and go see if something is holding it open.  This happens to us with SAP a lot.  Even though SAP says it's done, the pid is still open and doing *something*.  So Autosys waits... and waits.. and waits.



  • 4.  Re: Job stuck in RUNNING status for a week but job has actually completed

    Posted Apr 03, 2018 07:21 AM

    Linda, is this for chain processes. There's an alleged fix in the latest pak for sap.

    we are testing it now..

     

    Steve C.



  • 5.  Re: Job stuck in RUNNING status for a week but job has actually completed

    Posted Apr 03, 2018 03:03 AM

    It used to happen at my site also for BW processchains before implementing SAP note 2123865. I have also slowed things down a bit adding the parameter sap.job.handler.sleep.interval, increasing timeouts and pollingintervals. Are there enough "batch-processes" on your SAP-system?

    Good luck!



  • 6.  Re: Job stuck in RUNNING status for a week but job has actually completed

    Broadcom Employee
    Posted Apr 03, 2018 03:13 AM

    If you have multiple jobs stuck in running then you can run chase command to verify if the agent has a running process for the job in question.



  • 7.  Re: Job stuck in RUNNING status for a week but job has actually completed

    Posted Apr 03, 2018 07:20 AM

    just remember Chase has a penchant to be wrong if you using  netscaler/loadbalancer DNS routing. 

    Until CA is able to reflect the real machine the job actually used during the run.

    for this reason, one should not use -E in chase. 

     

    Steve C.



  • 8.  Re: Job stuck in RUNNING status for a week but job has actually completed

    Posted Apr 04, 2018 04:25 AM

    Hi

    To narrow down the problem:

    1) run 'autorep -J -d' to see the Status of the Job

    2) if the Job is Success or Failure in autorep output and Running in WCC, there is a problem with the WCC collector

    3) If the autorep output shows the Job still running, look for any error in the transmitter.log file on the System Agent

    4) if there is nothing in the transmitter.log file regarding this job, look into the runner_os_component.log on the System Agent to ensure it got a termination from the script. 

    5) if the problem is with the WCC collector, WCC logs have to be analyzed to see if this is a performance issue with WCC or if there were errors in the WCC collector or with the Autosys SDK which communicates with the Autosys application server

    6) if the problem is with the Job Flow in the WCC cache, you can also clear it with: wcc_monitor -u user_name -    p password  - d server1

           -d, --delete ALL, server1[,server2]
            Specifies the names of the servers for which to clear the cache. You can use either the short name (-d) or the long               name (--delete).

     

    As you can see, there are several options to take in such a case ( Including what Linda and Steve C. reported about SAP job types) 

    Regards

    Jean Paul



  • 9.  Re: Job stuck in RUNNING status for a week but job has actually completed

    Posted Apr 04, 2018 10:37 AM

    Jean Paul,

     

    How can you distinguish between step 5 and step 6? Is there any quick way to determine if it is a collector issue (i.e. use quick view maybe?) or a job flow issue? Reloading cache and reloading job flow both worry me a little when we have lots of users logged in at any given time.

     

    Thanks,

     

    Lisa



  • 10.  Re: Job stuck in RUNNING status for a week but job has actually completed

    Posted Apr 04, 2018 10:54 AM

    Hi Lisa

     

    You're right, if the status is incorrect in the Monitoring view but correct in QV, usually the problem is with the WCC collector.

    Another solution, if it appends to a few jobs,  is to change again the status of the Job in Autosys  and during its next cycle, the collector should pick it up again

     

    Regards

    Jean Paul



  • 11.  Re: Job stuck in RUNNING status for a week but job has actually completed

    Posted Apr 04, 2018 11:03 AM

    Jean Paul,

     

    That does nothing.. the collector can be behind up to 15mins. Or not collecting..

     

    Lisa,

     

    check the logs and see if the collector crapped out on all the servers.

    If all 8 or 9 collectors abend you are fried.

    Hope that helps

     

     

    Steve C.



  • 12.  Re: Job stuck in RUNNING status for a week but job has actually completed

    Posted Apr 04, 2018 11:13 AM

    Steve, exactly and we’ve been there before, although not recently, which is why I asked for clarification from Jean Paul.

     

    We’ve seen the collector get behind when there is a massive jil import, or more job status changes than usual (mostly during large migrations from one instance to another). What we’ve had to do is jump through hoops to get it back up to speed. Below is the quickest recovery method we have found. Thankfully we have only had to use it 3 or 4 times over the last year.

     

    1. Disable load balancer completely for all wcc servers.

    2. Change wcc configuration to an invalid monitor user account

    3. export monitoring views

    4. delete monitoring views

    5. flush cache

    6. set wcc config back to valid monitor account

    7. wait for cache to load

    8. import views

    9. Enable load balancer

     

    It works, and believe it or not in our environment where there are multiple instances and thousands of jobs and monitoring views, it hours and hours faster than just trying to delete cache with everything left active.

     

    I’ve heard there are performance improvements with WCC 11.4 SP6, so we will soon find out ☺

     

    Lisa



  • 13.  Re: Job stuck in RUNNING status for a week but job has actually completed

    Posted Apr 04, 2018 11:27 AM

    Lisa

     

    Yikes!

    To do that just to get back in line? HMM. Have you tried to just stop load balancer and or restart WCC after clearing cache?

    Clearing those views just to reload. ugh

    We have yet to run into that quagmire... LUCKILY

     

    Let’s hope the R12 WCC poses less stress etc. but I am still worried with some of this sharing views etc. I am not cool with that

     

     

     

     

    Steve C.



  • 14.  Re: Job stuck in RUNNING status for a week but job has actually completed

    Posted Apr 04, 2018 11:39 AM

    Steve – I’ve restarted WCC services on all the nodes more times than you can shake a stick. Trust me, if there were an easier way I would be ALL OVER IT ☺. We have the magic bullet, but like I said we don’t have to use it very often.

     

    Lisa



  • 15.  Re: Job stuck in RUNNING status for a week but job has actually completed

    Posted Apr 04, 2018 11:42 AM

    Understood! .. I am happy that its few and far between as well

     

     

     

    Steve C.



  • 16.  Re: Job stuck in RUNNING status for a week but job has actually completed

    Posted Apr 04, 2018 11:10 AM

    Hi Steve

     

    I agree with you , but sometimes this workaround is enough if it happens to a few jobs

     

    Regards

    Jean Paul



  • 17.  Re: Job stuck in RUNNING status for a week but job has actually completed

    Posted Apr 04, 2018 11:40 AM

    Jean Paul

     

    I just resurrected this old poll of mine..

    WCC collector efficiency and speed required , PLEASE 

     

    I understand R12 is looking to do things .. BUT what we all REALLY wants is near realtime collection and views. 

     

     

    Steve C.



  • 18.  Re: Job stuck in RUNNING status for a week but job has actually completed

    Posted Apr 04, 2018 11:43 AM

    "If it happens to a few jobs" - with all due respect this is not always a selectable option and then the method that Lisa has documented [thank you Lisa!] is the only way out, unless CA comes up with a more efficient Collector.

    IMHO, there is relevancy to this in what Dan S, has stated in another post - is WCC becoming too "Centric"  and targeted at the older model of IT Operations in which everything happens out of one "center" vs a more distributed usage?

    WCC [despite the improvement in the underlying Tomcat architecture] becomes unwieldy and too much is being demanded from the Collectors and the Autosys app servers.

    My 1 cent

    Chris  <CJ>



  • 19.  Re: Job stuck in RUNNING status for a week but job has actually completed

    Posted Apr 04, 2018 11:51 AM

    regardless of the route they take. a forced collection by the administration should be available.

    another product has such of a mechanism. you can also freely schedule the collection. something that seems to happens arbitrarily in WCC and knowing which WCC server is actually collecting is so much fun .. ;-)



  • 20.  Re: Job stuck in RUNNING status for a week but job has actually completed

    Posted Apr 04, 2018 12:10 PM

    Hi Steve

     

    Dan replied to the enhancement request at WCC collector efficiency and speed required , PLEASE 

     

    Regards

    Jean Paul



  • 21.  Re: Job stuck in RUNNING status for a week but job has actually completed
    Best Answer

    Broadcom Employee
    Posted Apr 06, 2018 05:00 AM

    We are looking into WCC performance improvement aspect in AE r12 and will keep you updated (post, sprint reviews, validate site, etc).

     

    Regards

    Nitin Pandey