Clarity

Expand all | Collapse all

Process Engine Trouble

  • 1.  Process Engine Trouble

    Posted Nov 27, 2015 06:00 AM


    Yesterday one of our process engines experienced some difficulty and stopped responding. A service stop start bg has it back up and running again now but we would like to understand what caused the prpblem if possible. So on the bg-ca.log file I can see these errors:

     

     

    ERROR 2015-11-26 08:18:38,775 [ProcessEngineThreadMonitor (tenant=clarity)] niku.union (none:none:none:none) ProcessEngineThreadMonitor: Thread, Process Loader (tenant=clarity) is not Alive. The Service running the ProcessEngine will need to be restarted.

     

    ERROR 2015-11-26 08:18:39,519 [Process Monitor (tenant=clarity)] bpm.engine (clarity:process_admin:200185659__F54454D6-9A7D-4E69-B1CB-B23708C99F3A:none) ----> Engine: bg-SERVER heart beat is old. Heartbeat: Thu Nov 26 06:15:13 EST 2015 Limit: Thu Nov 26 06:18:39 EST 2015. Engine will be shutdown.. If this is not normal, please fix any problems and restart the process engine. Trying to shudown process enigne.

    java.lang.Exception

    at com.niku.bpm.engine.ProcessMonitor.execute(ProcessMonitor.java:188)

    at com.niku.bpm.engine.ProcessMonitor.run(ProcessMonitor.java:118)

     

     

    ERROR 2015-11-26 08:18:39,522 [Process Monitor (tenant=clarity)] engine.state (clarity:process_admin:200185659__F54454D6-9A7D-4E69-B1CB-B23708C99F3A:none) <?xml version="1.0" encoding="UTF-8"?>
    <processEngine id="5874001" instanceName="bg-PAERSCBBLA1063">
      <controller heartBeat="2015-11-26T08:18:39"/>
      <loader heartBeat="2015-11-26T06:15:13" queueLength="410"/>
      <conditionWaitList queueLength="0"/>
      <retryWaitList queueLength="0"/>
      <actionWaitList queueLength="0"/>
      <PreConditionPipelineManager load="0.0" noOfPipelines="2" queueLength="0" recentLoad="0.0">
        <pipeline heartBeat="2015-11-26T06:13:38" index="1" load="0.0" name="Pre Condition Pipeline 0" recentLoad="0.0" runTime="0" running="false" startTime="2015-11-26T06:13:38"/>
        <pipeline heartBeat="2015-11-26T06:13:38" index="1" load="0.0" name="Pre Condition Pipeline 1" recentLoad="0.0" runTime="0" running="false" startTime="2015-11-26T06:13:38"/>
      </PreConditionPipelineManager>
      <PostConditionTransitionPipelineManager load="0.0" noOfPipelines="3" queueLength="0" recentLoad="0.0">
        <pipeline heartBeat="2015-11-26T06:13:38" index="1" load="0.0" name="Post Condition Transition Pipeline 0" recentLoad="0.0" runTime="0"
          running="false" startTime="2015-11-26T06:13:38"/>
        <pipeline heartBeat="2015-11-26T06:13:38" index="1" load="0.0" name="Post Condition Transition Pipeline 1" recentLoad="0.0" runTime="0"
          running="false" startTime="2015-11-26T06:13:38"/>
        <pipeline heartBeat="2015-11-26T06:13:38" index="1" load="0.0" name="Post Condition Transition Pipeline 2" recentLoad="0.0" runTime="0"
          running="false" startTime="2015-11-26T06:13:38"/>
      </PostConditionTransitionPipelineManager>
      <ActionExecutionPipelineManager load="0.0" noOfPipelines="3" queueLength="0" recentLoad="0.0">
        <pipeline heartBeat="2015-11-26T06:13:38" index="1" load="0.0" name="Action Execution Pipeline 0" recentLoad="0.0" runTime="0" running="false" startTime="2015-11-26T06:13:38"/>
        <pipeline heartBeat="2015-11-26T06:13:38" index="1" load="0.0" name="Action Execution Pipeline 1" recentLoad="0.0" runTime="0" running="false" startTime="2015-11-26T06:13:38"/>
        <pipeline heartBeat="2015-11-26T06:13:38" index="1" load="0.0" name="Action Execution Pipeline 2" recentLoad="0.0" runTime="0" running="false" startTime="2015-11-26T06:13:38"/>
      </ActionExecutionPipelineManager>
    </processEngine>

    SYS   2015-11-26 08:18:39,525 [Process Monitor (tenant=clarity)] bpm.engine (clarity:process_admin:200185659__F54454D6-9A7D-4E69-B1CB-B23708C99F3A:none) Process Controller for engine bg-SERVER stopping...

     

     

    Is there any good place to start investigating the issue which may have caused the process engine to stop?



  • 2.  Re: Process Engine Trouble

    Broadcom Employee
    Posted Nov 27, 2015 07:08 AM

    Hi Colin,

     

    See the error

     

    ERROR 2015-11-26 08:18:38,775 [ProcessEngineThreadMonitor (tenant=clarity)] niku.union (none:none:none:none) ProcessEngineThreadMonitor: Thread, Process Loader (tenant=clarity) is not Alive. The Service running the ProcessEngine will need to be restarted.

     

    ERROR 2015-11-26 08:18:39,519 [Process Monitor (tenant=clarity)] bpm.engine (clarity:process_admin:200185659__F54454D6-9A7D-4E69-B1CB-B23708C99F3A:none) ----> Engine: bg-SERVER heart beat is old.

     

    The error is Heart beat is old so it looks like the BG heart beat went out of sync, could be due to multicast issue.

     

    What is the Process Engine configuration you have like how many pipeline.

     

    Regards

    Suman Pramanik



  • 3.  Re: Process Engine Trouble

    Posted Nov 27, 2015 07:59 AM

    Hi Suman,

     

    I think you could be right in a problem with the multicasting.

     

    status.png

     

    It looks like 1 server is handling a lot more processes than the other?

     

    Server 1:

     

    Server 1.PNG

     

    Server 2.PNG



  • 4.  Re: Process Engine Trouble

    Broadcom Employee
    Posted Nov 27, 2015 08:04 AM

    Hi Colin,

     

    Yes thats a clear imbalance, we will not have exact 50% 50% distribution but again 7k and 0 is not something right. Try doing a multicast test and see if the non working bg is communicating

     

    Regards

    Suman Pramanik



  • 5.  Re: Process Engine Trouble

    Posted Nov 27, 2015 08:50 AM

    1. In the NSA, I checked that both servers are using the same multicast address and port (239.0.1.52 & 9090)

    2. The NSA password being used is the same on both.

    3. The "Distributed" option is checked for both servers in the cluster.

    4. The bind address is listed as comma separated values with no space between the IP addresses.

    5. On the NSA, I ran the command - admin tower refresh list clients

    both servers in the cluster appeared as expected

    6. The beacon service is running on both servers in the cluster.



  • 6.  Re: Process Engine Trouble

    Posted Nov 27, 2015 08:52 AM

    Also, "Distributed" is ticked on both servers in the NSA.



  • 7.  Re: Process Engine Trouble

    Broadcom Employee
    Posted Nov 27, 2015 10:02 AM

    Do you have the test to check the mulsticast using tower test?



  • 8.  Re: Process Engine Trouble

    Posted Nov 27, 2015 10:15 AM

    Yes, I went on to the NSA server and ran the following:

     

    admin tower

    refresh

    list clients

     

    Both servers displayed:

     

    admin_tower.PNG



  • 9.  Re: Process Engine Trouble

    Broadcom Employee
    Posted Nov 27, 2015 10:27 AM

    Hi Colin,

     

    The results looks like the 1st test of multicast is ok, can you take off the distributed and see how the distribution goes as checking that has an additional overhead as it will store the sessions in database

     

    http://www.ca.com/us/support/ca-support-online/product-content/knowledgebase-articles/tec480143.aspx



  • 10.  Re: Process Engine Trouble

    Posted Nov 27, 2015 11:16 AM

    Thank you Soman.

     

    I have unchecked "Distributed" on both application servers in the NSA.

     

    I will monitor and see if things improve.



  • 11.  Re: Process Engine Trouble

    Broadcom Employee
    Posted Nov 27, 2015 11:26 AM

    You are welcome Colin, have a great weekend



  • 12.  Re: Process Engine Trouble

    Posted Nov 30, 2015 04:56 AM

    Hi Suman,

     

    so over the weekend we have seen little change.

     

    On server 1 we have 7746 active processes and on server 2 there are none!

     

    Also, we see 1329 process errors on server 1 but none on server 2.

     

    Do you think I should open a case?



  • 13.  Re: Process Engine Trouble

    Posted Nov 30, 2015 04:57 AM

    I should add that all of our scheduled jobs still appear to be completing successfully. But obviously the load isn't being shared between the 2 PE's.



  • 14.  Re: Process Engine Trouble

    Posted Nov 30, 2015 03:04 PM

    The Distributed settings will not affect the bg or process engine or multicast in any way.

     

    Those are settings only for the capturing of transient UI data (e.g. things you type into text filter fields for example) for your session whilst logged into the app services using a web browser, when your apps are clustered.

     

    The process engine acts greedily towards any processes it needs to operate on, and this can be seen especially at the time of restarting services and the respective process loaders start up in each one and they try and scoop up all the processes to work with.

     

    For processes that appear to be on a process engine that isn't functioning, there will be a grace period before another process engine will inherit their pending actions and take it over.

     

    Once all these things are addressed, then for new incoming process instances they will be grabbed by whichever process engine wants it first.  This may still appear unbalanced based either on network timings and coincidences, as well as the process engine simply not being burdened sufficiently that it still feels capable of being the first responder to a new request.

     

    Do all your settings for the process engines in your CSA for these servers look the same as this?

     

     

    If so, then there should have been some kind of problem that resulted in the heartbeat not being registered within the allowed time window.  Without further evidence pointing to problems with the multicast threads and libraries in the logs prior to this, or perhaps a mention of memory issues in those logs before the problem began but after the services were last restarted, then you would probably want to open a support issue if it continues to return.

     

    It may be prudent to keep an eye on the heartbeat times (if you can periodically check on the bpm_run_process_engines table for the heart_beat column value where end_date is null) so that you can try to anticipate the next occurrence before the heartbeat is flagged by the system as too old (a 2 hour window, so if you could check/monitor every 30 minutes perhaps?).  This would allow for a window of opportunity for collecting any statistics/data such as thread or heap dumps as needed to further analyse the problem if considered needed or useful.



  • 15.  Re: Process Engine Trouble

    Posted Dec 01, 2015 04:31 AM

    Hi Nick, thank you once again for your invaluable insight.

     

    Our bg settings in the NSA are the same as the ones you illustrated in the screen shot in your post.

     

    From the basic multi-cast tests we've conducted it does appear that it is working correctly.

    As we saw in these errors:

     

    ERROR 2015-11-26 08:18:38,775 [ProcessEngineThreadMonitor (tenant=clarity)] niku.union (none:none:none:none) ProcessEngineThreadMonitor: Thread, Process Loader (tenant=clarity) is not Alive. The Service running the ProcessEngine will need to be restarted.

     

    ERROR 2015-11-26 08:18:39,519 [Process Monitor (tenant=clarity)] bpm.engine (clarity:process_admin:200185659__F54454D6-9A7D-4E69-B1CB-B23708C99F3A:none) ----> Engine: bg-SERVER heart beat is old. Heartbeat: Thu Nov 26 06:15:13 EST 2015 Limit: Thu Nov 26 06:18:39 EST 2015. Engine will be shutdown.. If this is not normal, please fix any problems and restart the process engine. Trying to shudown process enigne.

     

    The heartbeat didn't update in its allocated time and so a bg restart was necessary. I'll go back in the logs to see if there is anything which might explain the reason why the heartbeat didn't update.

     

    Funnily enough, since we restarted both bg services the workload does appear to be being shared more equally than more.

     

    Currently Server 1 has 3300 active processes and Server 2 has 4446 active processes.