Clarity

Back to discussions

Expand all | Collapse all

Process Engine Trouble

1. Process Engine Trouble

0 Recommend
CMCN1982
Posted Nov 27, 2015 06:00 AM

Reply Reply Privately
Yesterday one of our process engines experienced some difficulty and stopped responding. A service stop start bg has it back up and running again now but we would like to understand what caused the prpblem if possible. So on the bg-ca.log file I can see these errors:

ERROR 2015-11-26 08:18:38,775 [ProcessEngineThreadMonitor (tenant=clarity)] niku.union (none:none:none:none) ProcessEngineThreadMonitor: Thread, Process Loader (tenant=clarity) is not Alive. The Service running the ProcessEngine will need to be restarted.

ERROR 2015-11-26 08:18:39,519 [Process Monitor (tenant=clarity)] bpm.engine (clarity:process_admin:200185659__F54454D6-9A7D-4E69-B1CB-B23708C99F3A:none) ----> Engine: bg-SERVER heart beat is old. Heartbeat: Thu Nov 26 06:15:13 EST 2015 Limit: Thu Nov 26 06:18:39 EST 2015. Engine will be shutdown.. If this is not normal, please fix any problems and restart the process engine. Trying to shudown process enigne.
java.lang.Exception
at com.niku.bpm.engine.ProcessMonitor.execute(ProcessMonitor.java:188)
at com.niku.bpm.engine.ProcessMonitor.run(ProcessMonitor.java:118)

ERROR 2015-11-26 08:18:39,522 [Process Monitor (tenant=clarity)] engine.state (clarity:process_admin:200185659__F54454D6-9A7D-4E69-B1CB-B23708C99F3A:none) <?xml version="1.0" encoding="UTF-8"?>
<processEngine id="5874001" instanceName="bg-PAERSCBBLA1063">
<controller heartBeat="2015-11-26T08:18:39"/>
<loader heartBeat="2015-11-26T06:15:13" queueLength="410"/>
<conditionWaitList queueLength="0"/>
<retryWaitList queueLength="0"/>
<actionWaitList queueLength="0"/>
<PreConditionPipelineManager load="0.0" noOfPipelines="2" queueLength="0" recentLoad="0.0">
    <pipeline heartBeat="2015-11-26T06:13:38" index="1" load="0.0" name="Pre Condition Pipeline 0" recentLoad="0.0" runTime="0" running="false" startTime="2015-11-26T06:13:38"/>
    <pipeline heartBeat="2015-11-26T06:13:38" index="1" load="0.0" name="Pre Condition Pipeline 1" recentLoad="0.0" runTime="0" running="false" startTime="2015-11-26T06:13:38"/>
</PreConditionPipelineManager>
<PostConditionTransitionPipelineManager load="0.0" noOfPipelines="3" queueLength="0" recentLoad="0.0">
    <pipeline heartBeat="2015-11-26T06:13:38" index="1" load="0.0" name="Post Condition Transition Pipeline 0" recentLoad="0.0" runTime="0"
      running="false" startTime="2015-11-26T06:13:38"/>
    <pipeline heartBeat="2015-11-26T06:13:38" index="1" load="0.0" name="Post Condition Transition Pipeline 1" recentLoad="0.0" runTime="0"
      running="false" startTime="2015-11-26T06:13:38"/>
    <pipeline heartBeat="2015-11-26T06:13:38" index="1" load="0.0" name="Post Condition Transition Pipeline 2" recentLoad="0.0" runTime="0"
      running="false" startTime="2015-11-26T06:13:38"/>
</PostConditionTransitionPipelineManager>
<ActionExecutionPipelineManager load="0.0" noOfPipelines="3" queueLength="0" recentLoad="0.0">
    <pipeline heartBeat="2015-11-26T06:13:38" index="1" load="0.0" name="Action Execution Pipeline 0" recentLoad="0.0" runTime="0" running="false" startTime="2015-11-26T06:13:38"/>
    <pipeline heartBeat="2015-11-26T06:13:38" index="1" load="0.0" name="Action Execution Pipeline 1" recentLoad="0.0" runTime="0" running="false" startTime="2015-11-26T06:13:38"/>
    <pipeline heartBeat="2015-11-26T06:13:38" index="1" load="0.0" name="Action Execution Pipeline 2" recentLoad="0.0" runTime="0" running="false" startTime="2015-11-26T06:13:38"/>
</ActionExecutionPipelineManager>
</processEngine>
SYS   2015-11-26 08:18:39,525 [Process Monitor (tenant=clarity)] bpm.engine (clarity:process_admin:200185659__F54454D6-9A7D-4E69-B1CB-B23708C99F3A:none) Process Controller for engine bg-SERVER stopping...

Is there any good place to start investigating the issue which may have caused the process engine to stop?
2. Re: Process Engine Trouble

0 Recommend
Broadcom Employee

Suman Pramanik
Posted Nov 27, 2015 07:08 AM

Reply Reply Privately
Hi Colin,

See the error

ERROR 2015-11-26 08:18:38,775 [ProcessEngineThreadMonitor (tenant=clarity)] niku.union (none:none:none:none) ProcessEngineThreadMonitor: Thread, Process Loader (tenant=clarity) is not Alive. The Service running the ProcessEngine will need to be restarted.

ERROR 2015-11-26 08:18:39,519 [Process Monitor (tenant=clarity)] bpm.engine (clarity:process_admin:200185659__F54454D6-9A7D-4E69-B1CB-B23708C99F3A:none) ----> Engine: bg-SERVER heart beat is old.

The error is Heart beat is old so it looks like the BG heart beat went out of sync, could be due to multicast issue.

What is the Process Engine configuration you have like how many pipeline.

Regards
Suman Pramanik
3. Re: Process Engine Trouble

0 Recommend
CMCN1982
Posted Nov 27, 2015 07:59 AM

Reply Reply Privately
Hi Suman,

I think you could be right in a problem with the multicasting.

It looks like 1 server is handling a lot more processes than the other?

Server 1:
4. Re: Process Engine Trouble

0 Recommend
Broadcom Employee

Suman Pramanik
Posted Nov 27, 2015 08:04 AM

Reply Reply Privately
Hi Colin,

Yes thats a clear imbalance, we will not have exact 50% 50% distribution but again 7k and 0 is not something right. Try doing a multicast test and see if the non working bg is communicating

Regards
Suman Pramanik
5. Re: Process Engine Trouble

0 Recommend
CMCN1982
Posted Nov 27, 2015 08:50 AM

Reply Reply Privately
1. In the NSA, I checked that both servers are using the same multicast address and port (239.0.1.52 & 9090)
2. The NSA password being used is the same on both.
3. The "Distributed" option is checked for both servers in the cluster.
4. The bind address is listed as comma separated values with no space between the IP addresses.
5. On the NSA, I ran the command - admin tower refresh list clients
both servers in the cluster appeared as expected
6. The beacon service is running on both servers in the cluster.
6. Re: Process Engine Trouble

0 Recommend
CMCN1982
Posted Nov 27, 2015 08:52 AM

Reply Reply Privately
Also, "Distributed" is ticked on both servers in the NSA.
7. Re: Process Engine Trouble

0 Recommend
Broadcom Employee

Suman Pramanik
Posted Nov 27, 2015 10:02 AM

Reply Reply Privately
Do you have the test to check the mulsticast using tower test?
8. Re: Process Engine Trouble

0 Recommend
CMCN1982
Posted Nov 27, 2015 10:15 AM

Reply Reply Privately
Yes, I went on to the NSA server and ran the following:

admin tower
refresh
list clients

Both servers displayed:
9. Re: Process Engine Trouble

0 Recommend
Broadcom Employee

Suman Pramanik
Posted Nov 27, 2015 10:27 AM

Reply Reply Privately
Hi Colin,

The results looks like the 1st test of multicast is ok, can you take off the distributed and see how the distribution goes as checking that has an additional overhead as it will store the sessions in database

http://www.ca.com/us/support/ca-support-online/product-content/knowledgebase-articles/tec480143.aspx
10. Re: Process Engine Trouble

0 Recommend
CMCN1982
Posted Nov 27, 2015 11:16 AM

Reply Reply Privately
Thank you Soman.

I have unchecked "Distributed" on both application servers in the NSA.

I will monitor and see if things improve.
11. Re: Process Engine Trouble

0 Recommend
Broadcom Employee

Suman Pramanik
Posted Nov 27, 2015 11:26 AM

Reply Reply Privately
You are welcome Colin, have a great weekend
12. Re: Process Engine Trouble

0 Recommend
CMCN1982
Posted Nov 30, 2015 04:56 AM

Reply Reply Privately
Hi Suman,

so over the weekend we have seen little change.

On server 1 we have 7746 active processes and on server 2 there are none!

Also, we see 1329 process errors on server 1 but none on server 2.

Do you think I should open a case?
13. Re: Process Engine Trouble

0 Recommend
CMCN1982
Posted Nov 30, 2015 04:57 AM

Reply Reply Privately
I should add that all of our scheduled jobs still appear to be completing successfully. But obviously the load isn't being shared between the 2 PE's.
14. Re: Process Engine Trouble

1 Recommend
Nick Darlington
Posted Nov 30, 2015 03:04 PM

Reply Reply Privately
The Distributed settings will not affect the bg or process engine or multicast in any way.

Those are settings only for the capturing of transient UI data (e.g. things you type into text filter fields for example) for your session whilst logged into the app services using a web browser, when your apps are clustered.

The process engine acts greedily towards any processes it needs to operate on, and this can be seen especially at the time of restarting services and the respective process loaders start up in each one and they try and scoop up all the processes to work with.

For processes that appear to be on a process engine that isn't functioning, there will be a grace period before another process engine will inherit their pending actions and take it over.

Once all these things are addressed, then for new incoming process instances they will be grabbed by whichever process engine wants it first. This may still appear unbalanced based either on network timings and coincidences, as well as the process engine simply not being burdened sufficiently that it still feels capable of being the first responder to a new request.

Do all your settings for the process engines in your CSA for these servers look the same as this?

If so, then there should have been some kind of problem that resulted in the heartbeat not being registered within the allowed time window. Without further evidence pointing to problems with the multicast threads and libraries in the logs prior to this, or perhaps a mention of memory issues in those logs before the problem began but after the services were last restarted, then you would probably want to open a support issue if it continues to return.

It may be prudent to keep an eye on the heartbeat times (if you can periodically check on the bpm_run_process_engines table for the heart_beat column value where end_date is null) so that you can try to anticipate the next occurrence before the heartbeat is flagged by the system as too old (a 2 hour window, so if you could check/monitor every 30 minutes perhaps?). This would allow for a window of opportunity for collecting any statistics/data such as thread or heap dumps as needed to further analyse the problem if considered needed or useful.
15. Re: Process Engine Trouble

0 Recommend
CMCN1982
Posted Dec 01, 2015 04:31 AM

Reply Reply Privately
Hi Nick, thank you once again for your invaluable insight.

Our bg settings in the NSA are the same as the ones you illustrated in the screen shot in your post.

From the basic multi-cast tests we've conducted it does appear that it is working correctly.
As we saw in these errors:

ERROR 2015-11-26 08:18:38,775 [ProcessEngineThreadMonitor (tenant=clarity)] niku.union (none:none:none:none) ProcessEngineThreadMonitor: Thread, Process Loader (tenant=clarity) is not Alive. The Service running the ProcessEngine will need to be restarted.

ERROR 2015-11-26 08:18:39,519 [Process Monitor (tenant=clarity)] bpm.engine (clarity:process_admin:200185659__F54454D6-9A7D-4E69-B1CB-B23708C99F3A:none) ----> Engine: bg-SERVER heart beat is old. Heartbeat: Thu Nov 26 06:15:13 EST 2015 Limit: Thu Nov 26 06:18:39 EST 2015. Engine will be shutdown.. If this is not normal, please fix any problems and restart the process engine. Trying to shudown process enigne.

The heartbeat didn't update in its allocated time and so a bg restart was necessary. I'll go back in the logs to see if there is anything which might explain the reason why the heartbeat didn't update.

Funnily enough, since we restarted both bg services the workload does appear to be being shared more equally than more.

Currently Server 1 has 3300 active processes and Server 2 has 4446 active processes.

Clarity

Process Engine Trouble

CMCN1982Nov 27, 2015 06:00 AM

Suman PramanikNov 27, 2015 07:08 AM

CMCN1982Nov 27, 2015 07:59 AM

Suman PramanikNov 27, 2015 08:04 AM

CMCN1982Nov 27, 2015 08:50 AM

CMCN1982Nov 27, 2015 08:52 AM

Suman PramanikNov 27, 2015 10:02 AM

CMCN1982Nov 27, 2015 10:15 AM

Suman PramanikNov 27, 2015 10:27 AM

CMCN1982Nov 27, 2015 11:16 AM

Suman PramanikNov 27, 2015 11:26 AM

CMCN1982Nov 30, 2015 04:56 AM

CMCN1982Nov 30, 2015 04:57 AM

Nick DarlingtonNov 30, 2015 03:04 PM

CMCN1982Dec 01, 2015 04:31 AM

1. Process Engine Trouble

2. Re: Process Engine Trouble

3. Re: Process Engine Trouble

4. Re: Process Engine Trouble

5. Re: Process Engine Trouble

6. Re: Process Engine Trouble

7. Re: Process Engine Trouble

8. Re: Process Engine Trouble

9. Re: Process Engine Trouble

10. Re: Process Engine Trouble

11. Re: Process Engine Trouble

12. Re: Process Engine Trouble

13. Re: Process Engine Trouble

14. Re: Process Engine Trouble

15. Re: Process Engine Trouble