nick_darlington

System Action In The Finish Step - Potential For Endless Process Chains

Blog Post created by nick_darlington Employee on Oct 2, 2015

This is something that we have recently reported as a potential defect, and is worth consideration particularly if you are likely to have been setting actions in the Finish steps of your process definitions.  The reference for that is: CLRT-78999 Having System Action in the Finish step of the process causing same process to run many times

 

 

Here though I would like to add more substance around the problem, why it can happen, and what might be done to avoid and/or prevent it.


Problem description:

 

A process may be needed to meet the following conditions:

 

  • Starts automatically on update of an object instance
  • Does not run more than once concurrently
  • May need to make further changes to attributes on the same object instance
  • It is probable that the process may need to run multiple times for the same instance, if say another update is manually made to the object after the first process completes

 

It seems simple enough, but the last two conditions can introduce a race condition that results in a (potentially endless) chain of process instances being created.


Diagnosis:

 

When changes occur to an object instance that is a) event enabled and b) having active processes that can autostart, then an event is raised in the system.  Events are sent via multicast as soon as the changes are made.  Events are received and handled by being evaluated and passed to the appropriate handlers to deal with them.

 

The handling of those events is an asynchronous action, meaning it happens outside of and along a separate timeline to the thread or action that initiated them.

 

It is the consequences of this being asynchronous that can result in the process design flaws (oversights) allowing the processes to be stuck in a chain.


Example:

 

Note: It is not advised to carry this out as a test on any environment that cannot be easily stopped and wiped / reset.  Running this test could literally result in several thousand process instances being launched in rapid succession - maybe even millions.

 

A simple process is created to auto-start on update of an object.  The 'singleton' option ('do not start if already running') is enabled.  Leave the conditions simple or unrelated to the attributes in the remaining steps.

 

In the Finish step of the process, add a System action to update an attribute (that is not part of the initial condition for simplicity sake) and give it a value.  Validate and activate the process, give your user Start rights to it.

 

Now when starting this process by making a change to an object instance record, the process will run through until the Finish step.  Once it gets there, it will update the attribute, and then the process will finish (it is no longer running).

 

As soon as the attribute is updated, even before the process is marked as finished, an event is raised.

 

This is event is received and passed to a handler in order to evaluate and action it, but the time taken to receive and handle the event is happening concurrently with the original process finishing.

 

At the time the evaluation takes place, the original process has finished.  The evaluation concludes that an update has occurred and eligible processes exist for starting (based on status and permissions).  The conditions are checked and are also met, and due to the singleton setting, a check is made to see if the process is already running on this instance - and now it is not.

 

A new process instance now begins, and a loop has been created.  This loop will continue until such time as luck would cause the event to be handled and evaluated before the original process was able to mark itself as complete/finished.


Remediation:

 

It has been at least an advisory note if not a best practice / caution / requirement that actions are not placed in Start or Finish steps of a process.  Following this advice will help mitigate the issue, and indeed, moving actions back just one step prior to the Finish can result in a massive reduction in these loops.  Partly because now the original process cannot simply finish but has to first transition to the Finish step after updating the object, which involves further events being raised and handled giving time for the object update event handler to see that the process is still running and not start a new one.

 

However, it is not guaranteed.  For example, it is possible due to a brief network interruption that the event for updating the object could not be received by multicast.  Alternatively at peak times, the existing threads may just be so busy handling other traffic that it is briefly queued.

 

A safeguard against lost events exists and every 5 minutes (by default) a thread in the process engine wakes up to check for them, since the events are also recorded to the database for persistence.  That can mean 5 minutes may pass since the process finished before the event and new process instance is started.  This occurrence ought to be relatively infrequent though, and may explain the occasional (otherwise unexplainable) duplicate process instance instead of creating an unending chain of them.

 

So along with moving any object attribute update actions further away from the Finish step as possible, it would also be best to consider flagging the update in some way that allows another potential process instance to know where the update is originating from, and to either not start or offer an exit out before the process instance can perpetuate the problem any further.


Also, although this only looks at limited cases where the system actions for updating attributes are in the Finish steps (and doesn't consider that maybe the conditions on the process definition are sufficient to prevent or avoid loops), this query may at least help to narrow down which process definitions are worth double-checking in advance of running into this issue:

 

-- Looks for any processes that auto start having actions in their Finish step that set
-- attribute values for the same object as the primary object for this process definition
-- These processes have a risk of a race condition as the process may finish and start
-- again (tens of thousands or more, possibly even endless loops) if there's not more space 
-- such as another step to transition to before another process is considered for starting.

select distinct d.id as PROC_DEF_ID, v.id as PROC_VER_ID, ld.name, 
  d.process_code, v.internal_status_code, v.user_status_code
from bpm_def_processes d
join bpm_def_process_versions v on d.id = v.process_id and v.start_option_code = 'AUTO_START' and v.start_event_code = 'update'
join bpm_def_stages g on g.process_version_id = v.id
join bpm_def_steps s on s.stage_id = g.id and s.step_code = 'Finish'
join bpm_def_step_actions a on a.step_id = s.id and a.type_code = 'BPM_SAT_SYSTEM' and a.action_code like 'set%'
join bpm_def_objects o on v.id = o.pk_id and o.table_name = 'BPM_DEF_PROCESS_VERSIONS' and o.type_code = 'BPM_POT_PRIMARY'
join cmn_captions_nls ld on d.id = ld.pk_id and ld.language_code = 'en' and ld.table_name = 'BPM_DEF_PROCESSES'
where o.object_name = a.object_name
order by ld.name
/

 

Note: The query returning 0 records is the best result, and is also the result to aim for after making corrections to any process definitions listed and then trying to run the query again.

Outcomes