AnsweredAssumed Answered

Identifying hung AE server processes

Question asked by Michael_Lowry on Jul 4, 2018
Latest reply on Jul 9, 2018 by Carsten_Schmitz

We have recently encountered a problem that can cause Automation Engine server processes — specifically work processes — to hang. In this case, the WPs are not blocking DB sessions, and they are not consuming CPU time. They’re just doing nothing. They do not respond to requests to quit from the Service Manager. They must be killed with the KILL signal (-9) and restarted. We would like to find a way to identify such hung AE server processes programmatically, so that we can kill and restart them automatically.

 

In the Java User Interface, we can identify hung processes by the fact that they appear grayed-out in the System Overview.

 

Initially I thought it might be possible to list the same information using the ServerList Java API class. For instance, in the System Overview, all of the hung WPs do not have a PID, and have a B.60 of 0. Might these two criteria be used to uniquely identify hung processes?

 

Unfortunately, the answer is no. When we iterate through the server list, ServerListItem.getName() returns no data for hung processes. This means that although we can use this approach to list the number of hung WPs, we cannot use it to identify which WPs are hung. (The other methods of ServerListItem also return empty results for hung WPs.)

 

I noticed that when WPs hang, they stop writing to their log files.
$ ls -ltr /var/uc4/server/?Psrv_DEV_log_???_00.txt
-rw-r----- 1 aedev1 mycompany  3151469 Jun 23 19:14 /var/uc4/server/WPsrv_DEV_log_014_00.txt
-rw-r----- 1 aedev1 mycompany  2595001 Jun 23 19:22 /var/uc4/server/WPsrv_DEV_log_016_00.txt
-rw-r----- 1 aedev1 mycompany    44771 Jul  3 11:32 /var/uc4/server/WPsrv_DEV_log_056_00.txt
-rw-r----- 1 aedev1 mycompany    45492 Jul  3 11:32 /var/uc4/server/WPsrv_DEV_log_010_00.txt
-rw-r----- 1 aedev1 mycompany    44759 Jul  3 11:32 /var/uc4/server/WPsrv_DEV_log_008_00.txt
-rw-r----- 1 aedev1 mycompany    45012 Jul  3 11:32 /var/uc4/server/WPsrv_DEV_log_006_00.txt
-rw-r----- 1 aedev1 mycompany   431613 Jul  4 13:03 /var/uc4/server/WPsrv_DEV_log_038_00.txt
-rw-r----- 1 aedev1 mycompany   371402 Jul  4 13:03 /var/uc4/server/WPsrv_DEV_log_034_00.txt
-rw-r----- 1 aedev1 mycompany   386269 Jul  4 13:03 /var/uc4/server/WPsrv_DEV_log_032_00.txt
-rw-r----- 1 aedev1 mycompany   381381 Jul  4 13:03 /var/uc4/server/WPsrv_DEV_log_030_00.txt
...

 

So another possibility would be to set up a log file monitor that periodically looks for files matching the pattern ?Psrv_DEV_log_???_00.txt and that have not been modified in the last few hours. E.g.,

 

check_for_hung_WPs.sh

#!/bin/bash
LOG_DIR="/var/uc4/server/"
ENV="DEV"
AGE="1"
for file in $(find "${LOG_DIR}" -name "?Psrv_${ENV}_log_???_00.txt" -mtime +"${AGE}"); do
        echo $file | awk -F'_' '{print $4}'
done

 

This appears to work well, and it correctly identifies the hung WPs.

$  ./check_for_hung_WPs.sh

016
014

 

We still need to find the process IDs though, and the only easy way I know of is to use the Service Manager UI.

 

Is there a way to look up the PIDs programmatically?

Outcomes