We have recently encountered a problem that can cause Automation Engine server processes — specifically work processes — to hang. In this case, the WPs are not blocking DB sessions, and they are not consuming CPU time. They’re just doing nothing. They do not respond to requests to quit from the Service Manager. They must be killed with the KILL signal (-9) and restarted. We would like to find a way to identify such hung AE server processes programmatically, so that we can kill and restart them automatically.
In the Java User Interface, we can identify hung processes by the fact that they appear grayed-out in the System Overview.
Initially I thought it might be possible to list the same information using the ServerList Java API class. For instance, in the System Overview, all of the hung WPs do not have a PID, and have a B.60 of 0. Might these two criteria be used to uniquely identify hung processes?
Unfortunately, the answer is no. When we iterate through the server list, ServerListItem.getName() returns no data for hung processes. This means that although we can use this approach to list the number of hung WPs, we cannot use it to identify which WPs are hung. (The other methods of ServerListItem also return empty results for hung WPs.)
I noticed that when WPs hang, they stop writing to their log files.
$ ls -ltr /var/uc4/server/?Psrv_DEV_log_???_00.txt
-rw-r----- 1 aedev1 mycompany 3151469 Jun 23 19:14 /var/uc4/server/WPsrv_DEV_log_014_00.txt
-rw-r----- 1 aedev1 mycompany 2595001 Jun 23 19:22 /var/uc4/server/WPsrv_DEV_log_016_00.txt
-rw-r----- 1 aedev1 mycompany 44771 Jul 3 11:32 /var/uc4/server/WPsrv_DEV_log_056_00.txt
-rw-r----- 1 aedev1 mycompany 45492 Jul 3 11:32 /var/uc4/server/WPsrv_DEV_log_010_00.txt
-rw-r----- 1 aedev1 mycompany 44759 Jul 3 11:32 /var/uc4/server/WPsrv_DEV_log_008_00.txt
-rw-r----- 1 aedev1 mycompany 45012 Jul 3 11:32 /var/uc4/server/WPsrv_DEV_log_006_00.txt
-rw-r----- 1 aedev1 mycompany 431613 Jul 4 13:03 /var/uc4/server/WPsrv_DEV_log_038_00.txt
-rw-r----- 1 aedev1 mycompany 371402 Jul 4 13:03 /var/uc4/server/WPsrv_DEV_log_034_00.txt
-rw-r----- 1 aedev1 mycompany 386269 Jul 4 13:03 /var/uc4/server/WPsrv_DEV_log_032_00.txt
-rw-r----- 1 aedev1 mycompany 381381 Jul 4 13:03 /var/uc4/server/WPsrv_DEV_log_030_00.txt
...
So another possibility would be to set up a log file monitor that periodically looks for files matching the pattern ?Psrv_DEV_log_???_00.txt and that have not been modified in the last few hours. E.g.,
check_for_hung_WPs.sh
#!/bin/bash
LOG_DIR="/var/uc4/server/"
ENV="DEV"
AGE="1"
for file in $(find "${LOG_DIR}" -name "?Psrv_${ENV}_log_???_00.txt" -mtime +"${AGE}"); do
echo $file | awk -F'_' '{print $4}'
done
This appears to work well, and it correctly identifies the hung WPs.
$ ./check_for_hung_WPs.sh
016
014
We still need to find the process IDs though, and the only easy way I know of is to use the Service Manager UI.
Is there a way to look up the PIDs programmatically?