MWNiebuhr

Important information for clients running CA Process Automation on VMWare

Discussion created by MWNiebuhr Employee on Aug 21, 2013
TEC597604 Important information for clients running CA Process Automation on VMWare when using the E1000 Network Interface Card.

Description:

This document is intended to inform clients of a potential problem when running CA Process Automation on a VMWare server when using the E1000 Network Interface.

Problem:

The root causes of this problem are rare, sporadic, socket I/O failures, which may leave the calling software waiting indefinitely for a read to complete.

From the users perspective the most typical symptom will be the unexpected hanging of processes that normally complete without issue, which resume and complete as expected following a restart of the CA Process Automation Orchestrator.

This can impact a small subset of processes, or all running processes. It has no correlation with Orchestrator uptime, and may manifest shortly after a restart, or, after days, weeks, or months of otherwise flawless Orchestrator functionality.

This problem has only been seen in environments running high volumes of PAM processes. In most environments where the E1000 NIC is installed the problem has never occurred, or occurred so infrequently that it has not been detected.

Troubleshooting:

This problem is very difficult to confirm. If this problem occurs, often the PAM thread is stuck on a socket read, and no relevant errors are written to the log files, and confirmation of the problem requires reviewing a series of Java thread dumps taken during an occurrence of this problem to confirm the Operator is stuck on a socket read.

When errors are observed in relation to this problem, they tend to indicate generic connection errors which could have other legitimate and unrelated causes. Below is such an example:

2013-07-24 18:55:23,219 WARN [org.hibernate.jdbc.AbstractBatcher] [nPool Worker-23] exception clearing maxRows/queryTimeout
com.microsoft.sqlserver.jdbc.SQLServerException: The connection is closed.
at com.microsoft.sqlserver.jdbc.SQLServerException.makeFromDriverError(Unknown Source)
at com.microsoft.sqlserver.jdbc.SQLServerConnection.checkClosed(Unknown Source)
at com.microsoft.sqlserver.jdbc.SQLServerStatement.checkClosed(Unknown Source)
at com.microsoft.sqlserver.jdbc.SQLServerStatement.getMaxRows(Unknown Source)
at org.jboss.resource.adapter.jdbc.CachedPreparedStatement.getMaxRows(CachedPreparedStatement.java:367)
at org.jboss.resource.adapter.jdbc.WrappedStatement.getMaxRows(WrappedStatement.java:378)
at org.hibernate.jdbc.AbstractBatcher.closeQueryStatement(AbstractBatcher.java:272)
at org.hibernate.jdbc.AbstractBatcher.closeQueryStatement(AbstractBatcher.java:209)
. . . more.


In these cases identification of the problem is tentative, and other causes for communication failure must be excluded.

Frequent process failure, or a repeatable failure of an individual Operator or Operators likely indicate other unrelated problems within the Process design or Orchestrator functionality.

This is a known issue and been documented in the 4.0 and 4.1 CA Process Automation Installation Guide under the Troubleshooting appendix with information specific to MS SQL due to the MS SQL JDBC Driver lacking a feature to configure a Timeout, but because this impacts all versions of PAM, and can have impact to internal operations, and Operators other than JDBC Operators, this notice is being published.

Solution:

At Sites where this problem has been confirmed, reconfiguring the VMWare server from an E1000 Network Interface Card driver to a VMXnet-3 NIC driver is seen to be a very effective mitigation.

CA is hesitant to declare this a complete resolution as the incident rate for this is very rare and timeframe between occurrences even with the E1000 NIC can be quite long.

If verification of the problem is required, please contact the Support Organization for assistance setting up the logging and Java thread dumps required to troubleshoot and verify this particular issue.

Outcomes