Oracle DCD (Dead Connection Detection) and Workload Automation AE (Autosys) / Workload Control Center

Document created by hebsh01 Employee on Aug 21, 2014Last modified by Lenn Thompson on Apr 20, 2016
Version 2Show Document
  • View in full screen mode

Problem:  Frequent disconnects between WAAE (Autosys) or the WCC and Oracle database

 

CAUAJM_E_18416 Event Server: <PFWLAR>  Failed Query: <BEGIN :RetVal := ujo_get_jobstart_depends_pkg.ujo_get_jobstart_depends (:I_joid, :I_cond_job_name, :I_autoserv, :I_mode, :B_dep_joid, :B_joid, :B_priority, :B_job_ver, :B_over_num, :Done); END; <<3635,'','',4>>>

CAUAJM_E_18402 ORA-03135: connection lost contact

CAUAJM_W_10900 The database monitoring system has detected a potential problem with the database.

CAUAJM_I_10901 The database monitoring system is beginning validation of database connections.

CAUAJM_E_18402 ORA-03135: connection lost contact

CAUAJM_E_18400 An error has occurred while interfacing with ORACLE.

 

Event Server / Oracle DB configuration:

DB server and Autosys server are on two different subnets and there is a firewall in the middle.

There could a period of time when there is no WAAE activity / jobs being processed and that the firewall drops the database connection between Oracle and WAAE. Then WAAE attempts to start a job but realizes the database connection is dead and needs to make a new connection.

 

 

Note: WAAE (Autosys) opens persistent connections to the DB at startup.  These connections are never killed by Autosys.

What you are experiencing is outside of Autosys and something external is killing these connections.

 

This could be a firewall that is killing INACTIVE sqlnet connections after a set amount of time.

 

Resolution:

1. Stop WAAE services

2. Set SQLNET.EXPIRE_TIME = 30 in sqlnet.ora file.

3. Bounce the listener.

4. Start WAAE services

 

Here is a Oracle document that explains this issue and a couple of ways to resolve it.

 

The 3rd party explanation below is published here.

Resolving Problems with Connection Idle Timeout With Firewall

 

An Overview

 

Firewall(FW) has become common in today's networking to protect the network

environment. The firewall recognizes the TCP protocol and it records the

client server socket end-points. Also, FW recognize the TCP connection

closure, and then will release the resources allocated for recording the

opening connection. For every end-point pairs , the firewall must also

allocate some resources(may be small).

 

When the client or server closes the communication it sends TCP FIN type

packet, this is a normal socket closure. However, it is not uncommon that

the client server communication abruptly ending without closing the end

points properly by sending FIN packet, for example, when the client or

server crashed, power down or a network error which prevents sending the

closure packet to the other end. In that cases, the firewall will not know

that the end-points will no longer use the opened channel. As a passive

intermediary, it had no way to determine if the endpoints are still active.

As is it not possible to maintain resources forever, and also, it is a

security threat keeping a port open for undefined time. So, firewall

imposes a BLACKOUT on those connections that stay idle for a predefined

amount of time.

 

Initially FW were designed to protect the application servers, network and

then to protect client/server connection. With these in mind, a time-out in

terms of hours (1 hour is the default for most FW) is reasonable. With the

advent of more complex security schemes, FW are not only between client and

server, but also between different application servers ( intranet,

demilitarized zone (DMZ) , and such) and database servers. So, the horizon

of 1 hour idle time for communication between servers maybe not be

appropriate.

 

Idle connections can be expected from an application server. There is the

case of J2EE using pooled JDBC connections. The pool usually returns the

first available connection to the requester, so the first connections of

the pool list are the most likely to be active. The last one, which are at

the end of the list, are only used at peek loads, and most of the time it

will be inactive.

 

Other cases are the connections established from a HTTP Server, either SQL

connections from mod_plsql, or AJP connections from mod_oc4j.

 

Blackout

 

One of the inconvenience of theses blackout, is that they are passive. None

of the endpoints will be notified that the communication was banned . Only

when the client or server tries to contact its peer, it comes to know that

the peer end is no more active and the communication has already been

broken.

 

The worst of all scenarios are the so called passive listeners . They will

never know. Because, passive listeners are those processes at an endpoint

that are simply waiting for commands to arrive from the other end. A

typical example of this are the backend database server processes, which

are reading from the socket looking new SQL statements to execute , and

after the request is answered, they return to their passive state. When a

blackout occurs, they will stay forever

in this reading state, unless some of the following techniques are applied.

 

Resolving problems with connection idle time-out

 

TCP KeepAlive

 

You can enable TCP KeepAlive option at the Operating System(OS) level. Once

TCP keepalive option is enabled and configured, a small probe packet will

be sent to the other end at every predefined in-activity interval and it

expects an ACK from the other end. And ACK will be returned only when the

other end is alive and is reachable. If ACK is not returned, then after

some retry, the OS will close the end points and will release the resources

allocated for that. The application which is listening on that particular

socket will recieve the error, so that application can take necessary

action upon receiving the error signal from the

OS.

 

When a communication is blacked out by the firewall, the probe will not

reach its other end, and then the OS will close the socket end points and

the application will be notified of the exception.

 

Steps to configure TCP KeepAlive depends on a specific Operating Systems.

You will have to refer the appropriate OS documentation for it.

t is common to enable TCP KeepAlive option at the server end. Because

server is the one which holds many resources for a communication, it any

communication is broken, then those resources at the server will be

released than holding it for indefinite time. By default TCP KeepAlive is

not enabled at the OS.

 

TCP KeepAlive is applicable for all network applications running on that

particular Operating System.

 

DCD for DataBase Servers

 

For database connections, one of the endpoints is a passive listener,

either is a dedicated process or a dispatcher process. If the connection

becomes blacked

out , this backend will never know that client cannot send any more

requests, and then will lock important resources as database sessions,

locks , and at least

, a file descriptor used for maintaining the socket.

 

A solution is to make this backend not so passive, using the DCD (dead

connection detection) to figure out if the communication is still possible.

 

Simply, set in the $ORACLE_HOME/network/admin/sqlnet.ora, in the server

side SQLNET.EXPIRE_TIME=10 (10 minutes, for example). With this parameter

in place, aft

er 10 minutes of inactivity, the server send a small 10 bytes probe packet

to the client. If this packet is not acknowledge, the connection will be

closed and the associated resources will be released.

 

There are two benefits with this DCD

 

1. If the SQLNET.EXPIRE_TIME is less than the FW connection idle time-out,

then the firewall will consider this packet as activity, and the idle

time-out (fire

wall blackout) will never happen until both the client and the server

processes are alive.

 

2. If the SQLNET.EXPIRE_TIME (lets say a little bit higher) than the FW

idle limit, then , as soon as the blackout happens , the RDBMS will know

and will close

the connection.

 

The first case is recommended when the connection comes from another

application server , and the second makes sense for client applications.

 

DCD works at the application level and also works on top of TCP/IP

protocol. If you have set the SQLNET.EXPIRE_TIME=10 then do not expect that

the connections will be closed exactly after 10 minutes of the blackout or

network outage. The TCP timeout and TCP retransmission values also adds to

this time.Please note that some latest firewalls may not see DCD packets as

a valid traffic, and thus the DCD may not be useful. In this case, firewall

timeout should be

increased or users should not leave the application idle for longer than

the idle time out configured on the firewall.

1 person found this helpful

Attachments

    Outcomes