CA Clarity Tuesday Tip by Shawn Moore, Sr. Principal Support Engineer for 4/19/2011
Time Slicing Stability Testing:
A few days ago the question came up about Time Slicing and how stable it is when abruptly stopped or aborted through a major database failure. We've always felt that it had a strong recovery mechanism, but I really wanted to test this out. So I decided to conduct a series of tests to see how time slicing would recover from various failures. (The goal was to try to break the execution of the job so that the job just wouldn't run upon start up.)
1) I first decided to do some basic stop and start tests, so I ran 3 iterations of stopping bg during the actual creation of slices.
INFO 2011-04-18 16:51:13,500 [Dispatch Thread-4 : bg@server] niku.blobcrack (none:none:none) Processing 18 new requests.
INFO 2011-04-18 16:51:13,500 [Dispatch Thread-4 : bg@server niku.blobcrack (none:none:none) ### Processing blobcrack.modifyTeam_set
INFO 2011-04-18 16:51:13,860 [Dispatch Thread-4 : bg@server] niku.blobcrack (none:none:none) ### Curve set size is 1000
INFO 2011-04-18 16:51:24,048 [Dispatch Thread-4 : bg@server] niku.blobcrack (none:none:none) ### Curve set size is 1000
RESULT: Upon startup, time slicing resumed as it should have.
2) Next I ran several iterations of stopping bg right after startup, prior to actual slice processing.
RESULT: Again upon starting, the time slicing resumed as it should have.
The logs noted the following message, which was expected.
Caused by: java.sql.SQLException: [CA Clarity][Oracle JDBC Driver]Object has been closed.
at com.ca.clarity.jdbc.base.BaseExceptions.createException(Unknown Source)
at com.ca.clarity.jdbc.base.BaseExceptions.getException(Unknown Source)
at com.ca.clarity.jdbc.base.BaseResultSet.getMetaData(Unknown Source)
at com.niku.union.persistence.PersistenceController.extractResultSet(PersistenceController.java:1586)
3) I decided to be a bit more drastic and start killing db sessions. After allowing time slicing to start processing, I killed several db sessions, which included the job scheduler.
The logs noted the following error.
ERROR 2011-04-19 15:44:46,927 [Dispatch Thread-97 : bg@server] niku.njs (none:none:none) Database error for job 5009012
com.niku.union.persistence.PersistenceException:
SQL error code: 28
Error message: [CA Clarity][Oracle JDBC Driver][Oracle]ORA-00028: your session has been killed
Executed:
update cmn_sch_jobs
set schedule_date = ?,
status_code = ?,
last_updated_date = ?,
last_updated_by = ?
where id = ?
and status_code != ?
4) The Job Scheduler did not automatically restart. (This is known behavior, after a db failure the bg server will need to be started.)
5) I then manually stopped and started job scheduler to bring it back online.
6) And I observed Slice processing had continued.
7) I decided to do one more test and cancel the job after it had failed. I first allowed time slicing to start processing (after reseting slicing), then killed several sessions. Again, I ended up killing the job scheduler.
.
.
INFO 2011-04-19 15:54:50,153 [Dispatch Thread-8 : bg@server] niku.blobcrack (none:none:none) ### Curve set size is 1000
INFO 2011-04-19 15:55:39,826 [Dispatch Thread-8 : bg@server] niku.blobcrack (none:none:none) ### Curve set size is 1000
ERROR 2011-04-19 15:56:17,217 [Dispatch Thread-8 : bg@server] niku.blobcrack (none:none:none) Exception during blobcrack process
java.sql.SQLException: [CA Clarity][Oracle JDBC Driver]No more data available to read.
at com.ca.clarity.jdbc.base.BaseExceptions.createException(Unknown Source)
at com.ca.clarity.jdbc.base.BaseExceptions.getException(Unknown Source)
at com.ca.clarity.jdbc.base.BaseExceptions.getException(Unknown Source)
at com.ca.clarity.jdbc.oracle.net8.OracleNet8NSPTDAPacket.sendRequest(Unknown Source)
at com.ca.clarity.jdbc.oracle.OracleImplConnection.rollbackTransaction(Unknown Source)
8) Observed Job Scheduler not restarting.
9) Fired up App (didn't want to accidentally kill the process)
10) Observed the job from the Clarity UI
11) Cancelled the job (I technically shouldn't have to to this, it should just start up again as in step 6, but I wanted to introduce this a factor because some user will cancel a job after failure. )
12) I finally created a new immediate mode single run Time Slicing job.
13) Within a minute, the job started up and began processing.
Lesson to be learned from this exercise. The Time Slicing job recovers very well and will continue running unless there is a major db failure. In that case you'll need to restart it to get it working.
Rule of thumb, give the job some time to start processing again, unless you know you had a database failure. Chances are the job will recover nicely.
Shawn Moore
CA Technologies
ps: There is one way I know of to get the job to be in a stuck state, perform a hot backup at the time of significant db activity. Then restore the backup. What may happen is that the job processing tables will be out of sync. This can almost always be fixed by simply canceling and deleting the offending job.