Peter WIRFS

How to protect AWA from difficult recoveries?

Discussion created by Peter WIRFS on May 23, 2017
Latest reply on Jun 1, 2017 by AlainMoisy
Last week we had a datacenter power outage for 6 hours (long story.)   It was a perfect storm... The outage started at 6:30pm and lasted past midnight.  Our production batch window activates at 7pm and run until midnight, and our schedule objects recycle at midnight.  Recovery wasn't pretty.  I pulled an auto-forecast report and several of us sat there and manually started objects until 4am (There were probably no more than 50-75 needing starting.  We are a small shop.)  Many but not all of our date parameters had to be adjusted backward 1 day.   And the following night some of our external dependencies were satisfied too early because the manually recovered objects had been activated that morning in the same logical day.

So we are now discussing multiple ideas to make recoveries like this one easier.  (Rolls eyes.  When will this sort of thing ever occur again?)  Here are some of the ideas we are working on, and I'm curious to see if anyone can think of an idea to consider that we haven't already considered?

(1) We are re-programming all of our external dependencies that use "same logical day" so they will no longer be satisfied early on the following nights run. (there are less than 20 to be re-programmed.)

(2) I plan to automate the generation of an auto-forecast report and email it to the staff every day.  (I really wish it could indicate predecessor relationships!)

(3) We are considering setting up our own "business day" static date variable that would be re-calculated at the beginning of a new business day.  This would replace many of our uses of current system date.

(4) We have discussed moving our schedule recycle time forward from midnight, but there seem to be too many cons to doing so.  And it wouldn't eliminate the recycle time issue; it would just be moved to a different time of day.

(5) Have batch activate earlier than 7pm, but not actually start running until 7pm.  (One drawback is estimated wall times would be artificially inflated.)

(6) Start running batch before 7pm to give us more buffer until midnight. (Not a popular idea because some of our batch systems can not co-exist with our online systems.)

(7) We considered creating and maintaining recovery versions of our multiple schedule objects.  But there would be too much maintenance effort.




Thank you for your voluntary time spent thinking about this... Pete

Outcomes