PSA: Experience with ZDU Update Part II, or: Tale of a confused system (with call to Automic)

Discussion created by Carsten_Schmitz on Feb 1, 2018
Latest reply on Sep 13, 2018 by RobertLenz609838
So, as you know we had done the small ZDU update from 12.1.0 to 12.1.1 yesterday, and it didn't go all that great.

Well, this morning, my colleague was greeted by reports about the environment being unstable. Most agents had disconnected. When I tried to login, there was a flurry of log messages about forced traces (U00015006 System forced memory trace dump, dozens of times in a row). The engine would write trace files almost all the time, and there was a multitide of Java exceptions from various frameworks in the Tomcat log (not sure if related). There were also the familiar message "ORA-02289: sequence does not exist" and "U00003316 Zero Downtime information: MixedMode='N', base MQSet='2', active MQSet='2', own MQSet='2', MQSet PWP='2'.


When I decided to restart the entire system, the workers on the second server would die after a while. I identified that once again, same as had happened with the update to 12.1.0, there was a missing Oracle sequence: SQ_MQ2CP006;

I created the sequence (create sequence SQ_MQ2CP006;), restarted all processes, and the problems and flurry of error messages in the UC4 server logs went away.

I suspect that the SQL to generate sequence SQ_MQ2CP006 failed, just as the kind of SQL failed with the update to 12.1.0. This probably caused the ZDU update to lose track, which is possibly also why my colleague who performed it saw a ZDU Wizard that had lost track of it's current "step" and was all greyed out. Since I manually created the sequence, and restarted all processes, everything is fine.

The message "U00003316 Zero Downtime information: MixedMode='N', base MQSet='2', active MQSet='2'" has disappeared since that exact moment, leading me to believe that the engine was still stuck in "ZDU mode" because of it's own previous failure to create the sequence, and now that the sequence has been created manually, it is back on track.

Maybe this problem is also related to PRB00138707:
I now humbly ask someone from Automic who feels responsible for the product if they can do something with this information and pass it on based on this forum post to someone who properly reacts to this. Please let me know in response if you can do this.

If I don't hear anything here, I might go ahead and file an incindent later this week, but I might not, because 1) I am not very inclined to file formal incidents trying to explain complex, probably not readily reproducible issues with the current state of incident handling, and 2) I already have way too many open incidents for issues with V12.1 to be able to deal with the open and ongoing problems (or other types of user reports) we currently see. And 3), there are way to many things that should have been catched by the QA of a company, especially one that prides itself with Release Management. Frankly I'm growing tired of analyzing problems with a product that should be more stable than it is.

But then, I suspect if left alone, this will sooner or later explode in other customer's faces, too.