Greetings.
Some of you may probably know that we have a multitude of issues with ZDU over the last weeks. I have written about some of them here, let's just say that at this time I am rather disappointed in the makeshift nature of the process, the vagueness of the documentation and the multitude of potential pitfalls - so disappointed that if I had my way, this company will not have ZDUs but only regular downtime upgrades.
Here's another thing we recently discovered. It's a comparatively minor one, but still. During a ZDU (which may be a week-long process), everything may look like it's working in "compatibility mode - until you install a new agent. It may not work, and here's why:
Let's say you have three CP processes on ports 2217, 2219, 2221 respectively. The UC4 agent, unfortunately, has a single point of failure in that it work like this: The agents connect to one (and only one) address and port first, the one given as "cp=" in the agents ini file. I am going to call this the "primary CP port". It connects to this port, and only via this connection it learns about the other CPs. You can see this when you telnet to any CP, it will broadcast a list of the other CP's ports as part of the binary telnet response. The agent then "caches" these alternative connections in the [CP_LIST] section of it's ini file. While people do it, I consider it bad form to short-circuit this mechanism by "pre-seeding" the [CP_LIST] with entries when installing a new agent.
When you do a ZDU, eventhough the Automic support may claim differently at times, Automic says you can either duplicate your processes (so you'd temporarily have six CP), or you can split them, e.g. keep one on the old version and put two on the new version.
Let's start with the "split" scenario. We put two out of three onto the new version during the ZDU. At some point, the ZDU will stop the CP for the old version. Whichever way you do this, you will end up at some point with no CP serving the "primary CP port". At that point, all your existing agents may work (because they have cached entries in [CP_LIST], but new agents may not - because there is simply no CP serving the one central address they are supposed to go to.
You should not simply restart the old CP that was serving this primary port after it's been terminated by the ZDU, because then that CP would also start accepting AWI connections, and at that point in the ZDU, that results in an error for those trying to login to the AWI. So here's the take-away: You need to either restart one of your new version's CPs until it settles on the "primary port" (if you wait for the port to be free, which can take some minutes due to TCP TIME_WAIT, CLOSE_WAIT, this should just take one restart, since the ports are tried in sequence). But this restart will kick AWI connections out once more.
Afternatively, you can probably reconfigure the remaining old CP to be a new CP and start it ahead of the time you normally should.
Duplicating your processes will also not solve the issue more elegantly. You'd duplicate your ports as well (to six ports in this example), but whichever way you turn it, the "primary CP port" will remain a single point of failure that will have to be switched between versions at some point, thus bringing with it one more disconnect of any connected user.
But yeah, at this point I think we've accepted anyway that the "Zero" in "Zero Downtime" only ever counts for the server, NOT for the AWI users, so this really is not that big of a deal once you know about it (and not go about looking for the reason why certain agents won't connect in firewalls for a while, like I did ...).
Hth.