webmaster_jim

The day after 11.3.6

Blog Post created by webmaster_jim on Oct 16, 2014

I started writing this blog on August 15, 2014, going into our production upgrade weekend.  As it's now October, you might be asking "did something go wrong"? And the answer would be, no not really.

 

It was a bit of a risk to go with version .0 of something, but that's what we did.  I joked that the first patch would appear around the same time as our upgrade.  As it turned out, it was a month later.  And we're back in the patch saddle again.

 

Previously, I posted lessons learned during the sandbox and development upgrades:

 

 

 

With the CA community site move to Jive, this one is a blog.  Without further ado, the lessons are:

 

  1. Derby Sux. This freeware database platform should never have been allowed to propagate to our production environment.  Within a week of go-live, we were dead in the water due to the WCC repository choking.  Though we opened a Severity 1 issue, we have never been given a satisfactory explanation of what went awry.  What I know now I should have known during the first sandbox installation.  Bad testing on our part.
  2. UNIX -> Windows Timezones.  We moved the EP from UNIX to Windows as part of cost-cutting measures.  It runs fine, but as there are subtle differences in the way these 2 operating systems manage timezones, we hit several problems.  I've opened this idea as a result: Better behavior with invalid timezone.  Auckland-Z. China-8. China-Z. Israel.
  3. Add More Think Time.  When you talk to a PM type, add several minutes to each time estimate to cover communications, hazardous duty latency, and biological necessities.  Just because the database patch takes 20 minutes does not mean you can start the next phase at minute 21.
  4. Progress Bar For Database Migration.  There is not one.  There should be.
  5. Do Machine Offline Parallel. To control the number of jobs that could start immediately after the upgrade was technically complete, we put all machines in an OFFLINE state before the upgrade started.  After basic functional test was done and we wanted to bring all machines offline, we ran a script with a series of sendevent commands.  I should have run this using the (poorly) documented "-F" flag instead, so we could get to the subsequent debugging sooner.
  6. Set Up SNMP Before Starting Event Processing.  My mistake.  The plan had this step between the checkout and the "open floodgates" phases.  It should have been done immediately.  I don't think we missed any critical error messages, but since they weren't sent to the message processing system, I was not sure.
  7. Friends? Chase is your friend. Chk_auto_up is your friend.  Autoping, I'm not sure about. cybAgent is not your friend.

 

What are your takeaways?

Outcomes