Tip: Snapshots Rule, Okay?

Back to discussions

Expand all | Collapse all

Kyle_RMay 02, 2014 12:41 AM

Hello Everyone, A quick story today of a case where things went right. We've all had ...

SystemMay 06, 2014 04:22 PM

Kyle_R: Hello Everyone, A quick story today of a case where things went ...

1. Tip: Snapshots Rule, Okay?

2 Recommend
Kyle_R
Posted May 02, 2014 12:41 AM

Reply Reply Privately
Hello Everyone,

A quick story today of a case where things went right.

We've all had this general scenario. We're making a change on the system, a familiar change, one that has been performed successfully dozens, if not hundreds of times before without incident . . . until this time. Now you're looking at the smouldering ruins of the system in question, and wondering what went wrong.

This happened to a customer this week.
A table and some columns were added to Service Desk Manager via Web Screen Painter. This was put through on a test system before trying it in production, even though it was a familar change. Just like we all do.
A run sheet was used to make the same change on production. At this point the Web Screen Painter hung when using the "test." Reasons unknown, but killed it with Task Manager and then continued with the usual pdm_publish without errors or incident. All good, right?

Service Desk Manager did not start up fully and was producing masses of new log entries.

They contacted CA Support, where we quickly identified that something had gone wrong during the WSP changes. What? Don't know. But all of the errors clearly pointed to there being a problem with consistency of the customisations and other parts of the system.

At this point, with a production down system, there are two main choices:

Restore to a backup.

Continue on and try to understand and undo whatever damage has been done.

Now so far, this site has done many things right:

Trial changes on a test system.

Document what the changes are and how to implement them.

Go through a change control process and schedule an outage for one known change.

Got further help when needed, without digging the hole deeper.

But the main thing that they did right? Even though it was a simple change, they:

Took a virtual machine snapshot before implementing the change.

This gave the option of a quick, clean restore of the SDM system. The SQL database needed to have the newly added table and columns removed manually, but fortunately this was a trivial task.

You don't have to restore to a backup, but without it being available there is not even the option.

I'll spare you the war stories of where the backup is not available, or not a usable backup.

The long and the short of it is - be sure that you can reverse changes on a production system.
If you don't have snapshots, then conventional backups for file system, database and any other thing likely to change, such as operating system variables.

Or if they are one way changes, have a plan to minimise and deal with any consequencs.

Prepare for the best and plan for the worst.

Thanks, Kyle_R.
2. RE: Tip: Snapshots Rule, Okay?

0 Recommend
System
Posted May 06, 2014 04:22 PM

Reply Reply Privately
Kyle_R:

Hello Everyone,

A quick story today of a case where things went right.

We've all had this general scenario. We're making a change on the system, a familiar change, one that has been performed successfully dozens, if not hundreds of times before without incident . . . until this time. Now you're looking at the smouldering ruins of the system in question, and wondering what went wrong.

This happened to a customer this week.
A table and some columns were added to Service Desk Manager via Web Screen Painter. This was put through on a test system before trying it in production, even though it was a familar change. Just like we all do.
A run sheet was used to make the same change on production. At this point the Web Screen Painter hung when using the "test." Reasons unknown, but killed it with Task Manager and then continued with the usual pdm_publish without errors or incident. All good, right?

Service Desk Manager did not start up fully and was producing masses of new log entries.

They contacted CA Support, where we quickly identified that something had gone wrong during the WSP changes. What? Don't know. But all of the errors clearly pointed to there being a problem with consistency of the customisations and other parts of the system.

At this point, with a production down system, there are two main choices:

Restore to a backup.

Continue on and try to understand and undo whatever damage has been done.

Now so far, this site has done many things right:

Trial changes on a test system.

Document what the changes are and how to implement them.

Go through a change control process and schedule an outage for one known change.

Got further help when needed, without digging the hole deeper.

But the main thing that they did right? Even though it was a simple change, they:

Took a virtual machine snapshot before implementing the change.

This gave the option of a quick, clean restore of the SDM system. The SQL database needed to have the newly added table and columns removed manually, but fortunately this was a trivial task.

You don't have to restore to a backup, but without it being available there is not even the option.

I'll spare you the war stories of where the backup is not available, or not a usable backup.

The long and the short of it is - be sure that you can reverse changes on a production system.
If you don't have snapshots, then conventional backups for file system, database and any other thing likely to change, such as operating system variables.

Or if they are one way changes, have a plan to minimise and deal with any consequencs.

Prepare for the best and plan for the worst.

Thanks, Kyle_R.

Thanks for sharing this story with the community Kyle!

CA Service Management

Tip: Snapshots Rule, Okay?

Kyle_RMay 02, 2014 12:41 AM

SystemMay 06, 2014 04:22 PM

1. Tip: Snapshots Rule, Okay?

2. RE: Tip: Snapshots Rule, Okay?