Skip navigation
All Places > CA Infrastructure Management > Blog > Authors Steve Harvey

One of my challenges on nearly every project is the almost inevitable conversation about resilience. You see, the duet between high availability (HA) and disaster recovery (DR) is like modern jazz: complex, elusive and hard to define.

 

All customers understand the importance of their platform: It must be available very close to 100% of the time, and if a disaster occurs, it needs to be revived very quickly. What organisations may not have considered in any real detail is precisely how critical the platform is, what kinds of disasters are likely to strike, and how to recover from disaster.

 

Let’s get something straight from the start: The logistics of HA and DR have little to do with technology. Rather, they are driven more by business needs and operational requirements. If organisations start with this in mind and leave technology until later, everything will work out better in the end.

 

Let me lead you through the process.

 

Your first task when aiming for resilience is to examine your business needs and understand the impact of platform unavailability for various lengths of time: ten minutes, an hour, a day, a week or more. What will you lose operationally, contractually and monetarily in each scenario? What are other potential business impacts? Only when you understand these issues can you formulate a plan to ensure that your platform meets your needs.

 

High Availability

What should emerge from this analysis are definitions of tolerable outage, data loss and recovery time. For instance, HA is typically expressed as a percentage, often 99% or higher, usually computed with a downtime calculator, which you can find online. (By the way, the data used to compute this percentage is probably more useful to the architect than the percentage—which is really just a convenient number to plug into presentations.)

 

Why is the data more useful than the percentage? Well, outages rear their ugly heads in many forms, and downtime duration varies. So if you have a 99.9% availability target, that translates to an outage of 8 hours, 45 minutes and 36 seconds each year, or 1 minute, 26 seconds per day. But those figures are quite useless, unless you can predict (and you cannot!) that you will have only one outage that won’t exceed 8:45:36.

 

That’s where scenario-building comes in very handy. Perhaps you anticipate a couple of complete outages and a few partial outages. For a partial outage, can you divide outage duration by the overall effect? In other words, does a one-hour outage affecting 10% of users translate to 6 minutes (10% of an hour)? Whatever data is available, allow for several scenarios. Previous experience of the solution or platform should help you do that; even for new systems, vendors like CA have data about reliability and potential threats to availability. What you should aim to end up with is a matrix of possible failures, with downtime attributed to each, and a recovery plan for each failure.

 

At that point, you have something you can apply to the technology and begin to build resilience.

 

So is that it? Actually, no. You may find that the technology doesn’t support what you need to achieve or that the cost is way beyond the benefit. Now it’s time for manipulating the possibilities to an availability percentage that is technically and financially viable but also meets business needs.

 

Disaster Recovery

DR is a different beast altogether: The possibilities are endless and the cost/benefit curve can get very steep very quickly. My first question to customers is, “What disasters do you anticipate?” Predictably, a typical answer is “A plane hitting the building”—an unfortunately understandable, but not very useful, response. Nevertheless, the question provokes meaningful discussion.

 

DR is so much more than replicating data in a second data centre—a frequently stated goal that is often not the best or most effective answer. As with HA, look at the business case and identify the recovery time objective (RTO) and recovery point objective (RPO), two essential parameters of a disaster recovery plan (DRP).

 

Don’t immediately accept the first RTO and RPO handed to you: Make stakeholders justify them. Only then can you start to create a DRP. You may find that a second data centre is not necessary—it may be more effective to build from backups in a recovered data centre or a cloud environment. Having collateral sitting around doing nothing while waiting for a disaster can be expensive and is often unnecessary.

 

So, I hope the message is clear. Put away thoughts of technology and think about your business objectives first. That will likely give you an easier and cheaper route to resilience—and that would be music to your ears.

 

Inquiring minds want to know: What business objectives are presenting challenges to you now? Let us know!

I confess: I’m a poacher turned gamekeeper. For those of you not familiar with this rather English idiom, it means that I was once the perp, but now I’m the cop.

 

In my formative years, when I was keen to impress any way I could, solutions to problems would just spring from any coding language I thought appropriate at the time. I’ve probably left thousands of lines of Shell, Java, Perl and PHP script in my wake, with nary a care for future support or development.

 

Often, when we coders are in a tight corner or perhaps not familiar with current trends or available tools, the temptation to write our own solution is compelling. We say, “I have knowledge of the problem” and “I have the skills.” What we may not have is a view to where our solution may go, the pressures that will be brought to bear on us and the solution in the future, and the fate of others who have to sail in our wake.

 

For me, the epiphany was an integration I wrote to solve a problem between two systems that would not normally be connected. Neither vendor had considered integrating with the other, and there was no out-of-the-box solution. That was great for me: eight weeks of playing with code was a joy, and I completed the job to specification.

 

Returning six months later to resolve a problem, I noticed significant changes to the code base. What had happened in the meantime was twofold. First, unanticipated new features had been added; second, these new features had been added by other coders. The result was something of a mess: My once elegant solution had become not only an organically grown mess but also a support nightmare, hard to extend and creaking at the seams.

 

In a more controlled development environment, these support and extensibility factors would have been addressed from the start and delivered as an agile code stream that could be useful for years to come. In a less controlled environment, that usually doesn’t happen.

 

These days, I’m inclined to block this kind of quick coding customisation activity unless I’m convinced there’s no other solution and that the customisation will not require huge future development or excessive support. Another criterion for success is that the customisation must be very well documented and understood by future parties—and that usually means it has to be simple.

 

I recognise that it’s not always possible to meet these criteria—sometimes you just need to automate or glue things together, and that often requires some sophisticated code. Products don’t always go in the direction you need them to, and sometimes specific local requirements need a solution outside of the vendor product’s capability.

 

So what’s the answer? Well, CA Automic (and CA IT PAM before it), enables programmers’ flair for creativity, letting us solve problems ourselves whilst providing a controlled and supportable environment with tools that accelerate development, add security and enable logging.  These tools offer much more than just point integration or simple orchestration solutions. They should form the backbone of an environment and be the first port of call when we seek to automate or integrate.

 

My advice is this: To let your creativity be burden free and build supportable solutions for the future with confidence, set yourself up for success by using the right tools.

 

Happy coding!

I recently came across a project requirement calling for an encrypted version of a VOIP protocol; I couldn’t fulfill the request, because the software didn’t support encryption of the protocol. However, I was able to resolve the issue by asking a simple question: “Is encryption an actual security requirement or is the possibility of encryption driving the request?” The latter was the case, and the issue was resolved.

 

That scenario got me thinking about a similar experience I had a few years ago, before I joined CA, that also illustrates how we can get so enticed by the lure of technical possibilities that we abandon practicality and common sense. I often tell the following anecdote because it shows how we can fall victim to techno-think that impedes—rather than facilitates—success.

 

I was asked to join a project that had been running for quite some time and had hit the bumblers because a key requirement had not been met. The customer had requested a GUI refresh rate of five seconds for a monitoring tool. Nearly 99% of customers request a refresh rate of 60 seconds, which was the default.

 

Five seconds simply wasn’t practical—or even possible, what with screen flashing/flickering and resources issues lurking in every corner. Nevertheless, the team had valiantly tried everything they could to make it work, to no avail.

 

My boss at the time asked me to see what I could do. Knowing that the chances of fixing this issue were very low, I had to think of a different approach. At my initial meeting with the team, the PM explained that the specs insisted on the five-second refresh rate. And there it was in black and white: a standard templated document with many boxes to fill, one of which said “Screen refresh rate: every 5 seconds.”

 

I said I would like to contact the architect so that I could understand the reason for the request. I was told he was a radar specialist and had returned to another unit of the organisation; they would call him for me.

 

That bit of information about the architect led me to put 2 and 2 together: At the time, typical CRT radar screens had a refresh rate of 5 seconds. I surmised that this was why the architect set this requirement, and that the requirement did not reflect the monitoring team’s operational needs. The architect confirmed my suspicion.

 

I was then shown to the monitoring suite and introduced to the two guys who monitored the systems on a 24-hour rotation. This got me thinking, “What happens when they take a break?” As you might expect, they stagger their breaks. I cheekily asked, “What happens when one of you is on a break and the other needs a bio break?” Looking a little non-plussed, one guy said, “Well, we just go.” The nearest rest room was at least a 3-minute walk away, so I estimated around 7 minutes for a round trip. If the 5-second refresh rate requirement were in effect, that would equate to 84 screen refreshes. When I pointed that out, they came to the realization that a 60-second refresh rate would be fine.

 

 My point—and I do have one—is that technical problems don’t always call for technical solutions; often, taking a wider view can solve an otherwise thorny problem. That’s something we should all remember in our tech-centric lives.

As SaaS-based solutions become more prevalent, the traditional solution architect’s role—articulating and delivering solution designs—is becoming obsolete. Many ancillary tasks that go with this core skill, such as platform sizing and effort estimation, will also fade away as SaaS takes hold.

 

So do these developments mean that architects no longer add value to software solutions?

As an architect who loves his job and wants to stay employed, I’m glad to report that the answer is a resounding “No!”

Just as so many other jobs have evolved and adapted to the application economy, solution architects need to bring their other skills to the forefront. From my perspective, it’s a simple case of doing more of the activities we had to sacrifice when we were so busy designing solutions. But here’s the real revelation (and the good news for architects and clients alike): These other activities are just as essential to solution success as good design, and we should find time for them even in non-SaaS implementations.

I would hazard a guess that everyone reading this post has lived through a situation in which a carefully designed solution loses value over time due to poor adoption. Who better than an architect, who knows a given solution inside and out, to add value to the team using the solution through training, troubleshooting and management? Architects have seen some organizations fail and others succeed, and they can share best practices with your organization to make sure it falls squarely in the latter group.

SaaS solutions are not dogged by implementation complexity and long delivery times; they usually are evaluated and pressed into production almost immediately. As a result, customers may not have run through their budget. Wise customers don’t funnel all of the remaining funds to another initiative; instead, they apportion part of it to keep an architect on to assist with rollout and adoption. This is where the solution architect can really add value at little cost.

You may be saying, “This still requires the architect to work side by side with the customer on configuration, people and process as we have always done.” But consider this: Most customer user groups are distributed, so the architect needn’t be on site, which of course saves travel costs.

Working remotely, architects can live with customers’ solutions on a daily basis, using the plethora of communications tools available to us. We can be part of their team—when they need us—and use our wealth of experience to progress them on their solution journey. Just as a SaaS solution offers a subscription service for software, the architect can offer a subscription service for his/her experience. This is a much more fluid, effective model for our journey with the customer than ad hoc, infrequent visits with them when they are stuck on an issue. We can be with the customer every step of the way to ensure that their success is maximised.

As SaaS becomes the norm and customer knowledge declines due to redirection of resources to on-premise solutions, the architect is the front line in ensuring that customers have continued success with SaaS solutions.

With many technologies, keeping current and supportable is becoming more and more difficult. If your organization hasn’t or can’t move to SaaS-based solutions, it’s up to your IT team to ensure that your organization is getting the best value from your software—and that the cost of supporting out-of-date software doesn’t impact your budget negatively.

Whilst you are focusing on your core business and rightly directing resources to projects at the forefront of your IT strategy, niche areas like enterprise management are often left to wither. Ultimately, they become a ball and chain as backlogs increase and technical debt mounts.

Compounding the problem is the fact that internal resources lose skill sets and the knowledge required to maintain and grow certain software solutions. This often results in capital projects in which a new vendor’s solutions are deployed to fix ailing environments. This creates a sine wave of value as you cycle from vendor to vendor—when what you really want is to steadily climb the maturity/value ladder.

That’s were CA’s Application Management Services’ (AMS) TechOps and Adoption Services come in. AMS helps organizations avoid this unfortunate situation and guides them with the experience, resources and best practices that allow you to ride the value curve, gain many benefits of a SaaS solution, and stay at the top.

From its inception, AMS was, for want of a better phrase, a ‘lights-on service’ for your environment; we call it TechOps. Near-shore teams take the day-to-day tasks away from your own over-committed resources; they maintain and monitor your system to ensure optimum performance and reduce risk. But AMS is much more than that. I’ve also observed an increase in cadence from customer delivery teams who, thanks to AMS, can get on with delivering services to the business, confident in the knowledge that everything they produce is well maintained, current and, above all, delivering what the business needs.

Companies like CA that offer application management services must, of course, make money, and that means that the solutions they manage must be in the best shape possible—otherwise, everyone loses. By virtue of effective management, your organization’s spot on the maturity curve is secure and stable—but this is only half the story. Without Adoption Services to supplement TechOps, your organization would stagnate at a point of stall; this is where the Adoption Services component of AMS kicks in.

The use of Adoption Services signals an organization’s intent to climb the maturity curve. CA collaborates with your team to map your strategy to a book of work with fixed outcomes that will deliver the business value you need from your software. This outcome-driven approach is adaptable to change, but the CA Services team focuses on achieving your goals by translating business requirements into operational and technical output. This process enables you to clear backlogs, clear up technical debt and move forward with new challenges. Many AMS customers have seen this approach, which meshes well with agile delivery, increase the tempo of project delivery.

I recently worked with several customers who reaped the benefit of AMS’ TechOps and Adoption Services. Having spent many years delivering software projects that never seem to move beyond phase one of maturity, I’m happy to say that working with customers who embrace the AMS philosophy has allowed me to use my knowledge and experience to deliver excellent value from their investment in CA software. In my book, AMS is a winner!

For many customers, the process of building meaningful services models in CA SOI can seem daunting. They encounter numerous challenges, such as issues with the underlying domain management layer and/or deploying multiple services with generic data sources, which leads to poor results—and even poorer adoption.

 

After witnessing this scenario a few times myself, some of my UK-based colleagues and I got together to find a way to make our customers’ journey along the service-driven enterprise management path easier—and more rewarding. We developed what we call the Spotlight approach, a series of four workshops that starts with a CA Services team gaining an understanding of the business significance of various services and ends with a fully configured CA Service Operations Insight (SOI) model.

 

For this approach to be effective (that is, enable customers to realize their vision of service-driven enterprise management), each series of workshops shines a spotlight on a single critical business service selected by the customer. You may conduct workshops for a few services in parallel, or you can do them sequentially.

 

The CA Services team leads four one-hour Spotlight workshops that together give a full picture of the service, from the business down to individual IT components. (As a point of information, Spotlight workshops are not designed to focus on technical debts such as component upgrades, unless they are essential to improving the service model. And any new monitoring requirements need to be acute to the service model we are building.) Here are more details about each workshop:

 

  • Workshop 1—Business: We establish the nature of the selected service, who uses the service, its purpose and why it’s critical to the business. Attendees include a CA Services architect, the customer’s service product and/or business owners and the customer’s service catalogue owner.
  • Workshop 2—Incident/Problem Management: We discuss the current health and technical makeup of the selected service. This leads to better understanding of how supporting applications fit together, what the critical access points are, and general characteristics and behavior. Attendees include a CA Services architect and the customer’s incident/problem manager and operations manager/lead.
  • Workshop 3—Operations: We examine how the service and its applications are managed day to day. By understanding current methods and challenges, we can determine the best way to represent the service in CA SOI. Attendees include a CA Services architect and the customer’s operations manager/lead and lead operations technician.
  • Workshop 4—Technical: A deep technical dive into the as-is state of monitoring. We use the knowledge gained here to develop technical recommendations and prerequisites required to on-board the selected service to CA SOI. We also use it to produce a project plan and estimate deployment deliverables. Attendees include a CA Services architect, CA senior consultant, and the customer’s systems engineer for monitoring tools.

 

Once the plan is complete, we draw up an estimate of effort and prerequisites. The next step is to execute the plan and promote the model into production.

 

In my experience, the Spotlight approach illuminates the customer’s path to success by maturing the customer’s monitoring capability and delivering real value quickly and effectively.

 

Inquiring minds want to know: What have you challenges/successes been in building meaningful services models in CA SOI? Please share your experience below.