JMertin

How to configure MoM fail-over

Blog Post created by JMertin Employee on Feb 7, 2017

Curtesy from Henrik Nissen Ravn who provided us this document.

 

How to configure MoM fail-over.

 

For a while, after I gave one of my favorite engineering architects (yes, solutions architects indeed have that) an insufficient and incomplete answer on this, I have had the urge to provide a fuller answer on this topic. Here it is.

Our documentation (https://docops.ca.com/ca-apm/10-3/en/administrating/configure-enterprise-manager/configure-enterprise-managers-and-clusters/use-enterprise-manager-failover) states the below: you can configure the DNS to route to the IPs for the active MoM of a primary/primary or primary/secondary MoM failover setup for one hostname as per this extract[1]:

Enable Enterprise Manager Failover

By default, Enterprise Manager failover is not enabled.

Follow these steps:

  1. Navigate to the<EM_Home>\config directory and open the IntroscopeEnterpriseManager.properties file.
  2. Locate the Hot Failover section.
  3. Set thefailover properties.
    When agents, TIMs, and Workstations try to connect to an Enterprise Manager, they try all the IP addresses for a host name. If you have defined a logical host in DNS with the IP addresses of the primary and secondary Enterprise Managers, then the agents, TIMs, and Workstations can use this for the Enterprise Manager host name and connect to whichever Enterprise Manager is running.
  4. Be sure that the secondary Enterprise Manager computer users have write permission to the primary Enterprise Manager SmartStor data directory.
    In a failover situation, this permission allows the secondary Enterprise Manager to write data to the primary Enterprise Manager SmartStor database.

The critical bullet to read carefully is obviously 3. This bullet was what got me started: How exactly does this work?

 

Agent failover vs. MoM failover.

The “standard” configuration that I see (and is known to work) is not following this documentation in that agents directly point to the primary/primary or primary/secondary MoMs (below is for primary/primary) using IP addresses or hostnames (FQDNs – a few words on best practices at the end):

 

Primary MOM: mom1.company.com or 10.1.2.3
Secondary MOM: mom2.company.com or 10.1.2.4
DNS: mom.company.com, 10.1.2.3, 10.1.2.4

 

Both MOMs would have:

introscope.enterprisemanager.failover.enable=true
introscope.enterprisemanager.failover.primary=10.1.2.3, 10.1.2.4

 

Agents would have:

agentManager.url.1= mom1.company.com:5001 or 10.1.2.3:5001
agentManager.url.2= mom2.company.com:5001 or 10.1.2.4:5001

Since agents’ profiles directly specifies two EMs, we call it agent failover (vs. MOM failover to follow).

So, in agent failover configuration DNS for the virtual host (mom.company.com), as mentioned in the documentation, is not utilized by agents.

But mind you, changes to this setup might require changes to agent profiles of which you may have many… Just a friendly word of warning, for you to avoid ending up having to change hundreds (or even thousands) of agent profiles… Which is one reason why best practices is a good thing.

Obviously, when specifying FQDN’s for hostnames DNS will be used to resolve names into IP-addresses. For now, I assume one IP-address configured for each mom1 and mom2.

 

MoM failover.

This configuration utilizes DNS for a virtual host[2] (mom.company.com:5001) in agent configuration as stated in the documentation:

Primary MOM: mom1.company.com or 10.1.2.3
Secondary MOM: mom2.company.com or 10.1.2.4
DNS: mom.company.com -> 10.1.2.3, 10.1.2.4
DNS: mom1.company.com -> 10.1.2.3
DNS: mom2.company.com -> 10.1.2.4

 

Both MOMs would have:

introscope.enterprisemanager.failover.enable=true
      introscope.enterprisemanager.failover.primary=10.1.2.3, 10.1.2.4

 

Or, both MOMs would have:

introscope.enterprisemanager.failover.enable=true
introscope.enterprisemanager.failover.primary=10.1.2.3
introscope.enterprisemanager.failover.primary=10.1.2.4

 

Agents would have:

agentManager.url.1= mom.company.com:5001

DNS is in principle “dumb” in the sense that it simply returns one or more configured IP addresses in response to a client request for a hostname resolution regardless if they are responsive or not. Then the agent itself will try the returned IP addresses one-by-one – and eventually figure out which MoM is active. This is what the documentation is explaining.

More precisely, for Java, the agent does an InetAddress.getAllByName() and then tries to do a new Socket(<IPAddress>) for the returned IP-addresses one-by-one. If no IP address can be found a new Socket(<hostname>) is done. The first successful connection is used (if none is successful the last exception thrown in this iterative process is re-thrown).

Hence, utilizing DNS with a virtual host simplifies and stabilizes the agent configuration as agents for MoM specify the FQDN of the MoM.

However, obviously this iterative process is not the most efficient, as you can get rid of the agents trying MoMs one-by-one by deploying intelligent and/or active components as you can utilize either intelligent DNS, routing or a load balancer – or a combination (or even all-in-one).

 

The port trick: the port check

First, you need to think about MoM start-up: During start-up a MoM requests an exclusive lock on the secondary_em.lck file. If it obtains it, it then requests an exclusive lock on the primary_em.lck file[3]. If it obtains it, it releases its first lock[4], and complete its startup and start listening on port 5001[5]. The secondary MoM will then only obtain its first lock, and hence not be able complete its start-up, and in consequence not start listening on port 5001. Hence, only the active MoM is listening on port 5001[6].

Hence, there is a simple, active, way of telling is a MoM is active: Doing a port-check on port 5001: check for a listener, as only the active MoM will be listening on port 5001.

Then, you will ask: how about during MoM failure when the active MoM fails and before the other MoM becomes active? During that period a port-check will fail for both MoMs? Yes, it will. Hence, it this (hopefully small) period agents will not be able to connect to any MoM – as there is none active – yet. Minimizing this period (as well as network traffic) is probably one reason you in the first place would opt for Intelligent DNS, routing or a load balancer (another being network flexibility). Know that agent will keep trying to connect.

 

Utilizing a Load Balancer

A load balancer – as an example NGINX – could utilize this port-check towards the actual primary/secondary MoMs (hostname or IP) and directly forward the request to the active MoM Configuration would be the same as the second above, with the important difference that the hostname for mom.company.com:5001 would represent a real host, namely the load balancer.

NGINX is actually able to do passive failover or active failover.

With passive failover, NGINX will simply forward the request to the next MoM. And if the connection fails NGINX will exclude forwarding to that MoM for a (configurable) quarantine period. And try the next MoM. The quarantine period is important but with conflicting priorities: you want long quarantine period to avoid forwarding requests in vain – you want short quarantine to enable retries during actual failure (when the active MoM fails and the second MoM is about to take over) and failback (in a primary/secondary setup where you want the primary MoM to again become active when restarting).

With, more efficient, active failover, NGINX will actively test that the failover MoM is in fact active by means of the port check[7]. And only forward requests to an active MoM. As testing is out-of-band with request forwarding it enables more efficient forwarding with no prioritization conflict as to how often to test (but rather a resource consideration not to do it too often)

Hence, utilizing a load balancer simplifies and stabilizes the agent configuration as agents for MoM specify the FQDN for the MoM.

 

Utilizing intelligent and routing DNS.

By deploying intelligent and routing DNS failover can be handled even more efficiently

With an intelligent, routing DNS – as an example BIG-IP DNS – you configure the virtual host, mom.company.com:5001, with its IP address being a virtual IP address, VIP, and the routing for this being to the actual MoMs.

In order to allow for this the DNS must advertise its ability to route the VIP and hence appear on the network as a router.

Further, BIG-IP DNS, is able to utilize the port-check to route MoM requests directly to the active MoM. This I would call active routing.

Configuration would be this:

Primary MOM: mom1.company.com or 10.1.2.3
Secondary MOM: mom2.company.com or 10.1.2.4
DNS: mom.company.com -> 10.1.2.5, Virtual IP
Routing: 10.1.2.5 -> 10.1.2.3 / 10.1.2.4
DNS: mom1.company.com -> 10.1.2.3
DNS: mom2.company.com -> 10.1.2.4

 

Both MOMs would have:

introscope.enterprisemanager.failover.enable=true
      introscope.enterprisemanager.failover.primary=10.1.2.3, 10.1.2.4

 

Or, both MOMs would have:

introscope.enterprisemanager.failover.enable=true
introscope.enterprisemanager.failover.primary=10.1.2.3
introscope.enterprisemanager.failover.secondary=10.1.2.4

 

Agents would have:

agentManager.url.1= mom.company.com:5001

 

Hence, intelligent, routing DNS defining a virtual host and utilizing virtual IP and active routing simplifies and stabilizes the agent profile as well as obtains the most efficient agent-to MoM connection. And even increases flexibility as a VIP address can be used to provide nearly unlimited component mobility as VIP addresses can be advertised with full netmasks (un-subnetted), so a MoM can be moved anywhere on the reachable network without changing addresses.

 

Mixing

In the above examples I have specified hostname or IP address. For all MoM-specifying properties you may actually specify either. It will work as the same internal mechanism for locating MoMs is consistently used. This opens up for convoluted mixing agent failover and MoM failover specifying IP addresses. You can. It will work if configured correctly. You may even mix agent failover and MoM failover specifying FQDNs.

However, don’t do either, as there a fair chance you may confuse yourself and/or you network peers. To no real avail: instead read on.

 

Best Practice

Just a final word: It is best practices to always and only have host names in agent profiles as it simplifies, stabilizes and separate duties. And explained above it even allows for most efficient agent to MoM connection[8].

 

TIMs

The documentation wrongly states that the TIM will get IP addresses through the DNS. This is not correct, as the TESS will provide it the IP’s during the “bind” process as it sees it. As a security precaution to have explicit, internal control where a TIM sends potentially sensitive data. It’s a typo: Simply exchange “TIMs” for “WebView” (or browser, if you like) – and add “Team Center”.

 

[1] The documentation is actually not entirely correct regarding TIMs. I will get to that later.

[2] Virtual host: meaning that the host exists only in DNS with its FQDN entry containing the actual IP addresses of the MoMs.

[3] Both in EM_HOME/config/internal/server.

[4] You can actually see acquire and release messages for this in MoM logs

[5] Throughout I use port 5001 as a placeholder only, as you may configure any other port.

[6] If the active MoM fails, its lock will be relinquished, and the inactive MoM will be able to continue and complete its startup to become active and listening on port 5001.

[7] If you’re interested in how this is done you may check the free, open source network scanner and mapper NMap.

[8] I have talked to one of my favorite engineers (yes, architects do indeed have that), she said that there might be circumstances in some special environments where hostname doesn’t work. Unfortunately, not being a network guy, it is beyond my limited network knowledge to explain this further.

Outcomes