DX Unified Infrastructure Management

  • 1.  Probe to move profiles from one robot to another: generic_cluster probe

    Posted Oct 12, 2010 03:23 PM

    Hi,

    here's a probe I developed, initially for a customer who wanted to have a probe similar to our cluster-probe but it needed to run on AIX.

     

    The probe is written in Java, hence should run on any platform. I've tested it on Windows, Linux and AIX.

     

    It basically allows you to define a set of probe-profiles ( or configuration-file sections ) that you want to switch from one robot to another and you can specify a condition on which robot should run the profiles.

     

    Please make sure you test this probe thoroughly until you are confident that you know how to properly configure the probe as a misconfiguration can remove probe profiles.

     

    As this is a unofficial probe it is not officially supported. 

    If you have any enhancement requests or questions, please write me an e-mail or post them in this thread.

     

    regards,

    chris



  • 2.  How to make your NAS real high-available

    Posted Oct 28, 2010 05:51 PM

    one problem you will encounter when you want to run the NAS in a high-availability setup using the HA-probe is the topic of auto-operator-profiles and preprocessors.

    Those are not being replicated between the two systems hence when the primary hub goes down, the secondary hub takes over but the ao-profiles and pre-processing-rules need to be transferred manually and activated manually.

     

    Here's a How-To how you can enable the auto-failover of those as well.

     

    Prerequisites:

    - you have two fully working NAS's

    - you have NAS-replication set up between the two NAS's and have activated the replication of scripts (!!!)

    - you have set up the HA-probe to take care of enabling/disabling all your queues, etc.

     

    What you have to do:

    In my scenario, there are two hubs, the primary one is /win08vm/primaryhub/win08-01, the secondary is /win08vm/deb01/deb01. When win08-01 goes down, deb01 should take over.

    1) enter the Raw-Configure of the "secondary" NAS and delete all sub-sections of "auto_operator" and "filters". If they are not available, the NAS will consider ao-profiles as deactivated.

    2) now deploy the generic_cluster-probe to your primary hub and your secondary hub. Keep both probes deactivated.

    3) now enter raw-configure or manually edit the generic_cluster.cfg on your primary hub, copy this config (make sure to adapt your robot names, hub names and domain name!!!!):

     

    <setup>
       loglevel = 5
       heartbeat_interval = 10
       configsync_interval = 33
    </setup>
    <clusters>
       <nasreplication>
          masterConditionType = callback
          masterConditionTarget = /win08vm/primaryhub/win08-01/hub.get_info
          masterConditionRegExp = [[success]]
          announceOnBus = false
          announcementSubject = generic_cluster
          <nodes>
             nimclu = /win08vm/primaryhub/win08-01
             deb01 = /win08vm/deb01/deb01
          </nodes>
          <master_sections>
             <nas>
                1 = /auto_operator/*
                2 = /filters/*
             </nas>
          </master_sections>
       </nasreplication>
    </clusters>

     

     

    In this config, the probe will try to call the callback  "get_info" on /win08vm/primaryhub/win08-01/hub(.get_info), if it succeeds ([[success]]) then this robot is the master. This is the normal case.

     

    4) Now edit the configuration of the generic_cluster-probe on your secondary hub and past the following text (again, modify robot names, etc.):

     

    <setup>

       loglevel = 5

       heartbeat_interval = 10

       configsync_interval = 33

    </setup>

    <clusters>

       <nasreplication>

          masterConditionType = callback

          masterConditionTarget = /win08vm/primaryhub/win08-01/hub.get_info

          masterConditionRegExp = [[fail]]

          announceOnBus = false

          announcementSubject = generic_cluster

          <nodes>

             nimclu = /win08vm/primaryhub/win08-01

             deb01 = /win08vm/deb01/deb01

          </nodes>

          <master_sections>

             <nas>

                1 = /auto_operator/*

                2 = /filters/*

             </nas>

          </master_sections>

       </nasreplication>

    </clusters>

    This profile tells the probe to reach the hub, but in this case, this node is the master if the callback fails (so the primary hub is down).

    In both profiles, you see that all sub-sections of /auto_operator/ and /filters/ in the nas.cfg will be replicated.

     

     

    5) Now start both probes and monitor their logfiles. You should wait at least 5 minutes for the probes to securely synchronize each other. If you look at the probe-directory on the secondary hub, you will see a directory appear called "repository/nasreplication/" in there you should find a nas.cfg. Once that file is there, the two nodes are synchronized.

     

    6) Now you can perform your failover test. When your primary hub goes down, you should see messages similar to this on your secondary system:

     

    Oct 28 16:18:11:962 [main, generic_cluster] performing heartbeat for all 1 configured clusters

    Oct 28 16:18:11:963 [main, generic_cluster] checking cluster nasreplication

    Oct 28 16:18:11:966 [main, generic_cluster] invoking callback 'get_info' on /win08vm/primaryhub/win08-01/hub

    Oct 28 16:18:26:706 [main, generic_cluster] encountered problem invoking callback 'get_info' on /win08vm/primaryhub/win08-01/hub: Unable to open a client session for 192.168.119.128:48002

    Oct 28 16:18:26:707 [main, generic_cluster] node /win08vm/deb01/deb01 is MASTER for cluster nasreplication

    Oct 28 16:18:26:707 [main, generic_cluster] NEW MASTER for cluster nasreplication. New master is: /win08vm/deb01/deb01

    Oct 28 16:18:26:708 [main, generic_cluster] activateClusterSections(nasreplication) invoked

    Oct 28 16:18:26:708 [main, generic_cluster] activating sections for probe nas

    Oct 28 16:18:26:708 [main, generic_cluster] syncSectionsToProbe(nasreplication,/win08vm/deb01/deb01,nas,Hashtable) invoked

    Oct 28 16:18:26:708 [main, generic_cluster] fetching current config from nas at /win08vm/deb01/deb01

    Oct 28 16:18:26:718 [main, generic_cluster] temporarily stored remote config as /opt/nimsoft/probes/application/generic_cluster/temp/nasreplication/nas.cfg

    Oct 28 16:18:26:718 [main, generic_cluster] using archived configuration from: repository/nasreplication/nas.cfg

    Oct 28 16:18:26:721 [main, generic_cluster] found wildcard for section definition. determining subsections dynamically

    Oct 28 16:18:26:722 [main, generic_cluster] found 0 subsections for the wildcard. copying the subsections

    Oct 28 16:18:26:722 [main, generic_cluster] found wildcard for section definition. determining subsections dynamically

    Oct 28 16:18:26:722 [main, generic_cluster] found 2 subsections for the wildcard. copying the subsections

    Oct 28 16:18:26:722 [main, generic_cluster] copying section /filters/invisible

    Oct 28 16:18:26:724 [main, generic_cluster] copyCfgSection returned 0

    Oct 28 16:18:26:724 [main, generic_cluster] copying section /filters/custom

    Oct 28 16:18:26:725 [main, generic_cluster] copyCfgSection returned 0

    Oct 28 16:18:26:725 [main, generic_cluster] deploying modified config back to nas on /win08vm/deb01/deb01

    Oct 28 16:18:26:734 [main, generic_cluster] deployment returned: 0

     

     

     

    PLEASE NOTE:

    - as this is not a offical probe this is not being supported by our support.

    - as you delete sections from your NAS.cfg - make sure you keep a backup copy of the unmodified nas.cfg. If you run into problems with your nas, deactivate the probes, rollback all your changes and verify if the problem persists. Only then should you call support. 

    - this probe and the nas-failover has been tested and seems to be working fine for quite a while now. Anyway, there is no guarantee. Nimsoft operators should be advised to manually verify full functionality after a failover to make sure the probe has not screwed up anything.

    - one thing you might run into problems with is if one hub is a windows and the other is a linux-based hub. In such cases, avoid using mixed uppercase/lowercase names for your lua-scripts. As the windows-NAS stores the names all in lowercase and the linux-nas will not find the script as it is case-sensitive

     

    that's it. 

    -chris