DX Unified Infrastructure Management

Expand all | Collapse all

NAS configurations for scale?

  • 1.  NAS configurations for scale?

    Posted Jan 18, 2012 12:26 AM

    I've read the whitepaper and received little more than a link to it when I posed this question to support.  I was wondering what conbination of nas configurations are recommended and what other large deployments are using to create highly scalable environments and to isolate potential impact of alert storms.

     

    It seems there is a number of ways to configure multi-nas environments with combinations of many different types of replication.  These are describe in some detail in the white paper and other docs, but there are no recommendations on how to piece it together.  Just the usual description of functionality without an architecture recommendation. Hopefully somone could share some direct experience on that higer level for me.

     

    I think the vision of how to scale this that's lacking in the docs and whitepaper would look something like this.

     

    top/master-nas  <-bidirectional/relay-forwarded-alarm-events-> top/dr-nas

     

    top/master-nas  >-event-responder-> spoke-nas1...

    spoke-nas1... >-one-direction-> top/master-nas

     

    top/dr-nas  >-event-responder-> spoke-nas2...

    spoke-nas2... >-one-direction-> top/dr-nas

     

    And then tie in alert/ticket system gateway on the top/master-nas.

     

    I also wonder if anyone is running their master nas and data_engine on separate servers, essentially breaking the nimroot functionality into to parts for scale and isolation?

     

    What are others doing.  What's working and not working?  Thanks for the discussion.

     

    -ray



  • 2.  Re: NAS configurations for scale?

    Posted Jan 18, 2012 01:16 AM

    Ray,

     

    I don't have any insights on NAS replication, but I was wondering if you looked at the storm protection options that are available in the NAS. I have not looked closely at how that works, but I noticed that the options in the GUI allow you to set a threshold, various properties of the response, and the scope of the protection. Even if it does what you want, it might not help if you are more worried about the effect of alert storms on queues than on the NAS itself.

     

    For what it's worth, I think your description of how to setup replication between a master and spoke NAS makes sense. Unfortunately, I have not spent enough time with anything more than bidirectional replication to be confident that the other types of replication would work exactly the way you need. I have always found it a bit tricky to really see what is going on with replication, whereas the queues between hubs are really easy to track because they have good statistics available in the GUI. There are some replication callbacks in the NAS, so those might help. I have not seen that information in the GUI anywhere, but maybe I have not looked in the right place.

     

    Hopefully others have used these other types of replication and can describe their experiences...

     

    I can see no reason that you would be unable to split the data_engine and NAS to separate servers, but I have not done this myself either. I am not sure if those probes have to run on directly on hubs or not, but adding hubs incurs little to no penalty.

     

    -Keith



  • 3.  Re: NAS configurations for scale?

    Posted Jan 18, 2012 01:33 AM

    Is there storm code in the NAS too?  I've not been able to find that.

     

    I do see some storm protection in the spooler which looks pretty advanced, but not in the nas.  I am definitely looking at those spooler options.

     

    At this point I think our next capacity constraint is the nas though. A heavy enough storm can cause a significant processing delay.  I think the combination of multiple nas / spoked nas architecture, leveraging spooler storm protection, and monitoring queue sizes at the hub is going to be the combination to crack that open.



  • 4.  Re: NAS configurations for scale?

    Posted Jan 18, 2012 06:32 PM

    In the NAS GUI, I see the storm protection options underneath the Setup tab and the General sub-tab.

     

    (NAS 3.71 in our case.)



  • 5.  Re: NAS configurations for scale?

    Posted Jan 18, 2012 06:52 PM

    Hi!

     

    I had some serious trouble with multiple nas and their replication, evenso some sessions with support.

    Fact was, that ack'ed messages werent ack'ed on all nas'es, so we had thousands of messages running around the bus

     

    The Idea was simple, 2 Mainhubs (One active the other standby) with bidirect nas-replication, what I mean is, that on has a dataengine runing the other not, but both have nas'es and Archives and so on.

    All other Hubs unidirect to both Mainhubs...

     

    Took weeks to get the problem though support, that they were realising there is a problem, finally, I reduces all to only the mainhubs and 2 other nas'es all other running with queues sending the alarms to the mainhub (nas'es)

     

    But got a solution, why this was happenign, if setting up the nasreplication increase the timeout-value, because there seems to be some speed-problems within the alarm-handling :-)

     

    cheers

    Matthias

     



  • 6.  Re: NAS configurations for scale?

    Posted Jan 18, 2012 07:03 PM

    Interesting. I ran into a similar problem with the NAS running on just two hubs--a primary and a secondary root hub. The secondary NAS would have old alarms out there that had been ack'ed on the primary. It would not surprise me if this issue were caused by the same problem as yours. The support case was still open when I moved on to another job, so I do not know how it ended up.



  • 7.  Re: NAS configurations for scale?

    Posted Jan 18, 2012 07:10 PM

    Hi!

     

    sounds equal, was not easy to get the support on my track :-), finally fought it out ;-) heheh one of my "famous" calls

     

    cheers

    Matthias

     

    P.S. Think that was event that one, where I did a lot of MPEG-Films to show them that problem in "realtime" :->



  • 8.  Re: NAS configurations for scale?

    Posted Jan 18, 2012 07:45 PM

    I recall reading about stopping and starting replication could cause inconsistencies in replication.  There is also a great thread in here between keith and another fellow that has some nice lua code to detect inconsistancies.

     

    Specifically if you stop replication to a target, then start it again, it will not send updates including close / clear on alarms already in the replicated NAS.  The work around is to clear any alarms on the target nas that was replicated from the source nas before re-enabling replication.

     

    Thanks for the tip on nas storm protection keith.  As of 3.54 it isn't there and I didn't see that tidbit in the change logs and it looks like the online dos haven't been updated either.  It seems worth the effort to upgrade but largely discovery to start using that feature until the docs come up to speed.

     

    I did notice another interesting feature, the "Publish alarm updates every X messages" in setup->general.  It seems this would make the gui tools and possibly ticket gateways work more efficiently by not publishing every alert every 5 minutes.  Apparently it effect the stream of data to alarm consoles and likely the ticket gateways.  Have you used this feature?  My only concern would be if it considers a severity change just another duplicate elligible for suppression, but then I don't think it could or you would mis clears in the alarm consoles.



  • 9.  Re: NAS configurations for scale?

    Posted Jan 18, 2012 08:06 PM

    Ray,

     

    The publishing of alarm updates defaults to every 100 messages. Because that is the default, I have some experience with it. I believe your system is setup to publish every update because of the ticket gateway. When that feature was first added to the NAS, I had to change the setting from the default to mimic the original behavior of the NAS. The gateway had the ability to hold off on opening tickets for certain alarms until the alarm repeated X times, so it needed to see the updates when that threshold was crossed. The gateway could be setup to pick up alarms and updates in other ways, but the update messages worked really well to get the information to the gateway as quickly as possible.

     

    It has always been a bit unclear to me how much of a difference this setting makes in the real world, but it probably makes a bigger difference in the event of an alarm storm. And I am not sure if an alarm storm is more likely to cause trouble for the NAS or for the recipients of the update messages. If the alarm storm ends up making a lot of unique alarms, the point is moot anyway.

     

    As someone who tends to spend a lot of time in the Alarm SubConsole (standalone or in Infrastructure Manager), I really do not like this feature. I prefer to see the latest and greatest information in the console, which this feature intentionally prevents. I also still tend to get confused on occasion when I think something should have repeated but does not show a count. Then I realize the reason and refresh my alarm list manually. I recently tried changing the publish setting from 100 to 10 in hopes that makes it a little easier in the console without having to get every update.

     

    I am fairly confident that severity changes always get published because they are not considered duplicate in this context. I think changes to the alarm message text are also published even if the severity remains unchanged, but I am not completely sure about that one. That would be simple enough to test if you really need to know.

     

    -Keith



  • 10.  Re: NAS configurations for scale?

    Posted May 03, 2015 11:38 AM

    Not to resurrect an old thread but it saves me from retyping it. Itr has been three years and I'd say that the status of the documentation about how to handle a multiple nas environment hasn't changed - there's a lot of powerpoint about the theory of how it could work but nothing about how it should be configured.

     

    So what I'd like is to restart this conversation and maybe the community can piece together the process.

     

    My specific need surrounds separating preprocessing script execution from AO profiles.

     

    I have 32 tunnel proxies (because of the number of connected hubs) and I have to run every alarm through a preprocessing script because we need to modify most of the message subjects to include a problem id and some identifying text for downstream Salesforce integration.

     

    We are constantly having the situation where the current single nas hangs because we are approaching some sort of an internal concurrency limit on the number of scripts that can be run.

     

    The thought is to put a nas on each of the primary tunnel proxies (we run them in HA pairs) with a secondary nas on the HA pair. This nas would only run the preprocessing scripts and any AO profiles that prevent alarms from propagating (close on arrival sort of things).

     

    Then use nas replication to move these alerts to the central hub where the primary nas would reside. This nas would then be responsible for running all remaining processing scripts.

     

    So is there anyone doing something like this already and if so could you share your experiences? If not, is this as simple as it sounds to do? It seems like it should be trivial to set up but whenever something seems trivial in Nimsoft it usually means that there's some looming unexpected reason why it shouldn't be done.

     

    -Garin



  • 11.  Re: NAS configurations for scale?

    Posted May 04, 2015 01:56 AM

    I've only recently started using multiple nasses, so my experience is also somewhat limited.

     

    The way nas replication works is, that it doesn't replicate AO or PP profiles. I guess this might be a consideration in your environment, if all the nasses are to have the same configuration. It definitely was for me, even with just one HA pair. Beside that I think the other obvious consideration is how to handle imported alarms in AO or PP profiles on your primary nas. What I mean is, making sure that alarms are preprocessed only in the nas on the proxy hubs.

     

    About the synchronization issues.. I was just investigating one last week, so there definitely still seems to be some issues these days.

     

    -jon



  • 12.  Re: NAS configurations for scale?

    Posted May 04, 2015 01:27 PM

    My need here is more along the line of needing an instance of nas just to run preprocess scripts.

     

    Sometimes when I restart my central nas I'll get several hundred error messages in the nas.log indicating that the preprocessing script was interrupted. That leads me to believe that I'm asking for too much processing out of hte single nas instance.

     

    So I tried installing nas on another hub and ran into issues.

     

    I deployed nas 4.67 and it failed to start. No log generated or any other indication of failure. Eventually it goes red in IM with no indication why.

     

    I executed nas from the command line and that created a log that contained messaging that the nas was unable to connect to its queue. And yes, on the local hub there was no nas attach queue. I created that queue and the nas probe "starts". Now the log file is rolling with a once a second message that it's waiting for MPSE.

     

    So, that implies that MPSE is a pre-requisite of Nas though it's not in the nas package as a prerequisite. Fixing that is presumably as simple as dropping the MPSE package on the hub in question.

     

    Unfortunately, MPSE will not start because it requires data_engine to be installed. Again, a prerequisite not in the package but showing up at runtime.

     

    So, I try installing data_engine which works but the problem is that data-engine requires the ability to connect to a database of which I have none available on this particular hub because is't Linux and so has no access to the MSSQL instance that the central hub uses.

     

    That leads me to the question if anyone has a process to run nas without all the prerequisites?

     

    -Garin



  • 13.  Re: NAS configurations for scale?

    Posted May 04, 2015 02:14 PM

    Hmm, that sounds odd. In my new env, I've installed nas in all hubs due to all the silly dependencies and it has worked without installing anything else. Sure, it installs alarm_enrichment and by default requires that, but everything else worked out of the box. I'm also talking about nas 4.67.

     

    -jon



  • 14.  Re: NAS configurations for scale?

    Posted May 04, 2015 02:30 PM

    About the only predictable thing about Nimsoft is its unpredictability.

     

    But yes, I was surprised at all the prerequisites too. So, I delete everything and install an older nas version (4.63 in this case). That installed without any kind of complaint except for not finding maintenance mode.

     

    May  4 13:21:06:816 [140080443713280] nas: maint: Maintenance Probe Find Failed

    May  4 13:21:06:819 [140080411707136] nas: maint:  Unable to obtain nimNamedSession for registration to: maintenance_mode

     

    Then right click and choose update in IM and that install also runs to completion without complaint.

     

    So, if at first you don't succeed, delete and re-install.

     

    -Garin

    .



  • 15.  Re: NAS configurations for scale?

    Posted May 04, 2015 03:03 PM

    Hmh yeah.. not surprised. About maintenance mode, I believe in 8.2 release notes it was said that secondary nasses should now be able to detect the maintenance_mode on the primary hub. I've seen this error on all my nasses and haven't found a key to point the nas to a specific maintenance_mode robot. Oh well..

     

    -jon



  • 16.  Re: NAS configurations for scale?

    Posted May 12, 2015 05:57 PM

    Jon,

     

    we noticed the same error log on our secondary NAS(ses). Maintenance also no longer works, did you find a solution?

    Regards

     

    Rob



  • 17.  Re: NAS configurations for scale?

    Broadcom Employee
    Posted May 15, 2015 08:22 AM

    Unfortunately this doesn't appear to be documented - I'll work to correct that today - but for remote NAS' you can put in a configuration key in the <setup> section of nas.cfg:

     

     

     

    maintenance_mode_address = /Domain/PrimaryHub/hubrobot/maintenance_mode

     

    This will allow the remote NAS's to communicate with the maintenance_mode probe and properly utilize maintenance.

     

    (NAS 4.67+ only.)



  • 18.  Re: NAS configurations for scale?

    Posted May 25, 2015 04:46 AM

    Hi Jonas,

     

    Thnx for your response. This fixed the issue. We applied the config, the maintenance is now working as supposed.

     

    Regards Rob



  • 19.  Re: NAS configurations for scale?

    Posted May 25, 2015 12:16 PM

    Nice catch Jason!