DX Unified Infrastructure Management

Expand all | Collapse all

Are your hubs "randomly" restarting?

  • 1.  Are your hubs "randomly" restarting?

    Posted Apr 27, 2015 04:15 PM

    We are seeing this phenomenon right after a hubsec_update fails:

     

    Logs look something like this:

     

    Apr 26 01:15:56:749 [32060] hub: Queue sender thread for 't_6' stopping

    Apr 26 01:15:56:791 [49256] hub: Queue sender thread for 't_5' stopping

    Apr 26 01:15:56:848 [58980] hub: Queue management thread for 't_6' stopping

    Apr 26 01:15:56:890 [24688] hub: Queue management thread for 't_5' stopping

    Apr 26 01:16:55:057 [45560] hub: nimSessionWaitMsg: got error on client session: 10054

    Apr 26 01:16:55:057 [45560] hub: ihubrequest: nimSessionRequest 'hubsec_update' failed: communication error (192.168.xx.***:48002)-/mydomain/h-nimhub-pos-1/h-nimhub-pos-1/hub (2)

    Apr 26 01:16:55:057 [45560] hub: hub_send_event_down - /mydomain/h-nimhub-pos-1

    Apr 26 01:16:55:057 [45560] hub: hub_send_event_down - child /mydomain/otherhub is also down

    Apr 26 01:16:55:057 [45560] hub: HUB SEC: update h-nimhub-pos-1 failed (communication error)

    Apr 26 01:17:33:169 [30188] hub: passive robot thread - finished

     

    Before I go into to much detail, wondering if others are seeing this issue.  We have been seeing this with Hub 7.63/7.63/7.71 and controller 7.70



  • 2.  Re: Are your hubs "randomly" restarting?

    Posted Apr 28, 2015 01:15 AM

    Hi,

     

    I haven't seen this. I tested making a change in the Users and kept an eye on the log. I got a weird looking hubsec error, but nothing beside that happened. This is robot 7.70 and hub 7.71. Are you doing some particular operation when this happens, so I could do test too?

     

    -jon



  • 3.  Re: Are your hubs "randomly" restarting?

    Posted Apr 28, 2015 09:57 AM

    Are you sure this isn't part of the "self healing" reset? That maybe the security message isn't a cause but just a symptom? The hub has code in it to restart when it thinks things are going badly - when it runs out of file handles for instance or leaks too many resources.

     

    -Garin



  • 4.  Re: Are your hubs "randomly" restarting?

    Posted Apr 28, 2015 11:50 AM

    Thank you both for the reply.

     

    "I haven't seen this. I tested making a change in the Users and kept an eye on the log. I got a weird looking hubsec error, but nothing beside that happened. This is robot 7.70 and hub 7.71. Are you doing some particular operation when this happens, so I could do test too?"  - so we have ~300 hubs in our env and have often seen issues with the security.cfg converging. I.E. We always seems to be in a state of flux, where the version can be slightly different in our environment. We have always attributed this to what we considered a lack of effective routing in the hub, in particular over tunnels. Our ACL is pretty static, yet for some reason we always have the security.cfg updating. Other than updating the ACL I don't think any other actions trigger a security.cfg update


    @Garin


    yeah the "self-healing" or as it seems to me a band-aid approach, is there a message that can tell me when this occurs? I have seen the internal tunnel restart messages for when that occurs, this seems different. I have not pursued the files handles or resource leaks, I will look into that and try to turn up logging to capture more detail. I recently learned there is level 6 logging nowadays.



  • 5.  Re: Are your hubs "randomly" restarting?

    Posted Apr 28, 2015 12:11 PM

    There is supposed to be a message in the log about the self healing restart but I've not been able to identify it.

     

    And regarding the comments about the security.cfg file constantly updating, we have the same issue. My version number increments between 50 and 100 times a day. Usually it happens as the result of the introduction of a formally unknown hub or the upgrade of the hub probe version. Unfortunately sometimes it happens for no identifiable reason. Best I can tell this is the result of there being multiple paths that the hub-up event can take. If you have three hubs that can reach each other, what seems to happen is that Hub C starts up and gets an updated security file and broadcasts to A and B. What is supposed to happen is that both A and B see the same version number and do nothing. Unfortunately what seems to happen is that A and maybe B will increment the version number and then broadcast that new file to the other systems. At that point, it's like an argument breaks out and the hubs start trying to distribute their newer security file to the other systems.


    There appears to be a critical point in the hub where it seems that the hub can receive security updates from multiple locations at the same time - Hub C gets security file updates from both A and B at the same time. This results in the security file getting truncated on C and then forwarded back to A and B. The file that gets forwarded will almost always be missing the ACL section.


    -Garin



  • 6.  Re: Are your hubs "randomly" restarting?

    Posted Apr 28, 2015 12:40 PM

    @Garin - You have just 100%described our issue with security.cfg file! This has been going on for years for us. At least once every six months we have to use secedit to restore a corrupted security.cfg file. Normally this we bring a new hub online. We have reported exactly what you called out and have been told this is supposed to be included/addressed in a forth coming release.  Our "use case" was submitted like this

     

    "In a large UIM deployment with many remote hubs connected through redundant tunnels to redundant hubs in redundant data centers, I want hub routing to converge predictably without the possibility of routing loops."


    I suspect you have a large environment as well. I would encourage you to push this issue with support as well, maybe we can turn "forth coming" into "next release" 





  • 7.  Re: Are your hubs "randomly" restarting?

    Posted Apr 28, 2015 12:49 PM

    Been pushing hard on it. For 2.5 years now. I have 715 hubs right now and this broadcast storm that happens when the security file propagates makes my central hub useless for the 2 to 20 minutes it takes for the noise to subside. Thankfully only a handful of my customers have asked about the spikes in traffic that this causes to their environments.

     

    I've been told to expect something in hub 7.80 or 8.00 but it's been a moving target.

     

    The whole routing and name resolution thing is a complete mess too. The fact that every hub has to know about every other hub and that every hub tells everything that could possible be listening that it's there is an awful security flaw. At any point in time I can see a bunch of hubs that my Nimsoft implementation knows about that are part of other Nimsoft implementations but happen to share the same physical network. And I imagine that my customers' Nimsoft implementations see my servers too. And anyone that would want to could at least get the same list just by having a hub listening to the broadcasts. Messy.

     

    -Garin



  • 8.  Re: Are your hubs "randomly" restarting?

    Posted Jun 05, 2015 05:01 PM

    Garin,

     

    I would like to message you about this topic and the response(s) you have gotten back regarding this issue. If you are open to that, please "follow" me so that I can message you directly via the communities.

     

    Thanks

    Brandon



  • 9.  Re: Are your hubs "randomly" restarting?

    Posted May 04, 2015 04:36 PM

    Actually the routing loops is a different bug.  If you have redundant paths, sometimes a removed hub will not go away from the hublist.  It will keep getting re-advertised through circular paths and never converge.  Additionally, sometimes messages will get routed down to an edge hub which routes it back up to another distribution level hub that reroutes it elsewhere.  The only way you'll know the latter is happening is if you specifically deny the up and back down traffic from edge hubs with tunnel ACLs, then log it, then alert on it via the logmon probe.  We have another years old request to just be able to generate an alarm via the tunnel ACL matching directly.



  • 10.  Re: Are your hubs "randomly" restarting?

    Posted Jun 06, 2015 08:06 AM

    Another thing that I've been seeing with UIM 8.2 is that the socket/file descriptor usage on the hub 7.70 is very variable. The busiest hubs will hover around 600 sockets active and then at fairly random times (every 15 minutes to a day or two), it will increase over the span of 20 to 30 seconds to breach the 1000ish limit and trip the too many open files self healing restart. My gut tells me this is discovery_server's doing but then I'd like to blame all my woes on discovery.

     

    -Garin



  • 11.  Re: Are your hubs "randomly" restarting?

    Posted Jun 12, 2015 07:45 AM

    Hi!

     

    Thats sounds familiar from my 8.2, relatively fast respawning ports, increasing temp-queues from discovery_server, behind all our tunneld hubs, came suddenly without any warning for some days, well rolled back to 7.63-Hub+Robots, still happened, so doesnt seems to be a hub+robot-problem so far, because trouble keeped on, and suddenly, silence.... disapeared, nothing worthfully or noticable done, only updating one old Infrastructure Manager 4.03, and now is silence. Even worse can not reproduce that one :-(



  • 12.  Re: Are your hubs "randomly" restarting?

    Posted Jun 12, 2015 08:36 AM
      |   view attached

    I wish it was that easy.

     

    One thing I can comment on is that in plotting the file handle use over time, there is a distinct pattern to the usage.

     

    I run

        while test 1; do ls -al /proc/$(ps -ef  |grep "[n]imbus(hub)"|awk '{print $2}')/fd | wc -l; sleep 10; done

     

    on my linux hubs to track and can fairly decisively state that a hub will experience consistent failures if the resulting number exceeds 700 ever.

     

    Also at intervals that are random but periodic (every 10, 15, 20 minutes for instance but not 22 minutes) there is a five minute period where the file descriptor use will jump by 200 to 300.

     

    I'm thinking that this might be related to tunnel keep alives because that also has a 5 minute period. each line in the plot of open files in the hub for each of my 52 tunnel proxies. That squarish pair across the bottom are and HA pair that connect the same set of hubs. The only difference is that the one with the slightly higher line also runs the get queues.

     

    OpenFiles.png

    The thing that's bothersome is that at about 10% of the way across the chart (upper left) there's a grayish line that was at 600 and then for some reason across a 2-4 minute span jumps up to close to 900. Then it comes back down.

     

    And I'm not quite able to explain the pinkish line at 300 because it has one of the largest sets of tunnels and get queues. Only obvious difference is that I was playing with the postroute timeout and it is set to 600 instead of the default (either 30 or 60 - I don't recall at the moment).

     

    I also attached the logmon profile that I use to capture this data (Linux hubs only though something alanlogous should be possible on windows via the performance counters I'd think)

     

    -Garin

    Attachment(s)

    zip
    logmon profile.txt.zip   879 B 1 version


  • 13.  Re: Are your hubs "randomly" restarting?

    Posted Jun 12, 2015 02:17 PM

    I was provided a beta version of the probe that is supposed to address the issue I have reported above. v7.72. I am currently trying to get change management approval to implement. Reference 00161913 if you want in on the fun



  • 14.  Re: Are your hubs "randomly" restarting?

    Posted Jun 15, 2015 11:15 AM

    And yet one more thing that's hard to explain to support is behavior like this:

     

    pic2.png

    Looks like at around 23:00 on 6/14 things start getting "noisy" and then at 00:30 or there abouts, I have hubs failing left and right. I'm sure that I have a level 5 log from the original offender that points out the issue clearly. But alas no. One of these hubs at level 5 generates between 4 and 10 MB/sec of log.

     

    -Garin