anders.synstad

Dissecting net_connect

Discussion created by anders.synstad on Jul 31, 2014
Latest reply on Apr 20, 2016 by SandroAlves

We've been seeing quite of bit of random FAILs in the net_connect logfiles that has been puzzling us and we have tried to investigate what could be causing this. This investigation has led us down a somewhat sad and disturbing path when it comes to the nature of net_connect. So here are our findings. Note that we don't have access to the net_connect source code, and we have not actually attempted to decompile net_connect. Everything is based on level 3 logs, tcpdump and running strace against the probe.

 

# ps auxwwwf | grep net_connect root     28901  0.2  0.2 3543012 4840 ?        Sl   13:28   0:01      \_ nimbus(net_connect)  # strace -p 28901 -s128 -ff -o /tmp/andersns/net_connect Process 28901 attached with 306 threads - interrupt to quit

 

I'm attaching net_connect_debug.tar.gz containing 4 different straces done. 2 with 1 profile, and 2 with 2 profiles (1 interval and 2 intervals) should anyone wish to look into the data as well. Everything is based on net_connect version 3.04, which is that latest version at this time.

 

According to the documentation (http://docs.nimsoft.com/prodhelp/en_US/Probes/Catalog/net_connect/3.0/1943202.html) there is a setting called "Concurrently Execute Profiles" which claims "Allows you to select the number of profiles simultaneously executed.". This setting is the max_threads in the configuration file, and in our tests this has been set to 100.

 

This begs the question. As you can see from the strace command above (and the content of the tar.gz file), it is actually running 306 threads. On a different system we have, with 813 profiles, it's running 318 threads. Still with max_threads set to 100:

 

# grep 'hostname = ' net_connect.cfg | wc -l 813 # strace -p 31041 -ff -o /tmp/andersns/foo Process 31041 attached with 318 threads - interrupt to quit # grep max_threads net_connect.cfg max_threads = 100

 

All threads are spawned when net_connect starts and the main process is the parent of all of them.

The thread id is logged and is the number in the column with brackets (in this case, 140397606377216). So that way, you should be able to trace some of the logic and see what is being done where:
Jul 30 13:36:35:091 [140397606377216] nc: Profile: andersns01.lab.basefarm.net IP resolved to : 172.24.128.6

 

In the production example with 813 profiles running for 1 hour, only 26 different threads (out of 318) are being used in total. In one of the lab tests with one profile, only 8 are used out of 306.

 

298 of the 306 files we got from strace in our 1-profile test are 44 byte large and just contain "futex(0x1893034, FUTEX_WAIT_PRIVATE, 1, NULL". In other words, they are just sleeping and not very interesting. Waiting for the controller thread to put them into use, something that never happens. We have identified what the remaining 8 do, and given them names based on their function:

 

# ls -lhSr | tail -n20 -rw-r--r--. 1 root root 44 Jul 30 13:34 net_connect.28914 -rw-r--r--. 1 root root 44 Jul 30 13:34 net_connect.28913 -rw-r--r--. 1 root root 44 Jul 30 13:34 net_connect.28912 -rw-r--r--. 1 root root 44 Jul 30 13:34 net_connect.28911 -rw-r--r--. 1 root root 44 Jul 30 13:34 net_connect.28910 -rw-r--r--. 1 root root 44 Jul 30 13:34 net_connect.28909 -rw-r--r--. 1 root root 44 Jul 30 13:34 net_connect.28908 -rw-r--r--. 1 root root 44 Jul 30 13:34 net_connect.28907 -rw-r--r--. 1 root root 44 Jul 30 13:34 net_connect.28906 -rw-r--r--. 1 root root 44 Jul 30 13:34 net_connect.28905 -rw-r--r--. 1 root root 44 Jul 30 13:34 net_connect.28904 -rw-r--r--. 1 root root 44 Jul 30 13:34 net_connect.28903 -rw-r--r--. 1 root root 3.0K Jul 30 13:36 net_connect.29002 connectToServiceSrv -rw-r--r--. 1 root root 4.7K Jul 30 13:36 net_connect.29102 ConnectForPingQos -rw-r--r--. 1 root root 6.6K Jul 30 13:37 net_connect.28901 MAIN SELECT LOOP -rw-r--r--. 1 root root 7.4K Jul 30 13:37 net_connect.29210 MAIN SELECT LOOP QOS -rw-r--r--. 1 root root 18K Jul 30 13:37 net_connect.29208 RESOLVER -rw-r--r--. 1 root root 45K Jul 30 13:37 net_connect.29206 MAIN SEND -rw-r--r--. 1 root root 80K Jul 30 13:37 net_connect.29207 MAIN RECV -rw-r--r--. 1 root root 305K Jul 30 13:37 net_connect.29209 SELECT LOOP

 

Thread descriptions

 

  • connectToServiceSrv

    In this case, only one thread is being used. It is a bit random how many threads are used for doing the tcp checks. In the 2-profile example with 2 intervals, it uses 2 threads. Of 4 tcp tests done, 1 is done is thread1, and 3 in thread2. Why it is distributed unevenly is a bit of a puzzle.

  • RESOLVER

    There is a thread whose only purpose is resolving hostnames. It is only triggered from the connectToServiceSrv, however, the connectToServiceSrv threads do their own resolving as well. So at this point it's resolving each hostname twice. No other threads seem to be initiating the resolver thread.

  • * SELECT LOOP

    There are a number of threads with just select loops. These are things like the main processes handling robot communications, and threads handling some scheduling. Not really interesting. We have identified 3 of these, and it always seems to be 3.

  • MAIN SEND

    This is the thread handling the ping send requests. Issuing of icmp requests for all profiles is done by a single thread. For each profile, this thread resolves the hostname (again, so now we're resolving the hostname 3 times), grabs a raw socket, and sends however many icmp requests as you have configured the burst setting to. The number of profiles to test in one bulk (in this single thread) is defined by the max_sockets setting.

  • MAIN RECV

    This is the thread handling the ping receives. It loops and gets a list of sockets it should read from. Then it reads the value(s) from all the profiles, does the calcuations to see if it needs to retry, etc. The curious thing is the this threads receives a "gTimeout:2000" on each loop. So it seems to be doing some kind of global timeout, and not per profile. Once that timeout is reached, it logs "Starting doReceiveTimeoutAnalysis", and triggers a FAIL for each host it hasn't read a value from yet. The timeout value is directly related to the time the read thread spends chewing data, NOT the time it has taken to ping a profile.

  • ConnectForPingQos

    A thread doing ANOTHER icmp ping used for qos. So, for each profile (with both alarms and qos), it pings the host 2 * burst times. It also does resolving again, so that is the 4th time the same hostname gets resolved in one interval. It uses more than one thread for this, and in the case of our 813 profile example, it has used 8 threads over the course of 1 hour. Distributed like this:
    $ grep ConnectForPingQos net_connect.log | awk '{print $4}' | sort | uniq -c | sort -rn 40380 [47954828532032] 18621 [47954818042176] 1167 [47954807552320] 366 [47954797062464] 294 [47954776082752] 150 [47954786572608] 147 [47954765592896] 9 [47954755103040]

    The ConnectForPingQos is also done on interval for each profile, but might be offset from when the ping is performed in the "MAIN SEND" thread. It does not look like the timing for the QoS pings is related to the alarm pings (they likely drift in time from each other since the alarm ping retries on fail, while the qos pings don't). They both run in the interval as defined by profile interval (in our case 2 minutes).

  • QoS Dispatcher

    While not shown in my dumps, there is also a thread issued for shipping qos data to the bus.

 

Conclusions


So, there is a number of conclusions we can draw from all this. Lets is try nesting it all up:

 

  1. Max threads setting: This setting doesn't seem to have anything to do with threads at all. Perhaps a remnant of how the probe used to work? Which means, setting this to max (100 in ui) might not do all that much.

  2. Resolving: If both alarms and qos is enabled for a profile, it appears to be resolving the same hostname 4 times on each interval. The resolver thread appears useless as its only triggered from the ConnectForPingQos thread, but the ConnectForPingQos threads resolves the hostname itself anyway.

  3. Timeout: So it does seem like it is possible to get a timeout even if ping replied in milliseconds, due to the fact that the recv thread is single-threaded and it might just have too much to do with all the profiles. In fact, in production, we see that exactly two seconds (the timeout value) after the receive thread started analysing packets, it stops doing so and instead starts running doReceiveTimeoutAnalysis, timing out any profile it has not had the time to analyse - regardless of the fact that a tcpdump shows that the ping in fact responded.

  4. QoS interval: According to the documentation this setting "Specifies the interval for when the probe delivers Quality of Service." The default is 5min. We are running the profiles with 2min intervals, which means the ConnectForPingQos is run every 2min for each profile. In the database however, we only see a single metric every 5min. This begs the question: Which value is sent? Average? Best? Worst? First? Last? Random?

  5. Double pings: The "MAIN SEND" thread appears to be used for handling ping related to alarms, and the ConnectForPingQos threads are used for collecting ping for qos. This makes it appear it is using different values for alarm and ping. Since these collections might get offset as well, you might get a ping alarm but a valid metric in the graph, and vice versa. This goes against the philosophy of collecting a metric once, and using that single collecting for both alarm and qos. It also doubles the load on the probe for no apparent reason. We've also verified with tcpdump to check it actually does the double up with pings, and it does.

  6. Unused threads: There are hundreads of threads being started that are never used. In the example with 813 profiles (all with at least one service check) over the course of one hour, it has started 318 threads, but has only ever used 26 of them.


All in all. This is not the design we thought we would see. What it looks like to us is that the probe was initially designed to spawn N threads, and these threads would be used per profile being checked. This is kinda what we would expect. Then it seems like perhaps the codebase was handed over to someone not familiar with the logic and design, and they have done changes to it they perhaps did not fully understand. A wild guess is when the probe was outsourced from the original developers (before HCL).

 

It would seem like a good idea to spawn threads and then use them for both ping send and receive (possibly with separate worker threads to handle TCP), and then have parent threads use the information from these worker threads to push both alarms and QoS. This way, the timing values would all be intact no matter how much time the main loop uses, and the data would be reused for alarms and QoS. In fact, this was what we expected, and it also looks like this was the initial plan - seeing all the unused worker threads. To our surprise it does not work like this at all.

 

The probe appears to be doing things in a horribly in-efficient matter, and our guess is that net_connect performance could be significantly improved by addressing and resolving at least some of these issues.

 

Again, keep in mind that we do not have access to the actual source code, and this is all based on our interpretation of the strace dumps and testing. We could be wrong in our analysis and are of course willing to accept that.

Attachments

Outcomes