JMertin

The importance of clean traffic for APM CEM

Blog Post created by JMertin Employee on Jul 17, 2014

Most problems showing up when using the CEM (Customer Experience) of the APM suite, is when the traffic forwarded to the TIM probe is not clean.

To understand the impact, one must understand that in contrary to many monitoring tools only looking at the TCP/IP packets, the TIM probe will look at the communication flows that occur between the client Browser and the Web-Server. In other words, all packets will be reassembled by the TIM probe to analyze what has been sent forth and back.

That is what the probe will base its evaluation of a valid transaction (for statistics collection), defects identification and data collection for bad traffic.

 

By default, the TIM probe will detect a transaction based on a match of a previously registered and configured transaction definition. In other words, it will only look for things it knows. This also means, that it will not see what has not been defined or does not exist.

The detection/identification of a transaction is by default linked to the request-header of a HTTP conversation.

- If TIM receives broken traffic and if TIM cannot see the request header, it will ignore the entire conversation.

- If it sees the request-header, but only part or none of the response header, TIM will generate a defect.

TIM however bases all this on the traffic it gets to analyze. If the traffic is not complete, it can base the detection/monitoring only on what it sees. If the traffic it gets is not the same as the real traffic, TIM will monitor something else than the real traffic !

 

Expect to see the following - if the traffic the TIM probe gets is not clean

  • Inability to record transactions
  • Inability to Monitor transactions
  • HTTPS traffic cannot be decrypted, or only partially. SSL errors showing up in TIM logs
  • Partial/Missing Component defects. If this happens on peak times, the SPAN equipment is probably overloaded and drops traffic.
  • OOQB (Out of Order Queued Bytes) queue getting high. This will cost CPU cycles

 

With that - it is impossible for the TIM to actually report real monitoring facts. It will report something eventually, but in this case, based on the provided unclean/broken data.

 

Now - due to the nature of a data collection point - which forwards traffic to the TIM probe, the following needs to be taken into account.

  • The physical data port/link usually uses only one communication channel. Namely from the Data collector to the TIM probe.
  • This data collection link is Uni-directional. This means that in the event of data loss here, the TIM Probe cannot tell the data collection device it has not received all data. This results in corrupt data the probe will receive.
  • Whatever data loss induced by the data collection device will let the TIM probe identify defects that are not real, but introduced by the bad quality of the provided traffic to its capture port.
  • By default - the SPAN port on Switches/Routers were developed for troubleshooting purposes. Not for monitoring !
  • SPAN Capture devices have to take a full duplex data feed (by example: 1Gbps in Tx, and 1Gbps in Rx) modify the data flow, and send it out through one only data flow (from the SPAN device to the SPAN receiver, here the TIM capture port). We have potentially 2 times 1Gbps that has to be fitted into a 1Gbps data feed (one direction only). And this if we SPAN only one communication flow.
    The amount of total SPAN traffic increases with each bidirectional physical port that is added to the SPAN configuration. For 20 physical ports, you will have a potential worst case of 40Gbps that will have to be forwarded to the Output SPAN port (If it is 1Gbps - one can already imagine that the capacity is reached very fast).
    Also, we cannot speak of a copy of the traffic anymore, but only of a very similar look alike version of the traffic if not too many packets are lost.
  • When more than one data flow needs to be re-arranged by the SPAN device, micro bursts and high priority traffic can mis-align the packets in a communication flow. the TIM probe will have to re-arrange these into the right order to be able to analyze the data. Because the buffers and CPU resources are limited, the higher the OOQB (Out of Order Queued Bytes) buffers are, the more likely you have an issue on the SPAN provider.
  • The SPAN device will discard packets. Hardware and Media errors are discarded, if utilization exceeds the SPAN link capacity, packets are dropped.

 

To make a long story short. If one wants to provide the copy of real traffic to a TIM probe, a network TAP installed in inline mode, and forwarding 2 data-streams to the TIM (1 data stream for the Tx traffic, one for the Rx traffic) will be required. That is the the only way to ensure the "correctness" of the provided monitoring results of the APM CEM!

Outcomes