DX Application Performance Management

  • 1.  Stall with no response time over 30s?

    Posted Apr 12, 2016 03:39 PM

    Hello guys, need help to understand this

     

    We are seeing stall counts happening at a method we instrumented on our custom pbd, but the response times for that method (we checked max/min) are never greater than 15k ms.

    Here is what our metrics look like:

    overview.jpg

    We have a max of 8 threads that could invoke this class/method so we believe that the stall toping a count of 7 is not a coincidence, but we just do not understand why the stall is showing up on the method that is invoked and not on the caller.

    We are also not seeing stall error (error tab) for this method/class, anyway to force that to show up there?

     

    Any ideas?



  • 2.  Re: Stall with no response time over 30s?

    Posted Apr 13, 2016 01:53 PM

    Are you sure that the call actually ends? I mean, Avg response time shows up when the call ends, but stall count reports when transactions are active. So, if there is anything that just kills the stalled transaction, but does not let it end correctly, we might lose contact with the call, and might not see a response time.

     

    It's a guess, I don't know if this sounds logical to your situation

     

    Regards,

    Roger



  • 3.  Re: Stall with no response time over 30s?

    Broadcom Employee
    Posted Apr 13, 2016 01:56 PM

    You must use create an entry in your PBD using the tracer type "ExceptionErrorReporter" in order to increment "Errors Per Interval". Otherwise, EPI will always report zero.



  • 4.  Re: Stall with no response time over 30s?

    Posted Apr 13, 2016 02:41 PM

    Howdy Andfer,

     

    Does the target application use AJAX or reverse proxy AJAX (comet)?

     

    In either case the AJAX will return a response to the request, thus allowing APM to register the Average Response Time, but the thread is still opened and could be forming additional responses/data, or browser call back along the AJAX path.  This would keep the thread alive which would trigger the stall counts.

     

    Hope this helps,

    Billy



  • 5.  Re: Stall with no response time over 30s?

    Posted Apr 13, 2016 07:09 PM

    Thanks for the insights,

     

    This is basically how the app is structured:

    there is a pool of threads that are always up (sleep/running when it makes sense) when running,  each of these threads might call that class/method we are instrumenting.

    The threads are also being monitored and actually show the endless stalls and huge response times. But I still dont understand why the method that is called on another class behaves like that.

     

    rogelio.dipasquale

    There should not be anything killing the transaction (at least not that we know of so far )

     

    Hiko_Davis

    We are actually already using the ExceptionErrorReporter, we do see the stall errors from the threads we are monitoring, but none from this class/method.

     

    bwcole

    We do not have an ajax/comet running but we do have some threads that we keep on, but from the graphics, the weird thing is that we see the stalls coming and going, so I expected that at some point the response time would be computed. Now if we look at the threads we are monitoring they show a constant number of stalls (as expected).



  • 6.  Re: Stall with no response time over 30s?

    Posted Apr 14, 2016 07:15 AM

    Hi andfer

     

    I'm on shaky ground on this but from what I know, the average response time is the java agent/auto probe listening for the request, method call, and the response or return of the method call.

    The Stall metric is based on the active thread id and monitors threads within the thread pool of the JVM.  Autoprobe/java agent will capture the thread ids within the thread pool and compare the active thread ids to the previously captured thread ids.

    This would be like a J2EE data source or EJB pool.

     

    I'm going to try to provide an example of what I think I know...again shaky ground ahead.

     

    1. Request is sent from client to server

    2. Server uses a process thread, web container thread from the web container pool.  Thread 1234

    3. Autoprobe - stall captures that thread 1234 is within the current metric measurement cycle

    4. Autoprobe - average response time clocks request on 1234

    5. Web container thread processes the request which includes a call to your process thread pool

    6. Process thread pool currently has no threads within the thread pool so creates thread 5678

    7. The process thread pool assigns thread 5678 to the thread 1234 transactional context

    8. Autoprobe - average response time clocks request on 5678

    9. Autoprobe - stall captures that thread 5678 is within the current metric measurement cycle

    10. Thread 5678 processes the request and returns the data response

    11. Autoprobe - clocks response on 5678, determines average response time on thread 5678

    12. Process thread pool returns response to thread 1234

    13. At this point, the thread 5678 is returned to the process thread pool and still listed within the threads of the JVM.  Since your custom thread pool isn't one of the standard JVM thread pools, auto-probe would not recognize that the thread is a thread in a pool and monitors it like any other JVM thread.

    14. Web Container thread 1234, returns response to original client thread

    15. Autoprobe - clocks response to 1234 determines average response time on thread 1234

    16.  Autoprobe enters next measurement cycle

    15.  There are no client threads but the thread enlisted at step 11 is still active. Stall captures that thread 5678 is still active

    16.  Now, if that thread ages out of the pool or gets used by another then the stall capture, I'm guessing, would then understand that the thread is processing a different request thus resetting the stall marker.  If not each cycle that the pool exists, it becomes another stall tick

    17.  If the thread is still marked for presence within the JVM thread pool for one more cycle then it is reported as a stall.

     

    Hoping Hiko_Davis will correct me so I can understand how the ART and stall stuff works.

     

    Billy



  • 7.  Re: Stall with no response time over 30s?

    Broadcom Employee
    Posted Apr 14, 2016 04:07 PM

    Stalls come from BlamePointTracer.

    Incrementing EPI comes from ExceptionErrorReporter.

     

    Make sure you have both in your PBD.