DX Application Performance Management

  • 1.  Stability issue of CA APM 10.5.2

    Posted Dec 21, 2017 10:24 AM

    Hi Team,
    Currently we are monitoring lot openshift applications through CA APM 10.5.2 and we have 8 collectors and 1 MoM in the test enviornment. But this enviornment is horribly unstable.Every now and then we are unable to connect to the EM.So, we stop and start the collectors and MoM to make the enviornment up and running.

    I did a 2 step analysis on one month data

    1.To find out if any collector is overloaded.
    Outcome:
    a) All the collectors are equally loaded and around 400 agents are reporting to each collector.
    b) More than 300 thousand metrics are getting collected by each collector and obviously the older collectors have more historical metrics than the newer one.
    2.To find out which agents are creating a lot of transaction tracing events per interval.
    Outcome:
    a) The list OpenShift agents are creating a lot of transaction tracing events per interval.
    When i am saying lot, it is in comparsion to other agents but i am not sure what should be the max value for that. I will be grateful if you can let me know the max value of transaction tracing events per interval for each agent. I do belive that if there are a lot of activities in the application then definitely there will be a lot of transaction tracing events per interval, please correct me if am wrong.

    Please let me know what changes do i need to make at MoM and agent end to make this enviornment stable or do i need to revisit the capacity planning of this
    enviornment.

    Below find the EM logs from one collector which we restared yesterday.

    12/20/17 04:21:27.554 PM EST [INFO] [PO Route Down Executor] [Manager] Lost connection at: Node=Agent_1796098, Address=oddcoc.dynamic.us.ups.com/153.2.212.115:47487, Type=socket
    12/20/17 04:21:27.823 PM EST [INFO] [PO Route Down Executor] [Manager] Lost connection at: Node=Agent_1796804, Address=KYLOUSVRPW225.us.ups.com/10.120.12.117:59254, Type=socket
    12/20/17 04:21:29.250 PM EST [INFO] [PO Route Down Executor] [Manager] Lost connection at: Node=Agent_1796723, Address=gaalplpapp00222.linux.us.ups.com/10.251.164.89:53840, Type=socket
    12/20/17 04:21:29.596 PM EST [INFO] [PO Route Down Executor] [Manager] Lost connection at: Node=Agent_1796728, Address=gaalplpapp0020f.linux.us.ups.com/10.251.164.77:42478, Type=socket
    12/20/17 04:21:29.876 PM EST [INFO] [PO Route Down Executor] [Manager] Lost connection at: Node=Agent_1796729, Address=gaalplpapp00211.linux.us.ups.com/10.251.164.79:53592, Type=socket

    Log files on the system

    12/05/17 12:03:46.911 PM EST [INFO] [PO:WatchedAgentPO Mailman 4] [Manager] TransactionTrace arrival buffer full, discarding trace(s)
    12/05/17 12:03:47.055 PM EST [INFO] [PO:WatchedAgentPO Mailman 4] [Manager] TransactionTrace arrival buffer full, discarding trace(s)
    12/05/17 12:03:48.064 PM EST [INFO] [PO:WatchedAgentPO Mailman 4] [Manager] TransactionTrace arrival buffer full, discarding trace(s)
    12/05/17 12:03:50.431 PM EST [INFO] [PO:WatchedAgentPO Mailman 4] [Manager] TransactionTrace arrival buffer full, discarding trace(s)
    12/05/17 12:03:50.537 PM EST [INFO] [PO:WatchedAgentPO Mailman 4] [Manager] TransactionTrace arrival buffer full, discarding trace(s)
    12/05/17 12:03:52.376 PM EST [INFO] [PO:WatchedAgentPO Mailman 4] [Manager] TransactionTrace arrival buffer full, discarding trace(s)

     
    Below find the traces buffer size limit of 100GB.

            <clamp id="introscope.enterprisemanager.transactionevents.storage.max.disk.usage">
                <description>
                    The maximum desired disk usage in MB for the trace storage. If this maximum is exceeded,
                    then the daily historical trace stores will be deleted starting with the oldest first
                    until the total historical trace storage size is below this value.
                                    The current days trace store actively storing traces will not be deleted even if
                                    its size exceeds this property value. The size of the trace index is not considered
                                    when determining what historical trace stores to delete.
                </description>
                <threshold value="102400"/>
            </clamp>


    I am seriously concerned after seeing the health of this enviornment after onboarding only 20+ applications in one CA APM cluster and we are planning to onboard 300+ applications with multiple cluster.

    Feel free to ask anything you need from my enviornment to diagnose this issue.

    Any suggestion/feedback/configurational changes/hint is highly appreciated.

     

    Cheers

    Jay

    Cell No:551-263-9681

    E-mail:jmishra@ups.com



  • 2.  Re: Stability issue of CA APM 10.5.2

    Broadcom Employee
    Posted Dec 21, 2017 01:18 PM

    This would be a good place to start.

     

    CA APM Slow Performance 

     

    Also see the sizing and performance guide.

     

    CA APM Sizing and Performance - CA Application Performance Management - 10.5 - CA Technologies Documentation 

     

     

    As for TT, the default limit is 5000.  This is configurable in the IntroscopeAgent.profile.



  • 3.  Re: Stability issue of CA APM 10.5.2
    Best Answer

    Broadcom Employee
    Posted Dec 21, 2017 01:55 PM

    Jay,

    I have reached out to your account team. They will be contacting you shortly since I know APM SWAT was involved in this setup.