AnsweredAssumed Answered

Stability issue of CA APM 10.5.2

Question asked by Jaykrishna on Dec 21, 2017
Latest reply on Dec 21, 2017 by Hiko_Davis

Hi Team,
Currently we are monitoring lot openshift applications through CA APM 10.5.2 and we have 8 collectors and 1 MoM in the test enviornment. But this enviornment is horribly unstable.Every now and then we are unable to connect to the EM.So, we stop and start the collectors and MoM to make the enviornment up and running.

I did a 2 step analysis on one month data

1.To find out if any collector is overloaded.
Outcome:
a) All the collectors are equally loaded and around 400 agents are reporting to each collector.
b) More than 300 thousand metrics are getting collected by each collector and obviously the older collectors have more historical metrics than the newer one.
2.To find out which agents are creating a lot of transaction tracing events per interval.
Outcome:
a) The list OpenShift agents are creating a lot of transaction tracing events per interval.
When i am saying lot, it is in comparsion to other agents but i am not sure what should be the max value for that. I will be grateful if you can let me know the max value of transaction tracing events per interval for each agent. I do belive that if there are a lot of activities in the application then definitely there will be a lot of transaction tracing events per interval, please correct me if am wrong.

Please let me know what changes do i need to make at MoM and agent end to make this enviornment stable or do i need to revisit the capacity planning of this
enviornment.

Below find the EM logs from one collector which we restared yesterday.

12/20/17 04:21:27.554 PM EST [INFO] [PO Route Down Executor] [Manager] Lost connection at: Node=Agent_1796098, Address=oddcoc.dynamic.us.ups.com/153.2.212.115:47487, Type=socket
12/20/17 04:21:27.823 PM EST [INFO] [PO Route Down Executor] [Manager] Lost connection at: Node=Agent_1796804, Address=KYLOUSVRPW225.us.ups.com/10.120.12.117:59254, Type=socket
12/20/17 04:21:29.250 PM EST [INFO] [PO Route Down Executor] [Manager] Lost connection at: Node=Agent_1796723, Address=gaalplpapp00222.linux.us.ups.com/10.251.164.89:53840, Type=socket
12/20/17 04:21:29.596 PM EST [INFO] [PO Route Down Executor] [Manager] Lost connection at: Node=Agent_1796728, Address=gaalplpapp0020f.linux.us.ups.com/10.251.164.77:42478, Type=socket
12/20/17 04:21:29.876 PM EST [INFO] [PO Route Down Executor] [Manager] Lost connection at: Node=Agent_1796729, Address=gaalplpapp00211.linux.us.ups.com/10.251.164.79:53592, Type=socket

Log files on the system

12/05/17 12:03:46.911 PM EST [INFO] [PO:WatchedAgentPO Mailman 4] [Manager] TransactionTrace arrival buffer full, discarding trace(s)
12/05/17 12:03:47.055 PM EST [INFO] [PO:WatchedAgentPO Mailman 4] [Manager] TransactionTrace arrival buffer full, discarding trace(s)
12/05/17 12:03:48.064 PM EST [INFO] [PO:WatchedAgentPO Mailman 4] [Manager] TransactionTrace arrival buffer full, discarding trace(s)
12/05/17 12:03:50.431 PM EST [INFO] [PO:WatchedAgentPO Mailman 4] [Manager] TransactionTrace arrival buffer full, discarding trace(s)
12/05/17 12:03:50.537 PM EST [INFO] [PO:WatchedAgentPO Mailman 4] [Manager] TransactionTrace arrival buffer full, discarding trace(s)
12/05/17 12:03:52.376 PM EST [INFO] [PO:WatchedAgentPO Mailman 4] [Manager] TransactionTrace arrival buffer full, discarding trace(s)

 
Below find the traces buffer size limit of 100GB.

        <clamp id="introscope.enterprisemanager.transactionevents.storage.max.disk.usage">
            <description>
                The maximum desired disk usage in MB for the trace storage. If this maximum is exceeded,
                then the daily historical trace stores will be deleted starting with the oldest first
                until the total historical trace storage size is below this value.
                                The current days trace store actively storing traces will not be deleted even if
                                its size exceeds this property value. The size of the trace index is not considered
                                when determining what historical trace stores to delete.
            </description>
            <threshold value="102400"/>
        </clamp>


I am seriously concerned after seeing the health of this enviornment after onboarding only 20+ applications in one CA APM cluster and we are planning to onboard 300+ applications with multiple cluster.

Feel free to ask anything you need from my enviornment to diagnose this issue.

Any suggestion/feedback/configurational changes/hint is highly appreciated.

 

Cheers

Jay

Cell No:551-263-9681

E-mail:jmishra@ups.com

Outcomes