Hi Team,
Currently we are monitoring lot openshift applications through CA APM 10.5.2 and we have 8 collectors and 1 MoM in the test enviornment. But this enviornment is horribly unstable.Every now and then we are unable to connect to the EM.So, we stop and start the collectors and MoM to make the enviornment up and running.
I did a 2 step analysis on one month data
1.To find out if any collector is overloaded.
Outcome:
a) All the collectors are equally loaded and around 400 agents are reporting to each collector.
b) More than 300 thousand metrics are getting collected by each collector and obviously the older collectors have more historical metrics than the newer one.
2.To find out which agents are creating a lot of transaction tracing events per interval.
Outcome:
a) The list OpenShift agents are creating a lot of transaction tracing events per interval.
When i am saying lot, it is in comparsion to other agents but i am not sure what should be the max value for that. I will be grateful if you can let me know the max value of transaction tracing events per interval for each agent. I do belive that if there are a lot of activities in the application then definitely there will be a lot of transaction tracing events per interval, please correct me if am wrong.
Please let me know what changes do i need to make at MoM and agent end to make this enviornment stable or do i need to revisit the capacity planning of this
enviornment.
Below find the EM logs from one collector which we restared yesterday.
12/20/17 04:21:27.554 PM EST [INFO] [PO Route Down Executor] [Manager] Lost connection at: Node=Agent_1796098, Address=oddcoc.dynamic.us.ups.com/153.2.212.115:47487, Type=socket
12/20/17 04:21:27.823 PM EST [INFO] [PO Route Down Executor] [Manager] Lost connection at: Node=Agent_1796804, Address=KYLOUSVRPW225.us.ups.com/10.120.12.117:59254, Type=socket
12/20/17 04:21:29.250 PM EST [INFO] [PO Route Down Executor] [Manager] Lost connection at: Node=Agent_1796723, Address=gaalplpapp00222.linux.us.ups.com/10.251.164.89:53840, Type=socket
12/20/17 04:21:29.596 PM EST [INFO] [PO Route Down Executor] [Manager] Lost connection at: Node=Agent_1796728, Address=gaalplpapp0020f.linux.us.ups.com/10.251.164.77:42478, Type=socket
12/20/17 04:21:29.876 PM EST [INFO] [PO Route Down Executor] [Manager] Lost connection at: Node=Agent_1796729, Address=gaalplpapp00211.linux.us.ups.com/10.251.164.79:53592, Type=socket
Log files on the system
12/05/17 12:03:46.911 PM EST [INFO] [PO:WatchedAgentPO Mailman 4] [Manager] TransactionTrace arrival buffer full, discarding trace(s)
12/05/17 12:03:47.055 PM EST [INFO] [PO:WatchedAgentPO Mailman 4] [Manager] TransactionTrace arrival buffer full, discarding trace(s)
12/05/17 12:03:48.064 PM EST [INFO] [PO:WatchedAgentPO Mailman 4] [Manager] TransactionTrace arrival buffer full, discarding trace(s)
12/05/17 12:03:50.431 PM EST [INFO] [PO:WatchedAgentPO Mailman 4] [Manager] TransactionTrace arrival buffer full, discarding trace(s)
12/05/17 12:03:50.537 PM EST [INFO] [PO:WatchedAgentPO Mailman 4] [Manager] TransactionTrace arrival buffer full, discarding trace(s)
12/05/17 12:03:52.376 PM EST [INFO] [PO:WatchedAgentPO Mailman 4] [Manager] TransactionTrace arrival buffer full, discarding trace(s)
Below find the traces buffer size limit of 100GB.
<clamp id="introscope.enterprisemanager.transactionevents.storage.max.disk.usage">
<description>
The maximum desired disk usage in MB for the trace storage. If this maximum is exceeded,
then the daily historical trace stores will be deleted starting with the oldest first
until the total historical trace storage size is below this value.
The current days trace store actively storing traces will not be deleted even if
its size exceeds this property value. The size of the trace index is not considered
when determining what historical trace stores to delete.
</description>
<threshold value="102400"/>
</clamp>
I am seriously concerned after seeing the health of this enviornment after onboarding only 20+ applications in one CA APM cluster and we are planning to onboard 300+ applications with multiple cluster.
Feel free to ask anything you need from my enviornment to diagnose this issue.
Any suggestion/feedback/configurational changes/hint is highly appreciated.
Cheers
Jay
Cell No:551-263-9681
E-mail:jmishra@ups.com