Collector(s) High Harvest Duration

Document created by eyes on Feb 5, 2015Last modified by Melanie_Giuliani on Feb 12, 2015
Version 1Show Document
  • View in full screen mode

I am running version 9.1.4. One MoM, 4 Collectors. I'm seeing high harvest duration times, up to 15 seconds. They occur every hour. During the rest of the hour Duration times are less than a second.

So every hour on each collector, I get this for a couple minutes:

JVM is not pegged. However, GC increases at the time (Not overloaded)
CPU reaches 96% for about a minute during the harvest.
Havest capacity spikes to 86 percent
Number of Metric Data Queries per interval spikes (No one's using it)
Under "Query" I'm getting a spike in Data Points Returned Per Interval
Also, "Metrics Read from Disk Per Interval" Spikes
Total Index Inseration Duration Per Interval(ms) Spikes
Harvest Duration reaches 15 seconds.

It looks like the EM is doing a query of the SmartStor Database each hour. SmartStor Duration looks fine; however, Harvest Duration goes to 15 seconds.
I'm curious about the Total Index Inseration Duration. Is Introscope querying the Smartstor database and rebuidling an index on it each hour?

Anyone else seeing something similiar?

JW
Hi JW,

Please take a look at the info below and let us know if that may help you find the source of your problems.

Regs,

Florian.
  
____________________________________________

This is a one page mini app-node, explaining the Harvest Cycle (as of 9.1), and the various metrics and log outputs that monitor it. I also include diagnosis characteristics to triage for at the bottom. The entire process is as follows:


THE HARVEST CYCLE

1.
The overall cycle is controlled by ClockEntity, which does the following steps (2 through 6 or 7/8) every 15 seconds. The actual time it takes to complete this entire process is recorded in the Metric “Tasks:Harvest Duration (ms)”. Please note that many of these processes overlap in different threads, so the times do not always add up.
2.
The first stage is the agent harvest. This:
a.
Waits for any previous asynchronous “query tracker” TRS (Timeslice Result Set) processing to complete (see item #4 below)
b.
Creates the “query tracker” TRS, kicks off a multithreaded process to copy data from the “chunk tree”s (where the Agent trackers store data in raw form) into the “persister” TRS; the system waits until all threads have completed this task.
c.
Data is copied in a multithreaded manner out of the Agent (where Agent metrics are placed) and put in a “Result” Timesliced Result Set.
The aggregate time for a, b, & c is calculated, but not displayed as a metric. However, by setting log4j.logger.Clock=DEBUG, you will see the following message appear: “Time to harvest agents: {}ms”, in the IntroscopeEnterpriseManager logs, where {} is the elapsed time.
3.
The calculators are then run in a multithreaded fashion. The manager runs each calculator’s query, sends the query results to the calculator, gets the calculated Metrics back, and then sends these results to the client. The aggregate time it takes to run all calculators is recorded in the Metric “Enterprise Manager|Internal:Calculator Harvest Time (ms)”. If you set log4j.logger.Clock=DEBUG, you will also see a nearly identical “Time to Harvest listener {RawDataStash}: {}ms” number in the logs. (It’s not really {RawDataStash}, but you can id it because it’s the first of two such “listener” entries.)
4.
After the calculators are run, a flag is flipped telling the RawDataStash to run all other (non-calculator) queries on the “Result” TRS, delivering their results up to the various clients that registered them. This is run on a different thread asynchronously with the other processes further down. However, the next timeslice will not begin processing before this is completed. It waits at stage 2a, listed above.
5.
The alert manager then runs, gathering all handled alerts, updating the state of all persistent alerts, and sending notices out to all of attached (workstation) clients that are not blacked out; this includes all AppMap Alerts. If you set log4j.logger.Clock=DEBUG, you will see a “Time to Harvest listener {AlertStateNotificationManager}: {}ms”, in the logs with the time set, indicating how long it took this to run. (It’s not really {AlertStateNotificationManager}, but you can id it as the second of two such “listener” entries.)
6.
At this point, a check is performed to determine if spooling needs to take place. Spooling only is performed once every {} seconds, where {} is the value of Introscope.enterprisemanager.smartstor.tier1.frequency; the default is 15 seconds. If spooling needs to be performed, but spooling is still in process, then the following error is produced: “EM load exceeds hardware capacity. Timeslice data is being aggregated into longer periods”. Then the data in the existing timeslice is entered into the gatherer (in a singlethreaded fashion), and steps 7 and 8 are skipped.
7.
If spooling is not currently in process, the following tasks are performed in order:
a.
All gatherers are called to compress the various TRS MetricData values into a single value
b.
Data from these gatherers are placed into independent “integer” and “long” lists
c.
A spooling task is spawned to perform additional work asynchronously
The aggregate time for a, b, & c is calculated, but not displayed as a metric. However, by setting log4j.logger.Clock=DEBUG, you will see the following message appear: “Time to harvest in SmartStor: {}ms”, in the IntroscopeEnterpriseManager logs, where {} is the elapsed time. Please note that this message IS IN ERROR, as the elapsed time only measures the time to launch SmartStor spooling, not actually complete it.
8.
SmartStor spooling kicked off by 7c above, does the following:
a.
Examines all the Metrics, extracting the Agent names for them, and checks them against its internal database, assigning each a new persistent storage tracking id if they haven’t been seen before.
b.
Places any unknown agents in the “when they were alive” map in metadata, if there have been any changes (new agents, or old ones timing out).
c.
Examines each metric in turn to see if it is known (i.e. in the Metadata), and assigning it a persistent storage tracking id if it’s not.
d.
Updates the map of which metrics are being reported by which agents, if there have been any changes.
e.
If there have been any changes to Metrics (usually additions), moves the metadata file to a backup, and writes a new metadata file. Before it does so, it waits to lock access to the metric metadata. The lock wait and write (but not the file moves beforehand) are all rolled up into the metric: “Data Store|SmartStor|MetaData:Write Duration (ms)”. This is only updated when metadata has been added.
f.
If com.wily.introscope.smartstor.disablespool is set TRUE, then skips steps g, h, i, and j.
g.
Registers MetricDataAgents as live (note: it’s likely a longstanding bug that “f” may skip this when true, which it almost never is).
h.
Performs a multithreaded sort of the integer Metric Data of this timeslice, followed by a Multithreaded sort of the Long data. The time taken to do this is tracked independently as “Data Store|SmartStor|MetaData:Sort Duration (ms)”
i.
Switches spool files, if necessary. Then writes out the metric data. The time taken to do this is not independently tracked.
j.
Launches any background tiering conversion tasks
The aggregate time for a through j is calculated, and is displayed as “Tasks:Smartstor Duration (ms)”.

TRIAGE AND DIAGNOSIS
The following kinds of conditions can occur, and this is what their symptoms would be:

Slow Disk – A classic over-resourced disk is going to be discernable by outside tools (google: “determine real disk speed” for a start. It is characterized by a large Tasks:Smartstor Duration (ms), with small MetaData:Sort Durations, and potentially long – BUT INTERMITTENT - MetaData:Write Durations.
Metric Explosions – The number of metrics that an EM tracks is well known, so you know if you have a leak or an explosion. To see if this is causing a problem in the harvest cycle though, look at the MetaData:Write duration, and see if that is pushing your Smartstor Duration too high.
Bad Calculators – The new “Internal:Calculator Harvest Time (ms)” will tell you if too much time is being used to run calculators. If this is out of whack compared to the number of calculators being run, it’s likely that some of those calculators are doing bad things, like issuing a “.*” query or some such.
Bad Queries – If the Harvest Duration is inexplicably big, you can turn on log4j.logger.Clock=DEBUG, to see if the “Time to harvest agents” seems too big. It is almost certain that if it is, the time to harvest agents isn’t the problem – rather reporting some overly large data in Step 4 is making Step 2a sleep for completion. And that is what too many persistent “.*” queries will look like.
Bad Alerts – Alerts can also be determined to be bad by turning on the same logging mechanism, and it usually means, again, an overly broad associated query. Either that, or the client’s AppMap may have gotten bigger than their network can handle.
Synchronicity - Because the SmartStor save happens asynchronously, “Tasks:Smartstor Duration (ms)” is NOT directly added into “Tasks:Harvest Duration (ms)”. In fact, while most people think that SmartStor is what causes the Harvest to go bad, it’s often the reverse. If SmartStor is delayed (due to a slow disk), that doesn’t affect anything else in the system. But if the Harvest Cycle is delayed, that slows down SmartStor like everything else. SmartStor duration also has locking issues – a huge query can lock down the metadata, delaying the SmartStor metadata save, even if the disk is fine otherwise (this issue will be fixed in 9.2).
Are you running SOA Performance Management on the MOM and Collectors?
Try to disable the SOA deviation calculators in your Enterprise Managers and check if the problems persist.
This can be done by setting the following parameter in <EM-Home>/config/IntroscopeEnterpriseManager.properties

com.wily.introscope.soa.deviation.enable=false
The CA Support Ticket I opened gave me the resoultion to the problem. It is a bug with the 9.1.x agent. It looks like every hour Introscope calculates the SOA Deviation, which causes higher than normal spikes in Harvest Times. We implemented the resolution and Harvest Times are now less than a second. The resolution from CA is below.

There is a known issue where a Harvest Duration spike is seen at hourly
intervals. This is fixed in the next major release (9.5.0.0) but for now,
in 9.1.4.0 & 9.1.5.0, please make the following changes, restart the EM's (MOM and
all collectors). The spike should be gone.

Solution:

Disable the SOA Deviation Calculation, which causes hourly harvest duration
on collectors.

The properties below need to be disabled in ALL collectors and the MOM, go
to <EM_HOME>/config/IntroscopeEnterpriseManager.properties and change the 5
entries listed below

to = false

com.wily.introscope.soa.deviation.enable=false
com.wily.introscope.soa.deviation.art.enable=false
com.wily.introscope.soa.deviation.dependencymetric.enable=false
com.wily.introscope.soa.deviation.errors.enable=false
com.wily.introscope.soa.deviation.usage.enable=false

For this change to take effect, a restart of EM is required.

JW

Este documento foi gerado a partir da seguinte discussão:Collector(s) High Harvest Duration

Attachments

    Outcomes