TEC610521: A Guide to Solving Common Stats and Defects Aggregation Problems

Document created by Hallett_German Employee on May 25, 2016Last modified by J.J. Lovett on May 25, 2016
Version 2Show Document
  • View in full screen mode

Document ID:  TEC610521
Last Modified Date:  11/30/2015

Author: Hallett_German

 

  • Products
    • CA Application Performance Management
  • Releases
    • CA Application Performance Management:Release:9.6
  • Components
    • WILY CEM
    • CUSTOMER EXPERIENCE MANAGER

 

Introduction

This tech note is an expansion of Stats Aggregation Issue after Upgrading from 9.0.x to 9.1.0.0, by Michal Lehotsky. The original document discusses ways to detect if statistical

aggregation is not working and how to correct it. This update adds the following new topics:

 

  • Definition of key terms
  • Common stats aggregation issues (bad rows, daily aggregation failure, others)
  • Common defect aggregation issues
  • Oracle update SQL statements
  • SQL statements for specific dates
  • Useful logs
  • References
  • Other things to check

 

Instructions:

 

Guiding Principles  

·       Unless otherwise stated, the goal is always to preserve unprocessed defect and statistical records. It should be mentioned that using this approach may take more time than starting with a fresh, empty database.  

·       Defect and statistical aggregation techniques are always changing. New releases may require new strategies. Updates to this document will be released as needed.

Glossary

o   Base Tables – These are the two tables (ts_st_ts_us_int, ts_st_tu_us_int) which include the raw stats data from stats files. Stats Aggregation is basically the process where data from these base tables are distributed into other stats tables (aggregated tables).

·       Base Table Partitions Partitions of base tables. The name includes the base table name and date (for example TS_ST_TS_US_INT_20140522)

·       LAR – LAR is Last Aggregated Row. There is one column in each aggregated stats table showing when the last successful aggregation was done. The next aggregation will start going forward from this date.

·       Aggregated tables - Aggregated tables are all stats tables except the base tables

·       Why do stats tables have so many partitions in APM 9.1 and in later releases? - Stats tables were partitioned for manageability and performance reasons. For example, querying a day’s partition table is much more flexible and faster than an entire table with one year of data.

·       Parent table – Parent tables are split into partitions. Partitions share table structure from parent table and have date in its name. (e.g. TS_ST_TS_US_INT_20140522). Parent tables do not have date in their name  (e.g. TS_ST_TS_US_INT)

·       How long does Stats and Defect Aggregation typically take? – Stats Aggregation in 9.1.x and later should be about 5-10 times faster than was in 9.0. This is due to partitioning stats tables and stats processing improvements. Hourly stats aggregation should take less than 30 minutes. Daily Stats aggregation should take less than eight hours. Defect aggregation should take less than ten minutes.

 

Out of scope   This document does not cover the following:

  • A comprehensive tutorial on defect and statistical aggregation.
  • A detailed explanation of Oracle table partitioning.
  • Overall defect/statistical aggregation workflow.
  • A complete list of the database tables involved in aggregation.
  • Determining database health. (See the Database maintenance tech note for more on that topic.)
  • Database maintenance and optimization.
  • Details of table partitioning.
  • Details of database table and partition naming standards.
  • A comprehensive list of all known statistical and defect aggregation issues. Only the most typical scenarios are covered.
  • Data retention settings and the impact of these settings on reports.

 

Common Symptoms of Defect/Statistical Aggregation Failures

The following symptoms may be indications of defect or statistical aggregation failure:

·       Not seeing any defects in Analysis Graphs.

·       Seeing APM CE Reports for today but not for yesterday.

·       Seeing APM CE Reports for a custom hour but not for a custom day.

·       Unable to view older APM CE Reports.

·       Unable to see selected APM CE Reports. Note that the following symptoms may also be observed:

·       Defects are not current

·       Not seeing incidents

Neither of these last two problems is related to aggregation. Troubleshoot by checking the TIM and TIM collector for configuration or performance issues.

 

Initial Steps for Defect and Statistical Aggregation Diagnosis

Quickly determine the scope of the issue.  Is it:

·       Defects aggregation only?

·       Statistical aggregation only?

·       Defects and statistical aggregation only?

·       All of the above plus other problems?

 

Logs and configuration files for aggregation issues

The following files are useful when looking at stats and defects aggregation:

FileLocationWhy Useful
Configuration
CEMHibernate.properties<EM_HOME>/config/c3p0 Settings for Data Source
CEMHibernate-dailyAggregation.properties<EM_HOME>/config/c3p0 Settings for Stats Aggregation
CEMHibernate-defectAggregation.properties<EM_HOME>/config/c3p0 Settings for Defects Aggregation
tess-db-cfg.xml<EM_HOME>/config/Database settings especially useful if the database is inaccessible (impacting aggregation writes and reads.)
tess-default.propertiestess-customer-properties<EM_HOME>/config/If running aggregation in a separate JVM, check JVM Settings. Defects and stats aggregation settings are also configured here.
Logging
IntroscopeEnterpriseManager.log<EM_HOME>/logs/Shows Stats and Defect Aggregation processing and the associated services starting/stopping.

 

Other things to review before looking at defects or statistical aggregation

Note that other factors may result in missing defects or reports. These include but are not limited to the following questions:

·       Is there any network traffic at all? The TIM connection to the switch (such as a span/tap or aggregator) may not be seeing traffic at all, traffic from specific servers, or only traffic from one direction.

·       Is the TIM collector not doing its job?The TIM is seeing traffic but defects and statistics are backing up on the TIM. SSH into the TIM and go to /etc/wily/cem/tim/data/out/*** where *** is the stats or defects directory. Typically, a restart of the TIM collector or disabling and re-enabling the TIM object from the Monitors tab in the TESS-MOM UI will resolve the issue.

·       Is there a security issue? The UI or authorization mechanism (such as local, EEM, or LDAP) may affect the ability to display the report screen.

·       Are there version-specific issues that are resolved by a later update?  If you are not at the current version, check the Release Notes for all later versions.  The issue may already be fixed.

 

Common stats aggregation issues

  Before Getting Started:

One important step before running any SQL update query is to determine which stat tables exist and whether they are populated. This can be done by one or two ways:

·       Visually checking through a database Administration tool such as pgadmin for Postgres and SQL Developer for Oracle.

·       Running a SQL query such as the following:

-        Postgres -- select tablename from pg_tables order by tablename;"

-        Oracle -- SELECT PARTITION_NAME FROM USER_TAB_PARTITIONS ORDER BY PARTITION_NAME;  

Based on the results, you can determine the extent of the issue and possible values for updating LAR (Last Aggregation Row).

 

Issue #1 hourly aggregation is not working  

Symptoms:

The following message appears in the TIM collector log:

[WARN] [DailyAggregation.Thread1] [Manager.com.timestock.tess.services.processors.StatsAggregator] Last hourly aggregation did not complete after 14400 seconds. Daily aggregation will retry tomorrow

Solution:

This issue is resolved in 9.5.3 and 9.6.x. Follow the steps in Issue #2 to update the

Last Aggregated Row for hourly (int or interval) tables.

 

Issue #2 daily aggregation is not working

Symptoms:

User-visible behavior: Report data for today is available, but not yesterday or any other recent day. Log-visible behavior:              The stats aggregation collector has this error message during stats aggregation:

org.postgresql.util.PSQLException: ERROR: relation "ts_st_ts_us_int_xxxxxxxx" does not exist OR The last aggregation row in the ts_*_*_* tables (e.g.  ts_st_ts_us_dly) is stuck on a specific date and does not update later dates.  

Troubleshooting:

To determine the date of the last aggregated row for daily aggregation, run this SQL query from the Postgres PGAdmin utility or the Oracle Administration tool:

select max(ts_last_aggregated_row) from ts_st_ts_all_dly;

select max(ts_last_aggregated_row) from ts_st_ts_all_int;

select max(ts_last_aggregated_row) from ts_st_ts_all_mly;

select max(ts_last_aggregated_row) from ts_st_ts_all_wly;

select max(ts_last_aggregated_row) from ts_st_ts_us_dly;

select max(ts_last_aggregated_row) from ts_st_ts_us_mly;

select max(ts_last_aggregated_row) from ts_st_ts_us_wly;

select max(ts_last_aggregated_row) from ts_st_ts_usgrp_dly;

select max(ts_last_aggregated_row) from ts_st_ts_usgrp_int;

select max(ts_last_aggregated_row) from ts_st_ts_usgrp_mly;

select max(ts_last_aggregated_row) from ts_st_ts_usgrp_wly;

select max(ts_last_aggregated_row) from ts_st_tsgrp_all_dly;

select max(ts_last_aggregated_row) from ts_st_tsgrp_all_int;

select max(ts_last_aggregated_row) from ts_st_tsgrp_all_mly;

select max(ts_last_aggregated_row) from ts_st_tsgrp_all_wly;

select max(ts_last_aggregated_row) from ts_st_tsgrp_us_dly;

select max(ts_last_aggregated_row) from ts_st_tsgrp_us_int;

select max(ts_last_aggregated_row) from ts_st_tsgrp_us_mly;

select max(ts_last_aggregated_row) from ts_st_tsgrp_us_wly;

select max(ts_last_aggregated_row) from ts_st_tsgrp_usgrp_dly;

select max(ts_last_aggregated_row) from ts_st_tsgrp_usgrp_int;

select max(ts_last_aggregated_row) from ts_st_tsgrp_usgrp_mly;

select max(ts_last_aggregated_row) from ts_st_tsgrp_usgrp_wly;

select max(ts_last_aggregated_row) from ts_st_tu_all_dly;

select max(ts_last_aggregated_row) from ts_st_tu_all_int;

select max(ts_last_aggregated_row) from ts_st_tu_all_mly;

select max(ts_last_aggregated_row) from ts_st_tu_all_wly;

select max(ts_last_aggregated_row) from ts_st_tu_us_dly;

select max(ts_last_aggregated_row) from ts_st_tu_us_mly;

select max(ts_last_aggregated_row) from ts_st_tu_us_wly;

select max(ts_last_aggregated_row) from ts_st_tu_usgrp_dly;

select max(ts_last_aggregated_row) from ts_st_tu_usgrp_int;

select max(ts_last_aggregated_row) from ts_st_tu_usgrp_mly;

select max(ts_last_aggregated_row) from ts_st_tu_usgrp_wly;

This will either return a date or no results. If running the select max statement on the daily (_dly) table returns nothing, it means that either all the data is still in the base tables or daily aggregation has never successfully completed. Looking at the oldest partition of the base table(s) will allow you to determine which date to use for the update command. By doing this, the queries are checking if all LARs in the stats tables are the same or afterthe date of oldest base table partition. For example, if oldest base table partition is from May 15th (TS_ST_TS_US_INT_20140515), then verify that all LARs are set to May 15th or later. For those tables where this is not the case, proceed to update the LARs            

 

There are 3 rules to follow while updating:

1)               Update LARs to date of oldest base table partition (to save as much data as possible.)

2)               Update LARs ONLY in tables where LARs are older than oldest base table (to avoid duplicate data.)

3)               Move LARs forward ONLY (to avoid duplicate data and introducing other issues.)

 

Causes:

After an upgrade or stats aggregation, the ts_st_ts_us_int_* tables are not created or do not contain any statistical data.  This can be indicative of a pre-existing condition which occurred prior to the upgrade.  It may happen in two situations:

1)     A database being upgraded was restored from a backup more than 1 day old. For example, if the database backup was taken on the 4th and the restore and upgrade were done on the 7th, there will be no statistics for the 5th and 6th.  Tables for those days will not be created and will cause the problem.

2)     No stats data is saved for one or more days prior to upgrade. For example, if the last statistics were recorded on the 4th and no stats data were inserted into the database for the 5th and 6th. If the database is later upgraded on the 7th, there will be no stats data for the 5th and 6th and the issue may occur because no tables were created for those days.

 

Solution:

Step 1: Determine the proper Last Aggregated Row (LAR) date.

Do this by running the query above under Troubleshooting.  Check the date of the oldest partitions of the base table by using analysis graphs to determine the last date (i.e. last day/hour) containing data. As listed below, this step must be done correctly else there will be duplicate or overlapping data. If you are not comfortable doing these steps, then open a support case for assistance.  

Additional Guidance:

1.     The date selected must be later than the date shown in the error message. This will allow the aggregation to skip dates with missing tables. For example, if last aggregated row is stuck on the 5th and tables exist only from the 7th forward, then set the last aggregated row to the 7th or later. The safest choice to get a system up fast is setting it to 11:00pm of the previous day using the queries below. However, this may not always be the most appropriate choice as discussed below.

2.     If retaining the most data is the first concern, then be conservative on the date selection. A more recent date will not backfill any missing gaps from the past.

3.     Important note: Do keep in mind that data will not be aggregated between the oldproperties (when aggregation is stuck) and the new date aggregation is set to. The wrong date selected could result in lost data or double sets of data.

4.     The logs may have “false positive” error messages when it tries to catch up because it legitimately cannot find any data for a particular date.

Specify dates in the formats shown in the table below:

UPDATE ts_st_ts_all_dly SET   ts_last_aggregated_row =<place date format here>

Updated Date to setDate FormatDatabase Type Comments
11P.M. the night before.DATE_TRUNC('day', now()) - INTERVAL '1 hour'Postgres
11P.M. the night before.ROUND(current_timestamp-1/2, 'DD') - 1/24Oracle
Other Cases
11 P.M. on 5-16-2013 (Specific Date Example)DATE_TRUNC('day', timestamp '2013-05-16 23:00:00')Postgres
11 P.M. on 5-16-2013 (Specific Date Example)'2013-05-16 23:00:00’OracleUse this query to see Oracle default date format: select * from nls_session_parameters;
5-16-2013 (Specific Date Example) '24-DEC-13'Oracle

 

Step 2: Using PGAdmin, Oracle Admin Tool, or an equivalent, run the UPDATE command using the date selected in step 1.

Two examples are provided to show you the different date update format

The first case is using 11 P.M. the day before, so  the Postgres and Oracle commands would be:

Postgres UPDATE ts_st_ts_all_dly SET ts_last_aggregated_row = DATE_TRUNC('day', now()) - INTERVAL '1 hour';

UPDATE ts_st_ts_all_int SET ts_last_aggregated_row = DATE_TRUNC('day', now()) - INTERVAL '1 hour';

UPDATE ts_st_ts_all_mly SET ts_last_aggregated_row = DATE_TRUNC('day', now()) - INTERVAL '1 hour';

UPDATE ts_st_ts_all_wly SET ts_last_aggregated_row = DATE_TRUNC('day', now()) - INTERVAL '1 hour';

UPDATE ts_st_ts_us_dly SET ts_last_aggregated_row = DATE_TRUNC('day', now()) - INTERVAL '1 hour';

UPDATE ts_st_ts_us_mly SET ts_last_aggregated_row = DATE_TRUNC('day', now()) - INTERVAL '1 hour';

UPDATE ts_st_ts_us_wly SET ts_last_aggregated_row = DATE_TRUNC('day', now()) - INTERVAL '1 hour';

UPDATE ts_st_ts_usgrp_dly SET ts_last_aggregated_row = DATE_TRUNC('day', now()) - INTERVAL '1 hour';

UPDATE ts_st_ts_usgrp_int SET ts_last_aggregated_row = DATE_TRUNC('day', now()) - INTERVAL '1 hour';

UPDATE ts_st_ts_usgrp_mly SET ts_last_aggregated_row = DATE_TRUNC('day', now()) - INTERVAL '1 hour';

UPDATE ts_st_ts_usgrp_wly SET ts_last_aggregated_row = DATE_TRUNC('day', now()) - INTERVAL '1 hour';

UPDATE ts_st_tsgrp_all_dly SET ts_last_aggregated_row = DATE_TRUNC('day', now()) - INTERVAL '1 hour';

UPDATE ts_st_tsgrp_all_int SET ts_last_aggregated_row = DATE_TRUNC('day', now()) - INTERVAL '1 hour';

UPDATE ts_st_tsgrp_all_mly SET ts_last_aggregated_row = DATE_TRUNC('day', now()) - INTERVAL '1 hour';

UPDATE ts_st_tsgrp_all_wly SET ts_last_aggregated_row = DATE_TRUNC('day', now()) - INTERVAL '1 hour';

UPDATE ts_st_tsgrp_us_dly SET ts_last_aggregated_row = DATE_TRUNC('day', now()) - INTERVAL '1 hour';

UPDATE ts_st_tsgrp_us_int SET ts_last_aggregated_row = DATE_TRUNC('day', now()) - INTERVAL '1 hour';

UPDATE ts_st_tsgrp_us_mly SET ts_last_aggregated_row = DATE_TRUNC('day', now()) - INTERVAL '1 hour';

UPDATE ts_st_tsgrp_us_wly SET ts_last_aggregated_row = DATE_TRUNC('day', now()) - INTERVAL '1 hour';

UPDATE ts_st_tsgrp_usgrp_dly SET ts_last_aggregated_row = DATE_TRUNC('day', now()) - INTERVAL '1 hour';

UPDATE ts_st_tsgrp_usgrp_int SET ts_last_aggregated_row = DATE_TRUNC('day', now()) - INTERVAL '1 hour';

UPDATE ts_st_tsgrp_usgrp_mly SET ts_last_aggregated_row = DATE_TRUNC('day', now()) - INTERVAL '1 hour';

UPDATE ts_st_tsgrp_usgrp_wly SET ts_last_aggregated_row = DATE_TRUNC('day', now()) - INTERVAL '1 hour';

UPDATE ts_st_tu_all_dly SET ts_last_aggregated_row = DATE_TRUNC('day', now()) - INTERVAL '1 hour';

UPDATE ts_st_tu_all_int SET ts_last_aggregated_row = DATE_TRUNC('day', now()) - INTERVAL '1 hour';

UPDATE ts_st_tu_all_mly SET ts_last_aggregated_row = DATE_TRUNC('day', now()) - INTERVAL '1 hour';

UPDATE ts_st_tu_all_wly SET ts_last_aggregated_row = DATE_TRUNC('day', now()) - INTERVAL '1 hour';

UPDATE ts_st_tu_us_dly SET ts_last_aggregated_row = DATE_TRUNC('day', now()) - INTERVAL '1 hour';

UPDATE ts_st_tu_us_mly SET ts_last_aggregated_row = DATE_TRUNC('day', now()) - INTERVAL '1 hour';

UPDATE ts_st_tu_us_wly SET ts_last_aggregated_row = DATE_TRUNC('day', now()) - INTERVAL '1 hour';

UPDATE ts_st_tu_usgrp_dly SET ts_last_aggregated_row = DATE_TRUNC('day', now()) - INTERVAL '1 hour';

UPDATE ts_st_tu_usgrp_int SET ts_last_aggregated_row = DATE_TRUNC('day', now()) - INTERVAL '1 hour';

UPDATE ts_st_tu_usgrp_mly SET ts_last_aggregated_row = DATE_TRUNC('day', now()) - INTERVAL '1 hour';

UPDATE ts_st_tu_usgrp_wly SET ts_last_aggregated_row = DATE_TRUNC('day', now()) - INTERVAL '1 hour';  

Note: - INTERVAL 'X hour'” is time from last midnight (in this case above it is 11pm on previous day)

Oracle

UPDATE ts_st_ts_all_dly SET ts_last_aggregated_row = ROUND(current_timestamp-1/2, 'DD') - 1/24;

UPDATE ts_st_ts_all_int SET ts_last_aggregated_row = ROUND(current_timestamp-1/2, 'DD') - 1/24;

UPDATE ts_st_ts_all_mly SET ts_last_aggregated_row = ROUND(current_timestamp-1/2, 'DD') - 1/24;

UPDATE ts_st_ts_all_wly SET ts_last_aggregated_row = ROUND(current_timestamp-1/2, 'DD') - 1/24;

UPDATE ts_st_ts_us_dly SET ts_last_aggregated_row = ROUND(current_timestamp-1/2, 'DD') - 1/24;

UPDATE ts_st_ts_us_mly SET ts_last_aggregated_row = ROUND(current_timestamp-1/2, 'DD') - 1/24;

UPDATE ts_st_ts_us_wly SET ts_last_aggregated_row = ROUND(current_timestamp-1/2, 'DD') - 1/24;

UPDATE ts_st_ts_usgrp_dly SET ts_last_aggregated_row = ROUND(current_timestamp-1/2, 'DD') - 1/24;

UPDATE ts_st_ts_usgrp_int SET ts_last_aggregated_row = ROUND(current_timestamp-1/2, 'DD') - 1/24;

UPDATE ts_st_ts_usgrp_mly SET ts_last_aggregated_row = ROUND(current_timestamp-1/2, 'DD') - 1/24;

UPDATE ts_st_ts_usgrp_wly SET ts_last_aggregated_row = ROUND(current_timestamp-1/2, 'DD') - 1/24;

UPDATE ts_st_tsgrp_all_dly SET ts_last_aggregated_row = ROUND(current_timestamp-1/2, 'DD') - 1/24;

UPDATE ts_st_tsgrp_all_int SET ts_last_aggregated_row = ROUND(current_timestamp-1/2, 'DD') - 1/24;

UPDATE ts_st_tsgrp_all_mly SET ts_last_aggregated_row = ROUND(current_timestamp-1/2, 'DD') - 1/24;

UPDATE ts_st_tsgrp_all_wly SET ts_last_aggregated_row = ROUND(current_timestamp-1/2, 'DD') - 1/24;

UPDATE ts_st_tsgrp_us_dly SET ts_last_aggregated_row = ROUND(current_timestamp-1/2, 'DD') - 1/24;

UPDATE ts_st_tsgrp_us_int SET ts_last_aggregated_row = ROUND(current_timestamp-1/2, 'DD') - 1/24;

UPDATE ts_st_tsgrp_us_mly SET ts_last_aggregated_row = ROUND(current_timestamp-1/2, 'DD') - 1/24;

UPDATE ts_st_tsgrp_us_wly SET ts_last_aggregated_row = ROUND(current_timestamp-1/2, 'DD') - 1/24;

UPDATE ts_st_tsgrp_usgrp_dly SET ts_last_aggregated_row = ROUND(current_timestamp-1/2, 'DD') - 1/24;

UPDATE ts_st_tsgrp_usgrp_int SET ts_last_aggregated_row = ROUND(current_timestamp-1/2, 'DD') - 1/24;

UPDATE ts_st_tsgrp_usgrp_mly SET ts_last_aggregated_row = ROUND(current_timestamp-1/2, 'DD') - 1/24;

UPDATE ts_st_tsgrp_usgrp_wly SET ts_last_aggregated_row = ROUND(current_timestamp-1/2, 'DD') - 1/24;

UPDATE ts_st_tu_all_dly SET ts_last_aggregated_row = ROUND(current_timestamp-1/2, 'DD') - 1/24;

UPDATE ts_st_tu_all_int SET ts_last_aggregated_row = ROUND(current_timestamp-1/2, 'DD') - 1/24;

UPDATE ts_st_tu_all_mly SET ts_last_aggregated_row = ROUND(current_timestamp-1/2, 'DD') - 1/24;

UPDATE ts_st_tu_all_wly SET ts_last_aggregated_row = ROUND(current_timestamp-1/2, 'DD') - 1/24;

UPDATE ts_st_tu_us_dly SET ts_last_aggregated_row = ROUND(current_timestamp-1/2, 'DD') - 1/24;

UPDATE ts_st_tu_us_mly SET ts_last_aggregated_row = ROUND(current_timestamp-1/2, 'DD') - 1/24;

UPDATE ts_st_tu_us_wly SET ts_last_aggregated_row = ROUND(current_timestamp-1/2, 'DD') - 1/24;

UPDATE ts_st_tu_usgrp_dly SET ts_last_aggregated_row = ROUND(current_timestamp-1/2, 'DD') - 1/24;

UPDATE ts_st_tu_usgrp_int SET ts_last_aggregated_row = ROUND(current_timestamp-1/2, 'DD') - 1/24;

UPDATE ts_st_tu_usgrp_mly SET ts_last_aggregated_row = ROUND(current_timestamp-1/2, 'DD') - 1/24;

UPDATE ts_st_tu_usgrp_wly SET ts_last_aggregated_row = ROUND(current_timestamp-1/2, 'DD') - 1/24;

 

Note: (current_timestamp-1/2, 'DD') to round up to last midnight - 1/24 specifies minus one hour.             In this second case, a specific date is set.

 

Postgres

UPDATE ts_st_ts_all_dly SET ts_last_aggregated_row = '2013-12-06';

UPDATE ts_st_ts_all_int SET ts_last_aggregated_row = '2013-12-06';

UPDATE ts_st_ts_all_mly SET ts_last_aggregated_row = '2013-12-06';

UPDATE ts_st_ts_all_wly SET ts_last_aggregated_row = '2013-12-06';

UPDATE ts_st_ts_us_dly SET ts_last_aggregated_row = '2013-12-06';

UPDATE ts_st_ts_us_mly SET ts_last_aggregated_row = '2013-12-06';

UPDATE ts_st_ts_us_wly SET ts_last_aggregated_row = '2013-12-06';

UPDATE ts_st_ts_usgrp_dly SET ts_last_aggregated_row = '2013-12-06';

UPDATE ts_st_ts_usgrp_int SET ts_last_aggregated_row = '2013-12-06';

UPDATE ts_st_ts_usgrp_mly SET ts_last_aggregated_row = '2013-12-06';

UPDATE ts_st_ts_usgrp_wly SET ts_last_aggregated_row = '2013-12-06';

UPDATE ts_st_tsgrp_all_dly SET ts_last_aggregated_row = '2013-12-06';

UPDATE ts_st_tsgrp_all_int SET ts_last_aggregated_row = '2013-12-06';

UPDATE ts_st_tsgrp_all_mly SET ts_last_aggregated_row = '2013-12-06';

UPDATE ts_st_tsgrp_all_wly SET ts_last_aggregated_row = '2013-12-06';

UPDATE ts_st_tsgrp_us_dly SET ts_last_aggregated_row = '2013-12-06';

UPDATE ts_st_tsgrp_us_int SET ts_last_aggregated_row = '2013-12-06';

UPDATE ts_st_tsgrp_us_mly SET ts_last_aggregated_row = '2013-12-06';

UPDATE ts_st_tsgrp_us_wly SET ts_last_aggregated_row = '2013-12-06';

UPDATE ts_st_tsgrp_usgrp_dly SET ts_last_aggregated_row = '2013-12-06';

UPDATE ts_st_tsgrp_usgrp_int SET ts_last_aggregated_row = '2013-12-06';

UPDATE ts_st_tsgrp_usgrp_mly SET ts_last_aggregated_row = '2013-12-06';

UPDATE ts_st_tsgrp_usgrp_wly SET ts_last_aggregated_row = '2013-12-06';

UPDATE ts_st_tu_all_dly SET ts_last_aggregated_row = '2013-12-06';

UPDATE ts_st_tu_all_int SET ts_last_aggregated_row = '2013-12-06';

UPDATE ts_st_tu_all_mly SET ts_last_aggregated_row = '2013-12-06';

UPDATE ts_st_tu_all_wly SET ts_last_aggregated_row = '2013-12-06';

UPDATE ts_st_tu_us_dly SET ts_last_aggregated_row = '2013-12-06';

UPDATE ts_st_tu_us_mly SET ts_last_aggregated_row = '2013-12-06';

UPDATE ts_st_tu_us_wly SET ts_last_aggregated_row = '2013-12-06';

UPDATE ts_st_tu_usgrp_dly SET ts_last_aggregated_row = '2013-12-06';

UPDATE ts_st_tu_usgrp_int SET ts_last_aggregated_row = '2013-12-06';

UPDATE ts_st_tu_usgrp_mly SET ts_last_aggregated_row = '2013-12-06';

UPDATE ts_st_tu_usgrp_wly SET ts_last_aggregated_row = '2013-12-06';

 

Oracle

UPDATE ts_st_ts_all_dly SET ts_last_aggregated_row = '24-DEC-13';

UPDATE ts_st_ts_all_int SET ts_last_aggregated_row = '24-DEC-13';

UPDATE ts_st_ts_all_mly SET ts_last_aggregated_row = '24-DEC-13';

UPDATE ts_st_ts_all_wly SET ts_last_aggregated_row = '24-DEC-13';

UPDATE ts_st_ts_us_dly SET ts_last_aggregated_row = '24-DEC-13';

UPDATE ts_st_ts_us_mly SET ts_last_aggregated_row = '24-DEC-13';

UPDATE ts_st_ts_us_wly SET ts_last_aggregated_row = '24-DEC-13';

UPDATE ts_st_ts_usgrp_dly SET ts_last_aggregated_row = '24-DEC-13';

UPDATE ts_st_ts_usgrp_int SET ts_last_aggregated_row = '24-DEC-13';

UPDATE ts_st_ts_usgrp_mly SET ts_last_aggregated_row = '24-DEC-13';

UPDATE ts_st_ts_usgrp_wly SET ts_last_aggregated_row = '24-DEC-13';

UPDATE ts_st_tsgrp_all_dly SET ts_last_aggregated_row = '24-DEC-13';

UPDATE ts_st_tsgrp_all_int SET ts_last_aggregated_row = '24-DEC-13';

UPDATE ts_st_tsgrp_all_mly SET ts_last_aggregated_row = '24-DEC-13';

UPDATE ts_st_tsgrp_all_wly SET ts_last_aggregated_row = '24-DEC-13';

UPDATE ts_st_tsgrp_us_dly SET ts_last_aggregated_row = '24-DEC-13';

UPDATE ts_st_tsgrp_us_int SET ts_last_aggregated_row = '24-DEC-13';

UPDATE ts_st_tsgrp_us_mly SET ts_last_aggregated_row = '24-DEC-13';

UPDATE ts_st_tsgrp_us_wly SET ts_last_aggregated_row = '24-DEC-13';

UPDATE ts_st_tsgrp_usgrp_dly SET ts_last_aggregated_row = '24-DEC-13';

UPDATE ts_st_tsgrp_usgrp_int SET ts_last_aggregated_row = '24-DEC-13';

UPDATE ts_st_tsgrp_usgrp_mly SET ts_last_aggregated_row = '24-DEC-13';

UPDATE ts_st_tsgrp_usgrp_wly SET ts_last_aggregated_row = '24-DEC-13';

UPDATE ts_st_tu_all_dly SET ts_last_aggregated_row = '24-DEC-13';

UPDATE ts_st_tu_all_int SET ts_last_aggregated_row = '24-DEC-13';

UPDATE ts_st_tu_all_mly SET ts_last_aggregated_row = '24-DEC-13';

UPDATE ts_st_tu_all_wly SET ts_last_aggregated_row = '24-DEC-13';

UPDATE ts_st_tu_us_dly SET ts_last_aggregated_row = '24-DEC-13';

UPDATE ts_st_tu_us_mly SET ts_last_aggregated_row = '24-DEC-13';

UPDATE ts_st_tu_us_wly SET ts_last_aggregated_row = '24-DEC-13';

UPDATE ts_st_tu_usgrp_dly SET ts_last_aggregated_row = '24-DEC-13';

UPDATE ts_st_tu_usgrp_int SET ts_last_aggregated_row = '24-DEC-13';

UPDATE ts_st_tu_usgrp_mly SET ts_last_aggregated_row = '24-DEC-13';

UPDATE ts_st_tu_usgrp_wly SET ts_last_aggregated_row = '24-DEC-13';

 

Issue #3 Aggregation Running Slowly

 

Due to Cluster Performance

Reactive measures to minimize performance issues which are causing aggregations to run slowly:

·       Add additional TIMs.

·       Add more hardware (disk, CPU) to collectors.

·       Offload other services on separate servers (such as moving the APM Database or EEM running on the MOM).

·       Split processes across more collectors (such as database cleanup, statistical aggregation, RTTM, or Auto discovery).

·       Change JVM arguments for statistical and defects aggregation in tess-default.properties or tess-customer.properties (if the latter is used).

·       Reduce the number of transactions and defect types being monitored.

·       Reduce TIM trace logging.

·       Disable user identification on TESS-MOM for e-commerce applications.

·       Reduce traffic from the TIM monitoring port by tuning web server filters, MTP hardware filters, or basic network connection configurations.

·       Remove plug-ins, regex used in definitions, evidence collection, HTTP plugins, and custom processes running on the TESS-MOM server.

·       Verify that EM doesn’t keep too many TS_USER_SESSION_MAP partitions

Proactive measures include:

·       Defect/Stats aggregation proactive monitoring

·       Quarterly architecture review

·       Monthly review of TIM/TESS health through logs, errors, and Introscope EM health metrics

·       Implement a horizontal/vertical capacity scaling process. For example:

o   If I monitor x number of transactions, how many TIMs will I need to add?

o   If I add x number of identified users, how many TIMs will I need to add?

 

Issue #4  Bad Rows error for defects or stats daily aggregation Symptoms:

In releases 9.1.4 - 9.1.6, and 9.5, the following error may occur in the logs when defect or daily stats aggregation is about to start:

WARN] [DailyAggregation.Thread1] [Manager.com.timestock.tess.services.processors.StatsAggregator] Bad rows processed:

 

Solution: Determining the ThreadStack Size For most JVMs, setting the specified thread stack size to 192 (i.e. -XX:ThreadStackSize=192) for defects/stats aggregation is sufficient. However, there are JVMs which may need more than 256. These scenarios are discussed below.

Determining a ThreadStackSize for HP-UX:

An example of a JVM needing a larger stack size is HP-UX. It requires a minimum stack size for 32-bit of 400 (i.e. XX:ThreadStackSize=400). See the following for further details: https://h20392.www2.hp.com/portal/swdepot/displayProductInfo.do?productNumber=HPUXJAVAHOME So, when using custom JVMs (under memory limitation scenarios), it is strongly recommended to use the standalone simple java program. This can check what is the minimal/default stack size needed. The alternative is specifying a random value for defects/stats aggregation ThreadStackSize. This is part of the defects/stats separate JVM VM options. It may take time to determine the minimum value. This will require multiple unneeded EM restarts until finding the minimum value for the defects/stats JVM command separately by looking at logs.

Determining the default thread stack size in 32-bit:

Get a copy of TestThreadStackSize.java such as from https://gist.github.com/rednaxelafx/1066792 This is a simple java program that sleeps 5 minutes, so one can get then the default/query the JVM default/custom values using jinfo jdk utility.

# /opt/java7/jre/bin/java TestThreadStackSizeRunning/sleeping… Get the PID of TestThreadStackSize :# /opt/java7/bin/jps -lvm
21524 TestThreadStackSize
21525 sun.tools.jps.Jps -lvm -Dapplication.home=/opt/java7 -Xms8m
Verify the default thread stack size value from the running JVM:

# /opt/java7/bin/jinfo -flag ThreadStackSize 21524
-XX:ThreadStackSize=512

Determining the minimum required thread stack size in 32-bit:
# /opt/java7/jre/bin/java -XX:ThreadStackSize=192 TestThreadStackSize The stack size specified is too small, Specify at least 200kError: Could not create the Java Virtual Machine.
Error: A fatal exception has occurred. Program will exit.
Run the program with the above recommended setting:

# /opt/java7/jre/bin/java -Xss200k TestThreadStackSize

Running/Sleeping...

# /opt/java7/bin/jps -lvm21637 sun.tools.jps.Jps -lvm -Dapplication.home=/opt/java7 -Xms8m
21631 TestThreadStackSize -Xss200k

# /opt/java7/bin/jinfo -flag ThreadStackSize 21631

-XX:ThreadStackSize=400

 

To determine the default thread stack size in 64-bit: (For HP-UX JVM, add "–d64" to VM args for 64-bit case):

# /opt/java7/jre/bin/java -d64 TestThreadStackSize

Running/sleeping….. Get the PID of TestThreadStackSize :

# /opt/java7/bin/jps -lvm21652 sun.tools.jps.Jps -lvm -Dapplication.home=/opt/java7 -Xms

Setting the ThreadStackSize

After determining the ThreadStackSize using the above approach, change its value in the dailystats or defects JVM argument found in the tess-default.properties or tess.customer.properties file:

# do daily stats aggregation in a separate JVM. Turn this off if: 1) using a 64 bit JVM with a very large heap

# or 2) processing low to moderate numbers of stats

dailystats.aggregateInSeparateJvm=true

dailystats.jvmArgs= -XX:+CMSClassUnloadingEnabled -XX:+UseConcMarkSweepGC -XX:+UseBiasedLocking -XX:SurvivorRatio=8 -XX:TargetSurvivorRatio=90 -Dlog4j.configuration=log4j.properties -Xms256M -XX:ThreadStackSize=192

# do defect aggregation in a separate JVM. Turn this off if: 1) using a 64 bit JVM with a very large heap

# or 2) processing low to moderate numbers of stats

defects.aggregateInSeparateJvm=true

defects.jvmArgs= -XX:+CMSClassUnloadingEnabled -XX:+UseConcMarkSweepGC -XX:+UseBiasedLocking -XX:SurvivorRatio=8 -XX:TargetSurvivorRatio=90 -Dlog4j.configuration=log4j.properties -Xms256M -XX:ThreadStackSize=192 for 32-bit JVMs: ·        Stop the EM ·        Edit the file <em-root-dir>/config/tess-default.properties or tess-customer.properties and modify dailystats.jvmArgs and/or defects.jvmArgs to increase the "-XX:ThreadStackSize" to 192. If this stack size is still not sufficient, increase it to 256. The default value of "-XX:ThreadStackSize" has been increased to 192 in release 9.1.7 and all future releases beginning with 9.5.1.

 

  • Start the EM

 

64-bit JVMs:

·        Stop the EM

·        Limiting the thread stack is not necessary because the heap on a 64-bit JVM can be many times what it can be on a 32-bit JVM.

Therefore, edit the file <em-root-dir>/config/tess-default.properties or tess-customer.properties and modify dailystats.jvmArgs and defects.jvmArgs to remove the "-XX:ThreadStackSize" parameter.

·        Start the EM

Note that there is a more specific error message in releases 9.5.1, 9.1.7 and later when defects and/or stats aggregation processes fail to start in a separate JVM. See TT 85188 for more details.

If the issue persists, set the EM log to DEBUG mode to see the specific reason for the failure.

 

Issue #5  TS_Dummy Error

Symptoms:

The statistics aggregation log contains the following error and the process fails to run:

[WARN] [DailyAggregation.Thread1] [Manager.com.timestock.tess.services.processors.StatsAggregator] can't
parse stats partition 'TS_ST_TS_ALL_DLY_DUMMY'
java.text.ParseException: Unparseable date: "LY_DUMMY"

Solution:

The TS_DUMMY table is a temporary table used only during upgrade. After upgrade has completed, TS_DUMMY tables should be dropped. In upgrades from APM 9.0.8.0 to 9.1.4.0, it was observed that the TS_DUMMY tables were not dropped as expected. This problem may also occur with other combinations of version upgrades.

See the install.log from the upgrade for more information.

To correct this problem, follow these steps:

1.     Look for all stats aggregation partitions ending with -DUMMY using a query like the following (which shows all daily partitions): SELECT * FROM user_tab_partitions WHERE PARTITION_NAME like 'TS_ST_TS_ALL_DLY_%';

2.     Issue a command like the following to drop DUMMY partitions: ALTER TABLE TS_ST_TS_ALL_DLY DROP PARTITION TS_ST_TS_ALL_DLY_DUMMY UPDATE GLOBAL INDEXES;

After taking these steps, statistics aggregation should run correctly if there are no other outstanding issues.

 

Issue #6  ORA-14074 Error: partition bound must collate higher than that of the last partition.

This error happens for Oracle APM Databases and may occur with some of the other hourly/daily aggregation errors listed above.

Symptoms:

The statistics aggregation log contains the following error and the process fails to run:

10/14/13 11:28:22.537 PM EST [ERROR] [Thread-1318][Manager.com.timestock.tess.services.processors.StatsAggregator] [main]
[com.timestock.tess.util.TessLogger] Execute failed for SQL DDL 'ALTER
TABLE ts_st_ts_all_dly ADD PARTITION ts_st_ts_all_dly_20121024 VALUES LESS
THAN (TIMESTAMP '2012-10-25 00:00:00 -2:00')'
java.sql.SQLException: ORA-14074: partition bound must collate higher than
that of the last partition

 

This generic error typically means the following:

-                  A new partition is attempted to be added or created. This can be

-                  between two partitions.

-                  a new partition typically at the beginning or middle of the table.

-                  a new partition value is less than a previously-created highest partition          value. (Which is the MAXVALUE.)

Run the query listed above to get more detail on this error. It lists the current APM database partitions: SELECT PARTITION_NAME FROM USER_TAB_PARTITIONS ORDER BY PARTITION_NAME;  

Solution: There are a variety of causes.

1.     This could be due to the Daylight Savings issue discussed in the March 2014 Advisory where there are overlapping session map partitions due to the time change. The Advisory covers next steps on how to clean the overlap.

2.     Out of tablespace. This is fixed in 9.16/9.5 and later. If the APM database runs out of tablespace, then not all partitions will be created which will cause the error. The solution is to increase Oracle tablespace for the APM database. (i.e Using ALTER TABLESPACE ADD DATAFILE command.)

 

  Common defects aggregation issues   Typically there are two common defect aggregation issues:

  1. Issues occur during defects aggregation.
  2. Issues occur during defect re-aggregation

  Issues occurring during defects aggregation. (Table Does Not Exist, Bad Rows, TCS not Running)  

 

Table does not exist in the TIM collector logs:

[ERROR] [DefectAggregation.Thread1] [Manager.com.timestock.tess.util.DbUtils] Select failed for ts_defects org.postgresql.util.PSQLException: ERROR: relation "ts_defects_20121105" does not exist

If you are getting this error, take the following steps

1.     Check to see if the ts_defecttable is completely empty. If so, it may indicate that no defects have been generated. Possible reasons for this include:

a.      The defect threshold is set too high

b.     the TIM is having performance or network issues  

2.     Check the latest value in the ts_occur_date field in the ts_defects_intervaltable using this query:  select max(ts_occur_date) from ts_defects_interval;

3.     If returning a null value, then no aggregation has occurred to date. To aggregate all defects in the ts_defectstable for all intervals, change the last record in the ts_defects_intervaltable to the desired ts_occur_date to specify performing aggregation beginning from this interval.

4.     Restart the Tim collector.

5.     Note: If, after doing this, some new defects are added to the ts_defectstable, it will now contain a mixture of defects which have already been aggregated and new ones which are yet to be aggregated. If so, re-check the ts_defects_intervaltable to verify the lastts_occur_date and start processing defects after this interval from the ts_defectstable. If needed, update the last aggregated row to the appropriate date.

6.     Restart the Tim collector

 

The TIM Collection Service (TCS) is not running: Symptoms:

Error in the TIM collector log

[ERROR] [ServicesCheck.Thread1] [Manager.com.timestock.tess.services.service.CheckServices] Could not start service 'TIM.Collection.Svc' on entity 161: the service did not stop

Solution:

·       Check TS_ENTITY table to make sure it includes all EMs

·       Check and update TS_LOCKED_BY_ENTITY column in TS_ENTITY_SERVICES table if needed

·       Restart the collector running the Tim Collection Service and monitor the TS_ENTITY_SERVICES table for correct entries. It may take a couple minutes for this value to update.

·       If the issue still exists, update the TS_LOCKED_BY_ENTITY column again and reassign the Tim Collection Service to a different collector without EM restart

A common error for releases APM 9.1.4 - 9.17 is the same Bad Rows error discussed in the previous section on stats aggregation. Follow the same steps as described above to resolve this issue  

Issues occur during defects re-aggregation

  • There will be an error message about a duplicate row during re-aggregation. If this occurs, simply delete the affected row from the ts_defects_reaggr_interval table.

 

  NOTE: Why have defect re-aggregations? If new stats/defects come to TESS-EM for already aggregated interval (such as after restoring the connection to TIM), APM needs to re-  aggregate that interval impacted by newly incoming stats/defects.

 

 

Related TTs/RTC Defects

 

Team Tracks/Defects associated with these issues:

 

  • ·       Defects Aggregation: 271182

 

  • ·       Daily Aggregation: 74487, 74739, 75792, 75684, 85188 (Bad Rows)

 

  • ·       Collation Error: 83073, 263639,229741

 

  Credits & Acknowledgements

 

  • Thanks to Michal Lehotsky for creating the original invaluable document and providing extensive comments.

 

  • Thanks to Raju Kanumuri for insightful comments and ThreadStackSize additions.

 

  • Thanks to Flora Parekh for feedback.

 

  • Special thanks to Mary Clark for editing and formatting this document. (Note: the last edit was mine.)

 

 

   Additional Information:

 

  Some of the above information was directly pulled from the following sources:

 

  • ·       APM Support, Separate JVM running Daily Stats Aggregation failed to start due to insufficient Stack Size (Bad rows processed

 

  • ·       APM Support, Why am I getting an error about a TS_DUMMY partition?

 

  • ·       Bashetti, Sandeep, Defects Aggregation and Incident Management

 

  • ·       Lehotsky, Michal, Stats aggregation issue after upgrading from 9.0.x to 9.1.0.0

 

  • ·       Lehotsky, Michal, Statistics in APM 9.1

 

  • ·       Tepper, Steve and Muhammad Arif, CEM Architecture Overview presentation, August, 2008

 

  • ·       Umansky, Vlad, TESS Overview: Tess and Database Schema Overview

 

  A good overview of partitioning is the following:

 

  • ·       Oracle® Database VLDB and Partitioning Guide11g2Partitioning Concepts

 

Search the Entire CA APM Knowledge Base

 

search-kb.jpg

Attachments

    Outcomes