Automic Workload Automation

Back to discussions

Expand all | Collapse all

Collecting AE performance metrics

1. Collecting AE performance metrics

1 Recommend
Michael A. Lowry
Posted Jul 10, 2018 11:02 AM

Reply Reply Privately
We are interested in setting up a process to collect at regular intervals data relevant to the performance of the Automation Engine. Our hope is that historical performance data could help us identify patterns and trends relevant to UC4 performance.

Here is what I have in mind.

Collect the following information every 5-10 minutes:
Server process utilization (B.01, B.10, and B.60 numbers)
WP/CP CPU & memory usage as reported by ps command
List of hung AE server processes, if any
Number & type of users/clients connected
Number of rows in the ITL table
Number of rows in the MQ tables

In addition, log the time, duration, and size of each UC4 DB unload/load and each XML export/import.

Does anyone do this sort of thing? Do you have any tips or insights?
2. Re: Collecting AE performance metrics

0 Recommend
FrankMuffke
Posted Jul 11, 2018 06:40 AM

Reply Reply Privately
Hi

we differ between monitoring and health check.

Monitoring
is done continuously via external tools & staff in combination with CALL API and some OS jobs.
we monitor externally all AE processes, DB, Agent processes, tomcat (for ECC)
a login check to some target systems, DB, etc as well

the check if all AE processes write into their log file is done via OS job triggered externally via CALL API.

Healthcheck
does a quick check of the system (is my UC4 system fine at the moment) and is done twice a day automatically (morning, evening)
and manually on purpose.

This covers if all AE processes are up and running, all agents are running and connected, all SMGR processes (on AE server) are running
all AE processes are writing into their log within the last XY minutes, all PROD clients are in state "GO", a latency check, filesystem usage of
AE servers, users connected, amount of entries of MQPWP ans MQWP, Amount of activities per client, and average workload of the PWP process.

The latency check (Idea stolen from Carsten_Schmitz) measures the time between activation and start of a couple of dummy jobs.

cheers, Wolfgang

3. Re: Collecting AE performance metrics

Recommend

Carsten Schmitz

Posted Jul 11, 2018 07:41 AM

Hi.

We have several things in this area. From top of head:

process monitoring with Nagios (NOT for hung processes, just checking the way "ps" would)
"heartbeat" dummy JOBF or JOBS that get executed and the result monitored by Nagios, thus verifying the actual funtion of engine and each individual agent (as FrankMuffke aka Wolfgang Brückler already mentioned)
I have a bunch of cron jobs that read from the DB via SQL and monitor MQ (message queues) levels *), the count of active objects by department (to warn me of spikes in usage by individual organisational units), licenses count **) and monitor for changes in the host table (to warn me when someone adds Agents **). If you need some of these SQL queries, shoot me a PM.

Some of these jobs feed into Zabbix, which is the companies choice for plotting diagrams.

Booknotes:

*) queue levels are unfortunately pretty much a binary indicator for system health, as far as Automic is concerned. Queue levels are either "oh, don't worry, an MQMEM of a few thousand items is still totally okay" or "oh, the system stopped". I don't know of useful "official" tresholds thus far beyond which a queue can be considered too full, but I made my own tresholds over time by looking at the "usual" queue levels, and when those do get exceeded, I get an alert and can take a more in-depth look at the system.

**) licenses have to be monitored because other people in the organisation can install agents without telling me, and agents immediately consume licenses without server admin intervention, which is bad design. So I run the Automic license analyzer query once per day, and have a script that tries to emulate the same data processing that Automic (by my guesses) does with the data, thus giving me a reasonable guesstimate on the number of used licenses that Automic would calculate at the annual audit, to make sure I never exceed that number on any given day.
Inside Automic, I have a "Server Monitor" script that monitors queues and load levels, using the Automic functions, warns via email for any violations, and also feeds into text files, which then get farmed by Zabbix for drawing diagrams of queue levels. It also keeps a running total of dead Automic processes, and can notify people if e.g. more processes keep dying. Here's my code (company internals removed) - it probably sucks, I suck at AutomicScript

!Server Monitor with Zabbix Feed: This version is a clone off of the Server
!Monitor script which runs as a JOBS, and writes counters to files on the
!central UC4 utility server. The text files with counters can then be processed
!in Zabbix or any other diagram drawing app.
!questions? -> cschmitz

!configure the name of the variable object used for keeping state here.
!object must exist
:SET &STATEVAR_NAME# = "VARA.DC1.SERVER_MONITOR_STATE"
:SET &CONFIGVAR_NAME# = "VARA.DC1.SERVER_MONITOR_CONFIG"

:SET &MAIL_RCPT# = GET_VAR(&CONFIGVAR_NAME#, "email_recipient")

!fallback if empty
:IF &MAIL_RCPT# = ""
:  SET &MAIL_RCPT# = "go_ahead_send_me_spam@example.com"
:ENDIF

:SET &VER_INFO# = SYS_INFO(SERVER, VERSION,ALL)
:SET &SERVER_OPTS# = GET_UC_SETTING(SERVER_OPTIONS)

:SET &LOAD_01# = SYS_BUSY_01()
:SET &LOAD_10# = SYS_BUSY_10()
:SET &LOAD_60# = SYS_BUSY_60()

!wow, this is silly
:SET &LOAD_01_FMT# = FORMAT(&LOAD_01#)
:SET &LOAD_10_FMT# = FORMAT(&LOAD_10#)
:SET &LOAD_60_FMT# = FORMAT(&LOAD_60#)

!PWP is the only one that has load averages for it's MQ
:SET &MQ_PWP_BUSY_01# = SYS_INFO(MQPWP, BUSY, 1)
:SET &MQ_PWP_BUSY_01_FMT# = FORMAT(&MQ_PWP_BUSY_01#)

:SET &MQ_PWP_BUSY_10# = SYS_INFO(MQPWP, BUSY, 10)
:SET &MQ_PWP_BUSY_10_FMT# = FORMAT(&MQ_PWP_BUSY_10#)

:SET &MQ_PWP_BUSY_60# = SYS_INFO(MQPWP, BUSY, 60)
:SET &MQ_PWP_BUSY_60_FMT# = FORMAT(&MQ_PWP_BUSY_60#)


!Message queues outstanding messages count
:SET &MQ_PWP_COUNT# = SYS_INFO(MQPWP, COUNT)
:SET &MQ_WP_COUNT# = SYS_INFO(MQWP, COUNT)
:SET &MQ_DWP_COUNT# = SYS_INFO(MQDWP, COUNT)
:SET &MQ_OWP_COUNT# = SYS_INFO(MQOWP, COUNT)
:SET &MQ_RWP_COUNT# = SYS_INFO(MQRWP, COUNT)

!Message queue average processing times
!
!beware: if you get something like "20877", that's actually a
!return code printed in place of the result, letting you know
!that e.g. the period parameter is not accepted!
!
:SET &MQ_PWP_AVGTIME# = SYS_INFO(MQPWP, LENGTH, 1)
:SET &MQ_WP_AVGTIME# = SYS_INFO(MQWP, LENGTH, 1)
:SET &MQ_DWP_AVGTIME# = SYS_INFO(MQDWP, LENGTH, 1)
:SET &MQ_OWP_AVGTIME# = SYS_INFO(MQOWP, LENGTH, 1)
:SET &MQ_RWP_AVGTIME# = SYS_INFO(MQRWP, LENGTH, 1)

!send email if various values are out of boundaries

:IF &MQ_PWP_COUNT# >= 30
:  SET &RETVAL# = SEND_MAIL(&MAIL_RCPT#,,"PWP MQ Count exceeded limit on &$SYSTEM#","Warning: Message queue count of PWP on &$SYSTEM# is currently &MQ_PWP_COUNT#")
:  IF &RETVAL# <> 0
:    PRINT "ERROR: Sending Email failed with code &RETVAL#!"
:  ENDIF
:ENDIF

!possibly ten minutes would be the most interesting, with one being to twitchy
!and the hour too laggy - IF the averages were to work as advertised, which
!they don't seem to do

:IF &LOAD_10_FMT# >= 90
:  SET &RETVAL# = SEND_MAIL(&MAIL_RCPT#,,"UC4 load on &$SYSTEM# exceeded limit over last 10 minute average","Warning: load average over 10 minutes on &$SYSTEM# is currently &LOAD_10_FMT# (which may be benign, since the function to report the averages seems to be buggy ...")
:  IF &RETVAL# <> 0
:    PRINT "ERROR: Sending Email failed with code &RETVAL#!"
:  ENDIF
:ENDIF

:PRINT "System Monitor for &$SYSTEM#"
:PRINT "Version: &VER_INFO#"
:PRINT "Options: &SERVER_OPTS#"
:PRINT ""
:PRINT "Load (system) pct. (1m, 10m, 60m): &LOAD_01_FMT#, &LOAD_10_FMT#, &LOAD_60_FMT#"
:PRINT "Load (PWP MQ) pct. (1m, 10m, 60m): &MQ_PWP_BUSY_01_FMT#, &MQ_PWP_BUSY_10_FMT#, &MQ_PWP_BUSY_60_FMT#"
:PRINT ""
:PRINT "Message queues:"
:PRINT "Type |   Msg. Queued      |   Avg. Proc. Time    |  Descr.    "
:PRINT "-----------------------------------------------------------------------------------------"
:PRINT "PWP  |   &MQ_PWP_COUNT# |   &MQ_PWP_AVGTIME#   |  Primary WP"
:PRINT "WP   |   &MQ_WP_COUNT# |   &MQ_WP_AVGTIME#   |  Worker Process"
:PRINT "DWP  |   &MQ_DWP_COUNT# |   &MQ_DWP_AVGTIME#   |  Dialog WP"
:PRINT "OWP  |   &MQ_OWP_COUNT# |   &MQ_OWP_AVGTIME#   |  Output"
:PRINT "RWP  |   &MQ_RWP_COUNT# |   &MQ_RWP_AVGTIME#   |  Resources"
:PRINT ""

! check all server processes, report if one (or more) are considered dead
! should be changed to parsing an array later

:SET &SYS_SERVERS_DEAD#=""
:IF SYS_SERVER_ALIVE("UC4P#CP001") <> "Y"
:  SET &SYS_SERVERS_DEAD#="&SYS_SERVERS_DEAD# UC4P#CP001"
:  PRINT "Fehler in &SYS_SERVERS_DEAD#"
:ENDIF

:IF SYS_SERVER_ALIVE("UC4P#CP002") <> "Y"
:  SET &SYS_SERVERS_DEAD#="&SYS_SERVERS_DEAD# UC4P#CP002"
:  PRINT "Fehler in &SYS_SERVERS_DEAD#"
:ENDIF

:IF SYS_SERVER_ALIVE("UC4P#CP003") <> "Y"
:  SET &SYS_SERVERS_DEAD#="&SYS_SERVERS_DEAD# UC4P#CP003"
:  PRINT "Fehler in &SYS_SERVERS_DEAD#"
:ENDIF

:IF SYS_SERVER_ALIVE("UC4P#CP004") <> "Y"
:  SET &SYS_SERVERS_DEAD#="&SYS_SERVERS_DEAD# UC4P#CP004"
:  PRINT "Fehler in &SYS_SERVERS_DEAD#"
:ENDIF

:IF SYS_SERVER_ALIVE("UC4P#WP001") <> "Y"
:  SET &SYS_SERVERS_DEAD#="&SYS_SERVERS_DEAD# UC4P#WP001"
:  PRINT "Fehler in &SYS_SERVERS_DEAD#"
:ENDIF

:IF SYS_SERVER_ALIVE("UC4P#WP002") <> "Y"
:  SET &SYS_SERVERS_DEAD#="&SYS_SERVERS_DEAD# UC4P#WP002"
:  PRINT "Fehler in &SYS_SERVERS_DEAD#"
:ENDIF

:IF SYS_SERVER_ALIVE("UC4P#WP003") <> "Y"
:  SET &SYS_SERVERS_DEAD#="&SYS_SERVERS_DEAD# UC4P#WP003"
:  PRINT "Fehler in &SYS_SERVERS_DEAD#"
:ENDIF

:IF SYS_SERVER_ALIVE("UC4P#WP004") <> "Y"
:  SET &SYS_SERVERS_DEAD#="&SYS_SERVERS_DEAD# UC4P#WP004"
:  PRINT "Fehler in &SYS_SERVERS_DEAD#"
:ENDIF

:IF SYS_SERVER_ALIVE("UC4P#WP005") <> "Y"
:  SET &SYS_SERVERS_DEAD#="&SYS_SERVERS_DEAD# UC4P#WP005"
:  PRINT "Fehler in &SYS_SERVERS_DEAD#"
:ENDIF

:IF SYS_SERVER_ALIVE("UC4P#WP006") <> "Y"
:  SET &SYS_SERVERS_DEAD#="&SYS_SERVERS_DEAD# UC4P#WP006"
:  PRINT "Fehler in &SYS_SERVERS_DEAD#"
:ENDIF

:IF SYS_SERVER_ALIVE("UC4P#WP007") <> "Y"
:  SET &SYS_SERVERS_DEAD#="&SYS_SERVERS_DEAD# UC4P#WP007"
:  PRINT "Fehler in &SYS_SERVERS_DEAD#"
:ENDIF

:IF SYS_SERVER_ALIVE("UC4P#WP008") <> "Y"
:  SET &SYS_SERVERS_DEAD#="&SYS_SERVERS_DEAD# UC4P#WP008"
:  PRINT "Fehler in &SYS_SERVERS_DEAD#"
:ENDIF

:IF SYS_SERVER_ALIVE("UC4P#WP009") <> "Y"
:  SET &SYS_SERVERS_DEAD#="&SYS_SERVERS_DEAD# UC4P#WP009"
:  PRINT "Fehler in &SYS_SERVERS_DEAD#"
:ENDIF

:IF SYS_SERVER_ALIVE("UC4P#WP010") <> "Y"
:  SET &SYS_SERVERS_DEAD#="&SYS_SERVERS_DEAD# UC4P#WP010"
:  PRINT "Fehler in &SYS_SERVERS_DEAD#"
:ENDIF

:IF SYS_SERVER_ALIVE("UC4P#WP011") <> "Y"
:  SET &SYS_SERVERS_DEAD#="&SYS_SERVERS_DEAD# UC4P#WP011"
:  PRINT "Fehler in &SYS_SERVERS_DEAD#"
:ENDIF

:IF SYS_SERVER_ALIVE("UC4P#WP012") <> "Y"
:  SET &SYS_SERVERS_DEAD#="&SYS_SERVERS_DEAD# UC4P#WP012"
:  PRINT "Fehler in &SYS_SERVERS_DEAD#"
:ENDIF

:IF SYS_SERVER_ALIVE("UC4P#WP013") <> "Y"
:  SET &SYS_SERVERS_DEAD#="&SYS_SERVERS_DEAD# UC4P#WP013"
:  PRINT "Fehler in &SYS_SERVERS_DEAD#"
:ENDIF

:IF SYS_SERVER_ALIVE("UC4P#WP014") <> "Y"
:  SET &SYS_SERVERS_DEAD#="&SYS_SERVERS_DEAD# UC4P#WP014"
:  PRINT "Fehler in &SYS_SERVERS_DEAD#"
:ENDIF

:IF SYS_SERVER_ALIVE("UC4P#WP015") <> "Y"
:  SET &SYS_SERVERS_DEAD#="&SYS_SERVERS_DEAD# UC4P#WP015"
:  PRINT "Fehler in &SYS_SERVERS_DEAD#"
:ENDIF

:IF SYS_SERVER_ALIVE("UC4P#WP016") <> "Y"
:  SET &SYS_SERVERS_DEAD#="&SYS_SERVERS_DEAD# UC4P#WP016"
:  PRINT "Fehler in &SYS_SERVERS_DEAD#"
:ENDIF

:IF SYS_SERVER_ALIVE("UC4P#WP017") <> "Y"
:  SET &SYS_SERVERS_DEAD#="&SYS_SERVERS_DEAD# UC4P#WP017"
:  PRINT "Fehler in &SYS_SERVERS_DEAD#"
:ENDIF

:IF SYS_SERVER_ALIVE("UC4P#WP018") <> "Y"
:  SET &SYS_SERVERS_DEAD#="&SYS_SERVERS_DEAD# UC4P#WP018"
:  PRINT "Fehler in &SYS_SERVERS_DEAD#"
:ENDIF

:IF SYS_SERVER_ALIVE("UC4P#WP019") <> "Y"
:  SET &SYS_SERVERS_DEAD#="&SYS_SERVERS_DEAD# UC4P#WP019"
:  PRINT "Fehler in &SYS_SERVERS_DEAD#"
:ENDIF

:IF SYS_SERVER_ALIVE("UC4P#WP020") <> "Y"
:  SET &SYS_SERVERS_DEAD#="&SYS_SERVERS_DEAD# UC4P#WP020"
:  PRINT "Fehler in &SYS_SERVERS_DEAD#"
:ENDIF

:IF &SYS_SERVERS_DEAD# <> ""
!  only send an email if something is a) wrong and b) has changed since last email
:  SET &LAST_DEADPROCESS_LIST# = GET_VAR(&STATEVAR_NAME#, "LAST_DEADPROCESS_LIST")
:  IF &SYS_SERVERS_DEAD# <> &LAST_DEADPROCESS_LIST#
:    PRINT "The following processes are reportedly DEAD: &SYS_SERVERS_DEAD#"
:    SET &RETVAL# = SEND_MAIL(&MAIL_RCPT#,,"Dead Process on &$SYSTEM#","The following server processes are reported DEAD by the server monitoring script: &SYS_SERVERS_DEAD#. You will NOT receive further emails unless the list of dead processes changes!")
:    IF &RETVAL# <> 0
:      PRINT "ERROR: Sending Email failed with code &RETVAL#!"
:    ELSE
!      only set "acknowledgement" in variable if email sending worked. Otherwise, keep attempting to send mails in case it's a temporary problem.
:      PUT_VAR &STATEVAR_NAME#,"LAST_DEADPROCESS_LIST", &SYS_SERVERS_DEAD#
:    ENDIF
:  ENDIF
:ELSE
:  print "All server processes reportedly alive."
:ENDIF

! ----------- no shell script above this line

set -e

COLLATE_DIR=/tmp/uc4_servmon_counters

mkdir -p $COLLATE_DIR

# put data into files so Zabbix can farm it later

echo "&LOAD_01_FMT#" | bc        > $COLLATE_DIR/uc4_sys_load_1min.txt
echo "&MQ_PWP_BUSY_01_FMT#" | bc > $COLLATE_DIR/uc4_mq_pwp_busy_1min.txt
echo "&MQ_PWP_COUNT#"   | bc     > $COLLATE_DIR/uc4_mq_pwp_count_cur.txt
echo "&MQ_WP_COUNT#"    | bc     > $COLLATE_DIR/uc4_mq_wp_count_cur.txt
echo "&MQ_DWP_COUNT#"   | bc     > $COLLATE_DIR/uc4_mq_dwp_count_cur.txt
echo "&MQ_OWP_COUNT#"   | bc     > $COLLATE_DIR/uc4_mq_owp_count_cur.txt
echo "&MQ_RWP_COUNT#"   | bc     > $COLLATE_DIR/uc4_mq_rwp_count_cur.txt
echo "&MQ_PWP_AVGTIME#" | bc     > $COLLATE_DIR/uc4_mq_pwp_avgproctime.txt
echo "&MQ_WP_AVGTIME#"  | bc     > $COLLATE_DIR/uc4_mq_wp_avgproctime.txt
echo "&MQ_DWP_AVGTIME#" | bc     > $COLLATE_DIR/uc4_mq_dwp_avgproctime.txt
echo "&MQ_OWP_AVGTIME#" | bc     > $COLLATE_DIR/uc4_mq_owp_avgproctime.txt
echo "&MQ_RWP_AVGTIME#" | bc     > $COLLATE_DIR/uc4_mq_rwp_avgproctime.txt

SERVER_PROCESSES_NOT_OK="&SYS_SERVERS_DEAD#"

if [ ! -z "$SERVER_PROCESSES_NOT_OK" ] ; then
  NUM_SERVER_PROCESSES_NOT_OK=$( echo "$SERVER_PROCESSES_NOT_OK" | wc -w )
  echo $NUM_SERVER_PROCESSES_NOT_OK > $COLLATE_DIR/uc4_sys_server_procs_not_ok.txt
else
  echo "0" > $COLLATE_DIR/uc4_sys_server_procs_not_ok.txt
fi

This also prints a mildly human-readable report like this:

It's mildly entertaining that the queue numbers from inside Automic never seem to match those gotten from the actual DB tables, but it's usually not enough difference to truly matter.

Here's an example of how the data looks in Zabbix after it's been processed into a pile of widgets:

4. Re: Collecting AE performance metrics

0 Recommend
Keld Mollnitz
Posted Aug 20, 2018 03:41 AM

Reply Reply Privately
There is also the option to meassure the average transaction times in the AE. Thew way to do this is to activate minimal traces (set trace option 16=1) and leave it running for 10-15 mins. Then upload the AE serverlogs/trace files to Support and have them send you the result.
The average transaction time is a good indicator for how the AE is performing and it would be really good if this number could be continuously fetched, say once every hour, and without involving Support (sending traces file to support every hour is of course not an option...

The minimal trace should not impact the overall performance so it should be fairly harmless...(so I have been told by Support).

/Keld.
5. Re: Collecting AE performance metrics

0 Recommend
Mihail Cheie
Posted Oct 16, 2018 09:03 AM

Reply Reply Privately
Thank you @Carsten Schmitz for sharing the script with everyone. We implemented these checks as well but instead of creating an ARRAY for SYS_SERVER_ALIVE, we used this:

:SET &WP#=PREP_PROCESS_VAR(VARA.SQL.SERVER_PROCESSES) :PROCESS &WP# : SET &WP_NAME# = GET_PROCESS_LINE(&WP#,1) :IF SYS_SERVER_ALIVE(&WP_NAME#) <> "Y" : SET &SYS_SERVERS_DEAD#="&WP_NAME#" : PRINT "Error in &SYS_SERVERS_DEAD#" : ENDIF :ENDPROCESS

We created VARA.SQL.SERVER_PROCESSES with the following SQL:
select oh_name from uc4db.oh where oh_otype IN ('SERV')

With this set, the VARA picks all our 53 WPs/DWPs/CPs and JWP and the rest of the script checks them line by line to make sure they are up or not.

Hope this helps!
6. RE: Collecting AE performance metrics

0 Recommend
Tony Ferraro
Posted Sep 06, 2019 05:48 PM

Reply Reply Privately
Would be great if some of these were built into the GUI.
Troubleshooting performance issues is tricky with workload.

Server process utilization (B.01, B.10, and B.60 numbers). - This one is in the GUI but only 24 hours worth, not enough for trend analysis.

Number of rows in the ITL table

Number of rows in the MQ tables

Stats - how many unique jobs do i have? How many jobs were executed per day/month?

Not sure if Analytics component tracks any of this long term?

Original Message
7. RE: Collecting AE performance metrics

0 Recommend
Carsten Schmitz
Posted Sep 11, 2019 04:50 AM
Edited by Carsten Schmitz Sep 11, 2019 04:51 AM

Reply Reply Privately
> Server process utilization (B.01, B.10, and B.60 numbers). - This one is in the GUI but only 24 hours worth, not enough for trend analysis.
> Number of rows in the ITL table
> Number of rows in the MQ tables
> Stats - how many unique jobs do i have? How many jobs were executed per day/month?

Great idea, I totally second all of that. The question "how many objects/jobs/executions" one per day/week/year has comes up so very often, it should be a running total counter and discoverable in the GUI.

Additionally, it would also be neat to have similar stats for the agents. E.g. the time to start jobs can be severely impacted on agents with high loads. Currently we monitor those execution timings with SQL queries, would be cool to have that on the admin tab/agent panel, and a counter for number of jobs. Could display the ratio of "overhead" vs. "execution" (similar to "user" vs. "system" times in UNIX). Or also have an "uptime %" counter for agents.

Unless it's in Analytics (sorry, I wouldn't know ...) this should go into ideation.

Original Message

Automic Workload Automation

Collecting AE performance metrics

Michael A. LowryJul 10, 2018 11:02 AM

FrankMuffkeJul 11, 2018 06:40 AM

Carsten SchmitzJul 11, 2018 07:41 AM

Keld MollnitzAug 20, 2018 03:41 AM

Mihail CheieOct 16, 2018 09:03 AM

Tony FerraroSep 06, 2019 05:48 PM

Carsten SchmitzSep 11, 2019 04:50 AM

1. Collecting AE performance metrics

2. Re: Collecting AE performance metrics

3. Re: Collecting AE performance metrics

4. Re: Collecting AE performance metrics

5. Re: Collecting AE performance metrics

6. RE: Collecting AE performance metrics

7. RE: Collecting AE performance metrics