Skip navigation
All Places > CA Security > CA Single Sign-On > Blog > Authors Mark.ODonohue

CA Single Sign-On

10 Posts authored by: Mark.ODonohue Employee

PURPOSE : 

This project has deployment of .jar file that changes the format of the httpclientX.log file in Access Gateway/Secure Proxy Server to print threadid and some other useful format changes.

 

OVERVIEW: 

The Current httpclient.log Format is  : 

Oct 04, 2017 5:21:49 PM org.apache.http.impl.conn.Wire wire
FINE: >> "GET /unprotected/dump.aspx HTTP/1.1[\r][\n]"

 

To a format like :

[2017-10-04][17:03:16.759][FINE][1620][39][org.apache.http.impl.conn.Wire][wire][Wire.java:63][>> "GET /unprotected/dump.aspx HTTP/1.1[\r] [\n]"]

 

The single line format, containing threadId and ms time is a more useful format for tracing and debugging individual transactions and being able to identify the response to individual requests. 

 

BACKGROUND:

Tracing requests to through the Access Gateway/ Secure Proxy Server to determine the cause of delays, requires us to have a good handle on the request as we trace it from the client browser : 

 

A request from the client browser:

request.png

Goes through the Ag server to the backend. 

 

The trace of the data sent and received between SPS -> backend server is logged in httpclient.log and the log4j formatting looks like this : 

pre-formated.png

That format make it difficult (virtually impossible) to identify which response from the backend sever is tied to a specific sent request.

 

The new format :  

newlog.png

Allows us to tie, via the threadid the request ">>" with its' response data "<<" received from the backend, via the Pid and Tid [1620][39].

 

The advantages of this log4j format are: 

a) It shows PID & threadId for each log line [1620][39]

  Currently it is not possible in httpclient log to identify the response to an individual request. 

  with the threadid it is possible to tie the request sent to the backend with ">>" tag and the response from the 

  backend with "<<" tag as they will have the same threadid. 

  

b) It shows SrcFile line of code Wire.java:63 

  Not as useful, in this case, but in other log4j circumstances can identify the line of code responsible for 

  writing the log file entry.

 

c) MS timeframe 

       Most transactions occur in ms timeframe so this is better default.

  

d) Data is all one one line 

       Having the data all on one line makes it simpler to use tools like grep and find since 

       they then have the complete log record, not just half of it.

 

Disadvantages:

   Doing extra work takes time, threadid is not a lot of work, log4j dont guarantee the threaid is the same

   but in all my test cases it has been the same (if you have some remote logger for example the threadid may 

   be different).

   

   SrcFile does add a bit of overhead (which is why they dont normally do it) - but can be useful - if you dont 

   need it and speed is important you can comment it out in the code.

   

I didn't work really hard on making the formatting efficient - so that could probably be improved if speed is needed in my case this is used for those odd occasions where there is some delay in the backedn and we need to identify which transactions have the delay.  

 

INSTALL INSTRUCTIONS: 

The deployment .zip file is attached.  It contains a README.txt, the deployment .jar file, and the src code if it needs to be recompiled for you java runtime.

 

log4j-MultiThreadFormat.zip 

 

DEPLOYMENT:

 

1) Copy : 

   log4jMultiThreadFormat.jar

 to : 

    C:\Program Files (x86)\CA\secure-proxy\Tomcat\endorsed

 

2) Edit  :   httpclientlogging.properties

Change :

#Specifies the name of a Formatter class to use.
java.util.logging.FileHandler.formatter = java.util.logging.SimpleFormatter
to : 
#Specifies the name of a Formatter class to use.
#java.util.logging.FileHandler.formatter = java.util.logging.SimpleFormatter
java.util.logging.FileHandler.formatter = com.ca.ssoutils.log4jformat.MultiThreadFormatter

httpclientproperties.png

 

3) Restart SPS proxy-engine Service 

 

    Use services, or sps-ctl script.

 

 

RECOMPILE:

    1) Move package to Access Gateway/Secure Proxy Server machine.

Generally it is easiest to build on the Ag/SPS machine since it has the JDK that you need, and also you can be sure the  .jar libraries are going to be compatable. 

 

    2) Edit build.bat 

Change the location of the JAVA_HOME and SPSPATH to locations of JDK and secure-proxy deployment locations.

set JAVA_HOME=C:\Program Files (x86)\Java\jdk1.7.0_51

set SPSPATH=C:\Program Files (x86)\CA\secure-proxy

 

    3) Run build.bat

 

 

Introduction:

The CA SSO - Secure Proxy Servers releases : R12.51, R12.52, R12.52Sp1 are supported for connecting to some older backend policy server versions.

 

For SPS released R12.51, R12.52, R12.52 Sp1 the recommended install for SPS is still that the Policy Server release version should be equal or greater than the SPS (or webagent) version.  

But there were some considerations with the above SPS releases, mainly that they let you select an earlier release version of the policy server in the install process, and the net result was that, although not the recommended arrangement, it is a supported configuration.

 

This above exception is not the case for SPS releases post R12.52 Sp1, (SPS is now called Access Gateway) so Ag releases R12.6 and R12.7 for supported configuration the policy server release must be >= agent version.

 

A similar situation also exists for CA Sharepoint 2010 Agent.

 

Background:

 Generally SSO agents have a strict regime that a policy server release must be greater than or equal to the webagent release version.   From the R12.52 PSM:

"CA SiteMinder12.52 Policy Server supports Web Agents at a higher CR (cumulative release) number than the Policy Server
provided both are the same level Service Pack."

 

 So a supported configuration is :

WebAgent R12.51  -> PolicyServer R12.51 (or R12.52, R12.52Sp1 or R12.6 or R12.7) 

And example unsupported configurations are :

WebAgent R12.52 -> PolicyServer R12.51

WebAgent R12.52Sp1 -> PolicyServer R12.52

 

But within the Service Pack release the CR level does not matter and the WebAgent can have a latter CR release,  so :  

WebAgent R12.52 Cr5 -> PolicyServer R12.52

WebAgent R12.52Sp1 Cr5 -> PolicyServer R12.52Sp1 Cr1

are also supported configurations. 

 

 

Instructions:

After install and on running the configure of SPS it gives a choice of what policy server version it is being connected to.  The exact choice depends on the release but are : 

    SPS R12.51 – gives choice of PolicyServer : R6, R12, R12.X

    SPS R12.52 – gives choice of PolicyServer : R6, R12.X, R12.5X

    SPS R12.52Sp1 – gives choice of PolicyServer : R12.X, R12.5X

The choice here determined which affwebservices package is installed into the SPS Tomcat deployment.

 

Here was the clarification note from Engineering :

Here are the SPS compatibility with Policy Server guidelines, across versions:

1) SPS 12.5 release was geared towards bringing SPS at parity with web-agents ; and to encourage customer-adoption (without forcing Policy Server upgrades), backward compatibility with older Policy Server versions was provided. 

o) This was mentioned in SPS PSM / Release-Notes & explicitly prompted during SPS configuration-screen too.

 

2) Later versions of SPS, 12.51 & 12.52 continue the configuration screen and support backward compatibility (PS 12.0.x & 12.5 x); though it is not explicitly highlighted in PSM. 

o) However, this compatibility is only for existing/old features & not the new ones. For ex, Session Assurance, support for Active Profile was brought in during 12.52 release and requires both SPS & Policy Server to be 12.52  version. These features are not supported with older PS versions.

 

Hope this clarifies the compatibility questions.  

Also, going forward, with next version of SPS, plan is not to carry forward this backward compatibility, and only support traditional compatibility (old agents support newer servers, for existing features).

 

The going forward ended up converting R12.6 and R12.7 which now have a fixed deployment of affwebservices deployed (the same version as per the release) , and do not allow you to select older versions. 

 

For those R12.5X releases, the screen you are presented with when running SPS configure (ca-sps-config.exe) : is :

For SPS R12.51  & R12.52 

 

 

And for R12.52 Sp1 :

 

 

 

https://support.ca.com/us/knowledge-base-articles.TEC1026157.html

 

Documented Process for enabling httpclient logging in Access Gateway  

 

With Access Gateway R12.7 using the documented way to enable httpclient logging does not work. 

The documented method is to change the server.conf entry: 

    httpclientlog ="yes"

 

But an extra setting is required in Ag R12.7

 

 

For Access Gateway R12.7 it needs : 

 

In server.conf : 

   httpclient ="yes"

This should be the only setting required.

But httpclient logging uses a common logging module, that can use either SimpleLogging, or log4j or java.util.logging.  The common logging module does some checking at startup, and determines which logging mechanism to use.   Prior to R12.7 the default was java.util.logging, but some change in R12.7 has changed the default. 

You can set the default for common logging to java.util.logging via JVM startup property : 

    -Dorg.apache.commons.logging.Log=org.apache.commons.logging.impl.Jdk14Logger 

 

For Access Gateway R12.7 this needs to be set in : 

 

For Windows: smSpsProxyEngine.properties 

 

 

Edit the smSpsProxyEngine.properties file and add the parameter : 

2017-08-29 12_26_04-SPS R12.52 Sp1Cr02 - VMware Workstation.png

 

For Unix : proxyserver.sh 

 

Edit the proxyserver.sh file and add the parameter :  

2017-08-29 12_35_19-CentOS 6.5 64-bit-SPS - VMware Workstation.png

 

It is also raised as an internal Sustaining Engineering ticket, and we expect it will be fixed in a latter CR version.

 

 

Raised KD: 

Enable httpclient logging in Agent Gateway 12.7 

The speed with which SMPolicyTraceAnalysis tool runs through trace logs has been improved.  The (long) time it takes to process large logs been on my mind for quite a while.  A cluster of recent cases with fairly large logs, tipped me into looking at it.

 

New version 670 : now downloadable from: 

Siteminder Policy Trace Analysis 

 

The main speed advantage in the changes comes from taking advantage of the larger memory in the 64bit JVM, and holding all the counters and open transactions in memory.  Previously for the 32bit JVM, I had to keep both counters and open transactions in temporary disk files.  

 

There is now a separate thread for reading the trace data and processing the trace data.  I tried pool of threads, to process the counters, and parse input lines, but they were slightly slower than just two threads (I suspect there is some  bottleneck and more threads could work, but for now the improvement was enough).

 

I've just used options available in JAVA 7, I did try some of the JAVA 8 pool features, but they did not add any speed, so those are currently commented out.

 

I did want to store the results in a database, to be able to incrementally add more trace logs, to make it easier to re-generate reports with say different graphs, and to be able to store processed traces from different (say policy )  servers to allow plotting data from multiple machines - but again this was slower - so that is on hold for now.

 

 

Here were the improvements I had for my test systems: 

 

Running on Windows: 

5.4 gig of logs, takes 3+ hours with older version, now takes less than 20min.

(Windows 7 on Dell Precision M4800 laptop with 32gig of memory and 256 SSD disk, )

 

Running on Linux, 

2 gig of logs takes 44min with older version, and now takes 4min. 

(Mint Linux 18. on Dell Precision T7500 with 47 gig of memory and fairly normal 7200 rpm HDD)

 

 

 

The JVM Options: 

The run.bat/run.sh now has the following options: 

java -XX:+UseLargePages -XX:+UseFastAccessorMethods -XX:+UseParallelGC -jar SMPolicyTraceAnalysis.jar 

 

Occasionally some uses have had trouble with the -XX:+UseLargePages  which may require specific permission on win7 machines - removing the options will slow it down a small amount - and you will still get most of the benefits.

 

Memory Usage:

The above runs show the JVM sitting at using 5gig of memory, the way the JVM works under load however is that it will use memory up to a point where the Garbage Collector is triggered.  The exact of memory requiest depends on how long your transactions are, how many are open at once,  and how long (ie how long the counters need to count) but the application will generally run quite easily run in 3-4gig of memory. 

 

More JVM Options:

There are more JVM optimisation that could be applied, particularly for java 8, if you want to experiment :

http://docs.oracle.com/javase/7/docs/technotes/tools/windows/java.html

http://docs.oracle.com/javase/8/docs/technotes/tools/windows/java.html

 

 

New version 670 : now downloadable from: 

Siteminder Policy Trace Analysis 

 

Cheers - Mark

----
Mark O'Donohue
Snr Principal Support Engineer - Global Customer Success

Introduction 

The purpose of this blog entry is to show how to enable all the different types of trace logs that are available in CA SSO Access Gateway (formerly known as Secure Proxy Server).   I will also be referring to the Access Gateway product as "Ag" in the article -however some of the slides predate the name change so will show up as SPS.  

 

Ag can be used a few different ways, and depending upon what you are using it for will determine what logs you want to enable.   I've split this up into different themes:

  • Ag Logging when used as Reverse Proxy Server
  • Ag Logging when used as Federation Gateway
  • Ag Logging for ProxyUI 
  • Ag Logging for WebServices

 

This log covers the Logginfg when used Ag is used as a Reverse Proxy Server the other logging profiles will be added as a separate documents at a later date. 

 

 

 

Ag Logging when used as a Reverse Proxy Server: 

 

The following gives an overview of the major components of Ag and also shows the name of (all) the logs that can be enabled and where they get their data from:

 

 

 

When used as a reverse proxy server, requests come in from the client, to Apache httpd, get passed to Apache/tomcat and then get forwarded to a backend server for processing.  The backend then completes the request and the data is then passed back to tomcat, to httpd and back to the client.    Note: I wont be discussing the use of fiddler and wireshark, but the diagram indicates where they would be used.

 

In summary we have: 

  • Apache Logs
  • Mod_Jk logs
  • Proxy Engine Logs
  • Web Agent Logs
  • Httpclient Logs

Each of which is covered in the sections below. 

 

 

Apache Logs 

The two major logs for apache httpd are access_log and error_log, these log the interaction with the user <-> httpd process. The httpd.conf entries are :

 

- Access_log  - settings in httpd.conf :  

The formats are defined here : 

LogFormat "%h %l %u %t \"%r\" %>s %b \"%{Referer}i\" \"%{User-Agent}i\"" combined

LogFormat "%h %l %u %t \"%r\" %>s %b" common

 

And the rotating logs are set here : 

CustomLog "|'C:/Program Files/CA/secure-proxy/httpd/bin/rotatelogs.exe' 'C:/Program Files/CA/secure-proxy/httpd/logs/access_log' 10M" common

 

Occasionally it is good to supplement what is in the access_log and to get some insight to a problem.  The example above it shows the %{User-Agent}i header, but you can also use that to capture cookies for example:  \"%{SMSESSION}C\" .   The option %T is also useful, since it logs the total time it took Ag to process the request and return the response to the user. 

 

A complete list of LogFormat parameters is available here: 

Tech Note : Enable httpclient logging in Access Gateway 12.7 

 

Note: And one final point to remember, the the access_log entry is written at the END of processing the request.  So if apache httpd crashed then the requests that are currently in flight when the crash happened are NOT logged. 

 

- Error_log  - settings in httpd.conf :  

The formats are defined here : 

# LogLevel: values include: debug, info, notice, warn, error, crit, alert, emerg.
LogLevel warn

 

And the rotating logs are set here : 

ErrorLog "|'C:/Program Files/CA/secure-proxy/httpd/bin/rotatelogs.exe' 'C:/Program Files/CA/secure-proxy/httpd/logs/error_log' 10M"


For debugging, you can raise the LogLevel to debug.   Apache 2.4 also has extra levels trace1 ... trace8, there are needed when you want to trace the raw data packets and SSL handshaking problems between the front end client and the httpd process. So for debugging often we can recommend : 

LogLevel trace8

 

The apache error_log is also good place to find the exact httpd and mod_jk version numbers: 

 

 

Mod_Jk Logs

Mod_jk is the Apache httpd module that forwards requests onto tomcat.  The log settings for it are in httpd.conf : 

 

JkLogFile "|'C:/Program Files/CA/secure-proxy/httpd/bin/rotatelogs.exe' 'C:/Program Files/CA/secure-proxy/httpd/logs/mod_jk.log' 10M"

JkLogLevel error

The log-level parameter describes what detail of logging should occur.   Possible values are : debug  info  error

 

For debug level logging, it is best to also set the JkRequestLogFormat, to display more detail of the transaction:

JkLogLevel debug

JkRequestLogFormat "%w %V %T %m %H %p %U %s"

That will show most of the raw byte data of what is send from httpd -> tomcat and what is returned.  The settings are explained here :  https://tomcat.apache.org/connectors-doc/reference/apache.html 

 

Sample mod_jk.log : 

 

 

Proxy Engine Logs

The proxy engine has two main logs :  

    server.log

    nohup*.out  

These are in secure-proxy/proxy-engine/logs directory by default.   The server.log is the log4j logging for the proxy-engine, and the nohup_<pid>.out log is the redirect of stdout and stderror logs 

 

server.log

Logging level for server.log is set in Tomcat/properties/logger.properties 

log4j.rootCategory=INFO,SvrFileAppender

log4j.rootCategory.ResourceBundle=root

The log level can bec changed to OFF, FATAL, ERROR, WARN, INFO, DEBUG, ALL

 

nohup_<date>_<time>.out log : 

We generally don't change the logging in this one as it logs the stdout/stderr logs from the proxy-engine.  Although one useful tip is adding "-verbose" to the java startup, and then you get the exact .jar file that each class is loaded from in this log.   A new timestamped log is started each time the proxy-engine is started.  The nohup log is good at capturing the stacktrace when Exceptions are thrown in proxy-engine eg: 

 

 

 

Web Agent Logs

 

Ag comes witn the standard WebAgent logs.  These are enabled via the ACO settings as per the normal agent eg: 

Must be enabled, and setup as normal agent ACO parameter: 

 

WebAgent.log 

LogAppend="NO"
LogFile=“YES"
LogFileName=“c:\ca\proxy-engine\logs\WebAgent.log"
LogFileSize="100"

 

 

WebAgentTrace.log 

TraceAppend="NO"
TraceConfigFile=“c:\ca\proxy-engine\conf\defaultagent\SecureProxyTrace.conf"
TraceFile=“YES"
TraceFileName=“c:\ca\proxy-engine\logs\WebAgentTrace.log"
TraceFileSize="100"

 

 

WebAgentTrace.log SecureProxyTrace.conf settings :

The SecureProxyTrace.conf is slightly different to the WebAgentTrace.conf.  It has ProxyAgent as default. 

I also tend to add Agent_Con_Manager, and AgentFunc as components. 

And add data items :  PreciseTime, Function,  and SrcFile as shown below: 

 

WebAgentTrace.log with proxy-rule messages : 

Additionally for the webagenttrace to log the proxy rule evaluation you need to add debug=“yes” to proxy-rules.xml to get additional error messages specific to SPS :

 

 

WebAgentTrace.log examples: 

After setting the above then we endup with normal trace log like: 

 

 

And with Ag specific messages for proxy-rules such as:

 

 

HttpClient Logs

Http client  logs the raw data GET/POST that is sent to the backend and and reply that is received.   So it is good for debugging the interaction with the backend server. 

 

To enable httpclient logging in server.conf set :
           httpclientlog = “yes”

and restart the proxy-engine service. 

 

Note: For Ag R12.7 there is extra setting needed to enable httpclient logging: 

https://communities.ca.com/community/ca-security/ca-single-sign-on/blog/2017/09/01/tech-note-enable-httpclient-logging-in-agent-gateway-127

 

HttpClient / Java SSL Logging

Java has the ability to log the SSL handshake and transfer of data.  This is done by adding  -Djavax.net.debug=all 
to the java runtime startup.  The file this needs to be applied to differs per platform :

For Windows - proxy-engine/conf/SpsProxyEngine.properties
For Unix - proxy-engine/proxyserver.sh

As show below: 

Enable SSL tracing for java:

 

SSL Tracing in the nohup and server.log files : 

These logs then show the SSL handshake, and decrypt/hash of each packet send and received when proxy-engine communicats to the SSL backend: 

 

Have a nice time enjoying your logging. 

 

Cheers - Mark

----
Mark O'Donohue
Snr Principal Support Engineer - Global Customer Success

 

This document is part of a series on Logging in SSO components: 

Tech Tip:How to enable trace logging in SSO (aka Siteminder) Webagent 

Tech Tip : Policy Server Loggings 

Tech Tip : Howto enable Tracing in Access Gateway (fka: Secure Proxy Server) 

This tip describes testing the configuration for Agent Gateway/SPS setup that avoids the situation where one "bad" back-end server will stop all access requests via Agent Gateway even those to other working "good" back-end servers.

 

Testing the configuration from:

TechTip - Configure Agent Gateway/SPS to avoid one bad back-end taking down all AG/SPS traffic. 

 

In the video's here Agent Gateway version R12.6 was used, but the same result were obtained with testing earlier version SPS R12.52 Sp1.

 

 

04-Test Setup : 

This video walks through the VMWare machine test setup.  We have: a Frontend server with jmeter installed on it with test scripts for a "good" and "bad" backend server; we have an Agent Gateway/SPS machine forwarding the requests; and a backend web server with small test proxy program to simulate two backend servers and the ability to add a delay to the "bad" one.

 

 

05-Test Failure:

Here we show the test scripts, we have two instances of jmeter running, one running against the "good" site and one running against the "bad" site.   This test we show that with the default configuration when the "bad" backend server goes down, the Agent Gateway is blocked and we also can't access the other "good" backend server.  

 

06-Test Success:

Here we show the configuration changes, and then run the two jmeter test scripts.  We show that with the changed configuration when the "bad" backend server goes down, the Agent Gateway is not completely blocked and access to the "good" backend server still works correctly.

 

 

 

07 - Tools used : 

 

Header 1Header 2
JMeterhttp://jmeter.apache.org/ 
JMeter PluginsJMeter Plugins :: JMeter-Plugins.org 
Simulating down backendAny method of simulating (or even a real server) can be used to create the test scenario.   The java test tool I have used is attached below.
jmeter scripts. see jmeter-test-scripts.zip attachment at end of post. 

 

 

 

Cheers - Mark

The short version (executive summary): 

Agent Gateway is CA product formerly known as Secure Proxy Server, for the purpose of this article, I'll refer to it as Agent Gateway or Ag for short. 


When one back-end sever for Agent Gateway goes down, all of the connections pool entries and worker threads are clogged up with transactions for the one down (or v slow) "bad" back-end server.  Other requests destined for other "good" working back-end are also held up and do not get processed.  The effect is the one "bad" back-end server tends to drive the whole Agent Gateway offline and it is unable to process any requests. 

 

The solution is: to give a fixed max size to the back-end connection pool; the fixed max size has to be less than the max available worker threads; and to give a quick timeout for any request trying to get a back-end connection beyond that thread pool size.   Then when one back-end server fails, it does hog worker threads, but only up to the limit of the connection pool size, and importantly it leaves the remaining threads free to handle requests to non-hung backend servers. 

 

The settings to change server.conf are from: 

# http_connection_pool_max_size="420"

# http_connection_pool_wait_timeout="0"

to:

http_connection_pool_max_size="100"  # (max pool of 100)

http_connection_pool_wait_timeout="200" # (timeout if pool size already over 100 of 200ms)

 

The longer version :

1. Background.

 

Because Agent Gateway acts as a proxy, is holds onto the connection from the front end (client) while it sends the request to the back-end server to be processed.  For a working back-end system the response time is  generally fairly quick and the number of connections/worker threads needed to maintain throughput is minimal.

 

So for example, as in the diagram below:

 

If we have requirement for 100 request/sec, and the back-end response time is 200ms, then the bandwidth in connection/threads we will need is to be able to handle about 20 requests in parallel.   So we will need: 20 httpd worker threads in apache; 20 connections via mod_jk from apache to tomcat; 20 worker threads in tomcat; and 20 connections to the back-end server.  Note the SPS is mostly not doing any activity, rather of the elapsed 200ms most of the time the threads are inactive waiting for the response from the back-end server.   

 

However if the back-end sever performance slows down, so rather than 200ms each transaction takes 2sec, Then the amount of in-progress transactions that Ag will have open at one time will increase.  That would be 100 trans/sec x 2sec = 200 open transactions.  So now we need a bandwidth of:  200 connection pool sizes; 200 thread pool size etc, for each component.  Obviously if the back-end server goes even slower then the pool/thread sizes requirements continue to increase.  

 

 

 

Ultimately if we get to a stage where the back-end server is down, then for the original 200 request/sec load, then (with the default settings, of a 60sec timeout and a 3x retry) we are faced with each transaction taking 180sec before it sends it's failure response back to the client. 

 

Under those conditions the connection/thread pools sizes that we would need is 18,000.  So we would need: 18,000 httpd threads; 18,000 connections from clients to httpd; from apache via mod_jk to tomcat, tomcat threads and from tomcat to the back-end. 

 

Obviously before that we've probably hit some limit, probably a 150 thread pool size in apache, or 600 or 1000 depending on what you have it set to.  But the important thing is when the back-end server is down, the SPS is flooded with waiting requests, and we can't realistically (or meaningfully) hold open all of those requests for all of those retries. 

 

But this is not the problem we are solving - this is the background to the problem. 

 

2. The Problem - one bad back-end can stop all activity

 

 

 

Now, generally an Agent Gateway server has multiple back-end servers.   And if one back-end servers goes down then as we've seen that leaves a heavy footprint on the internal Ag infrastructure, blocking up all the pipes, and stopping acces to all back-end servers, not just the non-working server. 

 

3. Connection Pool Size  

 

Here is the pattern of connection/thread pool sizes that is best for throughput if the Agent Gateway has only one back-end server.  The design is that at each stage the next pipe bandwidth is slightly larger than the previous one.  With that model all incoming requests will be forwarded onto the next stage, and ultimately onto the back-end server, there will be no internal bottleneck within Ag.  

 

The reality is for Ag that in normal operating conditions the default connection pools are low, often only 5 or 10 or 20 active requests, depending on the type of transaction, and that the pool sizes & thread counts only go up to values of 100 or more when there are delay or problems with the back-end servers. 

 

4. The Solution 

 

 

If we have multiple back-end servers, and we know :

a) That normally the pools size is about 20 connections; and

b) That if the back-end connections go above 100 we know there is a problem with that back-end server.  

 

Then we can setup the back-end connection pool as above, where there are less connections to the back-end than there are httpd & tomcat worker threads. 

 

Normally if we have a connection pool of 100, and all 100 get used, then when the 101'st request come in for the same back-end, then the new request/thread would wait (forever) for a connection to become available.

 

But for our purposes, we know that since we already have 100 requests in progress, that we are in trouble, so what we will do is add a quick timeout, so this new request will timeout very quickly return an error to the client, and importantly it will return the httpd & tomcat worker threads back into the available pools. 

 

Here are the settings we change in server.conf : 

http_connection_pool_max_size="420"

http_connection_pool_wait_timeout="0"

 

 

And we change them to :  

http_connection_pool_max_size="100"  (max pool of 100)

http_connection_pool_wait_timeout="200" (timeout over 100 of 200ms)

 

So when we have one "bad" back-end server, it will have 100 threads waiting on a response or connection to the backend, But importantly only 100 worker threads.  The remaining ~150 httpd worker threads, and tomcat worker threads are still free to handle requests for "good" back-end servers.   And so the requests for the "good" servers are still processed as normal.  The 100 or so waiting threads are not using a lot of resources, as they will be mostly idle waiting for a response from the "bad" back-end server, so thoughput to the good servers does not seem to be affected much.  

 

 

 

5. The Video - Testing the Solution 

 

We also ran some sample test jmeter scripts using this technique, the results jmeter script & recorded video are in a second post, at the following link :  

 

TechTip: Testing Solution to Agent Gateway/SPS with one bad back-end 

 

Effectively, the results are that if we have the original, and most common configuration then as expected when one backend server goes down, the traffic to the other backend servers are affected as well.  Below the blue shows successful transactions, and the pink errors.  We can see that both servers are affected : 

 

However, with the settings are we suggest above, when one backend server goes down, traffic to the other backend server is essentially unaffected:  

 

6. Conclusion

Traffic and load are always different for different installations.  The best way to ensure your Agent Gateway is configured to both handle your normal traffic load & load under exceptional circumstances is by testing.  With the load testing It is best to also test to breaking point to understand where & how the system breaks.  And also how the system performs when components - such as the various back-end servers - fail. 

 

The above offers a way to handle the common situation where one back-end server goes down, we recommend that you test the setting in your own QA environment, both with normal conditions and its response with failed back-end servers before applying them to production.

The SMPolicyTraceAnalysis tool can be used to review trace logs from the policy server, agents and a number of other trace logs.  The link to the download is here : 

Siteminder Policy Trace Analysis 

 

 

This techtip is to explain the purpose of two new(ish) graphs that are now generated by the tool. 

 

1. Viewing the Data 

The best way to view the graphs (it is a bit confusing, since there are too many of them, fixing that is on the, long, todo list) - is to view the Bookmarks or Index side tab, as per the screenshot below you can see it is the blue selected item, and then you get a list of the chapter headings.   You can also see the two new graph items as sections :

       1.12. AccumOpenTransTime_ProcessRequest 

       1.13. AvTransTimeOverTime_ProcessRequest

 

 

 

2. Purpose of the Graphs  

The generated pdf report prints a summary at the top with the av transaction time, and a distribution of transactions times for all of the transactions in the reviewed trace files.  

 

But often, it would be good to see what is happening to the transaction time over the course of the trace file, not just a summary at the end of the whole trace.   That way if there is a holdup at a specific point in the trace, perhaps a failover of the LDAP user store, or spike in load, then it would be good to see what effect that had on the transaction times during the timesteps of the failover and how that compared to the background transaction time at other times.  

 

So these graphs are an attempt to see what is happening along the timeline of the trace logs, rather than just a summary of the whole trace log. 

 

 

3. Section 1.13 AvTransTimeOverTime_ProcessRequest 

I've started with the last new graph, as that order makes it easier to explain the Section 1.12 one, even though they are in different order in the pdf report.

 

The AvTransTimeOverTIme graph shows the transaction time over the last 5 recorded steps.  The default is usually 1 sec graph increments, and so this grap will show the average trans time (blue) and max transaction (red) time for all transactions that finished in the last 5 sec.  

 

It gives a reasonable running total of performance over the trace log.  But interestingly, since we need the transaction to be completed before we can plot it.  So if there is an immediate bad effect on performance, the graph will likely be the transaction will take longer, but the visible effect on the graph will be delayed until the transaction is completed, and not start showing as soon as the effect is felt (In this way it is a bit similar to the unix uptime command, which shows running averages, here we have av over the last 5sec, but we could also be graphing averages over the last 10sec, and last 30sec as well).    

 

Another effect on performance is change in load, if we suddenly have spike, and now have 30 active worker threads rather than 15, then that also is not quite as visible in this graph (sure you can see that in the Concurrent Threads graph).  It should have some effect, but if all 30 threads are active, and all take just as long as it took to process 15 threads, then there won't be any clue, in this graph, that we have doubled the load. 

 

 

 

Nevertheless, graphing the running av/max transaction time over the last 5 sec, is still useful - and is also a fairly easy value to explain.  In the below graph for example we have av of about 5sec response, generally with max response of 30sec, and as we can see a peak of 50sec max response, and peak of ~15sec av response. 

 

 

 

 

4. Section 1.13 AccumOpenTransTime_ProcessRequest 

As explained in the previous section, there are limitations to what you can see in graphing a running average and we need completed transactions. This graph is an attempt to get a bit better insight into the working of those incomplete transactions.

 

What we are graphing here is the total accumulated time for all in-progress transactions. That way we can see if the policy server starts going slow, as that will result in both the existing open tranactions taking longer as they are held open, and possibly also an increase in in-progress transactions as new worker threads are started to absorb the continued requests coming into the queue. 

 

So in the graph below, we can see are holding about 450sec worth of transactions (since I peaked at the Parallel/Concurrent Requests graph) I know there are 30 active threads, so each thread has been running for 450/ 30 = 15sec on av.   We can see at 8:30am the accumulated load dropped to 50sec, again since I looked at the threads running at the time, I know it dropped to 5 active threads, so we can see that on av each thread has been open for 50/5 = ~10sec.   Its not perfect, but it does give a bit of an insight into how the in-progress transactions are going, and should give a more immediate response when something adverse happens to the policy server.

 

 

 

Cheers - Mark

This is fairly widely known, but though I should document it since we often get trace log files with default settings, and that isn't enough to assess performance or throughput.

 

The default SSO logging (generally) does not include [PreciseTime].   And if you want better graphs, or better resolution in the SSO products trace logs, then you should add [PreciseTime] [SrcFile] and [Function] to your SSO trace log configuration. 


Graphs & Stats with default [Time] (1sec) resolution 

The default for most SSO products is still to use [Date] and [Time], so often with a first cut at trace analysis we get a graph that looks like this : 

 

Generally most transactions take less than a sec, so the graph and also any estimate of av transaction time is pretty useless.   Even the transactions that took "2 sec" are probably unlucky enough to have occurred at over a timeframe of : 02:24.994 to 02:25.001 - but all we see is 2:24 to 2:25 , so we can only (badly) guess the transaction time  must have taken between 1 - 2 sec. 

 

Graphs & Stats with [PreciseTime] (millisec) resolution 

If we add [PreciseTime] to the trace log then we get a better estimate of av transaction time, and a better understanding of distribution of request transaction times for example  : 

We can see two spikes of transaction times, and a small number that took longer.   Not show but we also include a summary of min max and av transaction times which then also have much more accurate values. 

 

Additional Graphs for [PreciseTime] (millisec) resolution 

Also if we add [PreciseTime][SrcFile] and [Function] to the trace log then we can also tell from the time gaps between each trace line "where" the thread spent time in processing.   

 

So we can produce a pdf report with a graph like this :

Which identifies which trace log steps (effectively thread state transitions) are taking the most time. That is a first step for identifying bottlenecks.

 

Additionally, given [SrcFile] (roughly) identifies the state of processing of the request, we can also show how many worker threads are in the same state at the same time.  for example :  

This helps identify delays and bottlenecks in processing.  if we want to improve throughput, we review this graph, pick the major bottleneck (hopefully) find a configuration change to allow it to have greater throughput.
(A common example of a policy server bottleneck and it's resolution,  if we have a single LDAPBank, we often find in this graph, that all worker threads are queued up in a state waiting to get access to the one LDAPBank.  Adding more LDAPBanks allows more threads to proceed in parallel, that decreases the number of threads in this [SrcFile] "wait"  state, and increases the overall throughput of the server)  

 

How to add PreciseTime, SrcFile and Function to the trace logging. 

The data items for SSO log file configuration are fairly simple.  There will be a configuration file eg : 

ServerTrace config file location
Policy Server/opt/ca/siteminder/config/smtracedefault.txt
Web AgentC:\Program Files\CA\webagent\win32\config\WebAgentTrace.conf

Access Gateway

(formerly Secure Proxy Server)

/opt/ca/secure-proxy/proxy-engine/conf/SecureProxyTrace.conf

 

 

The last two lines of the .conf file are the important ones they will look something like :
components: AgentFramework, HTTPAgent, WebAgent, AgentFunc, Agent_Con_Manager
data: Date, Time, PreciseTime, Pid, Tid, SrcFile, Function, TransactionID, IPAddr, IPPort, AgentName, Resource, User, Message

(note that last line will probably auto-wrap on your screen - it will be one line in the file)

We want to add and make sure that last data: line then contains [PreciseTime, SrcFile and Function. eg: 

data: Date, Time, PreciseTime, Pid, Tid, SrcFile, Function, TransactionID, IPAddr, IPPort, AgentName, Resource, User, Message

 

 

And that is all you need to do, when you next start trace logging, it will have the new values. 

 

SMTraceAnalysisTool  - where do you get it?  

The policy trace analysis tool is available from the CA communities site here: 

 

Siteminder Policy Trace Analysis 

 

It has settings to parse the following log types : 

x Policy Server Trace Log

x Policy Server smps (stats) Log

x SM Agent trace log

x SPS Agent trace log

x FWS Agent trace log

x SM SDK trace log

x ASA Websphere agent trace log

x ASA Weblogic agent trace log

x WWSI trace log

x CA Directory Query trace log

 

Hope that helps with your transaction analysis. 

 

Cheers - Mark

PS: I believe there is also an enhancement request to add these fields to the default trace logs, I will hunt up the link for that as well.

Just adding as a blog post, since the orignal doc was missing from some users feeds. 

 

After some internal review of the SMPolicyReader, it was suggested that a good enhancement would be so it could create and edit an xcart selection.  Which would make a nice visual method for creating the xcart items from a -xb backup or other export. 

 

I've put together an article on this :

 

How to generate an xcart with SMPolicyReader here :

Using SMPolicyReader to generate xcart selection. 

 

And the latest SMPolicyReader can be downloaded from :

Siteminder Policy Reader 

 

Cheers - Mark O'Donohue