Symantec Access Management

Expand all | Collapse all

Does anyone know how dynamic load balancing works?

  • 1.  Does anyone know how dynamic load balancing works?

    Posted Aug 23, 2017 02:13 PM

    We have 2 data centers a few hundred miles apart from each other, each with 2 policy servers (12.52 SP1) with similar hardware configuration (2 CPU/8 Cores/24 GBs of RAM...) -

     

    Data Center A : PS1A and PS2A


    Data Center B : PS1B and PS2B

     

     

    For this example, assume applications are currently only running in Data Center A and say there are 40 CA SSO  agents installed on servers in that data center.

     

     

    What I have noticed is that if we set up a traditional HCO with all 4 Policy Servers in it (disable failover), 90+% of the traffic (determined by number of records in smaccess log files) will invariably go to one of the policy servers. This became apparent when applications started to mention performance problems with CA SSO and when I checked into it, that's when I noticed that the lion's share of the traffic was going to say PS1A and that server was having threading issues trying to handle the load.

     


    I will say that the small amount of traffic that went to the other policy servers were fairly balanced out so the Agents seem to think that they need to send 90+% of their requests to a "chosen" policy server and balance out the remaining requests which doesn't make much sense to me.

     

     

    Rebooting PS1A, I had thought that would balance the traffic out again, but then the vast majority of the traffic went to PS1B (which also surprised me that PS2A wouldn't have been the "chosen" one).

     

     

    It should also be noted that when configuring a cluster HCO with one cluster containing PS1A and PS2A and the other cluster containing PS1B and PS2B, which ever cluster was "active", again, only one policy server was handling 90+% of the traffic.

     

     

    What exactly do Agents (12.52) look for when deciding which Policy Server to send requests to and how often do they re-assess changing policy servers?

     

     

    Thanks.



  • 2.  Re: Does anyone know how dynamic load balancing works?

    Posted Aug 23, 2017 02:42 PM

    Hi Alan,


    Please refer to this :

    https://communities.ca.com/docs/DOC-231163972


    It looks to me agents are not properly calibrating the best policy server in your case.

    What is the exact version of webagent? CR?

    I can check if there is any known issues.


    Regards,

    Ujwol



  • 3.  Re: Does anyone know how dynamic load balancing works?

    Posted Aug 23, 2017 07:17 PM

    Hi Ujwol,

     

    Thanks for responding!

     

    From looking at the Agent Instances screen, I seemostly 12.52.0000.142 with a few 12.52.0100.499 and 12.52.0104.2032. Will that work for you regarding the versions?

     

    You had mentioned about Agents “calibrating”… is there any documentation that mentions what algorithm they are using?

     

    In regards to the link you provided (and thanks for that)… I did see that and saw “…distributes the load dynamically depending on the response time of the policy servers”. I guess I was trying to understand what sort of response times the algorithm is using (ping times? request response times?, etc)?

     

    Thanks,

     

    Alan



  • 4.  Re: Does anyone know how dynamic load balancing works?

    Posted Aug 23, 2017 08:02 PM

    For best server calculation it uses average request response times.

     

    Here is what I could find related to this Best server algorithm (from very old technote ). There may have been few variation in the recent agent versions.

     

    Best Server Algorithm

    The algorithm maintains the distribution table for each policy server. The distribution table contains 2 variables – capacity and throughput. The web agent compares the current capacity of all the policy servers and one with the highest value is considered to be the best. When best server is computed using the above algorithm to serve the current request from WA, then throughput of the best PS is subtracted from its current capacity. That means after serving the request, capacity of the PS comes out to be previous capacity minus current throughput of PS.

     

    Before we have a look at the algorithm, let us first understand the recalibration logic (This recalibration logic is used by best server algorithm)

     

    Recalibration Logic

    “The recalibration logic first resets the distribution variables of the policy servers to 0. Then it calculates the throughput for all the active policy servers. The distribution variables for inactive policy servers remain at 0.

    The throughput is calculated as inversely to the response time of the policy server. Also the web agent calculates the total throughput i.e. throughput of all the policy servers. The final throughput of the policy server is computed as the ratio between the total throughput and the server throughput (calculated above).

    Next, the web agent sets the capacity of all the active policy servers to 100. (The capacity of inactive servers remain at 0.)”

     

    Now, here is how the best server algorithm works:

    Initially, the distribution variables of the policy servers are reset to 0.  The request processing code (called when request is received by web agent) invokes best server algorithm to select the server which will be serving the current request.

    In best server algorithm, web agent iterates through the list of all the policy servers (in a cluster). If capacity of some inactive policy server is found to be greater than 0, then recalibration logic is invoked – this will eventually reset the capacity of inactive PS to 0 and other active PS will have capacity set to 100.

    One with the highest capacity is considered to be the best policy server – here web agent initially sets the first policy server as the best server and then if subsequent PS in the list has greater capacity- it is considered as the best and so on till the end of the list of the policy servers.

    Now in the end, if the capacity of the best PS comes out to be <=0, then recalibration logic is invoked. If still the capacity of best server remains at <= 0, then an error is returned. Otherwise, upon success, current request is served by this best server and distribution values of the best server are adjusted - basically throughput is subtracted from capacity as mentioned above.

     

    The above algorithm offers following advantages:

    •       Since response time is taken into account for load balancing, this makes sure that at any point of time, request will be served by the best available policy server
    •       Since after serving the request, throughput of PS is subtracted from it capacity – this makes sure that requests are distributed in a round robin fashion (since PS with highest capacity is considered to be the best)

     

    Apart from above algorithm, an independent management thread takes care that if some policy server comes into an active state (from an inactive state) – then also recalibration logic is invoked – this will make sure that the capacity of this newly active server is set to 100 and it also starts serving the request.

     

    Illustration

    Suppose there are 3 policy servers in an active state – PS1, PS2, PS3

    As per the above algorithm, the requests will be distributed between these 3 policy servers depending on their capacity. If suppose PS1 goes down, then recalibration logic will mark its capacity as 0 and subsequent requests will not be forwarded to that PS. The subsequent requests will be distributed between rest of 2 PS – PS2 and PS3 (again depending on their capacity). If PS1 is started again, then management thread will call recalibration logic to mark its capacity to 100 so that this newly active server also starts receiving the requests.



  • 5.  Re: Does anyone know how dynamic load balancing works?

    Posted Aug 23, 2017 09:05 PM

    So if I understand correctly, there are two potential avenues which could cause irregular agent traffic balance to policy servers :

     

     

      1.  If updating the “capacity” value is not occurring correctly.

      2.  If recalibration invoking is not occurring.

     

    Does that make sense?

     

    I’ll see if I can get some apps to add Agent_Con_Manager to their Agent tracing, but what parts of Agent_Con_Manager would help show either of the above? Moreover, let’s assume we find there is an Agent issue, how could something like that be fixed (or would that require a new Agent release)?

     

    Thanks for your time!



  • 6.  Re: Does anyone know how dynamic load balancing works?

    Posted Aug 24, 2017 12:09 AM

    Hi Alan,

     

    I looked at our database of known issues in 12.52, and I don't see any issue related to load balancing.

    If there is an issue identified in our code, the fix could be delivered as a dev fix until it is available in the next CR.

     

    Yes, with the Agent_Con_Manager enabled, you could see capacity and throughput for each server as determined by agent.

    See sample below :

     

    [08/15/2017][06:41:20.911][31404][473929472][][][][Server XX.XX.***: Current total capacity:  91, current throughput:   1

    [08/15/2017][06:41:21.768][31404][623560448][][][][Selected server  XX.XX.***:: Current total capacity:  88, current throughput:   1][SmClient.cpp:2977][GetServer][][][][][][][][][][][][][][][][]

     

    Dumping DistrTable][SmClient.cpp:2923][GetBestServerIndex][][][][][][][][][][][][][][][][]
    [08/15/2017][06:40:45.913][31403][634050304][][][][Server XX.XX.***: Current total capacity: 13, current throughput: 1



  • 7.  Re: Does anyone know how dynamic load balancing works?

    Posted Aug 24, 2017 11:32 AM

    Hi Ujwol,

     

    Thanks for the example! That helps. Let me see if I understand what I’m reading… I marked the lines below :

     

      1.  --> [06:41:20.911][31404][473929472][][][][Server XX.XX.***: Current total capacity:  91, current throughput:   1

      2.  --> [06:41:21.768][31404][623560448][][][][Selected server  XX.XX.***:: Current total capacity:  88, current throughput:   1][SmClient.cpp:2977][GetServer][][][][][][][][][][][][][][][][]

     

    Dumping DistrTable][SmClient.cpp:2923][GetBestServerIndex][][][][][][][][][][][][][][][][]

    (3) --> [06:40:45.913][31403][634050304][][][][Server XX.XX.***: Current total capacity: 13, current throughput: 1

     

    Let’s assume line (1) and (2) represent 2 different policy servers and the agent needs to determine which PS to connect to. PS(2) is the current policy server that the agent is connected to (I’m assuming that’s what “Selected server” means, is that correct?). Given the above capacity numbers for PS(1) and PS(2), the agent should select PS(1) and update PS(1)’s Total Capacity from 91 to 90. Is that correct?

     

    For line (3)… Does the Dumping DistrTable/GetBestServerIndex function supposed to list the capacity/throughput of all the policy servers or just the policy server that is considered the best policy server to connect to (presumably the one the agent is currently connected to?)?

     

    Appreciate all the explanations!



  • 8.  Re: Does anyone know how dynamic load balancing works?

    Posted Aug 27, 2017 09:57 PM

    Please find my answers below :

     

    • Given the above capacity numbers for PS(1) and PS(2), the agent should select PS(1) and update PS(1)’s Total Capacity from 91 to 90. Is that correct?

    Ujwol=> Yes, it will select the PS with the highest capacity and then resultant capacity of the selected PS comes out be previous capacity minus current throughput of PS.  (91 -1 = 90 as per previous stats)

    • Does the Dumping DistrTable/GetBestServerIndex function supposed to list the capacity/throughput of all the policy servers or just the policy server that is considered the best policy server to connect to (presumably the one the agent is currently connected to?)?

    Ujwol => It is supposed to list the stats of all the policy servers in the cluster.   



  • 9.  Re: Does anyone know how dynamic load balancing works?

    Posted Aug 28, 2017 12:44 PM

    That is helpful. Thanks. I will await for some of our apps to collect a few days of agent logs and check it out. Thanks again.



  • 10.  Re: Does anyone know how dynamic load balancing works?

    Posted Aug 28, 2017 12:51 PM

    No worries. If you can , I will also suggest collecting TLI logs from Policy server:


    Enable Transport Layer Interface (TLI) Logging

    When you want to examine the connections between the agent and the Policy Server, enable transport layer interface logging.

    To enable TLI logging

    1. Add the following environment variable to your web server.

      SM_TLI_LOG_FILE
    2. Specify a directory and log file name for the value of the variable, as shown in the following example:

      directory_name/log_file_name.log
    3. Verify that your agent is enabled.
    4. Restart your web server.
      TLI logging is enabled.


  • 11.  Re: Does anyone know how dynamic load balancing works?

    Posted Aug 29, 2017 07:12 PM

    Hello,

     

    So the TLI logging is also on the policy server and just not on the agent server? If so, is the instructions below the same?

     

    Thanks,

     

    Alan



  • 12.  Re: Does anyone know how dynamic load balancing works?

    Posted Aug 30, 2017 01:52 AM

    TLI log is only on the web server/web agent.



  • 13.  Re: Does anyone know how dynamic load balancing works?

    Posted Aug 23, 2017 08:06 PM

    If you add Agent_Con_Manager  component in the WebAgentTrace.conf, some of these Policy server statistics which are used for calculating best server are printed in the agent trace log. That may give you some idea as to why other servers are not getting printed..