Symantec Access Management

Expand all | Collapse all

Socket error 104

  • 1.  Socket error 104

    Posted Feb 01, 2013 03:09 PM
    Apps. are intermittently failing as PS throws the following:
    [3098/150862736][Fri Feb 01 2013 13:15:32][CServer.cpp:1916][ERROR] Failed to send response on session # 1148 : agentname/::ffff:10.152.144.241:59387. Socket error 104
    [3098/2778049424][Fri Feb 01 2013 13:16:30][CServer.cpp:1916][ERROR] Failed to send response on session # 614 : agentname/::ffff:10.152.10.137:55669. Socket error 104
    [3098/2778049424][Fri Feb 01 2013 13:16:30][CServer.cpp:1916][ERROR] Failed to send response on session # 1253 : agentname/::ffff:10.152.10.137:51723. Socket error 104
    [3098/3086298000][Fri Feb 01 2013 13:16:32][CServer.cpp:1480][INFO] Closing Idle connection for session # 934
    [3098/3086298000][Fri Feb 01 2013 13:16:32][CServer.cpp:1480][INFO] Closing Idle connection for session # 931
    [3098/3086298000][Fri Feb 01 2013 13:16:32][CServer.cpp:1480][INFO] Closing Idle connection for session # 928
    [3098/3086298000][Fri Feb 01 2013 13:16:32][CServer.cpp:1480][INFO] Closing Idle connection for session # 920
    [3098/3086298000][Fri Feb 01 2013 13:16:32][CServer.cpp:1480][INFO] Closing Idle connection for session # 919
    [3098/3086298000][Fri Feb 01 2013 13:16:32][CServer.cpp:1480][INFO] Closing Idle connection for session # 825
    [3098/2725600144][Fri Feb 01 2013 13:16:32][CServer.cpp:1916][ERROR] Failed to send response on session # 1256 : agentname/::ffff:10.152.10.137:51724. Socket error 104
    [3098/150862736][Fri Feb 01 2013 13:16:32][CServer.cpp:1916][ERROR] Failed to send response on session # 1259 : agentname/::ffff:10.152.10.137:51725. Socket error 104
    [3098/150862736][Fri Feb 01 2013 13:16:32][CServer.cpp:1916][ERROR] Failed to send response on session # 1260 : agentname/::ffff:10.152.10.137:51729. Socket error 104
    [3098/2715110288][Fri Feb 01 2013 13:16:33][CServer.cpp:1916][ERROR] Failed to send response on session # 1261 : agentname/::ffff:10.152.10.137:51731. Socket error 104
    [3098/2778049424][Fri Feb 01 2013 13:16:33][CServer.cpp:1916][ERROR] Failed to send response on session # 12 : agentname/::ffff:10.152.10.137:47095. Socket error 104
    [3098/2725600144][Fri Feb 01 2013 13:16:33][CServer.cpp:1916][ERROR] Failed to send response on session # 1265 : agentname/::ffff:10.152.10.137:51733. Socket error 104
    [3098/2683640720][Fri Feb 01 2013 13:16:34][CServer.cpp:1916][ERROR] Failed to send response on session # 1267 : agentname/::ffff:10.152.10.137:51734. Socket error 104

    Agent.log says:

    [3881/153339648][Fri Feb 01 2013 13:10:55][CSmLowLevelAgent.cpp:1086][ERROR] LLA: SiteMinder Agent Api function failed - 'Sm_AgentApi_LoginEx' returned '-2'.
    [3881/153339648][Fri Feb 01 2013 13:10:55][CSmAuthenticationManager.cpp:181][ERROR] HLA: Component reported fatal error: 'Low Level Agent'.
    [3881/153339648][Fri Feb 01 2013 13:10:55][CSmHighLevelAgent.cpp:540][ERROR] HLA: Component reported fatal error: 'Authentication Manager'.
    [4058/4204726016][Fri Feb 01 2013 13:10:56][CSmHttpPlugin.cpp:329][ERROR] Unable to resolve server host name. Exiting with HTTP 500 server error '10-0004'.
    [4058/4204726016][Fri Feb 01 2013 13:10:56][CSmResourceManager.cpp:155][WARNING] HLA: Missing resource data.
    [3881/4204726016][Fri Feb 01 2013 13:10:57][CSmLowLevelAgent.cpp:492][ERROR] LLA: SiteMinder Agent Api function failed - 'Sm_AgentApi_IsProtectedEx' returned '-1'.
    [3881/4204726016][Fri Feb 01 2013 13:10:57][CSmProtectionManager.cpp:192][ERROR] HLA: Component reported fatal error: 'Low Level Agent'.
    [3881/4204726016][Fri Feb 01 2013 13:10:57][CSmHighLevelAgent.cpp:376][ERROR] HLA: Component reported fatal error: 'Protection Manager'.
    [3881/67106560][Fri Feb 01 2013 13:10:57][CSmLowLevelAgent.cpp:492][ERROR] LLA: SiteMinder Agent Api function failed - 'Sm_AgentApi_IsProtectedEx' returned '-1'.
    [3881/67106560][Fri Feb 01 2013 13:10:57][CSmProtectionManager.cpp:192][ERROR] HLA: Component reported fatal error: 'Low Level Agent'.
    [3881/67106560][Fri Feb 01 2013 13:10:57][CSmHighLevelAgent.cpp:376][ERROR] HLA: Component reported fatal error: 'Protection Manager'.


  • 2.  RE: Socket error 104

    Broadcom Employee
    Posted Feb 04, 2013 10:22 AM
    Good morning SamWalker,

    The Socket error 104 means the following:
    Technically `Socket Error 104' is `A call to bind() function failed.' which is a connection reset by peer.
    The smtest tool or some network device in the middle sent a RST packet in the same TCP stream before the handshake was complete.

    The web agent -1 and error -2 are also communication errors between the policy server and the web agent.
    Sm_AgentApi_IsProtectedEx, ' is function call of agent. This function
    has -1 or -2 error code return:

    -1 means that it has failed to communicate with the server
    -2 means timeout

    Agents connect to the Policy Server uses a three way handshake. This error
    will occur after the agent sends its initial hello message to the policy
    server.
    The policy server does not respond back with its hello message because the
    agent closed the connection.

    This condition could be due to a policy server side issue (starvation)
    where policy server lacks the resources to respond fast enough to the web
    agent.

    Please try to increase agentwaittime parameter in the Webagent,conf
    AgentWaitTime
    Specifies the number of seconds that the Web Agent waits for the Low-level
    Agent Worker process (LLAWP) to become available. When the interval expires
    the Web Agent tries to connect to the Policy Server.



    In order to help look into this further we will need additional infomration:
    1) Web agent version and CR
    2) Web server Type and version
    3) Web server OS type and version
    4) Policy server version and CR
    5) Policy server OS type and version
    6) Is there any type of router / Firewall. Load banalncer between web agents and policy servers?

    Once we have some more details we might be able to point you in the right direction.

    Hope this helps.

    Gene


  • 3.  RE: Socket error 104

    Posted Feb 05, 2013 03:16 PM
    Hi Gene, Thanks for the response.

    Policy Server:

    more ca-ps-version.info
    ProductName=CA SiteMinder Policy Server r12.0 SP2
    FullVersion=12.0.200.186
    Location=/opt/netegrity/siteminder

    RedHat Version

    Red Hat Enterprise Linux Server release 5.7 (Tikanga)

    I am not mentioning the webagent or weservers because, I have several agents (R12sp1,sp2,sp3) with different CMRs, on linux,AIX and running on Apache 2.2, IBM HTTP Server. I got this problem with ASA's as well running under WebSphere.

    There is a Firewall between some agents, and some agents are on the same subnet as the as the Policy server, so there is no FW. The problem still occured on all agents although intermittently.
    No Load balancer between Agents and Policy Servers.

    Load is pretty minimal, and this started happening after hours so I m not sure if it is actually starvation.

    Memory and CPU utilization are under 10%.

    I didnt check the stats when the error happend.

    I dont rememeber the Max connections over 250 ever. Even if it goes over 800 which is over our config, I should not see the socket connection errors.

    Stats as of now:(pretty busy time of the day)
    [1904/2851638160][Tue Feb 05 2013 13:37:02][CServer.cpp:6912][INFO] Server 'Stats' command received
    [1904/2851638160][Tue Feb 05 2013 13:37:02][CServer.cpp:4184][INFO] ===================================================================================
    [1904/2851638160][Tue Feb 05 2013 13:37:02][CServer.cpp:4185][INFO] System Statistics
    [1904/2851638160][Tue Feb 05 2013 13:37:02][CServer.cpp:4191][INFO] Available file descriptors: 1024
    [1904/2851638160][Tue Feb 05 2013 13:37:02][CServer.cpp:4202][INFO] Thread pool limit: 8
    [1904/2851638160][Tue Feb 05 2013 13:37:02][CServer.cpp:4222][INFO] Thread pool: Msgs=1321239 Waits=1278223 Misses=10403 Max HP Msg= 17 Max NP Msg= 20 Current Depth= 0 Max Depth= 20 Current High Depth= 0 Current Norm Depth= 0 Current Threads= 8 Max Threads= 8
    [1904/2851638160][Tue Feb 05 2013 13:37:02][CServer.cpp:4230][INFO] Connections: Current=97 Max=162 Limit=800 Exceeded limit= 0


    Can you elaborate more on how file descriptors are used? What is recommended and if I need to increase.
    Can I increase the number of threads to 50? How is 'threads' tied to system resources?
    What are things to look for in the Policy Server when you think the system is under load?
    and also, How do I know how do I know my webagent version?

    foir example:
    My agent.log says:
    [8013/-2108266752][Wed Jan 30 2013 16:29:09] SiteMinder APACHE 2.2 WebAgent, Version 12.0 QMR03, Update HF-05, Label 427.
    [8013/-2108266752][Wed Jan 30 2013 16:29:09] FileVersion: 12.0.0305.427.

    How do I know that this translates to R12 SP(?)QMR(?)CR(?)HF(?), without creating a ticket.


  • 4.  RE: Socket error 104

    Broadcom Employee
    Posted Feb 06, 2013 02:55 PM
    Good afternoon SamWalker,

    There are a lot of questions in your last Update. I will do my best to address them all.
    Easy ones first!

    Questions 1) How do I know how do I know my webagent version?
    Answer: Check the top of the web agent logs is the best way.

    Question 2) How do I know that this translates to R12 SP(?)QMR(?)CR(?)HF(?), without creating a ticket.

    Example:
    [8013/-2108266752][Wed Jan 30 2013 16:29:09] SiteMinder APACHE 2.2 Webagent, Version 12.0 QMR03, Update HF-05, Label 427.
    [8013/-2108266752][Wed Jan 30 2013 16:29:09] File Version: 12.0.0305.427.


    Answer: Version is the R so in this case Version 12.0 is the same as R12.0
    QMR03 is the same as SP03
    Update HF -05 is the same as CR04
    Label 427 is the same as Build 427

    So in this case Version 12.0 QMR03, Update HF-05, Label 427. is the same as R12SP3CR5 Build 427, Or in your terms R12QMR3CR5 build 427

    Question 3) Can you elaborate more on how file descriptors are used?
    Answer:
    For information on how the policy server uses File descriptors please see the following:
    https://support.ca.com/cadocs/0/CA%20SiteMinder%20r12%20SP3-ENU/Bookshelf_Files/HTML/idocs/index.htm?toc.htm?ps-install.html

    Installation and Upgrade Guides › Policy Server Installation Guide › Installing the Policy Server on UNIX Systems › How to Prepare for the Policy Server Installation › Modify the UNIX System Parameters

    Question 4) What is recommended and if I need to increase.
    Answer: You would need to work with your system administrator to tune these settings based on your needs. Or engage CA service to help do performance testing and tuning.

    Question 5) Can I increase the number of threads to 50?
    Answer:
    Please see the following Policy server guide for thread information
    Policy Server Guides › Policy Server Configuration Guide › Agents and Agent Groups › Trusted Hosts for Web Agents › Trusted Host Configuration Settings › TCP/IP Connections


    Question 6) How is 'threads' tied to system resources?
    Answer:
    Please see the following Policy server guide for How 'threads' are tied to system resources
    Policy Server Guides › Policy Server Configuration Guide › Agents and Agent Groups › Trusted Hosts for Web Agents › Trusted Host Configuration Settings › TCP/IP Connections

    What are things to look for in the Policy Server when you think the system is under load?
    Answer:
    1) CPU Load
    2) memory usage
    3) If running stats command every 5 minutes: Current High Depth, Current Norm Depth, Connections: Current. Connections: Exceeded limit

    Looking at the stats command output you posted I do not see anything out of the ordinary.
    I would monitor this behavior and have your network team ready to do a network packet capture the next time it happens to see if you can narrow down were the communication problem is happening.

    Hope this helps. Sorry for the delay in the response.

    Gene


  • 5.  RE: Socket error 104

    Posted Feb 06, 2013 03:55 PM
    Thanks a lot Gene. Really appreciate your patience answering all the questions.

    We captured TCPDUMPS when the actual problem happend(showed TCP RST) and uploaded them to support for further diagnosis. However, they come back saying they need TLI logs but not TCP dumps. Some of my agents run Sp3 while my policy server runs Sp2, this gives them an oppurtunity to say I am running on unsupported environment. However, no response for Sp2 agents also have the same problem.

    No apetite fight anymore:

    I enabled TLI logs and all I got was the following:

    [Wed Feb 06 2013 10:14:42] [../../../SmAgentAPI.cpp:804] [7994--677411072]
    Starting initialization
    [Wed Feb 06 2013 10:21:49] [../../../SmAgentAPI.cpp:1569] [7994--351668256]
    Finished uninitializating

    I have a case open with support on this, but I am doing my own research in parallel.
    I am planning for a PS upgrade this weekend, hope Support is right about TLI being sufficient.


  • 6.  RE: Socket error 104

    Broadcom Employee
    Posted Feb 06, 2013 04:55 PM
    Hi SamWalker,

    Sorry to hear you are running into problems with this that are frustrating.

    I am not sure the TLI logs will help or not based on your snippet and not having all the details from the support case.

    The reason I suggested that you engage your network team is because the web agent are designed to make a connection to the policy server and leave it open.
    There fact that you are seeing a TCP RST means something is closing the connection for some reason.
    Usuallty with a full network trace such as with wireshark or something similar you can see exactly what device and port the tcp rct is coming from and you can than work from there.

    One thing to check is to see if this is happening every 6 hours or so.
    In R12 there was a new feature added that was different from R6 in which the policy server would reset all agent connections every 6 hours.
    Not sure what version exactly this was introduced in but I do now that a change was added to the registry in R12SP3CR11 that allowed you to disabled this policy server reset every 6 hours.

    You may want to go back through your logs and see if you can see a patern to this or not.

    If you can not find a patter of this nature than I would focus more on the network side and see what you can come up with from there.

    Hope this helps

    Gene


  • 7.  RE: Socket error 104

    Posted Feb 07, 2013 06:25 PM
    My NW team is in the loop. But they need to know what they have to monitor.
    I have a full packet trace from wireshark when the error happened, we gathered it based on request from support. But the case got rolled over to another person, who doesn't want that, they want TLI log and they also have strong reservations about running unsupported version with PS being on SP2 and few agents on SP3. I am upgrading my PS to SP3 tomorrow evening to get into a Supported version.
    As far as the problem, it did nt happen since last week and it was the first(only) time it happened since 2010 when we migrated to R12. This does nt happen every six hours.

    I believe the PS is the problem, because it disconnected connections from multiple agents(ASA and web) from different sub-nets at the same time.

    I will ask you few more questions: hope you have the time to answer. As I mentioned earlier, I really appreciate your time and help.

    I have 3 pids running from agenttrace. 1 is LLAWP and 2 others running under are http doing 'actual' work. and the only thing LLAWP(5150) does is:

    [02/06/2013][14:33:17][5150][1140848384][][ProcessMessage][Open message received from client '5166.71276288']
    [Date][Time][Pid][Tid][TransactionID][Function][Message]
    [====][====][===][===][=============][========][=======]
    [02/06/2013][14:33:17][5166][71276288][][ConfigureMP][Tracing initialized.]
    [Date][Time][Pid][Tid][TransactionID][Function][Message]
    [====][====][===][===][=============][========][=======]
    [02/06/2013][14:33:17][5150][1140848384][][ProcessMessage][Delivering response to Manage query received from client '5166.71276288']

    What is this message from pid 5166 above? What is it asking from LLAWP?

    Do I have to have 3 persistent connections(sockets) open from this webagent to Policy Server? one for each of these PIDs? or the 2 working threads depend on PS communication with LLAWP PID.

    I have a lot of idle connections closed in my smps.logs. For example:

    [11798/3023854480][Thu Feb 07 2013 01:11:09][CServer.cpp:1521][INFO] Closing Idle connection for session # 9009
    [11798/3023854480][Thu Feb 07 2013 01:11:09][CServer.cpp:1521][INFO] Closing Idle connection for session # 9007

    What are these? I thought they are from WA, and connection is closed becasue of idle-timeout. Now that you said, WA is designed not to close the connection, I am not sure what these connections are.

    Thanks for ur help.


  • 8.  RE: Socket error 104

    Broadcom Employee
    Posted Feb 08, 2013 10:06 AM
    Good morning SamWalker,

    Again their are a lot of questions and thoughts in your post. I will try to address them all. Please let me know if I miss something by mistake.

    First a question for you. When this happened was there and end user impact?
    The reason I am asking is that you seem to be spending a lot of cycles on this for something that per this thread happened one time since 2010.

    Everything you have stated would imply there was a problem on the network level the caused a communication issue for a short period of time and it resolved itself.

    If there was heavy user impact than by all means I can undersand you wanting to determine root cause. But if it only happened once I am not sure you will be able to get that based on the information you have from that time.

    Support can help look at Wireshark traces usually if they are small enough. What you would be looking for is a trace between the policy server and one or two agents that are having the problem only.
    What you would be looking for is the devvice that is issuing the TCP Reset command.

    Below is what the LLAWP process does and is repsonsible for:
    Low Level Agent Worker Process (LLAWP)

    The LLA Worker Process solves many multi-process issues for the Agent Framework. This independent process is created by the Low Level Agent during the initialization phase and serves the following tasks:


    Inter-process cache management.

    Centralized DoManagement

    Logging

    Health Monitoring & Diagnostics

    This architecture is required to serve web servers like IIS 6.0 that support multiple independent worker processes each loading the Web Agent separately without a common parent. Without a common administration service for handling logging and the other issues listed above, each agent instance would be forced to open its own log, report health monitoring and generally act in all ways as an independent web agent. This would add a lot of confusion within the SiteMinder system.

    By choosing to use a worker process to service these Low Level Agent facilities, we obscure the multi-process details from the Agent Framework and simplify the overall model. For example, a future enhancement might be to disable this process for pure single-process multi-threaded servers. No other code above the Low Level Agent would have to change for this enhancement.


    In order to give you details on the trace log snippet you provided I would have to know the Web agent version and CR, They type and version of the web server you are using including bit level as well as the os type and version.
    But you question: What is this message from pid 5166 above? What is it asking from LLAWP?
    [02/06/2013][14:33:17][5166][71276288][][ConfigureMP][Tracing initialized.]
    [Date][Time][Pid][Tid][TransactionID][Function][Message]

    This is normall when a new agent process is started up. Usually you see this more in apache instances that have child process.

    The number of connection to a policy store are controlled by the minsocketsperport and maxsocketsperport settings.

    Web Agents excluding Apache

    The minsocketsperport and maxsocketsperport settings determine the number of sockets that at least and at most are open from the Web Agent to the Policy Server. When the web server starts (with the Web Agent enabled), it will open a minsocketsperport number of sockets. If there are multiple agent identities configured in that web server’s agent configuration (in the webagent.conf or Web Agent Management Console), then a minsocketsperport number of sockets will open for each agent identity for 4.x Web Agents, or for each trusted host for 5.x Web Agents (see Calculations for more information).

    As load increases, the number of sockets will increase as well, up to maxsocketsperport for each agent identity / trusted host. If you have more load than can be handled by maxsocketsperport, then a certain number of overflow requests will be placed in a queue (of finite size depending on the version of SiteMinder – 4.5.1 SP1 and later has a limit of 300).

    Each request uses a socket, but not all requests open new sockets. If all sockets from the connection pool are in use, then additional sockets will be opened as needed, in steps of newsocketstep, until maxsocketsperport is reached. If maxsocketsperport is set to 20, this means that a maximum of 20 simultaneous requests per service can take place (e.g. 20 authentications). Sockets are not multiplexed, meaning that a socket is utilized until the response comes back from the Policy Server. Once a request is completed, the socket is placed into a connection pool so that it can be used again.

    Connections should be considered persistent for all intents and purposes, meaning that once a socket is opened, it will not be closed. Exceptions include communication errors between the agent and Policy Server and the idling out of connections by the Policy Server. Socket(s) will be closed by the Policy Server if they are unused for the length of time specified by the TCP Idle Session Timeout for that service (specified in the Registry – see the Firewall Timeout Technote for more information).


    Apache Web Agents


    Unlike the other agents, Apache Web Agents do not use connection pooling. Apache is multi-processed and has a drastically different architecture from IIS and Netscape, which are multi-threaded. Apache spawns child processes to handle requests, and uses a configuration setting called MaxClients to determine the maximum number of child processes that it will fork to handle load. The number of child processes is managed by Apache settings in the httpd.conf file. Each child process has its own independent socket connection(s) to the Policy Server. When the Apache parent process forks a child, an initial connection is opened to each policy server for the default agent. For 4.x Web Agents, because each socket is specific to a particular agent identity, each additional agent identity defined requires its own socket per child. A socket is opened the first time the child handles a request for an agent identity. This means that each child can have as many open sockets as there are agent identities (including the default agent identity). As load is evenly distributed among the children by the Apache parent process, the total number of sockets opened from an Apache server at maximum will equal MaxClients times the number of agent identities. For 5.x Web Agents, the total number of sockets opened from an Apache server at maximum will equal MaxClients times the number of trusted hosts.

    This connection model may have major implications for the Web Agent to Policy Server ratio (depending on the version of the Policy server being used), as the limiting factor often becomes connections between the agent and Policy Server rather than the number of transactions per second. Before deploying SiteMinder on Apache, it is very important to ensure that the Policy Server is able to handle the maximum number of connections that may be opened by all Web Agents that connect to that Policy Server. Please see the Calculations section below for additional information.



    As for you idle connections closing.

    These are connections to web agents that the policy server has not recieved any communication from a web agent on in

    Policy Server closes the idle connection over the idle timeout setting. It is processed every half time interval of the idle timeout setting time. If you set the idle timeout setting to 10 minutes,PS will check the idle conn ection and close every 5 minutes. So you will see their log just every 5 minu tes.

    This value is set in the Smconsole by default to 10 minutes on the settings tab. It is unusuall for a web agent to go more than 10 minutes without checking with the policy server.



    I hope this help explain and that I answered all of your questions

    gene


  • 9.  RE: Socket error 104

    Posted Feb 11, 2013 11:25 AM
    Hi Gene, Thanks again for your time and appreciate your detailed clarification.

    We had an outage when this happened, the apps. were inaccessible intermittently for 2 days. Again, I understand it is a NW thing, but I need to know what in the network is making SiteMinder so sensitive.

    From here there is not much we can do, until I patch my policy servers to Sp3 and then wait for the error to happen again.

    Again, I appreciate your time and help, I will keep posting my updates to this thread.


  • 10.  RE: Socket error 104

    Posted Mar 04, 2013 03:07 PM
    Hi Gene, I have some more information about this case, if you are interested. Feel free to take the ownership of my case :). I am very confident that support technician is either going in wrong direction or does nt take interest as this does nt happen in Production right now.

    here is how case update reads:

    "I was able to reproduce the socket error 104 in my dev env. I ran
    XPSExplorer, kept the program running for 15 mins and one webagent started seeing
    communication errors. TLI logs dont give me much. Agenttrace gives me
    Communicationfailue error with PS.
    smps.log show socket error 104. And the issue went away by itself after 10
    mins. Not sure if the connection got recycled and new connection is
    established.

    This problem started at 12.01 and went away at 12.10.
    Snippet from agent.log:

    [30168/4271834880][Tue Feb 26 2013 12:09:06][CSmLowLevelAgent.cpp:532][ERROR] LLA: SiteMinder Agent Api function failed - 'Sm_AgentApi_IsProtectedEx' returned '-1'.
    [30168/4271834880][Tue Feb 26 2013 12:09:06][CSmProtectionManager.cpp:192][ERROR] HLA: Component reported fatal error: 'Low Level Agent'.
    [30168/4271834880][Tue Feb 26 2013 12:09:06][CSmHighLevelAgent.cpp:405][ERROR] HLA: Component reported fatal error: 'Protection Manager'.
    [5165/18827008][Tue Feb 26 2013 12:09:14][CSmLowLevelAgent.cpp:532][ERROR] LLA: SiteMinder Agent Api function failed - 'Sm_AgentApi_IsProtectedEx' returned '-1'.
    [5165/18827008][Tue Feb 26 2013 12:09:14][CSmProtectionManager.cpp:192][ERROR] HLA: Component reported fatal error: 'Low Level Agent'.
    [5165/18827008][Tue Feb 26 2013 12:09:14][CSmHighLevelAgent.cpp:405][ERROR] HLA: Component reported fatal error: 'Protection Manager'.
    [5166/123725568][Tue Feb 26 2013 12:09:15][CSmLowLevelAgent.cpp:532][ERROR] LLA: SiteMinder Agent Api function failed - 'Sm_AgentApi_IsProtectedEx' returned '-1'.
    [5166/123725568][Tue Feb 26 2013 12:09:15][CSmProtectionManager.cpp:192][ERROR] HLA: Component reported fatal error: 'Low Level Agent'.
    [5166/123725568][Tue Feb 26 2013 12:09:15][CSmHighLevelAgent.cpp:405][ERROR] HLA: Component reported fatal error: 'Protection


    However, these threads did process requests again.
    [30168/4271834880][Tue Feb 26 2013 15:23:02][CSmHttpPlugin.cpp:1553][WARNING] Unable to process SMSESSION cookie.
    [5165/18827008][Tue Feb 26 2013 14:02:06][CSmHttpPlugin.cpp:1553][WARNING] Unable to process SMSESSION cookie.
    [5166/123725568][Tue Feb 26 2013 15:23:04][CSmHttpPlugin.cpp:1553][WARNING] Unable to process SMSESSION cookie.

    On a side note, can a same thread be owned by 2 different processes? or is it just a coincidence that process '5166' and '30168' created threads with same threadID?
    [30168/4271834880][Tue Feb 26 2013 15:23:02][CSmHttpPlugin.cpp:1553][WARNING] Unable to process SMSESSION cookie.
    [5166/4271834880][Tue Feb 26 2013 14:30:10][CSmHttpPlugin.cpp:1553][WARNING] Unable to process SMSESSION cookie.


    LLAWP processs did not have Agent Api errors, I can tell that because PIDs of LLAWP did not show up with these errors in
    agent.log. Only the PIDs running under httpd(5139) lost connections.
    Not sure if that makes any sense/difference.


    TimeFrame: XPS ran at 11.43AM.

    I provided the following logs to support.
    Agent.log - Api errors
    Agenttrace.log - CommunicationFailure messages
    TLI logs
    smps.log
    smtrace.log
    agentconn.log - netstat run every minute for 15 mins, while the errors were occuring.
    XPSExplore log
    strace while the error was happening.

    But support wants to see TLI log to be updated with an error message, and as far as I understand TLI log does nt care about this error. I pressed on what exactly support is looking for in TLI? Unforunately, even the support dont know what to look for. All they know is process 5165 only had one line instead of any errors. So I was asked rerun my test for as long as the tli log is updated with correct information.

    TLI log current has the following:

    [Wed Feb 06 2013 14:33:09] [../../../SmAgentAPI.cpp:804] [5165-50296576] Starting initialization


  • 11.  RE: Socket error 104

    Broadcom Employee
    Posted Mar 04, 2013 05:23 PM
    HI SamWalker,

    Sorry I cannot take over your case. I have requested SiteMinder Support management to review the case for you and let them know that you were not happy with the direction the case was going.
    I will try to get some time tomorrow to review the case in its entirety and see if I can provide any insight into it.

    Unfortunate there is just not enough information here in the forum to provide you with any real help on this at this point.

    After I have a chance to review if I find something I will update the case and this post as to what I think the next steps should be.


    Sorry I could not be more help

    Gene


  • 12.  RE: Socket error 104

    Posted Mar 04, 2013 06:38 PM
    Hi Gene, I understand. Thank you for your help all along.


  • 13.  RE: Socket error 104

    Broadcom Employee
    Posted Mar 05, 2013 11:25 AM
    Hi SamWalker,

    I had a chance this morning to review your case and it is very confusing as I am sure you are aware.

    Below is a summary of what I understand the problem to be: ( As this was a long case to review I may have missed somthing)
    1) Intermittently your 64-bit web agents are getting errors -1 and -2 in the web agent logs that are impacting PROD. (Explanation for this provided earlier)
    2) The web agents are on a higher version than the Policy server which is not a supported configuration.
    3) There is a firewall between the Policy servers and the web agents
    4) The wire shark traces provided seem to point to the Firewall resetting the connection between the web agents and firewall.
    5) The issue is resolving itself without doing anything to the SiteMinder infrastructure.
    6) At the same time the -1 and -2 errors are being seen you are seeing socket error 104 on the policy server side. (Explanation for this provided earlier)

    Support has asked you to do the following:
    1) get on a support Policy server and web agent combination
    2) To enable TLI tracing on the web agent to try to capture some more low level network information.

    Per the case notes it would appear you’re working on getting this done.

    At this point I would have to agree with support that the problem points to a network infrastructure problem not a SiteMinder problem based on the following:
    1) TCP/IP reset seems to be coming from firewall. Nothing so far points to it coming from the policy server or web agent.
    2) The problem started when no changes were made to the Siteminder infrastructure.
    3) The problem is resolved on its own with no changes to the SiteMinder infrastructure

    All of the above would point to a network level problem that is the root cause.

    SiteMinder has very little in the way of ability to get down to the network level and tell you what is happening. We hand off networking request to the OS and once that is done we wait for the OS to handle all of the networking process and return the appropriate information. So once that hand off is done there is little insight the SiteMinder product can provide.

    I think the support team is doing their best to try and provide you with more detailed information on this.

    Personally how I would proceed in this matter would be on two fronts:
    1) Continue doing what CA support has asked you to do and provide them with the requested upgrades and new logging
    2) Engage your Networking team to monitor the firewall and capture logs if available from the firewall during one of these outages.
    3) if you can disable the session timeout and or idle time out between the policy servers and web agents on the firewall.
    4) Check and see if the firewall has had any updates done to it recently
    5) Possibly see if traffic from one web agent having the problem can be rerouted through another firewall and see if the problem still persists for that agent.

    When dealing with network level problems of this nature that are intermittent at best it is always very difficult to pinpoint the problem and get the right teams involved.

    I think you are on your way to resolving this even if it might not be in the time frame that you may want. But that is the nature of intermittent network problem trouble shooting.

    I Hope this helps in some way.

    Gene


  • 14.  RE: Socket error 104

    Posted Mar 05, 2013 01:07 PM
    Hi Gene, Thanks for your time again. Couple of weeks ago, we upgraded the Policy Servers so they are on higher version than Web agents. So we are in supported configuration.

    Have you talked to support technician or you have the following understanding based on reading the case notes?

    '4) The wire shark traces provided seem to point to the Firewall resetting the connection between the web agents and firewall.'
    At this point I would have to agree with support that the problem points to a network infrastructure problem not a SiteMinder problem based on the following:
    1) TCP/IP reset seems to be coming from firewall. Nothing so far points to it coming from the policy server or web agent.

    Both of these statements are not accurate. If the support technician thinks so, I may have to correct him. Honestly, the current support tech. did not even look at wireshark traces. He just thinks TLI logs should capture such events, which clearly do not.

    I always maintained that webagents that are in the same network as the Policy Server had similar issue as well. There were no connection drops on the firewall. After initial troubleshooting, Support asked me to set a REG Key even though that key setting does nt apply to our INF. I did that, but the problem did not stop.

    I totally agree that this is a network issue, but I need to be able to prove that something broke SM setup which caused a business application outage. That is what I m trying. Trying to get to the bottom, as deep as I can go.

    When I reproduced similar behavior in our dev, the PS and WA and on same VMWare host. Just ran XPSExplorer and PS started dropping connections, no firewall , no INF changes.

    This may be a totally scenario to the Production issue that happend a month ago.

    However it tells me that the Policy Server throws Socket errors when it is doing some internal work(unable to respond?), like executing XPSExplorer.


  • 15.  RE: Socket error 104

    Posted May 29, 2013 08:23 AM
    Hi SamWalker,
    Were you able to identify the root cause/resolution on this issue ? I am currently experiencing similar errors on both agents and policy servers when we are performing a load test in our QA environment?
    Which User Directory are you using ?


    -- Regards


  • 16.  RE: Socket error 104

    Posted May 29, 2013 10:08 AM
    Hi Krish, We are using AD, ADAM and Domino as our user stores.

    We could not conclude why the error was happening on this particular instance but from all the troubleshooting it appears that following conditions can prompt socket errors:

    Policy Server under load
    Connection timeouts between agent and policy server
    Network issues(can be troubleshooted via TLI logging, but it did nt give me any information I needed).

    In your case I have a feeling that it is because of load.