CA Service Management

  • 1.  CA Tuesday Tip - Working with CA Support on a Performance/Crash/Hang Issue

    Posted Aug 27, 2012 01:36 PM
      |   view attached
    Hello All, I was out of the office last week on vacation and apologize for not posting. Thank you to my colleague Paul C. for putting the post out there last week about the Additional BOXI Documentation to keep us rolling. This week I wanted to touch on a subject that we have all dealt with from time to time - those complex issues where a Service Desk process is crashing or hanging, thus causing performance problems with Service Desk. Many of you have opened issues with CA Support for such problems and I know it always seems that you are asked a TON of questions by the Support Engineer working on the issue with you. I first want to assure you that we are NOT doing that for fun, or to stall, but rather we need to break the issue down as best we can to narrow down the possible root causes - and to do that we need to ask questions and gather the appropriate information needed to help us drive the issue in the right direction. With that, many times the Support Engineer will come back to you advising you to use a utility or tool to generate a dump file manually or to monitor a process for a crash and generate a dump file at that point, and then you are asked to send us a whole slew of files and pieces of information along with that dump file for the next occurrence. This week I wanted to go over the best practices for working with CA Support on this type of issue.

    So with out further dealy - here is everything you need to know about working with CA Support on a "Crashing" or "Hanging" process issue:
    **(I have also attached this to the post as a word document as well - it may be easier to read that way)

    Working with CA Support & Troubleshooting a “Crashing” or “Hanging” Process

    First Determine if a process “Crashing” or “Hanging”
    Many times when a Service Desk process seems to be failing, you may be asked by CA Support if its “crashing” or “hanging” - and it is sometimes difficult to tell the difference. This document seeks to clarify the difference, and provide you with the knowledge needed to be able to differentiate and determine in your case, whether a process has “crashed” or is “hanging.” This document is specific to environments where the system is running a Microsoft Windows Based Operating System.

    A “Crashing” Process Defined
    A Crashing or Crashed process is one that fails in such a way that it either stops running completely, or recycles itself.

    A “Hanging” or “Hung” Process Defined
    A Hanging or Hung process is one that appears not to be responding, but at the same time, still appears to be in a running state.

    To determine if the process has crashed, confirm or answer the following:
    1. After the “crash” does the process still show as running when you run a pdm_status?
    2. After the “crash” does the process still show in task manger in the process list?
    3. In the Service Desk STDLogs, at the time of the “crash” (could also be before, during, or after the occurrence is reported to you), search for the words “died” - and look for any messages with something similar to “****** process died: restarting” (where xxxxx is a process name such as domsrvr.exe, or webengine.exe).
    4. In the Service Desk STDLogs, at the time of the “crash” (could also be before, during, or after the occurrence is reported to you), search for the words “FATAL” - and look for any FATAL type including an “EXIT”, “SIGSEGV” , or “CANNOT ALLOCATE xxxxx BYTES”

    If you can answer “No” to #1 and #2, and confirm at least one of the messages in the logs on #3 or #4, then most likely you are experiencing a “crashing” process.

    If you answer “Yes” to #1 and #2, and are not able to confirm any of the messages in the logs on #3 or #4, then you are most likely experiencing a “hanging” process.

    If a process appears to be in a “hung” state and does not appear to be responding, please confirm this by performing the following steps:

    First, run the following command to see if the process responds to a request via the command line: “pdm_diag -a {slump name of process}”

    **to get the slump name of the process, you can run the slstat command and pipe it out to a file by running the following command: “slstat > slstat.txt”

    Example: If it was a webengine hanging, and you found that the slump name for the failing webengine as per the slstat output is “web:local” you would run the command as follows to see if that webengine process is responding: “pdm_diag -a web:local”

    If you receive information back from the process, then the process IS actually responding. If you do not receive information back from the process, and it appears the command is hanging, then the process is most likely in a “hung state” and will not respond with any information.

    Then run the following two commands to turn on advanced tracing and logging for the hung process and let it run for about 30 seconds:

    “pdm_logstat –n {slump name of process} TRACE”
    “bop_logging {slump name of process} -f $NX_ROOT\log\{processname}.out -n 10 -m 20000000 ON”

    NOTE: In most cases - it is a good practice to turn bop logging on for all domsrvrs, webengines, and spelsrvrs, even the ones that are not hanging or crashing - this will allow CA Support and Sustaining Engineering to see how other processes are being affected by the hanging or crashing process.

    Then turn the logging off by running the following commands:

    “pdm_logstat –n {slump name of process}”
    “bop_logging {slump name of process} OFF”

    Example: Using the same example above for a hanging webengine process, the syntax would be as follows: “pdm_logstat –n web:local TRACE” “bop_logging web:local -f $NX_ROOT\log\weblocal.out ON”

    **the output files for this logging will be included in the Service Desk log directory, so they will be uploaded along with the log directory to the support issue once all required files, output, and info has been gathered.

    Steps to take once you have confirmed that you have a “crashing” or “Hanging” process:
    It is always best to have a crash dump file generated for a “crashing” or “hanging” process. Once a crash dump file is generated, your CA Support Engineer will work with the Sustaining Engineering Team to try and pinpoint the probable cause of the crash or hang.

    Crash dump files can be generated in multiple ways - depending on your environment, and whether the process had been determined to be “crashing” or “hanging.”

    Use the chart below to help you decide which option is most applicable for you:

    Dr. Watson MS Dumper Utility ADPlus
    Operating System



    Windows Server 2003
    YES
    YES
    YES
    Windows Server 2003 64Bit NO
    NO
    YES
    Windows Server 2008
    NO
    NO
    YES
    **With the release of ADPlus from Microsoft, as their official utility for generating dump files on a Windows platform, we will recommend using ADPlus as the preferred method whenever possible.

    Being that ADPlus is the recommended utility to use for generating a crash dump on a crashing or hung process; we will go through the steps to use ADPlus to capture a crash dump in both situations.


    Using ADPlus to generate a crash dump on a “Crashing” process:

    **IMPOTANT NOTE: ADPlus requires .NET 4 to be installed prior to installing the ADPlus application

    Part-1 – Download and install the ADPlus Application

    1.
    Download ADPlus from:
    http://msdn.microsoft.com/en-us/windows/hardware/hh852363

    2.
    Once downloaded, run the installer and select only to install the windows debugging tools (you don’t need the rest to use ADPlus)


    Part-2 – Using ADPlus to generate a dump file for a crashing process

    1.
    Find the PID(s) (Process Identifiers) for the process(es) for which you need to generate the dump files for by using task manager.

    **Note that you may need to add the “PID” column to task manager to be able to see the PID number for all running processes.

    Here are some Examples of process names relevant to Service Desk:

    Tomcat = javaw.exe

    Domsrvr = domsrvr.exe

    Spelsrvr = spelsrvr.exe

    Webengine = webengine.exe

    Screenshot of Task Manager with PID column showing:

    Here you see the javaw.exe process has a PID of 4608

    2.
    Open a command prompt window, and navigate to the directory where ADPlus was installed. (in most cases this would be C:\Program Files\Debugging Tools for Windows (x64)- but it could differ depending on the version and location you installed it to)
    3.
    Run the following command to monitor a crashing process:

    Adplus -crash -p xxxx -o c:\dumps -FullOnFirst
    **“xxxx” should be replaced with the PID number of the process




    **Key to the Attributes/Flags used in the command:




    -p = PID number of the process




    -o = output directory for dump file to be created




    -FullOnFirst = to specify a FULL dump (mini-dumps don’t show info)

    **IMPORTANT NOTE: If you have more than one of the same process running - then you should run the command multiple times, once for each PID for each of the processes.


    Using ADPlus to generate a crash dump on a “Hanging” process:

    **IMPOTANT NOTE: ADPlus requires .NET 4 to be installed prior to installing the ADPlus application

    Part-1 – Download and install the ADPlus Application

    3.
    Download ADPlus from:
    http://msdn.microsoft.com/en-us/windows/hardware/hh852363

    4.
    Once downloaded, run the installer and select only to install the windows debugging tools (you don’t need the rest to use ADPlus)

    Part-2 – Using ADPlus to generate a dump file for a hanging process

    4.
    Find the PID(s) (Process Identifiers) for the process(es) for which you need to generate the dump files for by using task manager.

    **Note that you may need to add the “PID” column to task manager to be able to see the PID number for all running processes.

    Here are some Examples of process names relevant to Service Desk:

    Tomcat = javaw.exe

    Domsrvr = domsrvr.exe

    Spelsrvr = spelsrvr.exe

    Webengine = webengine.exe

    Screenshot of Task Manager with PID column showing:

    Here you see the javaw.exe process has a PID of 4608

    5.
    Open a command prompt window, and navigate to the directory where ADPlus was installed. (in most cases this would be C:\Program Files\Debugging Tools for Windows (x64)- but it could differ depending on the version and location you installed it to)
    6.
    Run the following command to force a dump on a hung process:

    Adplus -hang -p xxxx -o c:\dumps -FullOnFirst
    **“xxxx” should be replaced with the PID number of the process




    **Key to the Attributes/Flags used in the command:




    -p = PID number of the process




    -o = output directory for dump file to be created




    -FullOnFirst = to specify a FULL dump (mini-dumps don’t show info)

    **IMPORTANT NOTE: If you have more than one of the same process running - then you should run the command multiple times, once for each PID for each of the processes.


    What to do after the dump file has been generated:
    Once you or ADPlus has generated a dump file for a crashing or hanging process, please fill out a “Crash Dump Template” as supplied to you by CA Support. This will serve as a checklist for you to gather all the required files, information, and data needed by CA Support to analyze the dump file(s) and help pinpoint the source of the crash or hang. The following is a copy of the Windows Crash Dump Template document - which should be supplied to you by CA Support (separately from this document):

    Windows Crash Dump Template


    Please fill this out as best you can after you capture a dump file for a dying, crashing or hanging process.

    Simply insert your answers/information to these items in-line below each item.

    You may cut and paste this template into the issue via support.ca.com, or you may save it and upload it to the issue as an attachment.

    If you are unsure about a specific item - please ask your CA Support Engineer for clarification.


    1.
    Please review the stdlog file that captures the timeframe of when the dump occurred and supply us with the following information:

    Was the process ended by a SIGSEGV message, a SIGBUS message or any another “FATAL” type message?

    What is seen in the stdlog file right before, during, and after the time the process crashed?

    What errors were reported in the logs right before, during and after the time when the process crashed if any?

    2.
    If the dump file was generated by Dr. Watson - please attach the ‘User.dmp’ and ‘drwatson.log’ log files to the issue. If the dump file was NOT generated using Dr. Watson - but was generated using ADPlus or the Microsoft Process Dumper Utility, then simply upload the .DMP file that was generated by the utility used to generate the dump, and specify the filename of the dump file (or zip file that contains the dump file) here.

    3.
    If the dump was generated by Dr. Watson, please supply us with the process name mentioned in the drwatson.log file. Again, if the dump file was not generated by Dr. Watson, simply specify N/A below, and continue to the next.

    4.
    Location of where the ‘User.dmp’ file was first found?

    5.
    Please specify the date/time the dump file was generated.

    6.
    How many times has the failing process crashed since first reported?

    7.
    Are there any possible reproducible steps noted prior to when this crash/hang occurs?

    8.
    Supply a “Directory Listing” output of the Service Desk root directory (NX_ROOT) by using a command line window, navigating to the directory where Service Desk is installed, and running the command “dir/s > dir_listing.out” - this will generate a file called dir_listing.out. Please upload this file and specify the name of the file (or zip file that contains the dir_listing.out file) here.

    9.
    Please run the command “winmsd” - this should pop up a window with system information. Click on the file menu and select save to save the output to a file. Please upload that file and specify the name of the file (or zip file that contains the output file) here. NOTE - on some environments, for security reasons, winmsd may not run. In this case, please specify the specs of the hardware, and whether or not it is VMware based, for the system where the dump file was generated, here.

    10.
    Navigate to the Service Desk\bin directory and run pdm_ident {process name} > pdm_ident.out. - where {process name} is the name of the Service Desk process for which the dump file was generated. If the failing process is javaw.exe - you will need to run a pdm_ident on the sda65.dll file as the javaw process does not contain pdm_ident information. Please upload the pdm_ident.out file, and specify the name of the file (or zip file that contains the output file) here.

    11.
    Please attach your patch history file ($NX_ROOT/<machine name>.his) to the issue and specify the name of the file (or zip file that contains the history file) here.

    12.
    Please zip up the entire Service Desk\log directory and attach it to the issue, and specify the name of the zip file here.

    13.
    Please zip up the Service Desk\site\mods directory and attach it to the issue, and specify the name of the zip file here.

    14.
    Please upload the Windows event log files, and specify the name of the files (or zip file containing the event logs) here.

    ***end of crash dump template***

    Once you have generated the crash dump file, and have gathered all required information, files, and data as per the Crash Dump Template document, please upload everything to your CA Support issue. Please be sure to appropriately label the filenames of all uploaded files so that it is easily visible to CA Support as to which file is which. We have found that the best way to do this is to gather all the files and output first, set appropriate file names, then, under each respective item on the Crash Dump Template Document, simply write the name of the file that corresponds with that item if applicable.

    Once all the required files and information has been uploaded to the support issue, your CA Support Engineer will review the information supplied - and will then engage the Sustaining Engineering Team to assist in analysis of the dump files.


    What should I do if additional dump files are produced for additional occurrences of the same exact problem on the same server?
    Sometimes multiple occurrences will produce multiple dump files if the dump files are being automatically generated by ADPlus or another tool.

    To avoid any confusion and “clouding” of your open support issue with CA Support, do NOT upload the additional dump files and logs without talking to your CA Support Engineer first. There is no need to upload multiple dump files for the same problem unless specifically requested by your CA Support Engineer. The CA Support Team may already have found the problem and may be working on possible resolutions or possible code changes to fix it, and adding these files and additional logs, and updates, may only cloud the issue and make it more difficult to review by others.


    What should I do if a similar, but not exactly the same problem occurs on the same server?
    If you experience a problem that is similar to the previous occurrence, but not exactly the same - say for example the original problem was a hanging webengine process, and now you are experiencing a hanging spelsrvr process, this problem should be treated as a different problem, and a separate new issue should be opened. The same steps that were followed for the original problem should be followed for this new, slightly different occurrence, including filling out the Crash Dump Template Document, and uploading the files and information specific to the new problem, in the new issue.


    What should I do if the same problem (as the original issue) occurs on a different server?
    If you experience a problem where the same process crashes or hangs, but on a different server, you should follow all the same steps you did to generate the crash dump, and fill out the Crash Dump Template Document with regards to the different server where the new crash or hang has occurred. You may upload the new crash dump, along with the filled out Crash Dump Template Document, and all required information and files to the original issue - however, you MUST make sure that the files are ALL appropriately labeled so it is easily visible that they are from a different server from than the original issue occurred on. The best way to do this is to zip up ALL of the files for this new occurrence on a different server, into one zip file specifically labeled with the second server name, and the date of the occurrence.


    *******************
    And thats about all there is to know :-)
    Remember to check back next week for more tips and tricks!

    Have a great week everyone,
    Jon Israel
    Principal Support Engineer
    CA Technologies


  • 2.  RE: CA Tuesday Tip - Working with CA Support on a Performance/Crash/Hang Is

     
    Posted Aug 28, 2012 02:22 PM
    Thanks for all the great info Jon and welcome back! Hopefully this info will help get resolutions to users issues even faster :grin:


  • 3.  RE: CA Tuesday Tip - Working with CA Support on a Performance/Crash/Hang Is

    Posted Aug 29, 2012 09:22 AM
    Thanks Jon for sharing this information with us. Does the CA Support have any specific / additional set of instructions for Linux systems...?


  • 4.  RE: [Tuesday's Tips] RE: CA Tuesday Tip - Working with CA Support on a Perf

    Posted Aug 29, 2012 09:28 AM
    Hi Ajays1710,

    First and formost – I apologize for not including Linux/Unix info in my post ☹ That was poorly planned on my part!

    The instructions for interval logging for linux are included in the interval logging document – as for the crash dump part, here is a piece of a document regarding Dump Files on Linux/Unix:

    Collecting Core Dumps on Solaris OS

    With the Solaris Operating System, unhandled signals such as a segmentation violation, illegal instruction, and so forth, result in a core dump. By default, the core dump is created in the current working directory of the process and the name of the core dump file is core. The user can configure the location and name of the core dump using the core file administration utility, coreadm. This procedure is fully described in the man page for the coreadm utility.

    The ulimit utility is used to get or set the limitations on the system resources available to the current shell and its descendants. Use the ulimit -c command to check or set the core file size limit. Make sure that the limit is set to unlimited; otherwise the core file could be truncated. Note that ulimit is a Bash shell built-in command; on a C shell, use the limit command.

    The gcore utility can be used to get a core image of running processes. This utility accepts a process id (pid) of the process for which you want to force core dump.

    To get the pid of processes that you are interested in forcing the core, by running on the machine,

    ps -ef | grep <process name>


    Collecting Core Dumps on Linux

    On the Linux operating system, unhandled signals such as segmentation violation, illegal instruction, and so forth, result in a core dump. By default, the core dump is created in the current working directory of the process and the name of the core dump file is core.pid, where pid is the process id of the crashed process.

    The ulimit utility is used to get or set the limitations on the system resources available to the current shell and its descendants. Use the ulimit -c command to check or set the core file size limit. Make sure that the limit is set to unlimited; otherwise the core file could be truncated. Note that ulimit is a Bash shell built-in command; on a C shell, use the limit command.

    You can use the gcore command in the gdb (GNU Debugger) interface to get a core image of a running process. This utility accepts the pid of the process for which you want to force the core dump.

    To get the pid of processes that you are interested in forcing the core, by running on the machine,

    ps -ef | grep <process name>



    Reasons for Not Getting a Core File

    The following list explains the major reasons that a core file might not be generated. This list pertains to both Solaris OS and Linux, unless specified otherwise.

    · The current user does not have permission to write in the current working directory of the process.

    · The current user has write permission on the current working directory, but there is already a file named core that has read-only permission.

    · The current directory does not have enough space or there is no space left.

    · The current directory has a subdirectory named core.

    · The current working directory is remote. It might be mapped by NFS (Network File System), and NFS failed just at the time the core dump was about to be created.

    · Solaris OS only: The coreadm tool has been used to configure the directory and name of the core file, but any of the above reasons apply for the configured directory or filename.

    · The core file size limit is too low. Check your core file limit using the ulimit -c command (Bash shell) or the limit -c command (C shell). If the output from this command is not unlimited, the core dump file size might not be large enough. If this is the case, you will get truncated core dumps or no core dump at allThe process is running a setuid program and therefore the operating system will not dump core unless it is configured explicitly.
    I hope this helps ☺

    Thanks again,

    Jon Israel
    Principal Support Engineer
    CA Technologies

    From: CA Service Management Global User Community [mailto:CommunityAdmin@communities-mail.ca.com]
    Sent: Wednesday, August 29, 2012 9:22 AM
    To: mb.14300553.98948259@myca-email.ca.com
    Subject: [Tuesday's Tips] RE: CA Tuesday Tip - Working with CA Support on a Performance/Crash/Hang Is

    Thanks Jon for sharing this information with us. Does the CA Support have any specific / additional set of instructions for Linux systems...?
    Posted by:ajays1710
    --
    CA Communities Message Boards
    98950799
    mb.14300553.98948259@myca-email.ca.com<mailto:mb.14300553.98948259@myca-email.ca.com>
    https://communities.ca.com


  • 5.  RE: CA Tuesday Tip - Working with CA Support on a Performance/Crash/Hang Is

    Posted Sep 11, 2012 10:02 AM
    Thanks for the reply Jon. This will certainly help.


  • 6.  Re: CA Tuesday Tip - Working with CA Support on a Performance/Crash/Hang Issue

    Posted Jul 16, 2014 01:37 AM

    Hello Service Desk Manager Community,

     

    I'm flagging this post by Jon_Israel on Working with CA Support on a Performance/Crash/Hang Issue as Featured Content on our site this week.

     

    Its detailed content that addresses issues experienced by sites makes it essential reading for those who need to get to the nuts and bolts of working with CA Service Desk Manager.

     

    Instead of MS ADPlus, any operating system process dumper, such as Microsoft Process Dumper on the later versions of MS Windows, may also be used.

     

    If you're an SDM Admin, in particular on a large/complex architecture site, I recommend this as part of your "kit."

     

    Thanks, Kyle_R.

    ADMIN.



  • 7.  Re: CA Tuesday Tip - Working with CA Support on a Performance/Crash/Hang Issue

    Posted Jul 16, 2014 09:58 AM

    Thanks Kyle!

     

    One quick change to your comments - the best tool to use at this point is the Microsoft Debug-Diag utility.  It is compatible with all microsoft OS's from Win2003 and up.   The Process Dumper tool is only valid for Windows 2003.  ADPlus works on all OSes also but it requires .NET and other things to be installed, and also requires the user to stay logged into windows or the application will not only close out, but will also take the process down that its set to monitor as well - makes for a not so good situation.  SO, we now recommend the Debug Diag utility from Microsoft in its place - which covers us from all sides and is a lot more user friendly.   The tool can be found here: http://www.microsoft.com/en-us/download/details.aspx?id=40336

     

    Thanks,

    Jon Israel

    Principal Support Engineer

    CA Technologies



  • 8.  Re: CA Tuesday Tip - Working with CA Support on a Performance/Crash/Hang Issue

    Posted Jul 17, 2014 12:07 AM

    Good to know!

     

    Thanks, Kyle_R.