Machine Offline or Missing

Document created by liuze01 Employee on Sep 4, 2014
Version 1Show Document
  • View in full screen mode

A machine can be placed offline automatically or manually.

A machine is placed offline automatiocally when the agent running on the machine is shut down

or when the scheduler tries to start a job on that machine but it fails to get response from

the agent running on that machine.  A machine is placed offline manually by a MACH_OFFLINE

event usually from a sendevent command.

When a machine is placed offline automatically, its status in autorep output is shown as

Missing.  In this situation, autoping to the machine will fail and jobs that are supposed to

start on the machine at that time are placed in Pending status.  FORCE_STARTJOB will not work.

The machine will be brought back online automatically when the scheduler is able to get

response from the agent again.

 

Example 1:  A machine is placed offline and status is set to Missing when the agent is shut

down:

[09/02/2014 11:01:36.7973] 13803 3943279504 CAUAJM_I_40245 EVENT: MACH_OFFLINE     MACHINE:

liuze01-U109361
[09/02/2014 11:01:36.7983] 13803 3943279504 <Detected the shutdown of agent on machine

<liuze01-U109361>. Automatically placing the machine offline.>

[09/02/2014 11:01:36.8032] 13803 3943279504 L:128 MStatHdlr.cpp 134 DBQ db2_send_sql update

ujo_machine set mach_status = 'a' where mach_name = 'liuze01-U109361' and ((type != 'w' and

type != 'v') or ((type = 'v' or type = 'w') and parent_name != ' ')) and mach_status != 'm'

Note that based on above SQL statement from trace, the machine can not be changed to Missing

status if it's is in Offline (much_status = 'm') status.

 

Exmaple 2: A machine is taken offline when a job can not be started on that machine which is

in online status.  In this case the agent can not response to the server due to firewall

setting.

[09/03/2014 11:57:00]      ----------------------------------------
[09/03/2014 11:57:15]      CAUAJM_I_40245 EVENT: STARTJOB         JOB: test_liuze01-U105187
[09/03/2014 11:57:15]      CAUAJM_I_40245 EVENT: CHANGE_STATUS    STATUS: STARTING        JOB:

test_liuze01-U105187 MACHINE: liuze01-U105187
[09/03/2014 11:57:16]      CAUAJM_E_10229 Communication attempt with the CA WAAE Agent has

failed! [liuze01-U105187:7520]
[09/03/2014 11:57:16]      CAUAJM_W_40290 Machine <liuze01-U105187> is in question. Placing

machine in the unqualified state.
[09/03/2014 11:57:16]      CAUAJM_E_40157 System Restart - Job [test_liuze01-U105187] was

unable to start.
[09/03/2014 11:57:16]      CAUAJM_I_40245 EVENT: ALARM            ALARM: STARTJOBFAIL     JOB:

test_liuze01-U105187 MACHINE: liuze01-U105187
[09/03/2014 11:57:16]      <COMM_ERR_5 Communication attempt with Agent on machine [liuze01-

U105187] has failed.>
[09/03/2014 11:57:26]      CAUAJM_I_40245 EVENT: CHANGE_STATUS    STATUS: RESTART         JOB:

test_liuze01-U105187 MACHINE: liuze01-U105187
[09/03/2014 11:57:26]      <System Restart - Job [test_liuze01-U105187] was unable to start.>
[09/03/2014 11:57:26]      CAUAJM_I_40109 Scheduled [test_liuze01-U105187 144.9008.1] due to

RESTART event.
[09/03/2014 11:57:41]      CAUAJM_I_40245 EVENT: STARTJOB         JOB: test_liuze01-U105187
[09/03/2014 11:57:41]      <Scheduled due to RESTART event.>
[09/03/2014 11:57:41]      CAUAJM_I_40188 Pending job <test_liuze01-U105187> due to offline

machine(s).
[09/03/2014 11:57:49]      CAUAJM_W_40253 Machine <liuze01-U105187> is not responding.  Taking

offline.
[09/03/2014 11:57:49]      CAUAJM_I_40245 EVENT: ALARM            ALARM: MACHINE_UNAVAILABLE

MACHINE: liuze01-U105187
[09/03/2014 11:57:49]      <Machine <liuze01-U105187> did not respond.>


When a machine is placed offline manually, its status is shown as Offline in the output of

autorep for that machine.  Jobs that are supposed to start on that machine based on

start_times or other job conditions will be placed on Pending status.  However, autoping to

the machine will be successful if the agent can response to the scheduler and application

server.  Jobs can only be started by a FORCE_STARTJOB event.  The scheduler will not bring the

machine online automatically if the machine is in Offline status.

Example 3: A machine is placed offline manually.

[09/03/2014 17:56:11.8486] 13803 3945769872 CAUAJM_I_40245 EVENT: MACH_OFFLINE     MACHINE:

liuze01-U109361
.......

[09/03/2014 17:56:11.8515] 13803 3945769872 L:128 MStatHdlr.cpp 134 DBQ db2_send_sql update

ujo_machine set mach_status = 'm' where mach_name = 'liuze01-U109361' and ((type != 'w' and

type != 'v') or ((type = 'v' or type = 'w') and parent_name != ' '))
  
When jobs that are supposed to start running but stay in Pending status, the first thing I

will check is the status of the machine specified in the job definition.  If the machine is in

Offline status, "sendevent -E MACH_ONLINE -n <machine_name>" is needed.  If the machine is

"Unqualified" or "Missing", check if the agent is active and if the connection or

communication is good between the agent and the server.

Attachments

    Outcomes