A machine can be placed offline automatically or manually.
A machine is placed offline automatiocally when the agent running on the machine is shut down
or when the scheduler tries to start a job on that machine but it fails to get response from
the agent running on that machine. A machine is placed offline manually by a MACH_OFFLINE
event usually from a sendevent command.
When a machine is placed offline automatically, its status in autorep output is shown as
Missing. In this situation, autoping to the machine will fail and jobs that are supposed to
start on the machine at that time are placed in Pending status. FORCE_STARTJOB will not work.
The machine will be brought back online automatically when the scheduler is able to get
response from the agent again.
Example 1: A machine is placed offline and status is set to Missing when the agent is shut
down:
[09/02/2014 11:01:36.7973] 13803 3943279504 CAUAJM_I_40245 EVENT: MACH_OFFLINE MACHINE:
liuze01-U109361
[09/02/2014 11:01:36.7983] 13803 3943279504 <Detected the shutdown of agent on machine
<liuze01-U109361>. Automatically placing the machine offline.>
[09/02/2014 11:01:36.8032] 13803 3943279504 L:128 MStatHdlr.cpp 134 DBQ db2_send_sql update
ujo_machine set mach_status = 'a' where mach_name = 'liuze01-U109361' and ((type != 'w' and
type != 'v') or ((type = 'v' or type = 'w') and parent_name != ' ')) and mach_status != 'm'
Note that based on above SQL statement from trace, the machine can not be changed to Missing
status if it's is in Offline (much_status = 'm') status.
Exmaple 2: A machine is taken offline when a job can not be started on that machine which is
in online status. In this case the agent can not response to the server due to firewall
setting.
[09/03/2014 11:57:00] ----------------------------------------
[09/03/2014 11:57:15] CAUAJM_I_40245 EVENT: STARTJOB JOB: test_liuze01-U105187
[09/03/2014 11:57:15] CAUAJM_I_40245 EVENT: CHANGE_STATUS STATUS: STARTING JOB:
test_liuze01-U105187 MACHINE: liuze01-U105187
[09/03/2014 11:57:16] CAUAJM_E_10229 Communication attempt with the CA WAAE Agent has
failed! [liuze01-U105187:7520]
[09/03/2014 11:57:16] CAUAJM_W_40290 Machine <liuze01-U105187> is in question. Placing
machine in the unqualified state.
[09/03/2014 11:57:16] CAUAJM_E_40157 System Restart - Job [test_liuze01-U105187] was
unable to start.
[09/03/2014 11:57:16] CAUAJM_I_40245 EVENT: ALARM ALARM: STARTJOBFAIL JOB:
test_liuze01-U105187 MACHINE: liuze01-U105187
[09/03/2014 11:57:16] <COMM_ERR_5 Communication attempt with Agent on machine [liuze01-
U105187] has failed.>
[09/03/2014 11:57:26] CAUAJM_I_40245 EVENT: CHANGE_STATUS STATUS: RESTART JOB:
test_liuze01-U105187 MACHINE: liuze01-U105187
[09/03/2014 11:57:26] <System Restart - Job [test_liuze01-U105187] was unable to start.>
[09/03/2014 11:57:26] CAUAJM_I_40109 Scheduled [test_liuze01-U105187 144.9008.1] due to
RESTART event.
[09/03/2014 11:57:41] CAUAJM_I_40245 EVENT: STARTJOB JOB: test_liuze01-U105187
[09/03/2014 11:57:41] <Scheduled due to RESTART event.>
[09/03/2014 11:57:41] CAUAJM_I_40188 Pending job <test_liuze01-U105187> due to offline
machine(s).
[09/03/2014 11:57:49] CAUAJM_W_40253 Machine <liuze01-U105187> is not responding. Taking
offline.
[09/03/2014 11:57:49] CAUAJM_I_40245 EVENT: ALARM ALARM: MACHINE_UNAVAILABLE
MACHINE: liuze01-U105187
[09/03/2014 11:57:49] <Machine <liuze01-U105187> did not respond.>
When a machine is placed offline manually, its status is shown as Offline in the output of
autorep for that machine. Jobs that are supposed to start on that machine based on
start_times or other job conditions will be placed on Pending status. However, autoping to
the machine will be successful if the agent can response to the scheduler and application
server. Jobs can only be started by a FORCE_STARTJOB event. The scheduler will not bring the
machine online automatically if the machine is in Offline status.
Example 3: A machine is placed offline manually.
[09/03/2014 17:56:11.8486] 13803 3945769872 CAUAJM_I_40245 EVENT: MACH_OFFLINE MACHINE:
liuze01-U109361
.......
[09/03/2014 17:56:11.8515] 13803 3945769872 L:128 MStatHdlr.cpp 134 DBQ db2_send_sql update
ujo_machine set mach_status = 'm' where mach_name = 'liuze01-U109361' and ((type != 'w' and
type != 'v') or ((type = 'v' or type = 'w') and parent_name != ' '))
When jobs that are supposed to start running but stay in Pending status, the first thing I
will check is the status of the machine specified in the job definition. If the machine is in
Offline status, "sendevent -E MACH_ONLINE -n <machine_name>" is needed. If the machine is
"Unqualified" or "Missing", check if the agent is active and if the connection or
communication is good between the agent and the server.