restarter.c revision dfe5735016dd804901e28c9585549f5aa15bb63b
1N/A * The contents of this file are subject to the terms of the 1N/A * Common Development and Distribution License (the "License"). 1N/A * You may not use this file except in compliance with the License. 1N/A * See the License for the specific language governing permissions 1N/A * and limitations under the License. 1N/A * When distributing Covered Code, include this CDDL HEADER in each 1N/A * If applicable, add the following below this CDDL HEADER, with the 1N/A * fields enclosed by brackets "[]" replaced with your own identifying 1N/A * information: Portions Copyright [yyyy] [name of copyright owner] 1N/A * Copyright 2007 Sun Microsystems, Inc. All rights reserved. 1N/A * Use is subject to license terms. 1N/A#
pragma ident "%Z%%M% %I% %E% SMI" 1N/A * This component manages services whose restarter is svc.startd, the standard 1N/A * restarter. It translates restarter protocol events from the graph engine 1N/A * into actions on processes, as a delegated restarter would do. 1N/A * The master restarter manages a number of always-running threads: 1N/A * - restarter event thread: events from the graph engine 1N/A * - timeout thread: thread to fire queued timeouts 1N/A * - contract thread: thread to handle contract events 1N/A * - wait thread: thread to handle wait-based services 1N/A * The other threads are created as-needed: 1N/A * - per-instance method threads 1N/A * - per-instance event processing threads 1N/A * The interaction of all threads must result in the following conditions 1N/A * being satisfied (on a per-instance basis): 1N/A * - restarter events must be processed in order 1N/A * - method execution must be serialized 1N/A * - instance delete must be held until outstanding methods are complete 1N/A * - contract events shouldn't be processed while a method is running 1N/A * - timeouts should fire even when a method is running 1N/A * Service instances are represented by restarter_inst_t's and are kept in the 1N/A * instance_list list. 1N/A * The current state of a service instance is kept in 1N/A * restarter_inst_t->ri_i.i_state. If transition to a new state could take 1N/A * some time, then before we effect the transition we set 1N/A * restarter_inst_t->ri_i.i_next_state to the target state, and afterwards we 1N/A * rotate i_next_state to i_state and set i_next_state to 1N/A * RESTARTER_STATE_NONE. So usually i_next_state is _NONE when ri_lock is not 1N/A * held. The exception is when we launch methods, which are done with 1N/A * a separate thread. To keep any other threads from grabbing ri_lock before 1N/A * method_thread() does, we set ri_method_thread to the thread id of the 1N/A * method thread, and when it is nonzero any thread with a different thread id 1N/A * waits on ri_method_cv. 1N/A * Method execution is serialized by blocking on ri_method_cv in 1N/A * inst_lookup_by_id() and waiting for a 0 value of ri_method_thread. This 1N/A * also prevents the instance structure from being deleted until all 1N/A * outstanding operations such as method_thread() have finished. 1N/A * dgraph_lock [can be held when taking:] 1N/A * dictionary->dict_lock 1N/A * ru->restarter_update_lock 1N/A * restarter_queue->rpeq_lock 1N/A * instance_list.ril_lock 1N/A * st->st_configd_live_lock 1N/A * instance_list.ril_lock 1N/A * graph_queue->gpeq_lock 1N/A * st->st_configd_live_lock 1N/A * dictionary->dict_lock 1N/A * graph_queue->gpeq_lock 1N/A * inst->ri_queue_lock 1N/A * single_user_thread_lock 1N/A * logbuf_mutex nests inside pretty much everything. * Fails with ECONNABORTED or ECANCELED. uu_die(
"Insufficient privilege.\n");
uu_die(
"Repository backend access denied.\n");
* int restarter_insert_inst(scf_handle_t *, char *) * If the inst is already in the restarter list, return its id. If the inst * is not in the restarter list, initialize a restarter_inst_t, initialize its * states, insert it into the list, and return 0. * ENOENT - name is not in the repository * We don't use inst_lookup_by_name() here because we want the lookup /* Allocate an instance */ * id shouldn't be -1 since we use the same dictionary as graph.c, but * If there's no running snapshot, then we execute using the editing * snapshot. Pending snapshots will be taken later. * If the restarter group is missing, use uninit/none. Otherwise, * we're probably being restarted & don't want to mess up the states * This shouldn't happen since the graph engine should * there was no restarter pg. In case somebody * Force next_state to _NONE since we * don't look for method processes. * Inform the restarter of our state without * changing the STIME in the repository. * This is odd, because the graph engine should have required * the general property group. So we'll just use default * flags in anticipation of the graph engine sending us * REMOVE_INSTANCE when it finds out that the general property * group has been deleted. /* No more failures we live through, so add it to the list. */ * Implication: if we can't reregister the * instance, we will start another one. Two * instances may or may not result in a resource "%s: couldn't reregister %ld for wait\n",
* Leading PID has exited. * Must drop the instance lock so we can pick up the instance_list * lock & remove the instance. * We can lock the instance without holding the instance_list lock * since we removed the instance from the list. * instance_is_wait_style() * Returns 1 if the given instance is a "wait-style" service instance. * instance_is_transient_style() * Returns 1 if the given instance is a transient service instance. * instance_in_transition() * Returns 1 if instance is in transition, 0 if not * returns 1 if instance is already started, 0 if not * ECONNRESET - success, but h was rebound /* Like startd_alloc(). */ uu_die(
"Insufficient memory.\n");
* This is where we'd check inst->ri_method_thread and if it * were nonzero we'd wait in anticipation of another thread * executing a method for inst. Doing so with the instance_list * locked, though, leads to deadlock. Since taking a snapshot * during that window won't hurt anything, we'll just continue. * Stop the instance identified by the instance given as the second argument, * -1 - inst is in transition cp =
"all processes in service exited";
cp =
"process dumped core";
cp =
"process received fatal signal from outside the service";
cp =
"process killed due to uncorrectable hardware error";
cp =
"dependency activity requires stop";
cp =
"service restarting";
/* Services in the disabled and maintenance state are ignored */ /* Already stopped instances are left alone */ /* requeue event by returning -1 */ "Restarter: Not stopping %s, in transition.\n",
* No need to stop instance, as child has exited; remove * contract and move the instance to the offline state. * ENOENT - fmri is not in instance_list * ECONNRESET - success, though handle was rebound * -1 - instance is in transition "Ignoring maintenance off command because %s is not in the " cp =
"disable requested";
* If we did ADMIN_MAINT_ON_IMMEDIATE, then there might still be /* Must have been deleted. */ uu_die(
"Insufficient memory.\n");
"Could not remove contract id %lu for %s (%s).\n",
ctid,
* Set inst->ri_i.i_enabled. Expects 'e' to be _ENABLE, _DISABLE, or * _ADMIN_DISABLE. If the event is _ENABLE and inst is uninitialized or * disabled, move it to offline. If the event is _DISABLE or * _ADMIN_DISABLE, make sure inst will move to disabled. * ECONNRESET - h was rebound * B_FALSE: Don't log an error if the log_instance() * fails because it will fail on the miniroot before * install-discovery runs. "Not changing state of %s for enable command.\n",
/* B_FALSE: See log_instance(..., "Enabled."); above */ * We only want to pull the instance out of maintenance * if the disable is on adminstrative request. The * graph engine sends _DISABLE events whenever a * service isn't in the disabled state, and we don't * want to pull the service out of maintenance if, * for example, it is there due to a dependency cycle. /* Services in the disabled and maintenance state are ignored */ /* Already started instances are left alone */ "%s: start_instance -> is already started\n",
/* Services in the maintenance state are ignored */ "%s: maintain_instance -> is already in maintenance\n",
/* Must have been deleted. */ /* Succeed in anticipation of REMOVE_INSTANCE. */ bad_error(
"libscf_get_startd_properties", r);
/* Refresh does not change the state. */ const char *
event_names[] = {
"INVALID",
"ADD_INSTANCE",
"REMOVE_INSTANCE",
"ENABLE",
"DISABLE",
"ADMIN_DEGRADED",
"ADMIN_REFRESH",
"ADMIN_RESTART",
"ADMIN_MAINT_OFF",
"ADMIN_MAINT_ON",
"ADMIN_MAINT_ON_IMMEDIATE",
"STOP",
"START",
"DEPENDENCY_CYCLE",
"INVALID_DEPENDENCY",
"ADMIN_DISABLE" * void *restarter_process_events() * Called in a separate thread to process the events on an instance's * queue. Empties the queue completely, and tries to keep the thread * around for a little while after the queue is empty to save on /* grab the queue lock */ /* drop the queue lock */ * Grab the inst lock -- this waits until any outstanding * method finishes running. /* Getting deleted in the middle isn't an error. */ "%s command (for %s) unimplemented.\n",
"Not restarting %s; not running.\n",
* Stop the instance. If it can be restarted, * the graph engine will send a new event. uu_warn(
"%s:%d: Bad restarter event %d. " /* grab the queue lock */ * Try to preserve the thread for a little while for future use. * void *restarter_event_thread() * Handle incoming graph events by placing them on a per-instance * queue. We can't lock the main part of the instance structure, so * just modify the seprarately locked event queue portion. * This is a new thread, and thus, gets its own handle * ADD_INSTANCE is special: there's likely no * instance structure yet, so we need to handle the * addition synchronously. * Lookup the instance, locking only the event queue. * Can't grab ri_lock here because it might be held * by a long-running method. "Ignoring %s command for unknown service " /* Keep ADMIN events from filling up the queue. */ "queue overflow. Dropping administrative " "queue overflow. Dropping administrative " /* Now add the event to the instance queue. */ * Start a thread if one isn't already * Signal the existing thread that there's * Unreachable for now -- there's currently no graceful cleanup * Since ri_lock isn't held by the contract id lookup, this * instance may have been restarted and now be in a new * contract, making the old contract no longer valid for this * Take action on contract events. * If startd has stopped this contract, there is no need to * There shouldn't be other events, since that's not how we set * the terms. Thus, just log an error and drive on. "%s: contract %ld received unexpected critical event " * We ignore all events; if they impact the * process we're monitoring, then the * wait_thread will stop the instance. "%s: ignoring contract event on wait-style service\n",
* A CT_PR_EV_EMPTY event is an RSTOP_EXIT request. * void *restarter_contract_event_thread(void *) * Listens to the process contract bundle for critical events, taking action * on events from contracts we know we are responsible for. * Await graph load completion. That is, stop here, until we've scanned * the repository for contract - instance associations. * This is a new thread, and thus, gets its own handle uu_die(
"Unable to bind a new repository handle: %s\n",
uu_die(
"process bundle open failed");
* Make sure we get all events (including those generated by configd * before this thread was started). "Error reading next contract event: %s",
* svc.configd(1M) restart handling performed by the * fork_configd_thread. We don't acknowledge, as that thread * This can happen for two reasons: * - method_run() has not yet stored the * the contract into the internal hash table. * - we receive an EMPTY event for an abandoned * If there is any contract in the process of * being stored into the hash table then re-read "Reset event %d for unknown " * Do not call contract_to_inst() again if first * This can happen if we receive an EMPTY * event for an abandoned contract. "Received event %d for unknown contract id " "Received event %d for contract id " * Timeout queue, processed by restarter_timeouts_event_thread(). "svc.startd. Using infinite timeout");
* If we overflow LLONG_MAX, we're never timing out anyways, so "treating as infinite.");
/* hrtime is in nanoseconds. Convert timeout_sec. */ /* Insert the calculated timeout time onto the queue. */ * Walk through the (sorted) timeouts list. While the timeout * at the head of the list is <= the current time, kill the "Method or service exit timed out. Killing contract %ld",
* void *restarter_timeouts_event_thread(void *) * Responsible for monitoring the method timeouts. This thread must * be started before any methods are called. * Timeouts are entered on a priority queue, which is processed by * this thread. As timeouts are specified in seconds, we'll do * the necessary processing every second, as long as the queue * As long as the timeout list isn't empty, process it /* The list is empty, wait until we have more timeouts. */