exec.c revision dc32d872cbeb56532bcea030255db9cd79bac7da
#
define MAC_FLAGS 0x10 /* need to adjust MAC flags */ * exece() - system call wrapper around exec_common() long execsz;
/* temporary count of exec size */ * exec() is not supported for the /proc agent lwp. * Brand actions are not supported for processes that are not * running in a branded zone. /* Only branded processes can be unbranded */ /* Only unbranded processes can be branded */ * If this is a native zone, or if the process is already * branded, then we don't need to do anything. If this is * a native process in a branded zone, we need to brand the * process as it exec()s the new binary. * Inform /proc that an exec() has started. * Hold signals that are ignored by default so that we will * not be interrupted by a signal that will be ignored after * successful completion of gexec(). * Look up path name and remember last component for later. * To help coreadm expand its %d token, we attempt to save * the directory containing the executable in p_execdir. The * first call to lookuppn() may fail and return EINVAL because * dirvpp is non-NULL. In that case, we make a second call to * lookuppn() with dirvpp set to NULL; p_execdir will be NULL, * but coreadm is allowed to expand %d to the empty string and * there are other cases in which that failure may occur. * We do not allow executing files in attribute directories. * We test this by determining whether the resolved path * contains a "/" when we're in an attribute directory; * only if the pathname does not contain a "/" the resolved path * points to a file in the current working (attribute) directory. /* don't free resolvepn until we are done with args */ * If we're running in a profile shell, then call pfexecd. /* Returning errno in case we're not allowed to execute. */ /* Don't change the credentials when using old ptrace. */ * Specific exec handlers, or policies determined via * /etc/system may override the historical default. /* If necessary, brand this process before we start the exec. */ * Free floating point registers (sun4u only) * Free thread and process context ops. * Remember file name for accounting; clear any cached DTrace predicate. * Clear contract template state * Save the directory in which we found the executable for expanding * the %d token used in core file patterns. * Reset stack state to the user stack, clear set of signals * caught on the signal stack, and reset list of signals that * restart system calls; the new program's environment should * not be affected by detritus from the old program. Any * pending held signals remain held, so don't clear t_hold. * Make saved resource limit == current resource limit. * If the action was to catch the signal, then the action * must be reset to SIG_DFL. * Close all close-on-exec files. /* Unbrand ourself if necessary. */ /* Mark this as an executable vnode */ * Allocate a new lwp directory and lwpid hash table if necessary. * Reset lwp id to the default value of 1. * This is a single-threaded process now * and lwp #1 is lwp_wait()able by default. * The t_unpark flag should not be inherited. * Install the newly-allocated lwp directory and lwpid hash table * and insert the current thread into the new hash table. * Restore the saved signal mask and * inform /proc that the exec() has finished. * Perform generic exec duties and switchout to object-file specific * If the SNOCD or SUGID flag is set, turn it off and remember the * previous setting so we can restore it if we encounter an error. /* need to open vnode for stateful file systems */ * Note: to support binary compatibility with SunOS a.out * executables, we read in the first four bytes, as the * magic number is in bytes 2-3. /* Pfcred is a credential with a ref count of 1 */ /* If we can, drop the PA bit */ * Implement the privilege updates: * But if running under ptrace, we cap I and F with P. /* pfcred is not forced to adhere to these settings */ /* fallback to mountpoint if a path can't be found */ "!uid %d: setuid execution not allowed, " "!uid %d: setuid execution not allowed, " /* zone_rootpath always has trailing / */ "setuid execution not allowed, file=%s%s",
"setuid execution not allowed, fs=%s, " * execsetid() told us whether or not we had to change the * credentials of the process. In privflags, it told us * whether we gained any privileges or executed a set-uid executable. * Use /etc/system variable to determine if the stack * should be marked as executable by default. * Traditionally, the setid flags told the sub processes whether * the file just executed was set-uid or set-gid; this caused * some confusion as the 'setid' flag did not match the SUGID * process flag which is only set when the uids/gids do not match. * /dev/fd/X but an executable would happily trust LD_LIBRARY_PATH. * Now we flag those cases where the calling process cannot * be trusted to influence the newly exec'ed process, either * because it runs with more privileges or when the uids/gids * This also makes the runtime linker agree with the on exec * values of SNOCD and SUGID. * If this process's p_exec has been set to the vp of * the executable by exec_func, we will return without * calling VOP_CLOSE because proc_exit will close it * Close the previous executable only if we are * Free the old credentials, and set the new ones. * Do this for both the process and the (single) thread. * DTrace accesses t_cred in probe context. t_cred * must always be either NULL, or point to a valid, * allocated cred structure. "privilege removed from E/I",
fn,
pid);
* On emerging from a successful exec(), the saved * uid and gid equal the effective uid and gid. * If the real and effective ids do not match, this * is a setuid process that should not dump core. * The group comparison is tricky; we prevent the code * from flagging SNOCD when executing with an effective gid * which is a supplementary group. /* Note that the process remains in the same zone. */ * If process is traced via /proc, arrange to * invalidate the associated /proc vnode. * Set the magic number last so that we * don't need to hold the execsw_lock in * Find the exec switch table entry with the corresponding magic string. * Find the execsw[] index for the given exec header string by looking for the * magic string at a specified offset and length for each kind of executable * file format until one matches. If no execsw[] entry is found, try to * autoload a module for this magic string. return (
NULL);
/* couldn't find the type */ * Find the execsw[] index for the given magic string. If no execsw[] entry * is found, try to autoload a module for this magic string. return (
NULL);
/* couldn't find the type */ /* Will try to reset the PRIV_AWARE bit later. */ * If it's a set-uid root program we perform the * forced privilege look-aside. This has three possible * no look aside information -> treat as before * look aside in Limit set -> apply forced privs * look aside not in Limit set -> ignore set-uid root * Ordinary set-uid root execution only allowed if the limit * set holds all unsafe privileges. * Do we need to change our credential anyway? * This is the case when E != I or P != I, as * we need to do the assignments (with F empty and A full) * Or when I is not a subset of L; in that case we need to /* Child has more privileges than parent */ /* If MAC-aware flag(s) are on, need to update cred to remove. */ * Set setuid/setgid protections if no ptrace() compatibility. * the presence of ptrace() compatibility. * If VPROC, ask /proc if the file is an object file. * If process is under ptrace(2) compatibility, * Process is traced via /proc. * Arrange to invalidate the /proc vnode. * Map a section of an executable file into the user's * If the segment can fit, then we prefault * the entire segment in. This is based on the * model that says the best working set of a * small program is all of its pages. * If we aren't prefaulting the segment, * increment "deficit", if necessary to ensure * that pages will become available when this * process starts executing. "execmap preread:freemem %d size %lu",
* Read in the segment in one big chunk. * Before we go to zero the remaining space on the last * page, make sure we have write permission. * Normal illumos binaries don't even hit the case * where we have to change permission on the last page * since their protection is typically either * PROT_USER | PROT_WRITE | PROT_READ * PROT_ZFOD (same as PROT_ALL). * We need to be careful how we zero-fill the last page * if the segment protection does not include * PROT_WRITE. Using as_setprot() can cause the VM * segment code to call segvn_vpage(), which must * allocate a page struct for each page in the segment. * If we have a very large segment, this may fail, so * we have to check for that, even though we ignore * other return values from as_setprot. * ASSERT alignment because the mapelfexec() * caller for the szc > 0 case extended zfod * so it's end is pgsz aligned. *
fdp = -
1;
/* just in case falloc changed value */ *
vpp =
vp;
/* vnode should not have changed */ * Support routines for building a user stack. * execve(path, argv, envp) must construct a new stack with the specified * arguments and environment variables (see exec_args() for a description * of the user stack layout). To do this, we copy the arguments and * environment variables from the old user address space into the kernel, * free the old as, create the new as, and copy our buffered information * to the new stack. Our kernel buffer has the following structure: * +-----------------------+ <--- stk_base + stk_size * +-----------------------+ <--- stk_offp * +-----------------------+ <--- stk_strp * +-----------------------+ <--- stk_base * When we add a string, we store the string's contents (including the null * terminator) at stk_strp, and we store the offset of the string relative to * stk_base at --stk_offp. At strings are added, stk_strp increases and * stk_offp decreases. The amount of space remaining, STK_AVAIL(), is just * the difference between these pointers. If we run out of space, we return * an error and exec_args() starts all over again with a buffer twice as large. * When we're all done, the kernel buffer looks like this: * +-----------------------+ <--- stk_base + stk_size * +-----------------------+ * +-----------------------+ * | argv[argc-1] offset | * +-----------------------+ * +-----------------------+ * +-----------------------+ * | envp[envc-1] offset | * +-----------------------+ * | AT_SUN_PLATFORM offset| * +-----------------------+ * | AT_SUN_EXECNAME offset| * +-----------------------+ <--- stk_offp * +-----------------------+ <--- stk_strp * | AT_SUN_EXECNAME offset| * +-----------------------+ * | AT_SUN_PLATFORM offset| * +-----------------------+ * | envp[envc-1] string | * +-----------------------+ * +-----------------------+ * +-----------------------+ * | argv[argc-1] string | * +-----------------------+ * +-----------------------+ * +-----------------------+ <--- stk_base * Add a string to the stack. * Copy interpreter's name and argument to argv[0] and argv[1]. * Check for an empty argv[]. * Add argv[] strings to the stack. * Add environ[] strings to the stack. /* Undo the copied string */ * Add AT_SUN_PLATFORM, AT_SUN_EXECNAME, AT_SUN_BRANDNAME, and * AT_SUN_EMULATOR strings to the stack. * Compute the size of the stack. This includes all the pointers, * the space reserved for the aux vector, and all the strings. * The total number of pointers is args->na (which is argc + envc) * plus 4 more: (1) a pointer's worth of space for argc; (2) the NULL * after the last argument (i.e. argv[argc]); (3) the NULL after the * last environment variable (i.e. envp[envc]); and (4) the NULL after * all the strings, at the very top of the stack. * Pad the string section with zeroes to align the stack size. * Put argc on the stack. Note that even though it's an int, * it always consumes ptrsize bytes (for alignment). * Add argc space (ptrsize) to usp and record argv for /proc. * Put the argv[] pointers on the stack. * Copy arguments to u_psargs. for (i = 0; i <
pslen; i++)
* Add space for argv[]'s NULL terminator (ptrsize) to usp and * Put the envp[] pointers on the stack. * Add space for envp[]'s NULL terminator (ptrsize) to usp and * remember where the stack ends, which is also where auxv begins. * Put all the argv[], envp[], and auxv strings on the stack. * Fill in the aux vector now that we know the user stack addresses * for the AT_SUN_PLATFORM, AT_SUN_EXECNAME, AT_SUN_BRANDNAME and * AT_SUN_EMULATOR strings. * Initialize a new user stack with the specified arguments and environment. * The initial user stack layout is as follows: * +---------------+ <--- curproc->p_usrstack * +---------------+ <--- ustrp * +---------------+ <--- auxv * +---------------+ <--- envp[] * +---------------+ <--- argv[] * +---------------+ <--- stack base * Make sure user register windows are empty before * attempting to make a new stack. * Leave only the current lwp and force the other lwps to exit. * If another lwp beat us to the punch by calling exit(), bail out. * Revoke any doors created by the process. * Release schedctl data structures. * Clean up any DTrace helpers for the process. * Cleanup the DTrace provider associated with this process. * discard the lwpchan cache. * Delete the POSIX timers. * Delete the ITIMER_REALPROF interval timer. * The other ITIMER_* interval timers are specified * to be inherited across exec(). * Ensure that we don't change resource associations while we * Destroy the old address space and create a new one. * From here on, any errors are fatal to the exec()ing process. * On error we return -1, which means the caller must SIGKILL * Reset resource controls such that all controls are again active as * well as appropriate to the potentially new address model for the /* Too early to call map_pgsz for the heap */ * Some platforms may choose to randomize real stack start by adding a * small slew (not more than a few hundred bytes) to the top of the * stack. This helps avoid cache thrashing when identical processes * simultaneously share caches that don't provide enough associativity * (e.g. sun4v systems). In this case stack slewing makes the same hot * stack variables in different processes to live in different cache * sets increasing effective associativity. * Finally, write out the contents of the new stack.