seg_vn.c revision ea7b14af2e7e53326787ef8be8bbeb8a53e4e143
2N/A * The contents of this file are subject to the terms of the 2N/A * Common Development and Distribution License (the "License"). 2N/A * You may not use this file except in compliance with the License. 2N/A * See the License for the specific language governing permissions 2N/A * and limitations under the License. 2N/A * When distributing Covered Code, include this CDDL HEADER in each 2N/A * If applicable, add the following below this CDDL HEADER, with the 2N/A * fields enclosed by brackets "[]" replaced with your own identifying 2N/A * information: Portions Copyright [yyyy] [name of copyright owner] 2N/A * Copyright 2008 Sun Microsystems, Inc. All rights reserved. 2N/A * Use is subject to license terms. 2N/A/* Copyright (c) 1984, 1986, 1987, 1988, 1989 AT&T */ 2N/A/* All Rights Reserved */ 2N/A * University Copyright- Copyright (c) 1982, 1986, 1988 2N/A * The Regents of the University of California 2N/A * All Rights Reserved 2N/A * University Acknowledgment- Portions of this document are derived from 2N/A * software developed by the University of California, Berkeley, and its 2N/A#
pragma ident "%Z%%M% %I% %E% SMI" 2N/A * Private seg op routines. * Common zfod structures, provided as a shorthand for others to use. #
define vpgtob(n) ((n) *
sizeof (
struct vpage))
/* For brevity */#
define SDR_RANGE 1 /* demote entire range */#
define SDR_END 2 /* demote non aligned ends only */ * Patching this variable to non-zero allows the system to run with * stacks marked as "not executable". It's a bit of a kludge, but is * provided as a tweakable for platforms that export those ABIs * (e.g. sparc V8) that have executable stacks enabled by default. * There are also some restrictions for platforms that don't actually * implement 'noexec' protections. * Once enabled, the system is (therefore) unable to provide a fully * ABI-compliant execution environment, though practically speaking, * most everything works. The exceptions are generally some interpreters * and debuggers that create executable code on the stack and jump * into it (without explicitly mprotecting the address range to include * One important class of applications that are disabled are those * that have been transformed into malicious agents using one of the * numerous "buffer overflow" attacks. See 4007890. * Segvn supports text replication optimization for NUMA platforms. Text * replica's are represented by anon maps (amp). There's one amp per text file * region per lgroup. A process chooses the amp for each of its text mappings * based on the lgroup assignment of its main thread (t_tid = 1). All * processes that want a replica on a particular lgroup for the same text file * mapping share the same amp. amp's are looked up in svntr_hashtab hash table * with vp,off,size,szc used as a key. Text replication segments are read only * MAP_PRIVATE|MAP_TEXT segments that map vnode. Replication is achieved by * forcing COW faults from vnode to amp and mapping amp pages instead of vnode * pages. Replication amp is assigned to a segment when it gets its first * pagefault. To handle main thread lgroup rehoming segvn_trasync_thread * rechecks periodically if the process still maps an amp local to the main * thread. If not async thread forces process to remap to an amp in the new * home lgroup of the main thread. Current text replication implementation * only provides the benefit to workloads that do most of their work in the * main thread of a process or all the threads of a process run in the same * lgroup. To extend text replication benefit to different types of * multithreaded workloads further work would be needed in the hat layer to * allow the same virtual address in the same hat to simultaneously map * different physical addresses (i.e. page table replication would be needed * amp pages are used instead of vnode pages as long as segment has a very * simple life cycle. It's created via segvn_create(), handles S_EXEC * (S_READ) pagefaults and is fully unmapped. If anything more complicated * happens such as protection is changed, real COW fault happens, pagesize is * changed, MC_LOCK is requested or segment is partially unmapped we turn off * text replication by converting the segment back to vnode only segment * (unmap segment's address range and set svd->amp to NULL). * The original file can be changed after amp is inserted into * svntr_hashtab. Processes that are launched after the file is already * changed can't use the replica's created prior to the file change. To * implement this functionality hash entries are timestamped. Replica's can * only be used if current file modification time is the same as the timestamp * saved when hash entry was created. However just timestamps alone are not * sufficient to detect file modification via mmap(MAP_SHARED) mappings. We * deal with file changes via MAP_SHARED mappings differently. When writable * MAP_SHARED mappings are created to vnodes marked as executable we mark all * existing replica's for this vnode as not usable for future text * mappings. And we don't create new replica's for files that currently have * potentially writable MAP_SHARED mappings (i.e. vn_is_mapped(V_WRITE) is * Initialize segvn data structures panic(
"segvn_init: bad szc 0");
* For now shared regions and text replication segvn support * are mutually exclusive. This is acceptable because * currently significant benefit from text replication was * only observed on AMD64 NUMA platforms (due to relatively * small L2$ size) and currently we don't support shared * set v_mpssdata just once per vnode life * so that it never changes. panic(
"segvn_create type");
* Check arguments. If a shared anon structure is given then * it is illegal to also specify a vp. panic(
"segvn_create anon_map");
/* MAP_NORESERVE on a MAP_SHARED segment is meaningless. */ * hat_page_demote() is not supported * If segment may need private pages, reserve them now. * Reserve any mapping structures that may be required. * Don't do it for segments that may use regions. It's currently a * noop in the hat implementations anyway. /* Inform the vnode of the new mapping */ * svntr_hashtab will be NULL if we support shared regions. * If more than one segment in the address space, and they're adjacent * virtually, try to concatenate them. Don't concatenate if an * explicit anon_map structure was supplied (e.g., SystemV shared * memory) or if we'll use text replication for this segment. * Memory policy flags (lgrp_mem_policy_flags) is valid when * Get policy when not extending it from another segment * First, try to concatenate the previous and new segments * Get memory allocation policy from previous segment. * When extension is specified (e.g. for heap) apply * this policy to the new segment regardless of the * outcome of segment concatenation. Extension occurs * for non-default policy otherwise default policy is * used and is based on extended segment size. * success! now try to concatenate * Failed, so try to concatenate with following seg * Get memory allocation policy from next segment. * When extension is specified (e.g. for stack) apply * this policy to the new segment regardless of the * outcome of segment concatenation. Extension occurs * for non-default policy otherwise default policy is * used and is based on extended segment size. * Anonymous mappings have no backing file so the offset is meaningless. * Shared mappings to a vp need no other setup. * If we have a shared mapping to an anon_map object * which hasn't been allocated yet, allocate the * struct now so that it will be properly shared * by remembering the swap reservation there. * Private mapping (with or without a vp). * Allocate anon_map when needed. * Mapping to an existing anon_map structure without a vp. * For now we will insure that the segment size isn't larger * than the size - offset gives us. Later on we may wish to * have the anon array dynamically allocated itself so that * we don't always have to allocate all the anon pointer slots. * This of course involves adding extra code to check that we * aren't trying to use an anon pointer slot beyond the end * of the currently allocated anon array. panic(
"segvn_create anon_map size");
* SHARED mapping to a given anon_map. * PRIVATE mapping to a given anon_map. * Make sure that all the needed anon * structures are created (so that we will * share the underlying pages if nothing * is written by this mapping) and then * duplicate the anon array as is done * when a privately mapped segment is dup'ed. * Prevent 2 threads from allocating anon * Allocate the anon struct now. * Might as well load up translation * to the page while we're at it... panic(
"segvn_create anon_zero");
* Re-acquire the anon_map lock and * initialize the anon array entry. * Set default memory allocation policy for segment * Always set policy for private memory at least for initialization * even if this is a shared memory segment * Concatenate two existing segments, if possible. * Return 0 on success, -1 if two segments are not compatible * or -2 on memory allocation failure. * If amp_cat == 1 then try and concat segments with anon maps /* both segments exist, try to merge them */ * vp == NULL implies zfod, offset doesn't matter * Don't concatenate if either segment uses text replication. * Fail early if we're not supposed to concatenate * segments with non NULL amp. * If either seg has vpages, create a new merged vpage array. * If either segment has private pages, create a new merged anon * array. If mergeing shared anon segments just decrement anon map's * XXX anon rwlock is not really needed because * this is a private segment and we are writers. * Now free the old vpage structures. /* all looks ok, merge segments */ svd2->
swresv = 0;
/* so seg_free doesn't release swap space */ * Extend the previous segment (seg1) to include the * new segment (seg2 + a), if possible. * We don't need any segment level locks for "segvn" data * since the address space is "write" locked. /* second segment is new, try to extend first */ /* XXX - should also check cred */ /* vp == NULL implies zfod, offset doesn't matter */ * Segment has private pages, can data structures * Acquire the anon_map lock to prevent it from changing, * if it is shared. This ensures that the anon_map * will not change while a thread which has a read/write * lock on an address space references it. * XXX - Don't need the anon_map lock at all if "refcnt" * Can't grow a MAP_SHARED segment with an anonmap because * there may be existing anon slots where we want to extend * the segment and we wouldn't know what to do with them * (e.g., for tmpfs right thing is to just leave them there, * for /dev/zero they should be cleared out). * Extend the next segment (seg2) to include the * new segment (seg1 + a), if possible. * We don't need any segment level locks for "segvn" data * since the address space is "write" locked. /* first segment is new, try to extend second */ /* XXX - should also check cred */ /* vp == NULL implies zfod, offset doesn't matter */ * Segment has private pages, can data structures * Acquire the anon_map lock to prevent it from changing, * if it is shared. This ensures that the anon_map * will not change while a thread which has a read/write * lock on an address space references it. * XXX - Don't need the anon_map lock at all if "refcnt" /* Not merging segments so adjust anon_index back */ * If segment has anon reserved, reserve more for the new seg. * For a MAP_NORESERVE segment swresv will be a count of all the * allocated anon slots; thus we reserve for the child as many slots * as the parent has allocated. This semantic prevents the child or * parent from dieing during a copy-on-write fault caused by trying * to write a shared pre-existing anon page. * Not attaching to a shared anon object. /* regions for now are only used on pure vnode segments */ * Allocate and initialize new anon_map structure. * We don't have to acquire the anon_map lock * for the new segment (since it belongs to an * address space that is still not associated * with any process), or the segment in the old * address space (since all threads in it * are stopped while duplicating the address space). * The goal of the following code is to make sure that * softlocked pages do not end up as copy on write * pages. This would cause problems where one * thread writes to a page that is COW and a different * thread in the same process has softlocked it. The * softlock lock would move away from this process * because the write would cause this process to get * a copy (without the softlock). * The strategy here is to just break the * sharing on pages that could possibly be * The softlock count might be non zero * because some pages are still stuck in the * cache for lazy reclaim. Flush the cache * now. This should drop the count to zero. * [or there is really I/O going on to these * pages]. Note, we have the writers lock so * nothing gets inserted during the flush. * XXX break cow sharing using PAGESIZE * pages. They will be relocated into larger * prot need not be computed * below 'cause anon_private is * going to ignore it anyway * as child doesn't inherit }
else {
/* common case */ * If at least one of anon slots of a * large page exists then make sure * all anon slots of a large page * exist to avoid partial cow sharing * of a large page in the future. * If necessary, create a vpage structure for the new segment. * Do not copy any page lock indications. for (i = 0; i <
npages; i++) {
/* Inform the vnode of the new mapping */ * callback function to invoke free_vp_pages() for only those pages actually * processed by the HAT when a shared region is destroyed. * callback function used by segvn_unmap to invoke free_vp_pages() for only * those pages actually processed by the HAT * This function determines the number of bytes of swap reserved by * a segment for which per-page accounting is present. It is used to * calculate the correct value of a segvn_data's swresv. * We don't need any segment level locks for "segvn" data * since the address space is "write" locked. * Fail the unmap if pages are SOFTLOCKed through this mapping. * softlockcnt is protected from change by the as write lock. * since we do have the writers lock nobody can fill * the cache during the purge. The flush either succeeds * or we still have pending I/Os. * could pass a flag to segvn_demote_range() * below to tell it not to do any unloads but * this case is rare enough to not bother for /* Inform the vnode of the unmapping. */ * Remove any page locks set through this mapping. * If text replication is not off no page locks could have been * established via this mapping. * Unload any hardware translations in the range to be taken * out. Use a callback to invoke free_vp_pages() effectively. * Check for entire segment * Check for beginning of segment * Free up now unused parts of anon_map array. * Unreserve swap space for the * unmapped chunk of this segment in * Check for end of segment * Free up now unused parts of anon_map array. * Unreserve swap space for the * unmapped chunk of this segment in "anon proc:%p %lu %u",
seg,
len, 0);
* The section to go is in the middle of the segment, * have to make it into two segments. nseg is made for * the high end while seg is cut down at the low end. panic(
"segvn_unmap seg_alloc");
/* need to split vpage into two arrays */ * Need to create a new anon map for the new segment. * We'll also allocate a new smaller array for the old * smaller segment to save space. * Free up now unused parts of anon_map array. * Unreserve swap space for the * unmapped chunk of this segment in panic(
"segvn_unmap: cannot split " return (0);
/* I'm glad that's all over with! */ * We don't need any segment level locks for "segvn" data * since the address space is "write" locked. * Be sure to unlock pages. XXX Why do things get free'ed instead * Deallocate the vpage and anon pointers if necessary and possible. * If there are no more references to this anon_map * structure, then deallocate the structure after freeing * up all the anon slot pointers that we can. * Private - we only need to anon_free * the part that this segment refers to. * Shared - anon_free the entire * anon_map's worth of stuff and * release any swap reservation. "anon proc:%p %lu %u",
seg,
len, 0);
* We had a private mapping which still has * a held anon_map so just free up all the * anon slot pointers that we were using. * Release swap reservation. * Release claim on vnode, credentials, and finally free the * Support routines used by segvn_pagelock() and softlock faults for anonymous * pages to implement availrmem accounting in a way that makes sure the * This prevents a bug when availrmem is quickly incorrectly exhausted from * several pagelocks to different parts of the same large page since each * pagelock has to decrement availrmem by the size of the entire large * we don't need to use cow style accounting here. We also need to make sure * the entire large page is accounted even if softlock range is less than the * entire large page because large anon pages can't be demoted when any of * constituent pages is locked. The caller calls this routine for every page_t * it locks. The very first page in the range may not be the root page of a * large page. For all other pages it's guaranteed we are going to visit the * root of a particular large page before any other constituent page as we are * locking sequential pages belonging to the same anon map. So we do all the * locking when the root is encountered except for the very first page. Since * softlocking is not supported (except S_READ_NOCOW special case) for vmpss * segments and since vnode pages can be demoted without locking all * constituent pages vnode pages don't come here. Unlocking relies on the * fact that pagesize can't change whenever any of constituent large pages is * locked at least SE_SHARED. This allows unlocking code to find the right * root and decrement availrmem by the same amount it was incremented when the * pagesize won't change as long as any constituent page is locked. * We haven't locked this large page yet. * pagesize won't change as long as any constituent page is locked. * Do a F_SOFTUNLOCK call over the range requested. The range must have * already been F_SOFTLOCK'ed. * Caller must always match addr and len of a softunlock with a previous * softlock with exactly the same addr and len. * Use page_find() instead of page_lookup() to * find the page since we know that it is locked. "segvn_softunlock: addr %p, ap %p, vp %p, off %llx",
"segvn_fault:pp %p vp %p offset %llx",
pp,
vp,
offset);
* All SOFTLOCKS are gone. Wakeup any waiting * unmappers so they can try again to unmap. * Check for waiters first without the mutex * held so we don't always grab the mutex on * Release all the pages in the NULL terminated ppp list * which haven't already been converted to PAGE_HANDLED. * Workaround for viking chip bug. See bug id 1220902. * To fix this down in pagefault() would require importing so * much as and segvn code as to be unmaintainable. * Handles all the dirty work of getting the right * anonymous pages and loading up the translations. * This routine is called only from segvn_fault() * when looping over the range of addresses requested. * The basic algorithm here is: * If this is an anon_zero case * Call anon_zero to allocate page * If this is an anon page * Use anon_getpage to get the page * Find page in pl[] list passed in * Load up the translation to the page * Call anon_private to handle cow * Load up (writable) translation to new page struct hat *
hat,
/* the hat to use for mapping */ struct seg *
seg,
/* seg_vn of interest */ struct vpage *
vpage,
/* pointer to vpage for vp, off */ page_t *
pl[],
/* object source page pointer */ enum seg_rw rw,
/* type of access at fault */ int brkcow,
/* we may need to break cow */ int first)
/* first page for this fault if 1 */ * Initialize protection value for this page. * If we have per page protection values check it now. return (
FC_PROT);
/* illegal access type */ * Always acquire the anon array lock to prevent 2 threads from * allocating separate anon slots for the same "addr". * Allocate a (normally) writable anonymous page of * zeroes. If no advance reservations, reserve now. goto out;
/* out of swap space */ * Re-acquire the anon_map lock and * initialize the anon array entry. * Handle pages that have been marked for migration * If AS_PAGLCK is set in a_flags (via memcntl(2) * with MC_LOCKAS, MCL_FUTURE) and this is a * MAP_NORESERVE segment, we may need to * permanently lock the page as it is being faulted * for the first time. The following text applies * only to MAP_NORESERVE segments: * As per memcntl(2), if this segment was created * after MCL_FUTURE was applied (a "future" * segment), its pages must be locked. If this * segment existed at MCL_FUTURE application (a * "past" segment), the interface is unclear. * We decide to lock only if vpage is present: * - "future" segments will have a vpage array (see * as_map), and so will be locked as required * - "past" segments may not have a vpage array, * depending on whether events (such as * mprotect) have occurred. Locking if vpage * exists will preserve legacy behavior. Not * locking if vpage is absent, will not break * the interface or legacy behavior. Note that * allocating vpage here if it's absent requires * upgrading the segvn reader lock, the cost of * which does not seem worthwhile. * Usually testing and setting VPP_ISPPLOCK and * VPP_SETPPLOCK requires holding the segvn lock as * writer, but in this case all readers are * serializing on the anon array lock. * Obtain the page structure via anon_getpage() if it is * a private copy of an object (the result of a previous * If this is a shared mapping to an * anon_map, then ignore the write * permissions returned by anon_getpage(). * They apply to the private mappings * Search the pl[] list passed in if it is from the * original object (i.e., not a private copy). * Find original page. We must be bringing it in panic(
"segvn_faultpage not found");
"segvn_fault:pp %p vp %p offset %llx",
opp,
NULL, 0);
* The fault is treated as a copy-on-write fault if a * write occurs on a private segment and the object * page (i.e., mapping) is write protected. We assume * that fatal protection checks have already been made. * If we are doing text replication COW on first touch. * If not a copy-on-write case load the translation * Handle pages that have been marked for migration * Steal the page only if it isn't a private page * since stealing a private page is not worth the effort. * Steal the original page if the following conditions are true: * We are low on memory, the page is not private, page is not large, * not shared, not modified, not `locked' or if we have it `locked' * (i.e., p_cowcnt == 1 and p_lckcnt == 0, which also implies * that the page is not shared) and if it doesn't have any * translations. page_struct_lock isn't needed to look at p_cowcnt * and p_lckcnt because we first get exclusive lock on page. * Check if this page has other translations * after unloading our translation. * hat_unload() might sync back someone else's recent * modification, so check again. * If we have a vpage pointer, see if it indicates that we have * ``locked'' the page we map -- if so, tell anon_private to * transfer the locking resource to the new page. * See Statement at the beginning of segvn_lockop regarding * Allocate a private page and perform the copy. * For MAP_NORESERVE reserve swap space now, unless this * is a cow fault on an existing anon page in which case * MAP_NORESERVE will have made advance reservations. * If we copied away from an anonymous page, then * we are one step closer to freeing up an anon slot. * NOTE: The original anon slot must be released while * holding the "anon_map" lock. This is necessary to prevent * other threads from obtaining a pointer to the anon slot * which may be freed if its "refcnt" is 1. * Handle pages that have been marked for migration * relocate a bunch of smaller targ pages into one large repl page. all targ * pages must be complete pages smaller than replacement pages. * it's assumed that no page's szc can change since they are all PAGESIZE or * complete large pages locked SHARED. panic(
"segvn_relocate_pages: " "page_relocate failed err=%d curnpgs=%ld " * Check if all pages in ppa array are complete smaller than szc pages and * their roots will still be aligned relative to their current size if the * entire ppa array is relocated into one szc page. If these conditions are * If all pages are properly aligned attempt to upgrade their locks * to exclusive mode. If it fails set *upgrdfail to 1 and return 0. * upgrdfail was set to 0 by caller. * Return 1 if all pages are aligned and locked exclusively. * If all pages in ppa array happen to be physically contiguous to make one * szc page and all exclusive locks are successfully obtained promote the page * size to szc and set *pszc to szc. Return 1 with pages locked shared. * p_szc changed means we don't have all pages * locked. return failure. panic(
"segvn_full_szcpages: " "large page not physically contiguous");
for (j = 0; j < i; j++) {
* When a page is put a free cachelist its szc is set to 0. if file * system reclaimed pages from cachelist targ pages will be physically * contiguous with 0 p_szc. in this case just upgrade szc of targ * pages without any relocations. * To avoid any hat issues with previous small mappings * hat_pageunload() the target pages first. * Create physically contiguous pages for [vp, off] - [vp, off + * page_size(szc)) range and for private segment return them in ppa array. * Pages are created either via IO or relocations. * Return 1 on success and 0 on failure. * If physically contiguous pages already exist for this range return 1 without * filling ppa array. Caller initializes ppa[0] as NULL to detect that ppa * array wasn't filled. In this case caller fills ppa array via VOP_GETPAGE(). * downsize will be set to 1 only if we fail to lock pages. this will * allow subsequent faults to try to relocate the page again. If we * fail due to misalignment don't downsize and let the caller map the * whole region with small mappings to avoid more faults into the area * where we can't get large pages anyway. * we pass NULL for nrelocp to page_lookup_create() * so that it doesn't relocate. We relocate here * later only after we make sure we can lock all * pages in the range we handle and they are all * sizing down to pszc won't help. * sizing down to pszc won't help. * Some file systems like NFS don't check EOF * conditions in VOP_PAGEIO(). Check it here * now that pages are locked SE_EXCL. Any file * truncation will wait until the pages are * unlocked so no need to worry that file will * be truncated after we check its size here. * XXX fix NFS to remove this check. * page szc chould have changed before the entire group was * locked. reread page szc. /* link just the roots */ * we're now bound to succeed or panic. * remove pages from done_pplist. it's not needed anymore. panic(
"segvn_fill_vp_pages: " for (i = 0; i <
pages; i++) {
* the caller will still call VOP_GETPAGE() for shared segments * to check FS write permissions. For private segments we map * file read only anyway. so no VOP_GETPAGE is needed. for (i = 0; i <
pages; i++) {
* Do the cleanup. Unlock target pages we didn't relocate. They are * linked on targ_pplist by root pages. reassemble unused replacement * and io pages back to pplist. /* relink replacement page */ * at this point all pages are either on done_pplist or * pplist. They can't be all on done_pplist otherwise * don't downsize on io error. * see if vop_getpage succeeds. * pplist may still be used in this case for (i = 0; i < (
pages); i++) { \
for (i = 0; i < (
pages); i++) { \
/* caller has already done segment level protection check. */ for (i = 0; i <
pages; i++) {
/* can't reduce map area */ for (i = 0; i <
pages; i++) {
* For private segments SOFTLOCK * either always breaks cow (any rw * type except S_READ_NOCOW) or * address space is locked as writer * (S_READ_NOCOW case) and anon slots * can't show up on second check. * Therefore if we are here for * SOFTLOCK case it must be a cow * break but cow break never reduces * szc. text replication (tron) in * this case works as cow break. * p_szc can't be changed for locked for (i = 0; i <
pages; i++) {
* hat_page_demote() needs an SE_EXCL lock on one of * constituent page_t's and it decreases root's p_szc * last. This means if root's p_szc is equal szc and * all its constituent pages are locked * hat_page_demote() that could have changed p_szc to * szc is already done and no new have page_demote() * can start for this large page. * we need to make sure same mapping size is used for * the same address range if there's a possibility the * adddress is already mapped because hat layer panics * when translation is loaded for the range already * mapped with a different page size. We achieve it * by always using largest page size possible subject * to the constraints of page size, segment page size * and page alignment. Since mappings are invalidated * when those constraints change and make it * impossible to use previously used mapping size no * mapping size conflicts should happen. for (i = 0; i <
pages; i++) {
* All pages are of szc we need and they are * all locked so they can't change szc. load * if page got promoted since last check * avoid large xhat mappings to FS * pages so that hat_page_demote() * doesn't need to check for xhat * Don't use regions with xhats. for (i = 0; i <
pages; i++) {
for (i = 0; i <
pages; i++) {
* See if upsize is possible. for (i = 0; i <
pages; i++) {
* check if we should use smallest mapping size. * segvn_full_szcpages failed to lock * all pages EXCL. Size down. for (i = 0; i <
pages; i++) {
for (i = 0; i <
pages; i++) {
for (i = 0; i <
pages; i++) {
* segvn_full_szcpages() upgraded pages szc. * p_szc of ppa[0] can change since we haven't * locked all constituent pages. Call * page_lock_szc() to prevent szc changes. * This should be a rare case that happens when * multiple segments use a different page size * to map the same file offsets. * page got promoted since last check. * we don't need preaalocated large for (i = 0; i <
pages; i++) {
* if page got demoted since last check * we could have not allocated larger page. for (i = 0; i <
pages; i++) {
for (i = 0; i <
pages; i++) {
for (i = 0; i <
pages; i++) {
* ierr == -1 means we failed to map with a large page. * misalignment with other mappings to this file. * ierr == -2 means some other thread allocated a large page * after we gave up tp map with a large page. retry with * other process created pszc large page. * but we still have to drop to 0 szc. * Size up case. Note lpgaddr may only be needed for * softlock case so we don't adjust it here. * Size down case. Note lpgaddr may only be needed for * softlock case so we don't adjust it here. * The beginning of the large page region can * be pulled to the right to make a smaller * region. We haven't yet faulted a single * Large page end is mapped beyond the end of file and it's a cow * fault (can be a text replication induced cow) or softlock so we can't * reduce the map area. For now just demote the segment. This should * really only happen if the end of the file changed after the mapping * was established since when large page segments are created we make * sure they don't extend beyond the end of the file. /* segvn_fault will do its job as if szc had been zero to begin with */ * This routine will attempt to fault in one large page. * it will use smaller pages if that fails. * It should only be called for pure anonymous segments. /* caller has already done segment level protection check. */ * Handle pages that have been marked for migration * If all pages in ppa array belong to the same * large page call segvn_slock_anonpages() for (i = 0; i <
pages; i++) {
for (j = 0; j < i; j++) {
for (j = i; j <
pages; j++) {
for (i = 0; i <
pages; i++)
* ierr == -1 means we failed to allocate a large page. * so do a size down operation. * ierr == -2 means some other process that privately shares * pages with this process has allocated a larger page and we * need to retry with larger pages. So do a size up * operation. This relies on the fact that large pages are * never partially shared i.e. if we share any constituent * page of a large page with another process we must share the * entire large page. Note this cannot happen for SOFTLOCK * case, unless current address (a) is at the beginning of the * next page size boundary because the other process couldn't * have relocated locked pages. * For the very first relocation failure try to purge this * segment's cache so that the relocator can obtain an * exclusive lock on pages we want to relocate. * For non COW faults and segvn_anypgsz == 0 * we need to be careful not to loop forever * if existing page is found with szc other * than 0 or seg->s_szc. This could be due * to page relocations on behalf of DR or * more likely large page creation. For this * case simply re-size to existing page's szc * if returned by anon_map_getpages(). * For softlocks we cannot reduce the fault area * (calculated based on the largest page size for this * segment) for size down and a is already next * page size aligned as assertted above for size * ups. Therefore just continue in case of softlock. continue;
/* keep lint happy */ * Size up case. Note lpgaddr may only be needed for * softlock case so we don't adjust it here. * Size down case. Note lpgaddr may only be needed for * softlock case so we don't adjust it here. * The beginning of the large page region can * be pulled to the right to make a smaller * region. We haven't yet faulted a single int fltadvice =
1;
/* set to free behind pages for sequential access */ * This routine is called via a machine specific fault handling routine. * It is also called by software routines wishing to lock or unlock * Here is the basic algorithm: * Checking and set up work * If we will need some non-anonymous pages * Call VOP_GETPAGE over the range of non-anonymous pages * Loop over all addresses requested * Call segvn_faultpage passing in page list * to load up translations and handle anonymous pages * Load up translation to any additional pages in page list not * already handled that fit into this segment * First handle the easy stuff * If we have the same protections for the entire segment, * insure that the access being attempted is legitimate. return (
FC_PROT);
/* illegal access type */ /* this must be SOFTLOCK S_READ fault */ * this must be the first ever non S_READ_NOCOW * softlock for this segment. * We can't allow the long term use of softlocks for vmpss segments, * because in some file truncation cases we should be able to demote * the segment, which requires that there are no softlocks. The * only case where it's ok to allow a SOFTLOCK fault against a vmpss * segment is S_READ_NOCOW, where the caller holds the address space * locked as writer and calls softunlock before dropping the as lock. * S_READ_NOCOW is used by /proc to read memory from another user. * Another deadlock between SOFTLOCK and file truncation can happen * because segvn_fault_vnodepages() calls the FS one pagesize at * a time. A second VOP_GETPAGE() call by segvn_fault_vnodepages() * can cause a deadlock because the first set of page_t's remain * locked SE_SHARED. To avoid this, we demote segments on a first * SOFTLOCK if they have a length greater than the segment's * So for now, we only avoid demoting a segment on a SOFTLOCK when * the access type is S_READ_NOCOW and the fault length is less than * or equal to the segment's page size. While this is quite restrictive, * it should be the most common case of SOFTLOCK against a vmpss * For S_READ_NOCOW, it's safe not to do a copy on write because the * caller makes sure no COW will be caused by another thread for a * Check to see if we need to allocate an anon_map structure. * Drop the "read" lock on the segment and acquire * the "write" version since we have to allocate the * Start all over again since segment protections * may have changed after we dropped the "read" lock. * S_READ_NOCOW vs S_READ distinction was * only needed for the code above. After * that we treat it as S_READ. * MADV_SEQUENTIAL work is ignored for large page segments. * The fast path could apply to S_WRITE also, except * that the protection fault could be caused by lazy * tlb flush when ro->rw. In this case, the pte is * RW already. But RO in the other cpu's tlb causes * the fault. Since hat_chgprot won't do anything if * pte doesn't change, we may end up faulting * indefinitely until the RO tlb entry gets replaced. * If MADV_SEQUENTIAL has been set for the particular page we * are faulting on, free behind all pages in the segment and put * If this is an anon page, we must find the * correct <vp, offset> for it * Skip pages that are free or have an * We don't need the page_struct_lock to test * as this is only advisory; even if we * acquire it someone might race in and lock * the page after we unlock and before the * PUTPAGE, then VOP_PUTPAGE will do nothing. * Hold the vnode before releasing * the page lock to prevent it from * being freed and re-used by some * We should build a page list * to kluster putpages XXX * XXX - Should the loop terminate if * See if we need to call VOP_GETPAGE for * *any* of the range being faulted on. * We can skip all of this work if there * Only acquire reader lock to prevent amp->ahp * from being changed. It's ok to miss pages, * hence we don't do anon_array_enter * Page list won't fit in local array, * allocate one of the needed size. * Ask VOP_GETPAGE to return the exact number * (a) this is a COW fault, or * (b) this is a software fault, or * (c) next page is already mapped. * Ask VOP_GETPAGE to return adjacent pages * Need to get some non-anonymous pages. * We need to make only one call to GETPAGE to do * this to prevent certain deadlocking conditions * when we are doing locking. In this case * non_anon() should have picked up the smallest * range which includes all the non-anonymous * pages in the requested range. We have to * be careful regarding which rw flag to pass in * because on a private mapping, the underlying * object is never allowed to be written. "segvn_getpage:seg %p addr %p vp %p",
* N.B. at this time the plp array has all the needed non-anon * pages in addition to (possibly) having some adjacent pages. * Always acquire the anon_array_lock to prevent * 2 threads from allocating separate anon slots for * If this is a copy-on-write fault and we don't already * have the anon_array_lock, acquire it to prevent the * fault routine from handling multiple copy-on-write faults * on the same "addr" in the same address space. * Only one thread should deal with the fault since after * it is handled, the other threads can acquire a translation * to the newly created private page. This prevents two or * more threads from creating different private pages for the * We grab "serialization" lock here if this is a MAP_PRIVATE segment * to prevent deadlock between this thread and another thread * which has soft-locked this page and wants to acquire serial_lock. * The fix for bug 4026339 becomes unnecessary when using the * locking scheme with per amp rwlock and a global set of hash * lock, anon_array_lock. If we steal a vnode page when low * on memory and upgrad the page lock through page_rename, * then the page is PAGE_HANDLED, nothing needs to be done * for this page after returning from segvn_faultpage. * But really, the page lock should be downgraded after * the stolen page is page_rename'd. * Ok, now loop over the address range and handle faults /* Didn't get pages from the underlying fs so we're done */ * Now handle any other pages in the list returned. * If the page can be used, load up the translations now. * Note that the for loop will only be entered if "plp" * is pointing to a non-NULL page pointer which means that * VOP_GETPAGE() was called and vpprot has been initialized. * Large Files: diff should be unsigned value because we started * supporting > 2GB segment sizes from 2.5.1 and when a * large file of size > 2GB gets mapped to address space * the diff value can be > 2GB. * Large Files: Following is the assertion * validating the above cast. * Prevent other threads in the address space from * creating private pages (i.e., allocating anon slots) * while we are in the process of loading translations * to additional pages returned by the underlying * Skip mapping read ahead pages marked * for migration, so they will get migrated * This routine is used to start I/O on pages asynchronously. XXX it will * only create PAGESIZE pages. At fault time they will be relocated into * Reader lock to prevent amp->ahp from being changed. * This is advisory, it's ok to miss a page, so * we don't do anon_array_enter lock. return (0);
/* zfod page - do nothing now */ "segvn_getpage:seg %p addr %p vp %p",
seg,
addr,
vp);
return (
EACCES);
/* violated maxprot */ /* return if prot is the same */ * Since we change protections we first have to flush the cache. * This makes sure all the pagelock calls have to recheck * Since we do have the segvn writers lock nobody can fill * the cache with entries belonging to this seg during * the purge. The flush either succeeds or we still have * If we are holding the as lock as a reader then * we need to return IE_RETRY and let the as * layer drop and re-acquire the lock as a writer. * If it's a private mapping and we're making it writable then we * may have to reserve the additional swap space now. If we are * making writable only a part of the segment then we use its vpage * array to keep a record of the pages for which we have reserved * swap. In this case we set the pageswap field in the segment's * segvn structure to record this. * If it's a private mapping to a file (i.e., vp != NULL) and we're * removing write permission on the entire segment and we haven't * modified any pages, we can release the swap space. * Start by determining how much swap * Make sure that the vpage array * exists, and make a note of the * range of elements corresponding * This is the first time we've * asked for a part of this * reserve everything we've * We have to count the number /* Try to reserve the necessary swap. */ * Make a note of how much swap space * Swap space is released only if this segment * does not map anonymous memory, since read faults * on such segments still need an anon slot to read "anon proc:%p %lu %u",
seg, 0, 0);
return (0);
/* all done */ * A vpage structure exists or else the change does not * involve the entire segment. Establish a vpage structure * if none is there. Then, for each page in the range, * adjust its individual permissions. Note that write- * enabling a MAP_PRIVATE page can affect the claims for * locked down memory. Overcommitting memory terminates * See Statement at the beginning of segvn_lockop regarding * the way cowcnts and lckcnts are handled. panic(
"segvn_setprot: no page");
* Did we terminate prematurely? If so, simply unload * the translations to the things we've updated so far. * Either private or shared data with write access (in * which case we need to throw out all former translations * so that we get the right translations set up on fault * and we don't allow write access to any copy-on-write pages * that might be around or to prevent write access to pages * representing holes in a file), or we don't have permission * to access the memory at all (in which case we have to * unload any current translations that might exist). * A shared mapping or a private mapping in which write * protection is going to be denied - just change all the * protections over the range of addresses in question. * segvn does not support any other attributes other * than prot so we can use hat_chgattr. * segvn_setpagesize is called via SEGOP_SETPAGESIZE from as_setpagesize, * to determine if the seg is capable of mapping the requested szc. * addr should always be pgsz aligned but eaddr may be misaligned if * it's at the end of the segment. * XXX we should assert this condition since as_setpagesize() logic * Check that protections are the same within new page * Since we are changing page size we first have to flush * the cache. This makes sure all the pagelock calls have * to recheck protections. * Since we do have the segvn writers lock nobody can fill * the cache with entries belonging to this seg during * the purge. The flush either succeeds or we still have * Operation for sub range of existing segment. /* eaddr is szc aligned */ /* eaddr is szc aligned */ * Break any low level sharing and reset seg->s_szc to 0. * If the end of the current segment is not pgsz aligned * then attempt to concatenate with the next segment. * May need to re-align anon array to * anon_fill_cow_holes() may call VOP_GETPAGE(). * don't take anon map lock here to avoid holding it * across VOP_GETPAGE() calls that may call back into * segvn for klsutering checks. We don't really need * anon map lock here since it's a private segment and * we hold as level lock as writers. * do HAT_UNLOAD_UNMAP since we are changing the pagesize. * unload argument is 0 when we are freeing the segment * and unload was already done. * XXX anon rwlock is not really needed because this is a * private segment and we are writers. panic(
"segvn_claim_pages: no anon slot");
panic(
"segvn_claim_pages: no page");
for (i = 0; i <
pg_idx; i++) {
* Returns right (upper address) segment if split occurred. * If the address is equal to the beginning or end of its segment it returns * The offset for an anonymous segment has no signifigance in * terms of an offset into a file. If we were to use the above * calculation instead, the structures read out of * /proc/<pid>/xmap would be more difficult to decipher since * it would be unclear whether two seemingly contiguous * prxmap_t structures represented different segments or a * single segment that had been split up into multiple prxmap_t * structures (e.g. if some part of the segment had not yet * Split the amount of swap reserved. * For MAP_NORESERVE, only allocate swap reserve for pages * being used. Other segments get enough to cover whole * called on memory operations (unmap, setprot, setpagesize) for a subset * of a large page segment to either demote the memory range (SDR_RANGE) * or the ends (SDR_END) by addr/len. * returns 0 on success. returns errno, including ENOMEM, on failure. /* demote entire range */ * If segment protection can be used, simply check against them. * Have to check down to the vpage level. * Check to see if it makes sense to do kluster/read ahead to * addr + delta relative to the mapping at addr. We assume here * that delta is a signed PAGESIZE'd multiple (which can be negative). * For segvn, we currently "approve" of the action if we are * still in the segment and it maps from the same vp/off, * or if the advice stored in segvn_data or vpages allows it. * Currently, klustering is not allowed only if MADV_RANDOM is set. return (-
1);
/* exceeded segment bounds */ * Check to see if either of the pages addr or addr + delta * have advice set that prevents klustering (if MADV_RANDOM advice * is set for entire segment, or MADV_SEQUENTIAL is set and delta return (0);
/* shared mapping - all ok */ return (0);
/* off original vnode */ return (-
1);
/* one with and one without an anon */ if (
oap ==
NULL) {
/* implies that ap == NULL */ return (0);
/* off original vnode */ * Now we know we have two anon pointers - check to * see if they happen to be properly allocated. * XXX We cheat here and don't lock the anon slots. We can't because * we may have been called from the anon layer which might already * have locked them. We are holding a refcnt on the slots so they * can't disappear. The worst that will happen is we'll get the wrong * names (vp, off) for the slots and make a poor klustering decision. * Swap the pages of seg out to secondary storage, returning the * number of bytes of storage freed. * The basic idea is first to unload all translations and then to call * VOP_PUTPAGE() for all newly-unmapped pages, to push them out to the * swap device. Pages to which other segments have mappings will remain * mapped and won't be swapped. Our caller (as_swapout) has already * performed the unloading step. * The value returned is intended to correlate well with the process's * memory requirements. However, there are some caveats: * 1) When given a shared segment as argument, this routine will * only succeed in swapping out pages for the last sharer of the * segment. (Previous callers will only have decremented mapping * 2) We assume that the hat layer maintains a large enough translation * cache to capture process reference patterns. * Find pages unmapped by our caller and force them * out to the virtual swap device. * Obtain <vp, off> pair for the page, then look it up. * Note that this code is willing to consider regular * pages as well as anon pages. Is this appropriate here? if (
vp ==
NULL) {
/* untouched zfod page */ * Examine the page to see whether it can be tossed out, * keeping track of how many we've found. * If the page has an i/o lock and no mappings, * it's very likely that the page is being * written out as a result of klustering. * Assume this is so and take credit for it here. * Skip if page is locked or has mappings. * We don't need the page_struct_lock to look at lckcnt * and cowcnt because the page is exclusive locked. * dispose skips large pages so try to demote first. * XXX should skip the remaining page_t's of this * No longer mapped -- we can toss it out. How * we do so depends on whether or not it's dirty. * We must clean the page before it can be * freed. Setting B_FREE will cause pvn_done * to free the page when the i/o completes. * XXX: This also causes it to be accounted * as a pageout instead of a swap: need * B_SWAPOUT bit to use instead of B_FREE. * Hold the vnode before releasing the page lock * to prevent it from being freed and re-used by * Queue all i/o requests for the pageout thread * to avoid saturating the pageout devices. * The page was clean, free it. * XXX: Can we ever encounter modified pages * with no associated vnode here? /*LINTED: constant in conditional context*/ * Credit now even if i/o is in progress. * Wakeup pageout to initiate i/o on all queued requests. * Synchronize primary storage cache with real object in virtual memory. * XXX - Anonymous pages should not be sync'ed out at all. * flush all pages from seg cache * otherwise we may deadlock in swap_putpage * for B_INVAL page (4175402). * Even if we grab segvn WRITER's lock or segp_slock * here, there might be another thread which could've * we acquired the lock here. So, grabbing either * lock here is of not much use. Until we devise * a strategy at upper layers to solve the * synchronization issues completely, we expect * applications to handle this appropriately. * We are done if the segment types don't match * or if we have segment level protections and * No attributes, no anonymous pages and MS_INVALIDATE flag * is not on, just use one big request. if (
vp ==
NULL)
/* untouched zfod page */ * See if any of these pages are locked -- if so, then we * will have to truncate an invalidate request at the first * locked one. We don't need the page_struct_lock to test * as this is only advisory; even if we acquire it someone * might race in and lock the page after we unlock and before * we do the PUTPAGE, then PUTPAGE simply does nothing. * swapfs VN_DISPOSE() won't * invalidate large pages. * XXX can't help it if it * pages it is no big deal. * Avoid writing out to disk ISM's large pages * because segspt_free_pages() relies on NULL an_pvp * of anon slots of such pages. * swapfs uses page_lookup_nowait if not freeing or * invalidating and skips a page if * page_lookup_nowait returns NULL. * Note ISM pages are created large so (vp, off)'s * page cannot suddenly become large after we unlock * XXX - Should ultimately try to kluster * calls to VOP_PUTPAGE() for performance. * Determine if we have data corresponding to pages in the * primary storage virtual memory cache (i.e., "in core"). return (
len);
/* no anonymous pages created yet */ /* A page exists for the anon slot */ * If page is mapped and writable * Don't get page_struct lock for lckcnt and cowcnt, * since this is purely advisory. /* Gather vnode statistics */ * Try to obtain a "shared" lock on the page * without blocking. If this fails, determine * if the page is in memory. /* Page is incore, and is named */ * Don't get page_struct lock for lckcnt and cowcnt, * since this is purely advisory. /* Gather virtual page information */ * p_cowcnt is updated while mlock/munlocking MAP_PRIVATE and PROT_WRITE region * irrespective of the following factors or anything else: * (1) anon slots are populated or not * (2) cow is broken or not * (3) refcnt on ap is 1 or greater than 1 * If it's not MAP_PRIVATE and PROT_WRITE, p_lckcnt is updated during mlock * if vpage has PROT_WRITE * transfer cowcnt on the oldpage -> cowcnt on the newpage * transfer lckcnt on the oldpage -> lckcnt on the newpage * During copy-on-write, decrement p_cowcnt on the oldpage and increment * p_cowcnt on the newpage *if* the corresponding vpage has PROT_WRITE. * We may also break COW if softlocking on read access in the physio case. * In this case, vpage may not have PROT_WRITE. So, we need to decrement * p_lckcnt on the oldpage and increment p_lckcnt on the newpage *if* the * vpage doesn't have PROT_WRITE. * If a MAP_PRIVATE region loses PROT_WRITE, we decrement p_cowcnt and * increment p_lckcnt by calling page_subclaim() which takes care of * availrmem accounting and p_lckcnt overflow. * If a MAP_PRIVATE region gains PROT_WRITE, we decrement p_lckcnt and * increment p_cowcnt by calling page_addclaim() which takes care of * availrmem availability and p_cowcnt overflow. * Lock down (or unlock) pages mapped by this segment. * XXX only creates PAGESIZE pages if anon slots are not initialized. * At fault time they will be relocated into larger pages. * Hold write lock on address space because may split or concatenate * If this is a shm, use shm's project and zone, else use * project and zone of calling process /* Determine if this segment backs a sysV shm */ * We are done if the segment types don't match * or if we have segment level protections and * If we're locking, then we must create a vpage structure if * none exists. If we're unlocking, then check to see if there * is a vpage -- if not, then we could not have locked anything. * The anonymous data vector (i.e., previously * unreferenced mapping to swap space) can be allocated * by lazily testing for its existence. /* determine number of unlocked bytes in range for lock operation */ /* Only count sysV pages once for locked memory */ * Loop over all pages in the range. Process if we're locking and * page has not already been locked in this mapping; or if we're * unlocking and the page has been locked. * If this isn't a MAP_NORESERVE segment and * we're locking, allocate anon slots if they * don't exist. The page is brought in later on. * Get name for page, accounting for * existence of private copy. * Get page frame. It's ok if the page is * not available when we're unlocking, as this * may simply mean that a page we locked got * truncated out of existence after we locked it. * Invoke VOP_GETPAGE() to obtain the page struct * since we may need to read it from disk if its * If the error is EDEADLK then we must bounce * up and drop all vm subsystem locks and then * retry the operation later * This behavior is a temporary measure because * ufs/sds logging is badly designed and will * deadlock if we don't allow this bounce to * happen. The real solution is to re-design * the logging code to work properly. See bug * 4125102 for details of the problem. * Quit if we fail to fault in the page. Treat * the failure as an error, unless the addr * is mapped beyond the end of a file. * See Statement at the beginning of this routine. * claim is always set if MAP_PRIVATE and PROT_WRITE * irrespective of following factors: * (1) anon slots are populated or not * (2) cow is broken or not * (3) refcnt on ap is 1 or greater than 1 * See 4140683 for details * Perform page-level operation appropriate to * operation. If locking, undo the SOFTLOCK * performed to bring the page into memory * after setting the lock. If unlocking, * and no page was found, account for the claim int ret =
1;
/* Assume success */ /* locking page failed */ /* sysV pages should be locked */ /* Credit back bytes that did not get locked */ /* Account bytes that were unlocked */ * Set advice from user for specified pages * There are 5 types of advice: * MADV_NORMAL - Normal (default) behavior (whatever that is) * MADV_RANDOM - Random page references * do not allow readahead or 'klustering' * MADV_SEQUENTIAL - Sequential page references * Pages previous to the one currently being * accessed (determined by fault) are 'not needed' * and are freed immediately * MADV_WILLNEED - Pages are likely to be used (fault ahead in mctl) * MADV_DONTNEED - Pages are not needed (synced out in mctl) * MADV_FREE - Contents can be discarded * MADV_ACCESS_DEFAULT- Default access * MADV_ACCESS_LWP - Next LWP will access heavily * MADV_ACCESS_MANY- Many LWPs or processes will access heavily * In case of MADV_FREE, we won't be modifying any segment private * data structures; so, we only need to grab READER's lock * Large pages are assumed to be only turned on when accesses to the * segment's address range have spatial and temporal locality. That * justifies ignoring MADV_SEQUENTIAL for large page segments. * Also, ignore advice affecting lgroup memory allocation * if don't need to do lgroup optimizations on this system * Since we are going to unload hat mappings * we first have to flush the cache. Otherwise * this might lead to system panic if another * thread is doing physio on the range whose * mappings are unloaded by madvise(3C). * Since we do have the segvn writers lock * nobody can fill the cache with entries * belonging to this seg during the purge. * The flush either succeeds or we still * have pending I/Os. In the later case, * Since madvise(3C) is advisory and * it's not part of UNIX98, madvise(3C) * failure here doesn't cause any hardship. * Note that we don't block in "as" layer. * MADV_FREE is not supported for segments with * underlying object; if anonmap is NULL, anon slots * are not yet populated and there is nothing for * us to do. As MADV_FREE is advisory, we don't * return error in either case. * If advice is to be applied to entire segment, * use advice field in seg_data structure * otherwise use appropriate vpage entry. * Set memory allocation policy for this segment * For private memory, need writers lock on * address space because the segment may be * split or concatenated when changing policy * If policy set already and it shouldn't be reapplied, * Mark any existing pages in given range for * If same policy set already or this is a shared * memory segment, don't need to try to concatenate * segment with adjacent ones. * Try to concatenate this segment with previous * one and next one, since we changed policy for * this one and it may be compatible with adjacent * Drop lock for private data of current * segment before concatenating (deleting) it * and return IE_REATTACH to tell as_ctl() that * current segment has changed * unloading mapping guarantees * detection in segvn_fault * Set memory allocation policy for portion of this * Align address and length of advice to page * boundaries for large pages * Check to see whether policy is set already * If policy set already and it shouldn't be reapplied, * For private memory, need writers lock on * address space because the segment may be * split or concatenated when changing policy * Mark any existing pages in given range for * Don't need to try to split or concatenate * segments, since policy is same or this is a shared * Split off new segment if advice only applies to a * portion of existing segment starting in middle * Must flush I/O page cache * before splitting segment * Split segment and return IE_REATTACH to tell * as_ctl() that current segment changed * If new segment ends where old one * did, try to concatenate the new * Set policy for new segment * Split off end of existing segment if advice only * applies to a portion of segment ending before * end of the existing segment * Must flush I/O page cache * before splitting segment * If beginning of old segment was already * split off, use new segment to split end off * Set policy for new segment * Split segment and return IE_REATTACH * to tell as_ctl() that current * If new segment starts where old one * did, try to concatenate it with * Drop lock for private data * of current segment before * concatenating (deleting) it * Create a vpage structure for this seg. * If no vpage structure exists, allocate one. Copy the protections * and the advice from the segment itself to the individual pages. * Dump the pages belonging to this segvn segment. * If pp == NULL, the page either does not exist * or is exclusively locked. So determine if it * exists before searching for it. * lock/unlock anon pages over a given range. Return shadow list "segvn_pagelock: start seg %p addr %p",
seg,
addr);
* We are adjusting the pagelock region to the large page size * boundary because the unlocked part of a large page cannot * be freed anyway unless all constituent pages of a large * page are locked. Therefore this adjustment allows us to * decrement availrmem by the right value (note we don't want * to just decrement availrem by the large page size without * adjusting addr and len because then we may end up * decrementing availrmem by large page size for every * constituent page locked by a new as_pagelock call). * as_pageunlock caller must always match as_pagelock call's * Note segment's page size cannot change while we are holding * as lock. And then it cannot change while softlockcnt is * not 0. This will allow us to correctly recalculate large * for pageunlock *ppp points to the pointer of page_t that * corresponds to the real unadjusted start address. Similar * for pagelock *ppp must point to the pointer of page_t that * corresponds to the real unadjusted start address. * update hat ref bits for /proc. We need to make sure * that threads tracing the ref and mod bits of the * address space get the right data. * Note: page ref and mod bits are updated at reclaim time * If someone is blocked while unmapping, we purge * segment page cache and thus reclaim pplist synchronously * without waiting for seg_pasync_thread. This speeds up * unmapping in cases where munmap(2) is called, while * raw async i/o is still in progress or where a thread * exits on data fault in a multithreaded application. * Even if we grab segvn WRITER's lock or segp_slock * here, there might be another thread which could've * we acquired the lock here. So, grabbing either * lock here is of not much use. Until we devise * a strategy at upper layers to solve the * synchronization issues completely, we expect * applications to handle this appropriately. "segvn_pagelock: unlock seg %p addr %p",
seg,
addr);
"segvn_pagelock: reclaim seg %p addr %p",
seg,
addr);
* for now we only support pagelock to anon memory. We've to check * protections for vnode objects and call into the vnode driver. * That's too much for a fast path. Let the fault entry point handle it. "segvn_pagelock: mapped vnode seg %p addr %p",
seg,
addr);
* if anonmap is not yet created, let the fault entry point populate it "segvn_pagelock: anonmap null seg %p addr %p",
seg,
addr);
* we acquire segp_slock to prevent duplicate entries * try to find pages in segment page cache "segvn_pagelock: cache hit seg %p addr %p",
seg,
addr);
* Avoid per page overhead of segvn_slock_anonpages() for small * pages. For large pages segvn_slock_anonpages() only does real * work once per large page. The tradeoff is that we may decrement * availrmem more than once for the same page but this is ok * We must never use seg_pcache for COW pages * because we might end up with original page still * lying in seg_pcache even after private page is * created. This leads to data corruption as * aio_write refers to the page still in cache * while all other accesses refer to the private "segvn_pagelock: cache fill seg %p addr %p",
seg,
addr);
"segvn_pagelock: cache miss seg %p addr %p",
seg,
addr);
* purge any cached pages in the I/O page cache panic(
"segvn_reclaim: unaligned addr or len");
* get a memory ID for an addr in a given segment * XXX only creates PAGESIZE pages if anon slots are not initialized. * At fault time they will be relocated into larger pages. * Get memory allocation policy info for specified address in given segment * Get policy info for private or shared memory * Bind text vnode segment to an amp. If we bind successfully mappings will be * established to per vnode mapping per lgroup amp pages instead of to vnode * pages. There's one amp per vnode text mapping per lgroup. Many processes * may share the same text replication amp. If a suitable amp doesn't already * exist in svntr hash table create a new one. We may fail to bind to amp if * segment is not eligible for text replication. Code below first checks for * these conditions. If binding is successful segment tr_state is set to on * and svd->amp points to the amp to use. Otherwise tr_state is set to off and * svd->amp remains as NULL. * If numa optimizations are no longer desired bail out. * Avoid creating anon maps with size bigger than the file size. * If VOP_GETATTR() call fails bail out. * VVMEXEC may not be set yet if exec() prefaults text segment. Set * this flag now before vn_is_mapped(V_WRITE) so that MAP_SHARED * mapping that checks if trcache for this vnode needs to be * invalidated can't miss us. * Bail out if potentially MAP_SHARED writable mappings exist to this * vnode. We don't want to use old file contents from existing * replicas if this mapping was established after the original file * Bail out if the file or its attributes were changed after * this replication entry was created since we need to use the * latest file contents. Note that mtime test alone is not * sufficient because a user can explicitly change mtime via * utimes(2) interfaces back to the old value after modifiying * the file contents. To detect this case we also have to test * ctime which among other things records the time of the last * mtime change by utimes(2). ctime is not changed when the file * is only read or executed so we expect that typically existing * replication amp's can be used most of the time. * if off, eoff and szc match current segment we found the * existing entry we can use. * Don't create different but overlapping in file offsets * entries to avoid replication of the same file pages more * If we didn't find existing entry create a new one. * We want to pick a replica with pages on main thread's (t_tid = 1, * aka T1) lgrp. Currently text replication is only optimized for * workloads that either have all threads of a process on the same * lgrp or execute their large text primarily on main thread. * In case exec() prefaults text on non main thread use * current thread lgrpid. It will become main thread anyway * Set p_tr_lgrpid to lgrpid if it hasn't been set yet. Otherwise * just set it to NLGRPS_MAX if it's different from current process T1 * home lgrp. p_tr_lgrpid is used to detect if process uses text * replication and T1 new home is different from lgrp used for text * replication. When this happens asyncronous segvn thread rechecks if * segments should change lgrps used for text replication. If we fail * to set p_tr_lgrpid with cas32 then set it to NLGRPS_MAX without cas * if it's not already NLGRPS_MAX and not equal lgrp_id we want to * use. We don't need to use cas in this case because another thread * that races in between our non atomic check and set may only change * p_tr_lgrpid to NLGRPS_MAX at this point. * lgrp_move_thread() won't schedule async recheck after * p->p_t1_lgrpid update unless p->p_tr_lgrpid is not * LGRP_NONE. Recheck p_t1_lgrpid once now that p->p_tr_lgrpid * If no amp was created yet for lgrp_id create a new one as long as * we have enough memory to afford it. * Convert seg back to regular vnode mapping seg by unbinding it from its text * replication amp. This routine is most typically called when segment is * unmapped but can also be called when segment no longer qualifies for text * replication (e.g. due to protection changes). If unload_unmap is set use * HAT_UNLOAD_UNMAP flag in hat_unload_callback(). If we are the last user of * svntr free all its anon maps and remove it from the hash table. panic(
"segvn_textunrepl: svntr record not found");
panic(
"segvn_textunrepl: amp mismatch");
* This is called when a MAP_SHARED writable mapping is created to a vnode * that is currently used for execution (VVMEXEC flag is set). In this case we * need to prevent further use of existing replicas. * Use tryenter locking since we are locking as/seg and svntr hash * lock in reverse from syncrounous thread order. * We don't need to drop the bucket lock but here we give other * threads a chance. svntr and svd can't be unlinked as long as * segment lock is held as a writer and AS held as well. After we * retake bucket lock we'll continue from where we left. We'll be able * to reach the end of either list since new entries are always added * to the beginning of the lists.