vm_pagelist.c revision cb15d5d96b3b2730714c28bfe06cfe7421758b8c
2N/A * The contents of this file are subject to the terms of the 2N/A * Common Development and Distribution License (the "License"). 2N/A * You may not use this file except in compliance with the License. 2N/A * See the License for the specific language governing permissions 2N/A * and limitations under the License. 2N/A * When distributing Covered Code, include this CDDL HEADER in each 2N/A * If applicable, add the following below this CDDL HEADER, with the 2N/A * fields enclosed by brackets "[]" replaced with your own identifying 2N/A * information: Portions Copyright [yyyy] [name of copyright owner] 2N/A * Copyright (c) 2004, 2010, Oracle and/or its affiliates. All rights reserved. 2N/A/* Copyright (c) 1984, 1986, 1987, 1988, 1989 AT&T */ 2N/A/* All Rights Reserved */ 2N/A * Portions of this source code were derived from Berkeley 4.3 BSD 2N/A * under license from the Regents of the University of California. 2N/A * This file contains common functions to access and manage the page lists. 2N/A * Many of these routines originated from platform dependent modules 2N/A * a platform independent manner. 2N/A/* vm_cpu_data0 for the boot cpu before kmem is initialized */ 2N/A * number of page colors equivalent to reqested color in page_get routines. 2N/A * If set, keeps large pages intact longer and keeps MPO allocation 2N/A * from the local mnode in favor of acquiring the 'correct' page color from 2N/A * a demoted large page or from a remote mnode. 2N/A * color equivalency mask for each page size. 2N/A * Mask is computed based on cpu L2$ way sizes and colorequiv global. 2N/A * High 4 bits determine the number of high order bits of the color to ignore. 2N/A * Low 4 bits determines number of low order bits of color to ignore (it's only 2N/A * relevant for hashed index based page coloring). 2N/A * if set, specifies the percentage of large pages that are free from within 2N/A * a large page region before attempting to lock those pages for 2N/A * page_get_contig_pages processing. 2N/A * Should be turned on when kpr is available when page_trylock_contig_pages 2N/A * can be more selective. 2N/A * Limit page get contig page search based on failure cnts in pgcpfailcnt[]. 2N/A * Enabled by default via pgcplimitsearch. 2N/A * pgcpfailcnt[] is bounded by PGCPFAILMAX (>= 1/2 of installed 2N/A * memory). When reached, pgcpfailcnt[] is reset to 1/2 of this upper 2N/A * bound. This upper bound range guarantees: 2N/A * - all large page 'slots' will be searched over time 2N/A * - the minimum (1) large page candidates considered on each pgcp call 2N/A * - count doesn't wrap around to 0 2N/A#
endif /* VM_STATS */ 2N/A/* enable page_get_contig_pages */ 2N/A * page_freelist_split pfn flag to signify no lo or hi pfn requirement. 2N/A/* Flags involved in promotion and demotion routines */ 2N/A * Flag for page_demote to be used with PC_FREE to denote that we don't care 2N/A * what the color is as the color parameter to the function is ignored. 2N/A/* mtype value for page_promote to use when mtype does not matter */ 2N/A * page counters candidates info 2N/A * See page_ctrs_cands comment below for more details. 2N/A * fields are as follows: 2N/A * pcc_pages_free: # pages which freelist coalesce can create 2N/A * pcc_color_free: pointer to page free counts per color 2N/A * On big machines it can take a long time to check page_counters 2N/A * arrays. page_ctrs_cands is a summary array whose elements are a dynamically 2N/A * updated sum of all elements of the corresponding page_counters arrays. 2N/A * page_freelist_coalesce() searches page_counters only if an appropriate 2N/A * element of page_ctrs_cands array is greater than 0. 2N/A * page_ctrs_cands is indexed by mutex (i), region (r), mnode (m), mrange (g) 2N/A * Return in val the total number of free pages which can be created 2N/A * for the given mnode (m), mrange (g), and region size (r) 2N/A * Return in val the total number of free pages which can be created 2N/A * for the given mnode (m), mrange (g), region size (r), and color (c) 2N/A * We can only allow a single thread to update a counter within the physical 2N/A * range of the largest supported page size. That is the finest granularity 2N/A * possible since the counter values are dependent on each other 2N/A * as you move accross region sizes. PP_CTR_LOCK_INDX is used to determine the 2N/A * ctr_mutex lock index for a particular physical range. 2N/A * Local functions prototypes. 2N/A * The page_counters array below is used to keep track of free contiguous 2N/A * physical memory. A hw_page_map_t will be allocated per mnode per szc. 2N/A * This contains an array of counters, the size of the array, a shift value 2N/A * used to convert a pagenum into a counter array index or vice versa, as 2N/A * well as a cache of the last successful index to be promoted to a larger 2N/A * page size. As an optimization, we keep track of the last successful index 2N/A * to be promoted per page color for the given size region, and this is 2N/A * allocated dynamically based upon the number of colors for a given 2N/A * Conceptually, the page counters are represented as: 2N/A * page_counters[region_size][mnode] 2N/A * region_size: size code of a candidate larger page made up 2N/A * of contiguous free smaller pages. 2N/A * page_counters[region_size][mnode].hpm_counters[index]: 2N/A * represents how many (region_size - 1) pages either 2N/A * exist or can be created within the given index range. 2N/A * Let's look at a sparc example: 2N/A * If we want to create a free 512k page, we look at region_size 2 2N/A * for the mnode we want. We calculate the index and look at a specific 2N/A * hpm_counters location. If we see 8 (FULL_REGION_CNT on sparc) at 2N/A * this location, it means that 8 64k pages either exist or can be created 2N/A * from 8K pages in order to make a single free 512k page at the given 2N/A * index. Note that when a region is full, it will contribute to the 2N/A * counts in the region above it. Thus we will not know what page 2N/A * size the free pages will be which can be promoted to this new free 2N/A * page unless we look at all regions below the current region. 2N/A * hw_page_map_t contains all the information needed for the page_counters 2N/A * logic. The fields are as follows: 2N/A * hpm_counters: dynamically allocated array to hold counter data 2N/A * hpm_entries: entries in hpm_counters 2N/A * hpm_base: PFN mapped to counter index 0 2N/A * hpm_color_current: last index in counter array for this color at 2N/A * which we successfully created a large page * Element zero is not used, but is allocated for convenience. * Cached value of MNODE_RANGE_CNT(mnode). * This is a function call in x86. * The following macros are convenient ways to get access to the individual * elements of the page_counters arrays. They can be used on both * the left side and right side of equations. * Protects the hpm_counters and hpm_color_current memory from changing while * looking at page counters information. * Grab the write lock to modify what these fields point at. * Grab the read lock to prevent any pointers from changing. * The write lock can not be held during memory allocation due to a possible * recursion deadlock with trying to grab the read lock while the * write lock is already held. * initialize cpu_vm_data to point at cache aligned vm_cpu_data_t. * page size to page size code * page size to page size code with the restriction that it be a supported * user page size. If it's not a supported user page size, -1 will be returned. * Return how many page sizes are available for the user to use. This is * what the hardware supports and not based upon how the OS implements the * support of different page sizes. * If legacy is non-zero, return the number of pagesizes available to legacy * applications. The number of legacy page sizes might be less than the * exported user page sizes. This is to prevent legacy applications that * use the largest page size returned from getpagesizes(3c) from inadvertantly * using the 'new' large pagesizes. * returns the count of the number of base pagesize pages associated with szc panic(
"page_get_pagecnt: out of range %d",
szc);
panic(
"page_get_pagesize: out of range %d",
szc);
* Return the size of a page based upon the index passed in. An index of * zero refers to the smallest page size in the system, and as index increases * it refers to the next larger supported page size in the system. * Note that szc and userszc may not be the same due to unsupported szc's on panic(
"page_get_user_pagesize: out of range %d",
szc);
panic(
"page_get_shift: out of range %d",
szc);
panic(
"page_get_pagecolors: out of range %d",
szc);
* this assigns the desired equivalent color after a split * The interleaved_mnodes flag is set when mnodes overlap in * the physbase..physmax range, but have disjoint slices. * In this case hpm_counters is shared by all mnodes. * This flag is set dynamically by the platform. * Size up the per page size free list counters based on physmax * of each node and max_mem_nodes. * If interleaved_mnodes is set we need to find the first mnode that * exists. hpm_counters for the first mnode will then be shared by * all other mnodes. If interleaved_mnodes is not set, just set * first=mnode each time. That means there will be no sharing. int firstmn;
/* first mnode that exists */ * We need to determine how many page colors there are for each * page size in order to allocate memory for any color specific * determine size needed for page counter arrays with * base aligned to large page size. /* add in space for hpm_color_current */ /* add in space for hpm_counters */ * Round up to always allocate on pointer sized /* add in space for page_ctrs_cands and pcc_color_free */ /* size for page list counts */ * add some slop for roundups. page_ctrs_alloc will roundup the start * address of the counters to ecache_alignsize boundary for every int firstmn;
/* first mnode that exists */ * We need to determine how many page colors there are for each * page size in order to allocate memory for any color specific /* page_ctrs_cands and pcc_color_free array */ /* initialize page list counts */ * the page_counters base has to be aligned to the * page count of page size code r otherwise the counts * will cross large page boundaries. /* base needs to be aligned - lower to aligned value */ /* hpm_counters may be shared by all mnodes */ * Verify that PNUM_TO_IDX and IDX_TO_PNUM * satisfy the identity requirement. * We should be able to go from one to the other * and get consistent values. * Roundup the start address of the page_counters to * cache aligned boundary for every memory node. * page_ctrs_sz() has added some slop for these roundups. /* Initialize other page counter specific data structures. */ * Functions to adjust region counters for each size free list. * Caller is responsible to acquire the ctr_mutex lock if necessary and * thus can be called during startup without locks. /* no counter update needed for largest page size */ * Increment the count of free pages for the current * region. Continue looping up in region size incrementing * count if the preceeding region is full. /* no counter update needed for largest page size */ * Decrement the count of free pages for the current * region. Continue looping up in region size decrementing * count if the preceeding region was full. * Adjust page counters following a memory attach, since typically the * size of the array needs to change, and the PFN to counter index * mapping needs to change. * It is possible this mnode did not exist at startup. In that case * allocate pcc_info_t and pcc_color_free arrays. Also, allow for nranges * to change (a theoretical possibility on x86), which means pcc_color_free * arrays must be extended. /* prepare to free non-null pointers on the way out */ * We need to determine how many page colors there are for each * page size in order to allocate memory for any color specific * Preallocate all of the new hpm_counters arrays as we can't * hold the page_ctrs_rwlock as a writer and allocate memory. * If we can't allocate all of the arrays, undo our work so far * Preallocate all of the new color current arrays as we can't * hold the page_ctrs_rwlock as a writer and allocate memory. * If we can't allocate all of the arrays, undo our work so far * Preallocate all of the new pcc_info_t arrays as we can't * hold the page_ctrs_rwlock as a writer and allocate memory. * If we can't allocate all of the arrays, undo our work so far * Grab the write lock to prevent others from walking these arrays * while we are modifying them. * For interleaved mnodes, find the first mnode * with valid page counters since the current * mnode may have just been added and not have * Map the intersection of the old and new * counters into the new array. /* update shared hpm_counters in other mnodes */ * for now, just reset on these events as it's probably * not worthwhile to try and optimize this. /* cache info for freeing out of the critical path */ * Verify that PNUM_TO_IDX and IDX_TO_PNUM * satisfy the identity requirement. * We should be able to go from one to the other * and get consistent values. /* pcc_info_t and pcc_color_free */ /* preserve old pcc_color_free values, if any */ * when/if x86 does DR, must account for * possible change in range index when * Now that we have dropped the write lock, it is safe to free all * of the memory we have cached above. * We come thru here to free memory when pre-alloc fails, and also to * free old pointers which were recorded while locked. * Cleanup the hpm_counters field in the page counters * Get the page counters write lock while we are * setting the page hpm_counters field to NULL * for non-existent mnodes. * confirm pp is a large page corresponding to szc * add pp to the specified page list. Defaults to head of the page list * unless PG_LIST_TAIL is specified. * Large pages should be freed via page_list_add_pages(). * Don't need to lock the freelist first here * because the page isn't on the freelist yet. * This means p_szc can't change on us. * PG_LIST_ISINIT is set during system startup (ie. single * threaded), add a page to the free list and add to the * the free region counters w/o any locking /* inline version of page_add() */ * Add counters before releasing pcm mutex to avoid a race with * page_freelist_coalesce and page_freelist_split. * It is up to the caller to unlock the page! * This routine is only used by kcage_init during system startup. * without the overhead of taking locks and updating counters. * If this is a large page on the freelist then * break it up into smaller pages. * Get list page is currently on. * Delete page from current list. *
ppp =
NULL;
/* page list is gone */ * Decrement page counters * Set no reloc for cage initted pages. * Insert page on new list. * Increment page counters * Update cage freemem counter panic(
"page_list_noreloc_startup: should be here only for sparc");
* During boot, need to demote a large page to base * pagesize pages for seg_kmem for use in boot_alloc() * Take a particular page off of whatever freelist the page * NOTE: Only used for PAGESIZE pages. * The p_szc field can only be changed by page_promote() * and page_demote(). Only free pages can be promoted and * demoted and the free list MUST be locked during these * operations. So to prevent a race in page_list_sub() * between computing which bin of the freelist lock to * grab and actually grabing the lock we check again that * the bin we locked is still the correct one. Notice that * the p_szc field could have actually changed on us but * if the bin happens to still be the same we are safe. * Note that we locked the freelist. This prevents * the p_szc will not change until we drop pcm mutex. * Subtract counters before releasing pcm mutex * to avoid race with page_freelist_coalesce. * Large pages on the cache list are not supported. panic(
"page_list_sub: large page on cachelist");
* Somebody wants this particular page which is part * of a large page. In this case we just demote the page * if it's on the freelist. * We have to drop pcm before locking the entire freelist. * Once we have re-locked the freelist check to make sure * the page hasn't already been demoted or completely * Large page is on freelist. * Subtract counters before releasing pcm mutex * to avoid race with page_freelist_coalesce. * See comment in page_list_sub(). * If we're called with a page larger than szc or it got * promoted above szc before we locked the freelist then * drop pcm and re-lock entire freelist. If page still larger * than szc then demote it. * Add the page to the front of a linked list of pages * using the p_next & p_prev pointers for the list. * The caller is responsible for protecting the list pointers. * Remove this page from a linked list of pages * using the p_next & p_prev pointers for the list. * The caller is responsible for protecting the list pointers. *
ppp =
NULL;
/* page list is gone */ * Routine fsflush uses to gradually coalesce the free list into larger pages. * Create a single larger page (of szc new_szc) from smaller contiguous pages * for the given mnode starting at pfnum. Pages involved are on the freelist * before the call and may be returned to the caller if requested, otherwise * they will be placed back on the freelist. * If flags is PC_ALLOC, then the large page will be returned to the user in * a state which is consistent with a page being taken off the freelist. If * we failed to lock the new large page, then we will return NULL to the * caller and put the large page on the freelist instead. * If flags is PC_FREE, then the large page will be placed on the freelist, * and NULL will be returned. * The caller is responsible for locking the freelist as well as any other * accounting which needs to be done for a returned page. * RFE: For performance pass in pp instead of pfnum so * we can avoid excessive calls to page_numtopp_nolock(). * This would depend on an assumption that all contiguous * pages are in the same memseg so we can just add/dec * There is a potential but rare deadlock situation * for page promotion and demotion operations. The problem * is there are two paths into the freelist manager and * they have different lock orders: * page_free() and page_reclaim() * caller grabs page_lock(EXCL) * What prevents a thread in page_create() from deadlocking * with a thread freeing or reclaiming the same page is the * page_trylock() in page_get_freelist(). If the trylock fails * The lock ordering for promotion and demotion is the same as * for page_create(). Since the same deadlock could occur during * page promotion and freeing or reclaiming of a page on the * cache list we might have to fail the operation and undo what * have done so far. Again this is rare. * Walk each page struct removing it from the freelist, * and linking it to all the other pages removed. * Once all pages are off the freelist, * walk the list, modifying p_szc to new_szc and what * ever other info needs to be done to create a large free page. * According to the flags, either return the page or put it /* don't return page of the wrong mtype */ * Loop through smaller pages to confirm that all pages * give the same result for PP_ISNORELOC(). * We can check this reliably here as the protocol for setting * P_NORELOC requires pages to be taken off the free list first. /* Loop around coalescing the smaller pages into a big page. */ * Remove from the freelist. * Since this page comes from the * cachelist, we must destroy the * We need to be careful not to deadlock * with another thread in page_lookup(). * The page_lookup() thread could be holding * the same phm that we need if the two * pages happen to hash to the same phm lock. * At this point we have locked the entire * freelist and page_lookup() could be trying * to grab a freelist lock. * Concatenate the smaller page(s) onto * return the page to the user if requested * in the properly locked state. * Otherwise place the new large page on the freelist * A thread must have still been freeing or * reclaiming the page on the cachelist. * To prevent a deadlock undo what we have * done sofar and return failure. This * situation can only happen while promoting * Break up a large page into smaller size pages. * Pages involved are on the freelist before the call and may * be returned to the caller if requested, otherwise they will * be placed back on the freelist. * The caller is responsible for locking the freelist as well as any other * accounting which needs to be done for a returned page. * If flags is not PC_ALLOC, the color argument is ignored, and thus * technically, any value may be passed in but PC_NO_COLOR is the standard * which should be followed for clarity's sake. * Returns a page whose pfn is < pfnmax * Number of PAGESIZE pages for smaller new_szc * We either break it up into PAGESIZE pages or larger. if (
npgs ==
1) {
/* PAGESIZE case */ * Break down into smaller lists of pages. * Check whether all the pages in this list * fit the request criteria. * Coalesce free pages into a page of the given szc and color if possible. * Return the pointer to the page created, otherwise, return NULL. * If pfnhi is non-zero, search for large page with pfn range less than pfnhi. int r =
szc;
/* region size */ /* Prevent page_counters dynamic memory from being freed */ /* get pfn range for mtype */ /* use lower limit if given */ /* round to szcpgcnt boundaries */ /* set lo to the closest pfn of the right color */ /* calculate the number of page candidates and initial search index */ /* invalid color, get the closest correct pfn */ nhi = 0;
/* search kcage ranges */ * Find lowest intersection of kcage ranges and mnode. * MTYPE_NORELOC means look in the cage, otherwise outside. /* jump to the next page in the range */ * RFE: For performance maybe we can do something less * brutal than locking the entire freelist. So far * this doesn't seem to be a performance problem? * No point looking for another page if we've * already tried all of the ones that * page_ctr_cands indicated. Stash off where we left * Note: this is not exact since we don't hold the * page_freelist_locks before we initially get the * value of cands for performance reasons, but should * be a decent approximation. nhi = 0;
/* search kcage ranges */ * For the given mnode, promote as many small pages to large pages as possible. * mnode can be -1, which means do them all * Lock the entire freelist and coalesce what we can. * Always promote to the largest page possible * first to reduce the number of page promotions. /* shared hpm_counters covers all mnodes, so we quit */ * This is where all polices for moving pages around * to different page size free lists is implemented. * Returns 1 on success, 0 on failure. * So far these are the priorities for this algorithm in descending * 1) When servicing a request try to do so with a free page * from next size up. Helps defer fragmentation as long * 2) Page coalesce on demand. Only when a freelist * larger than PAGESIZE is empty and step 1 * will not work since all larger size lists are * If pfnhi is non-zero, search for large page with pfn range less than pfnhi. * First try to break up a larger page to fill current size freelist. * If page found then demote it. * If pfnhi is not PFNNULL, look for large page below * pfnhi. PFNNULL signifies no pfn requirement. /* loop through next size bins */ /* we are done with this page size - check next */ /* we have already checked next size bins */ * Helper routine used only by the freelist code to lock * a page. If the page is a large page then it succeeds in * locking all the constituent pages or none at all. * Returns 1 on sucess, 0 on failure. * Fail if can't lock first or only page. * On failure unlock what we have locked so far. * We want to avoid attempting to capture these * pages as the pcm mutex may be held which could * lead to a recursive mutex panic. * init context for walking page lists * Called when a page of the given szc in unavailable. Sets markers * for the beginning of the search to detect when search has * completed a full cycle. Sets flags for splitting larger pages * and coalescing smaller pages. Page walking procedes until a page * of the desired equivalent color is found. * if vac aliasing is possible make sure lower order color * calculate the number of non-equivalent colors and * this is a heterogeneous machine with different CPUs * having different size e$ (not supported for ni2/rock /* we can split pages in the freelist, but not the cachelist */ /* set next szc color masks and number of free list bins */ * set mark to flag where next split should occur * large pages all have the same vac color * so by now we should be done with next * size page splitting process * check if next page size bin is the * same as the next page size bin for if (
mtype < 0) {
/* mnode does not have memory in mtype range */ * Only hold one freelist lock at a time, that way we * can start anywhere and not have to worry about lock * These were set before the page * was put on the free list, * they must still be set. * Walk down the hash chain. * 8k pages are linked on p_next * and p_prev fields. Large pages * are a contiguous group of * constituent pages linked together * on their p_next and p_prev fields. * The large pages are linked together * on the hash chain using p_vpnext * p_vpprev of the base constituent * page of each large page. panic(
"free page is not. pp %p", (
void *)
pp);
/* calculate the next bin with equivalent color */ * color bins are all empty if color match. Try and * satisfy the request by breaking up or coalescing * pages from a different size freelist of the correct * color that satisfies the ORIGINAL color requested. * If that fails then try pages of the same size but * different colors assuming we are not called with /* if allowed, cycle through additional mtypes */ * Returns the count of free pages for 'pp' with size code 'szc'. * Note: This function does not return an exact value as the page freelist * locks are not held and thus the values in the page_counters may be * changing as we walk through the data. /* Make sure pagenum passed in is aligned properly */ /* Prevent page_counters dynamic memory from being freed */ /* Check for completely full region */ * If cnt here is full, that means we have already * accounted for these pages earlier. * Called from page_geti_contig_pages to exclusively lock constituent pages * starting from 'spp' for page size code 'szc'. * If 'ptcpthreshold' is set, the number of free pages needed in the 'szc' * region needs to be greater than or equal to the threshold. * check if there are sufficient free pages available before attempting * to trylock. Count is approximate as page counters can change. /* attempt to trylock if there are sufficient already free pages */ for (i = 0; i <
pgcnt; i++) {
* Claim large page pointed to by 'pp'. 'pp' is the starting set * of 'szc' constituent pages that had been locked exclusively previously. * Will attempt to relocate constituent pages in use. * If this is a PG_FREE_LIST page then its * size code can change underneath us due to * page promotion or demotion. As an optimzation * use page_list_sub_pages() instead of for (i = 0; i <
npgs; i++,
pp++) {
* page_create_wait freemem accounting done by caller of * page_get_freelist and not necessary to call it prior to * calling page_get_replacement_page. * page_get_replacement_page can call page_get_contig_pages * to acquire a large page (szc > 0); the replacement must be * smaller than the contig page size to avoid looping or * szc == 0 and PGI_PGCPSZC0 is set. * If replacement is NULL or do_page_relocate fails, fail * Unlock un-processed target list * Free the processed target list. ASSERT(
hpp =
pp);
/* That's right, it's an assignment */ * Trim kernel cage from pfnlo-pfnhi and store result in lo-hi. Return code * of 0 means nothing left after trim. /* lower part of this mseg inside kernel cage */ /* kernel cage may have transitioned past mseg */ /* else entire mseg in the cage */ /* upper part of this mseg inside kernel cage */ /* kernel cage may have transitioned past mseg */ /* entire mseg outside of kernel cage */ * called from page_get_contig_pages to search 'pfnlo' thru 'pfnhi' to claim a * page with size code 'szc'. Claiming such a page requires acquiring * exclusive locks on all constituent pages (page_trylock_contig_pages), * relocating pages in use and concatenating these constituent pages into a * The page lists do not have such a large page and page_freelist_split has * already failed to demote larger pages and/or coalesce smaller free pages. * 'flags' may specify PG_COLOR_MATCH which would limit the search of large * pages with the same color as 'bin'. * 'pfnflag' specifies the subset of the pfn range to search. /* LINTED : set but not used in function */ /* clear "non-significant" color bits */ * trim the pfn range to search based on pfnflag. pfnflag is set * when there have been previous page_get_contig_page failures to * The high bit in pfnflag specifies the number of 'slots' in the * pfn range and the remainder of pfnflag specifies which slot. * For example, a value of 1010b would mean the second slot of * the pfn range that has been divided into 8 slots. /* skip if 'slotid' slot is empty */ * This routine is can be called recursively so we shouldn't * acquire a reader lock if a write request is pending. This * could lead to a deadlock with the DR thread. * Returning NULL informs the caller that we could not get * a contig page with the required characteristics. * loop through memsegs to look for contig page candidates * trim off kernel cage pages from pfn range and check for * a trimmed pfn range returned that does not span the * desired large page size. /* round to szcpgcnt boundaries */ * set lo to point to the pfn for the desired bin. Large * page sizes may only have a single page color /* set lo to point at appropriate color */ /* mseg cannot satisfy color request */ /* randomly choose a point between lo and hi to begin search */ /* pages unlocked by page_claim on failure */ /* start from the beginning */ * controlling routine that searches through physical memory in an attempt to * claim a large page based on the input parameters. * on the page free lists. * calls page_geti_contig_pages with an initial pfn range from the mnode * and mtype. page_geti_contig_pages will trim off the parts of the pfn range * that overlaps with the kernel cage or does not match the requested page * color if PG_MATCH_COLOR is set. Since this search is very expensive, * page_geti_contig_pages may further limit the search range based on * previous failure counts (pgcpfailcnt[]). * for PGI_PGCPSZC0 requests, page_get_contig_pages will relocate a base * pagesize page that satisfies mtype. /* no allocations from cage */ if (
mtype < 0) {
/* mnode does not have memory in mtype range */ /* do not limit search and ignore color if hi pri */ /* remove color match to improve chances */ /* get pfn range based on mnode and mtype */ /* double the search size */ * Return 0 if the likelihood is small otherwise return 1. * For now, be conservative and check only 1g pages and return 0 * if there had been previous coalescing failures and the szc pages * needed to satisfy request would exhaust most of freemem. * Find the `best' page on the freelist for this (vp,off) (as,vaddr) pair. * Does its own locking and accounting. * If PG_MATCH_COLOR is set, then NULL will be returned if there are no * pages of the proper color even if there are pages of a different color. * Finds a page, removes it, THEN locks it. * If we aren't passed a specific lgroup, or passed a freed lgrp * assume we wish to allocate near to the current thread's home. * Set a "reserve" of kcage_throttlefree pages for * PG_PANIC and cageout thread allocations. * Everybody else has to serialize in * page_create_get_something() to get a cage page, so * that we don't deadlock cageout! * Convert size to page size code. panic(
"page_get_freelist: illegal page size request");
* Try to get a local page first, but try remote if we can't * get a page of the right color. * for non-SZC0 PAGESIZE requests, check cachelist before checking * remote free lists. Caller expected to call page_get_cachelist which * will check local cache lists and remote free lists. * Try to get a non-local freelist page. * when the cage is off chances are page_get_contig_pages() will fail * to lock a large page chunk therefore when the cage is off it's not * called by default. this can be changed via /etc/system. * page_get_contig_pages() also called to acquire a base pagesize page * for page_create_get_something(). * Find the `best' page on the cachelist for this (vp,off) (as,vaddr) pair. * If PG_MATCH_COLOR is set, then NULL will be returned if there are no * pages of the proper color even if there are pages of a different color. * Otherwise, scan the bins for ones with pages. For each bin with pages, * try to lock one of them. If no page can be locked, try the * next bin. Return NULL if a page can not be found and locked. * Finds a pages, trys to lock it, then removes it. * If we aren't passed a specific lgroup, or pasased a freed lgrp * assume we wish to allocate near to the current thread's home. * Reserve kcage_throttlefree pages for critical kernel * Everybody else has to go to page_create_get_something() * to get a cage page, so we don't deadlock cageout. * Try local cachelists first * This is our only chance to allocate remote pages for PAGESIZE if (
mtype < 0) {
/* mnode does not have memory in mtype range */ * Only hold one cachelist lock at a time, that way we * can start anywhere and not have to worry about lock * We have searched the complete list! * And all of them (might only be one) * are locked. This can happen since * these pages can also be found via * the hash list. When found via the * hash list, they are locked first, * then removed. We give up to let the * Found and locked a page. * Subtract counters before releasing pcm mutex * to avoid a race with page_freelist_coalesce * and page_freelist_split. /* calculate the next bin with equivalent color */ #
else /* REPL_PAGE_STATS */#
endif /* REPL_PAGE_STATS */ * The freemem accounting must be done by the caller. * First we try to get a replacement page of the same size as like_pp, * if that is not possible, then we just get a set of discontiguous * Now we reset like_pp to the base page_t. * That way, we won't walk past the end of this 'szc' page. * Kernel pages must always be replaced with the same size * pages, since we cannot properly handle demotion of kernel * If an lgroup was specified, try to get the * NOTE: Must be careful with code below because * lgroup may disappear and reappear since there * is no locking for lgroup here. * Keep local variable for lgroup separate * from lgroup argument since this code should * only be exercised when lgroup argument /* Try the lgroup's freelists first */ * Now try it's cachelists if this is a * small page. Don't need to do it for * larger ones since page_freelist_coalesce() /* Now try it's cachelists */ /* Done looking in this lgroup. Bail out. */ * No lgroup was specified (or lgroup was removed by * DR, so just try to get the page as close to * like_pp's mnode as possible. * First try the local freelist... * ...then the local cachelist. Don't need to do it for * larger pages cause page_freelist_coalesce() already /* Now try remote freelists */ /* Now try remote cachelists */ * Break out of while loop under the following cases: * - If we successfully got a page. * - If pgrflags specified only returning a specific * page size and we could not find that page size. * - If we could not satisfy the request with PAGESIZE /* try to find contig page */ * The correct thing to do here is try the next * page size down using szc--. Due to a bug * with the processing of HAT_RELOAD_SHARE * where the sfmmu_ttecnt arrays of all * hats sharing an ISM segment don't get updated, * using intermediate size pages for relocation * can lead to continuous page faults. * We were unable to allocate the necessary number * We need to free up any pl. * demote a free large page to it's constituent pages * Factor in colorequiv to check additional 'equivalent' bins.