mem_cage.c revision ce8eb11a8717b4a57c68fd77ab9f8aac15b16bf2
/* set in kcage_invalidate_page() */ /* set in kcage_expand() */ /* managed by KCAGE_STAT_* macros */ /* set in kcage_cageout */ /* set in kcage_expand */ /* set in kcage_freemem_add() */ /* set in kcage_freemem_sub() */ /* set in kcage_create_throttle */ /* set in kcage_cageout_wakeup */ /* managed by KCAGE_STAT_* macros */ * No real need for atomics here. For the most part the incs and sets are * done by the kernel cage thread. There are a few that are done by any * number of other threads. Those cases are noted by comments. * Cage expansion happens within a range. * The firstfree element is provided so that kmem_alloc can be avoided * until that cage has somewhere to go. This is not currently a problem * as early kmem_alloc's use BOP_ALLOC instead of page_create_va. * Miscellaneous forward references * Kernel Memory Cage counters and thresholds. /* when we use lp for kmem we start the cage at a higher initial value */ /* kstats to export what pages are currently caged */ * Startup and Dynamic Reconfiguration interfaces. * kcage_range_delete_post_mem_del() * Called from page_get_contig_pages to get the approximate kcage pfn range * for exclusion from search for contiguous pages. This routine is called * without kcage_range lock (kcage routines can call page_get_contig_pages * through page_relocate) and with the assumption, based on kcage_range_add, * that kcage_current_glist always contain a valid pointer. * Called from vm_pagelist.c during coalesce to find kernel cage regions * within an mnode. Looks for the lowest range between lo and hi. * Kernel cage memory is defined between kcage_glist and kcage_current_glist. * Non-cage memory is defined between kcage_current_glist and list end. * If incage is set, returns the lowest kcage range. Otherwise returns lowest * Returns zero on success and nlo, nhi: * Returns non-zero if no overlapping range is found. * Reader lock protects the list, but kcage_get_pfn * running concurrently may advance kcage_current_glist * and also update kcage_current_glist->curr. Page * coalesce can handle this race condition. /* find the range limits in this element */ /* return non-zero if no overlapping range found */ /* return overlapping range */ panic(
"kcage_range_add_internal failed: " "ml=%p, ret=0x%x\n",
ml,
ret);
* Third arg controls direction of growth: 0: increasing pfns, * Any overlapping existing ranges are removed by deleting * from the new list as we search for the tail. * Calls to add and delete must be protected by kcage_range_rwlock * Check if the delete is OK first as a number of elements * might be involved and it will be difficult to go * back and undo (can't just add the range back in). * If there have been no pages allocated from this * element, we don't need to check it. * If the element does not overlap, its OK. * Overlapping element: Does the range to be deleted * overlap the area already used? If so fail. * Calls to add and delete must be protected by kcage_range_rwlock. * This routine gets called after successful Solaris memory * delete operation from DR post memory delete routines. * No locking is required here as the whole operation is covered * by kcage_range_rwlock writer lock. /* The delete range overlaps this element. */ /* Delete whole element. */ /* This can never happen. */ * Remove a section from the middle, * need to allocate a new element. * Tranfser unused range to new. * Edit lp in place to preserve /* Delete part of current block. */ * If lockit is 1, kcage_get_pfn holds the * reader lock for kcage_range_rwlock. * Changes to lp->curr can cause race conditions, but * they are handled by higher level code (see kcage_next_range.) * Walk the physical address space of the cage. * This routine does not guarantee to return PFNs in the order * in which they were allocated to the cage. Instead, it walks * each range as they appear on the growth list returning the PFNs * range in ascending order. * To begin scanning at lower edge of cage, reset should be nonzero. * To step through cage, reset should be zero. * PFN_INVALID will be returned when the upper end of the cage is * reached -- indicating a full scan of the cage has been completed since * previous reset. PFN_INVALID will continue to be returned until * kcage_walk_cage is reset. * It is possible to receive a PFN_INVALID result on reset if a growth * list is not installed or if none of the PFNs in the installed list have * been allocated to the cage. In otherwords, there is no cage. * Caller need not hold kcage_range_rwlock while calling this function * as the front part of the list is static - pages never come out of * The caller is expected to only be kcage_cageout(). * In this range the cage grows from the highest * address towards the lowest. * Arrange to return pfns from curr to lim-1, * inclusive, in ascending order. * In this range the cage grows from the lowest * address towards the highest. * Arrange to return pfns from base to curr, * inclusive, in ascending order. if (
lp->
decr != 0) {
/* decrementing pfn */ /* Don't go beyond the static part of the glist. */ }
else {
/* incrementing pfn */ /* Don't go beyond the static part of the glist. */ * Callback functions for to recalc cage thresholds after /* TODO: when should cage refuse memory delete requests? */ * This is called before a CPR suspend and after a CPR resume. We have to * turn off kcage_cageout_ready before a suspend, and turn it back on after a * kcage_recalc_preferred_size() increases initial cage size to improve large * page availability when lp for kmem is enabled and kpr is disabled * Kcage_init() builds the cage and initializes the cage thresholds. * The size of the cage is determined by the argument preferred_size. * or the actual amount of memory, whichever is smaller. /* increase preferred cage size for lp for kmem */ /* Debug note: initialize this now so early expansions can stat */ * Initialize cage thresholds and install kphysm callback. * If we can't arrange to have the thresholds track with * available physical memory, then the cage thresholds may * end up over time at levels that adversly effect system * performance; so, bail out. ASSERT(0);
/* Catch this in DEBUG kernels. */ * Limit startup cage size within the range of kcage_minfree * and availrmem, inclusively. * Construct the cage. PFNs are allocated from the glist. It * is assumed that the list has been properly ordered for the * platform by the platform code. Typically, this is as simple * as calling kcage_range_init(phys_avail, decr), where decr is * 1 if the kernel has been loaded into upper end of physical * memory, or 0 if the kernel has been loaded at the low end. * Note: it is assumed that we are in the startup flow, so there * is no reason to grab the page lock. * Set the noreloc state on the page. * If the page is free and not already * on the noreloc list then move it. * Need to go through and find kernel allocated pages * and capture them into the Cage. These will primarily * be pages gotten through boot_alloc(). * CB_CL_CPR_POST_KERNEL is the class that executes from cpr_suspend() * after the cageout thread is blocked, and executes from cpr_resume() * before the cageout thread is restarted. By executing in this class, * we are assured that the kernel cage thread won't miss wakeup calls * and also CPR's larger kmem_alloc requests will not fail after * CPR shuts down the cageout kernel thread. * Coalesce pages to improve large page availability. A better fix * would to coalesce pages as they are included in the cage /* TODO: any reason to take more care than this with live editing? */ * kcage_create_throttle() * Wakeup cageout thread and throttle waiting for the number of pages * requested to become available. For non-critical requests, a * timeout is added, since freemem accounting is separate from cage * freemem accounting: it's possible for us to get stuck and not make * forward progress even though there was sufficient freemem before * Obviously, we can't throttle the cageout thread since * we depend on it. We also can't throttle the panic thread. * Don't throttle threads which are critical for proper * vm management if we're above kcage_throttlefree or * if freemem is very low. * Don't throttle real-time threads if kcage_freemem > kcage_reserve. * Cause all other threads (which are assumed to not be * critical to cageout) to wait here until their request * can be satisfied. Be a little paranoid and wake the * kernel cage on each loop through this logic. * NOTE: atomics are used just in case we enter * mp operation before the cageout thread is ready. * return 0 on failure and 1 on success. * The szc of a locked page can only change for pages that are * non-swapfs (i.e. anonymous memory) file system pages. for (i = 0; i <
npgs; i++,
pp++) {
* Attempt to convert page to a caged page (set the P_NORELOC flag). * If successful and pages is free, move page to the tail of whichever * EBUSY page already locked, assimilated but not free. * ENOMEM page assimilated, but memory too low to relocate. Page not free. * EAGAIN page not assimilated. Page not free. * ERANGE page assimilated. Page not root. * 0 page assimilated. Page free. * *nfreedp number of pages freed. * NOTE: With error codes ENOMEM, EBUSY, and 0 (zero), there is no way * to distinguish between a page that was already a NORELOC page from * those newly converted to NORELOC pages by this invocation of * Need to upgrade the lock on it and set the NORELOC * bit. If it is free then remove it from the free * list so that the platform free list code can keep * NORELOC pages where they should be. * Before doing anything, get the exclusive lock. * This may fail (eg ISM pages are left shared locked). * If the page is free this will leave a hole in the * cage. There is no solution yet to this. /* TODO: we don't really need n any more? */ * Expand the cage if available cage memory is really low. Calculate * the amount required to return kcage_freemem to the level of * kcage_lotsfree, or to satisfy throttled requests, whichever is * more. It is rare for their sum to create an artificial threshold * above kcage_lotsfree, but it is possible. * Exit early if expansion amount is equal to or less than zero. * (<0 is possible if kcage_freemem rises suddenly.) * Exit early when the global page pool (apparently) does not * have enough free pages to page_relocate() even a single page. * Assimilate more pages from the global page pool into the cage. n = 0;
/* number of pages PP_SETNORELOC'd */ nf = 0;
/* number of those actually free */ * Sanity check. Skip this pfn if it is case 0:
/* assimilated, page is free */ case EBUSY:
/* assimilated, page not free */ case ERANGE:
/* assimilated, page not root */ case ENOMEM:
/* assimilated, but no mem */ case EAGAIN:
/* can't assimilate */ default:
/* catch this with debug kernels */ * Realign cage edge with the nearest physical address * boundry for big pages. This is done to give us a * better chance of actually getting usable big pages * Relocate page opp (Original Page Pointer) from cage pool to page rpp * (Replacement Page Pointer) in the global pool. Page opp will be freed * if relocation is successful, otherwise it is only unlocked. * On entry, page opp must be exclusively locked and not free. * *nfreedp: number of pages freed. return (0);
/* success */ * Based on page_invalidate_pages() * Kcage_invalidate_page() uses page_relocate() twice. Both instances * of use must be updated to match the new page_relocate() when it * Return result of kcage_relocate_page or zero if page was directly freed. * *nfreedp: number of pages freed. * Is this page involved in some I/O? shared? * The page_struct_lock need not be acquired to * examine these fields since the page has an * Unload the mappings and check if mod bit is set. * Wait here. Sooner or later, kcage_freemem_sub() will notice * that kcage_freemem is less than kcage_desfree. When it does * notice, kcage_freemem_sub() will wake us up via call to * kcage_cageout_wakeup(). * Did a complete walk of kernel cage, but didn't free * any pages. If only one cpu is online then * stop kernel cage walk and try expanding. * Do a quick PP_ISNORELOC() and PP_ISFREE test outside * of the lock. If one is missed it will be seen next * Skip non-caged-pages. These pages can exist in the cage * because, if during cage expansion, a page is * encountered that is long-term locked the lock prevents the * expansion logic from setting the P_NORELOC flag. Hence, * non-caged-pages surrounded by caged-pages. /* catch this with debug kernels */ /* P_NORELOC bit should not have gone away. */ * In pass {0, 1}, skip page if ref bit is set. * In pass {0, 1, 2}, skip page if mod bit is set. /* On first pass ignore ref'd pages */ /* On pass 2, page_destroy if mod bit is not set */ * unload the mappings before * checking if mod bit is set * skip this page if modified * No need to drop the page lock here. * Kcage_invalidate_page has done that for us * either explicitly or through a page_free. * Expand the cage only if available cage memory is really low. * This test is done only after a complete scan of the cage. * The reason for not checking and expanding more often is to * avoid rapid expansion of the cage. Naturally, scanning the * cage takes time. So by scanning first, we use that work as a * delay loop in between expand decisions. * Kcage_expand() will return a non-zero value if it was * able to expand the cage -- whether or not the new * pages are free and immediately usable. If non-zero, * we do another scan of the cage. The pages might be * freed during that scan or by time we get back here. * If not, we will attempt another expansion. * However, if kcage_expand() returns zero, then it was * unable to expand the cage. This is the case when the * the growth list is exausted, therefore no work was done * and there is no reason to scan the cage again. * Note: Kernel cage scan is not repeated on single-cpu * system to avoid kernel cage thread hogging cpu. * If available cage memory is less than abundant * and a full scan of the cage has not yet been completed, * or a scan has completed and some work was performed, * or pages were skipped because of sharing, * or we simply have not yet completed two passes, * Available cage memory is really low. Time to * start expanding the cage. However, the * kernel cage thread is not yet ready to * do the work. Use *this* thread, which is * most likely to be t0, to do the work. /* else, kernel cage thread is already running */ * Once per second we wake up all the threads throttled * waiting for cage memory, in case we've become stuck * and haven't made forward progress expanding the cage.