htable.c revision b59c4a48daf5a1863ecac763711b497b2f8321e4
* Copyright (c) 2004, 2010, Oracle and/or its affiliates. All rights reserved. * The variable htable_reserve_amount, rather than HTABLE_RESERVE_AMOUNT, * is used in order to facilitate testing of the htable_steal() code. * By resetting htable_reserve_amount to a lower value, we can force * stealing to occur. The reserve amount is a guess to get us through boot. * Used to hand test htable_steal(). * This variable is so that we can tune this via /etc/system * Any value works, but a power of two <= mmu.ptes_per_table is best. * mutex stuff for access to htable hash * A counter to track if we are stealing or reaping htables. When non-zero * htable_free() will directly free htables (either to the reserve or kmem) * instead of putting them in a hat's htable cache. * Track the number of active pagetables, so we can know how many to reap * Deal with hypervisor complications. panic(
"HYPERVISOR_mmuext_op() failed");
/*LINTED: constant in conditional context*/ panic(
"HYPERVISOR_mmuext_op() failed");
panic(
"HYPERVISOR_mmuext_op() failed");
/*LINTED: constant in conditional context*/ panic(
"HYPERVISOR_mmuext_op() failed");
* Value of "how" should be: * PT_WRITABLE | PT_VALID - regular kpm mapping * PT_VALID - make mapping read-only * returns 0 on success. non-zero for failure. panic(
"HYPERVISOR_mmuext_op() failed");
panic(
"HYPERVISOR_mmuext_op() failed");
panic(
"HYPERVISOR_update_va_mapping() failed");
* Allocate a memory page for a hardware page table. * A wrapper around page_get_physical(), with some extra checks. * The first check is to see if there is memory in the system. If we * drop to throttlefree, then fail the ptable_alloc() and let the * stealing code kick in. Note that we have to do this test here, * since the test in page_create_throttle() would let the NOSLEEP * allocation go through and deplete the page reserves. * The !NOMEMWAIT() lets pageout, fsflush, etc. skip this check. * This code makes htable_steal() easier to test. By setting * force_steal we force pagetable allocations to fall * into the stealing code. Roughly 1 in ever "force_steal" * page table allocations will fail. panic(
"ptable_alloc(): Invalid PFN!!");
* Free an htable's associated page table page. See the comments * need to destroy the page used for the pagetable panic(
"ptable_free(): no page for pfn!");
* Get an exclusive lock, might have to wait for a kmem reader. panic(
"failure making kpm r/w pfn=0x%lx",
pfn);
* Put one htable on the reserve list. * Take one htable from the reserve. * Allocate initial htables and put them on the reserve list * Readjust the reserves after a thread finishes using them. * Free any excess htables in the reserve list * Search the active htables for one to steal. Start at a different hash * bucket every time to help spread the pain of stealing * Can we rule out reaping? * Increment busy so the htable can't disappear. We * drop the htable mutex to avoid deadlocks with * hat_pageunload() and the hment mutex while we * - unload and invalidate all PTEs * Reacquire htable lock. If we didn't remove all * mappings in the table, or another thread added a new * mapping behind us, give up on this table. * Steal it and unlink the page table. * remove from the hash list * Break to outer loop to release the * higher (ht_parent) pagetable. This * spreads out the pain caused by * Move hat to the end of the kas list /* relink at end of hat list */ * This routine steals htables from user processes. Called by htable_reap * (reap=TRUE) or htable_alloc (reap=FALSE). * Limit htable_steal_passes to something reasonable * Loop through all user hats. The 1st pass takes cached htables that * aren't in use. The later passes steal by removing mappings, too. /* skip the first hat (kernel) */ * Skip any hat that is already being stolen from. * We skip SHARED hats, as these are dummy * hats that host ISM shared page tables. * We also skip if HAT_FREEING because hat_pte_unmap() * won't zero out the PTE's. That would lead to hitting * stale PTEs either here or under hat_unload() when we * steal and unload the same page table in competing * Mark the HAT as a stealing victim so that it is * not freed from under us, e.g. in as_free() * Take any htables from the hat's cached "free" list. * Don't steal active htables on first pass. * do synchronous teardown for the reap case so that * we can forget hat; at this time, hat is * guaranteed to be around because HAT_VICTIM is set * (see htable_free() for similar code) * Try to spread the pain of stealing, * move victim HAT to the end of the HAT list. * Clear the victim flag, hat can go away now (once /* move on to the next hat */ * This is invoked from kmem when the system is low on memory. We try * to free hments, htables, and ptables to improve the memory situation. * Try to reap 5% of the page tables bounded by a maximum of * 5% of physmem and a minimum of 10. * Note: htable_dont_cache should be set at the time of * Let htable_steal() do the work, we just call htable_free() * Free up excess reserves * Allocate an htable, stealing one or using the reserve if necessary panic(
"htable_alloc(): level %d out of range\n",
level);
* First reuse a cached htable from the hat_ht_cached field, this * avoids unnecessary trips through kmem/page allocators. /* XX64 ASSERT() they're all zero somehow */ * Allocate an htable, possibly refilling the reserves. * Donate successful htable allocations to the reserve. * allocate a page for the hardware page table if needed * If allocations failed, kick off a kmem_reap() and resort to * htable steal(). We may spin here if the system is very low on * memory. If the kernel itself has consumed all memory and kmem_reap() * can't free up anything, then we'll really get stuck here. * That should only happen in a system where the administrator has * misconfigured VM parameters via /etc/system. * If we stole for a bare htable, release the pagetable page. * make stolen page table writable again in kpm panic(
"failure making kpm r/w pfn=0x%lx",
* All attempts to allocate or steal failed. This should only happen * if we run out of memory during boot, due perhaps to a huge * boot_archive. At this point there's no way to continue. panic(
"htable_alloc(): couldn't steal\n");
* Under the 64-bit hypervisor, we have 2 top level page tables. * If this allocation fails, we'll resort to stealing. * We use the stolen page indirectly, by freeing the panic(
"2nd steal ptable failed\n");
* Shared page tables have all entries locked and entries may not * setup flags, etc. for VLP htables * Zero out any freshly allocated page table * Free up an htable, either to a hat's cached list, the reserves or * If the process isn't exiting, cache the free htable in the hat * structure. We always do this for the boot time reserve. We don't * If we have a hardware page table, free it. * We don't free page tables that are accessed by sharing. * Free it or put into reserves. * This is called when a hat is being destroyed or swapped out. We reap all * the remaining htables in the hat cache. If destroying all left over * htables are also destroyed. * We also don't need to invalidate any of the PTPs nor do any demapping. * Purge the htable cache if just reaping. * if freeing, no locking is needed * walk thru the htable hash table and free all the htables in it. * Unlink an entry for a table at vaddr and level out of the existing table * one level higher. We are always holding the HASH_ENTER() when doing this. * This is weird, but Xen apparently automatically unlinks empty * pagetables from the upper page table. So allow PTP to be 0 already. * When a top level VLP page table entry changes, we must issue * a reload of cr3 on all processors. * If we don't need do do that, then we still have to INVLPG against * an address covered by the inner page table, as the latest processors * have TLB-like caches for non-leaf page table entries. * Link an entry for a new table at vaddr and level into the existing table * one level higher. We are always holding the HASH_ENTER() when doing this. * When any top level VLP page table entry changes, we must issue * a reload of cr3 on all processors using it. * We also need to do this for the kernel hat on PAE 32 bit kernel. * Release of hold on an htable. If this is the last use and the pagetable * is empty we may want to free it, then recursively look at the pagetable * above it. The recursion is handled by the outer while() loop. * On the metal, during process exit, we don't bother unlinking the tables from * upper level pagetables. They are instead handled in bulk by hat_free_end(). * We can't do this on the hypervisor as we need the page table to be * implicitly unpinnned before it goes to the free page lists. This can't * happen unless we fully unlink it from the page table hierarchy. * The common case is that this isn't the last use of * an htable so we don't want to free the htable. * we always release empty shared htables * don't release if in address space tear down * At and above max_page_level, free if it's for * a boot-time kernel mapping below kernelbase. * Remember if we destroy an htable that shares its PFN * Handle release of a table and freeing the htable_t. * Unlink it from the table higher (ie. ht_parent). * remove this htable from its hash list * If we released a shared htable, do a release on the htable * Find the htable for the pagetable at the given level for the given address. * If found acquires a hold that eventually needs to be htable_release()d * 32 bit address spaces on 64 bit kernels need to check * for overflow of the 32 bit address space * Acquires a hold on a known htable (from a locked hment entry). * make sure the htable is there * Find the htable for the pagetable at the given level for the given address. * If found acquires a hold that eventually needs to be htable_release()d * If not found the table is created. * Since we can't hold a hash table mutex during allocation, we have to * drop it and redo the search on a create. Then we may have to free the newly * allocated htable if another thread raced in and created it ahead of us. panic(
"htable_create(): level %d out of range\n",
level);
* Create the page tables in top down order. * look up the htable at this level * if we found the htable, increment its busy cnt * and if we had allocated a new htable, free it. * If we find a pre-existing shared table, it must * share from the same place. panic(
"htable shared from wrong place " "found htable=%p shared=%p",
* if we didn't find it on the first search * allocate a new one and search again * 2nd search and still not there, use "new" table * Link new table into higher, when not at top level. * Note we don't do htable_release(higher). * That happens recursively when "new" is removed by * htable_release() or htable_steal(). * If we just created a new shared page table we * increment the shared htable's busy count, so that * it can't be the victim of a steal even if it's empty. * Inherit initial pagetables from the boot program. On the 64-bit * hypervisor we also temporarily mark the p_index field of page table * pages, so we know not to try making them writable in seg_kpm. * make sure the page table physical page is not FREE panic(
"page_resv() failed in ptable alloc");
* Page table pages that were allocated by dboot or * in very early startup didn't go through boot_mapin() /* match offset calculation in page_get_physical() */ offset +=
1ULL <<
40;
/* something > 4 Gig */ * Record in the page_t that is a pagetable for segkpm setup. * Count valid mappings and recursively attach lower level pagetables. * As long as all the mappings we had were below kernel base * we can release the htable. * Walk through a given htable looking for the first valid entry. This * routine takes both a starting and ending address. The starting address * is required to be within the htable provided by the caller, but there is * no such restriction on the ending address. * If the routine finds a valid entry in the htable (at or beyond the * starting address), the PTE (and its address) will be returned. * This PTE may correspond to either a page or a pagetable - it is the * caller's responsibility to determine which. If no valid entry is * found, 0 (and invalid PTE) and the next unexamined address will be * The loop has been carefully coded for optimization. * Compute the starting index and ending virtual address * The following page table scan code knows that the valid * bit of a PTE is in the lowest byte AND that x86 is little endian!! * if we found a valid PTE, load the entire PTE * deal with VA hole on amd64 * Find the address and htable for the first populated translation at or * above the given virtual address. The caller may also specify an upper * limit to the address range to search. Uses level information to quickly * skip unpopulated sections of virtual address spaces. * If not found returns NULL. When found, returns the htable and virt addr * and has a hold on the htable. * If this is a user address, then we know we need not look beyond * If we're coming in with a previous page table, search it first * without doing an htable_lookup(), this should be frequent. * We found nothing in the htable provided by the caller, * so fall through and do the full search * Find the level of the largest pagesize used by this HAT. * Find lowest table with any entry for given address. * No htable at this level for the address. If there * is no larger page size that could cover it, we can * skip right to the start of the next page table. * Find the htable and page table entry index of the given virtual address * with pagesize at or below given level. * If not found returns NULL. When found, returns the htable, sets * entry, and has a hold on the htable. for (l = 0; l <=
level; ++l) {
* Find the htable and page table entry index of the given virtual address. * There must be a valid page mapped at the given address. * If not found returns NULL. When found, returns the htable, sets * entry, and has a hold on the htable. * To save on kernel VA usage, we avoid debug information in 32 bit * get the pte index for the virtual address in the given htable's pagetable * Given an htable and the index of a pte in it, return the virtual address * Need to skip over any VA hole in top level table * The code uses compare and swap instructions to read/write PTE's to * avoid atomicity problems, since PTEs can be 8 bytes on 32 bit systems. * will naturally be atomic. * The combination of using kpreempt_disable()/_enable() and the hci_mutex * are used to ensure that an interrupt won't overwrite a temporary mapping * while it's in use. If an interrupt thread tries to access a PTE, it will * yield briefly back to the pinned thread which holds the cpu's hci_mutex. * On 32 bit kernels, loading a 64 bit PTE is a little tricky if ((t &
0xffffffff) == p[0])
* Disable preemption and establish a mapping to the pagetable with the * given pfn. This is optimized for there case where it's the same * pfn as we last used referenced from this CPU. * VLP pagetables are contained in the hat_t * map the given pfn into the page table window. * If kpm is available, use it. * Disable preemption and grab the CPU's hci_mutex * For hardware we can use a writable mapping. * Release access to a page table. * nothing to do for VLP htables * Drop the CPU's hci_mutex and restore preemption. * We need to always clear the mapping in case a page * that was once a page table page is ballooned out. * Atomic retrieval of a pagetable entry * Be careful that loading PAE entries in 32 bit kernel is atomic. * Atomic unconditional set of a page table entry, it returns the previous * value. For pre-existing mappings if the PFN changes, then we don't care * about the old pte's REF / MOD bits. If the PFN remains the same, we leave * If asked to overwrite a link to a lower page table with a large page * mapping, this routine returns the special value of LPAGE_ERROR. This * allows the upper HAT layers to retry with a smaller mapping size. ASSERT(
new != 0);
/* don't use to invalidate a PTE, see x86pte_update */ * Install the new PTE. If remapping the same PFN, then * copy existing REF/MOD bits to new mapping. * Another thread may have installed this mapping already, * flush the local TLB and be done. * Detect if we have a collision of installing a large * page mapping where there already is a lower page table. * Do a TLB demap if needed, ie. the old pte was valid. * Note that a stale TLB writeback to the PTE here either can't happen * or doesn't matter. The PFN can only change for NOSYNC|NOCONSIST * mappings, but they were created with REF and MOD already set, so * no stale writeback will happen. * Segmap is the only place where remaps happen on the same pfn and for * that we want to preserve the stale REF/MOD bits. * Atomic compare and swap of a page table entry. No TLB invalidates are done. * This is used for links between pagetables of different levels. * Note we always create these links with dirty/access set, so they should * We can't use writable pagetables for upper level tables, so fake it. * On the 64-bit hypervisor we need to maintain the user mode panic(
"HYPERVISOR_mmu_update() failed");
* Invalidate a page table entry as long as it currently maps something that * matches the value determined by expect. * Also invalidates any TLB entries and returns the previous value of the PTE. * If exit()ing just use HYPERVISOR_mmu_update(), as we can't be racing panic(
"HYPERVISOR_mmu_update() failed");
* Note that the loop is needed to handle changes due to h/w updating * Change a page table entry af it currently matches the value in expect. * When removing write permission *and* clearing the * MOD bit, check if a write happened via a stale * TLB entry before the TLB shootdown finished. * If it did happen, simply re-enable write permission and * act like the original CAS failed. * Copy page tables - this is just a little more complicated than the * previous routines. Note that it's also not atomic! It also is never * used for VLP pagetables. * Acquire access to the CPU pagetable windows for the dest and source. * Finish defining the src pagetable mapping * The hypervisor only supports writable pagetables at level 0, so we have * to install these 1 by 1 the slow way. * Zero page table entries - Note this doesn't use atomic stores! * Map in the page table to be zeroed. * On the hypervisor we don't use x86pte_access_pagetable() since * in this case the page is not pinned yet. * Called to ensure that all pagetables are in the system dump