vm_machdep.c revision affbd3ccca8e26191a210ec9f9ffae170f919afd
1N/A * The contents of this file are subject to the terms of the 1N/A * Common Development and Distribution License, Version 1.0 only 1N/A * (the "License"). You may not use this file except in compliance 1N/A * See the License for the specific language governing permissions 1N/A * and limitations under the License. 1N/A * When distributing Covered Code, include this CDDL HEADER in each 1N/A * If applicable, add the following below this CDDL HEADER, with the 1N/A * fields enclosed by brackets "[]" replaced with your own identifying 1N/A * information: Portions Copyright [yyyy] [name of copyright owner] 1N/A * Copyright 2005 Sun Microsystems, Inc. All rights reserved. 1N/A * Use is subject to license terms. 1N/A/* Copyright (c) 1984, 1986, 1987, 1988, 1989 AT&T */ 1N/A/* All Rights Reserved */ 1N/A * Portions of this source code were derived from Berkeley 4.3 BSD 1N/A * under license from the Regents of the University of California. 1N/A#
pragma ident "%Z%%M% %I% %E% SMI" 1N/A * UNIX machine dependent virtual memory support. /* 4g memory management */ /* How many page sizes the users can see */ * Return the optimum page size for a given mapping * use the pages size that best fits len * for ISM use the 1st large page size. * This can be patched via /etc/system to allow large pages * to be used for mapping application and libraries text segments. * Return a bit vector of large page size codes that * can be used to map [addr, addr + len) region. * If this isn't a potential unmapped hole in the user's * UNIX data or stack segments, just return status info. * Check to see if we happened to faulted on a currently unmapped * part of the UNIX data or stack segments. If so, create a zfod * mapping there and then try calling the fault routine again. /* not in either UNIX data or stack segments */ * the rest of this function implements a 3.X 4.X 5.X compatibility * This code is probably not needed anymore /* expand the gap to the page boundaries on each side */ * This page is already mapped by another thread after * we returned from as_fault() above. We just fall * through as_fault() below. * map_addr_proc() is the routine called when the system is to * choose an address for the user. We will pick an address * range which is the highest available below kernelbase. * On input it is a hint from the user to be used in a completely * machine dependent fashion. We decide to completely ignore this hint. * On output it is NULL if no address can be found in the current * processes address space or else an address that is currently * not mapped for len bytes with a page of red zone on either side. * align is not needed on x86 (it's for viturally addressed caches) * XX64 Yes, this needs more work. * This happens when a program wants to map * something in a range that's accessible to a * program in a smaller address space. For example, * a 64-bit program calling mmap32(2) to guarantee * that the returned address is below 4Gbytes. * XX64 This layout is probably wrong .. but in * the event we make the amd64 address space look * like sparcv9 i.e. with the stack -above- the * heap, this bit of code might even be correct. * Redzone for each side of the request. This is done to leave * one page unmapped between segments. This is not required, but * it's useful for the user because if their program strays across * a segment boundary, it will catch a fault immediately making * debugging a little easier. * figure out what the alignment should be * XX64 -- is there an ELF_AMD64_MAXPGSZ or is it the same???? * Align virtual addresses to ensure that ELF shared libraries * are mapped with the appropriate alignment constraints by * Look for a large enough hole starting below userlimit. * After finding it, use the upper part. Addition of PAGESIZE * is for the redzone as described above. * Round address DOWN to the alignment amount, * add the offset, and if this address is less * than the original address, add alignment amount. * Determine whether [base, base+len] contains a valid range of * addresses at least minlen long. base and len are adjusted if * required to provide a valid range. * If hi rolled over the top, try cutting back. * Deal with a possible hole in the address range between * hole_start and hole_end that should never be mapped. /* lo < hole_start && hi >= hole_end */ * Determine whether [addr, addr+len] are valid user addresses. * Return 1 if the page frame is onboard memory, else 0. * initialized by page_coloring_init(). * Page freelists and cachelists are dynamically allocated once mnoderangecnt * and page_colors are calculated from the l2 cache n-way set size. Within a * mnode range, the page freelist and cachelist are hashed into bins based on * color. This makes it easier to search for a page within a specific memory * As the PC architecture evolved memory up was clumped into several * ranges for various historical I/O devices to do DMA. * < 4Gig - PCI bus or drivers that don't understand PAE mode 0x100000,
/* pfn range for 4G and above */ 0x80000,
/* pfn range for 2G-4G */ 0x01000,
/* pfn range for 16M-2G */ 0x00000,
/* pfn range for 0-16M */ * These are changed during startup if the machine has limited memory. * Used by page layer to know about page sizes * This can be patched via /etc/system to allow old non-PAE aware device * drivers to use kmem_alloc'd memory on 32 bit systems with > 4Gig RAM. * return the memrange containing pfn for (n = 0; n <
nranges -
1; ++n) {
* return the mnoderange containing pfn * returns a page list of contiguous pages. It minimally has to return * minctg pages. Caller determines minctg based on the scatter-gather * pfnp is set to the next page frame to search on return. * fail if pfn + minctg crosses a segment boundary. * Adjust for next starting pfn to begin at segment boundary. * exit loop when pgcnt satisfied or segment boundary reached. *
pfnp += i;
/* set to next pfn to search */ * failure: minctg not satisfied. * if next request crosses segment boundary, set next pfn * to search from the segment boundary. /* clean up any pages already allocated */ * verify that pages being returned from allocator have correct DMA attribute panic(
"PFN (pp=%p) below dma_attr_addr_lo",
pp);
panic(
"PFN (pp=%p) above dma_attr_addr_hi",
pp);
* in order to satisfy the request, must minimally * acquire minctg contiguous pages * start from where last searched if the minctg >= lastctgcnt /* conserve 16m memory - start search above 16m when possible */ * return when contig pages no longer needed /* cannot find contig pages in specified range */ /* did not start with lo previously */ /* allow search to go above startpfn */ * return when contig pages no longer needed * combine mem_node_config and memrange memory ranges into one data * structure to be used for page list management. * mnode_range_cnt() calculates the number of memory ranges for mnode and * memranges[]. Used to determine the size of page lists and mnoderanges. * mnode_range_setup() initializes mnoderanges. /* find the memranges index below contained in mnode range */ * increment mnode range counter when memranges or mnode * Determine if the mnode range specified in mtype contains memory belonging * to memory node mnode. If flags & PGI_MT_RANGE is set then mtype contains * the range of indices to 0 or 4g. * Return first mnode range type index found otherwise return -1 if none found. int mtlim = 0;
/* default to PGI_MT_RANGEO */ * Returns the free page count for mnode * Initialize page coloring variables based on the l2 cache parameters. * Calculate and return memory needed for page coloring data structures. * Reduce the memory ranges lists if we don't have large amounts * of memory. This avoids searching known empty free lists. /* physmax greater than 4g */ * setup pagesize for generic page layer /* l2_assoc is 0 for fully associative l2 cache */ /* for scalability, configure at least PAGE_COLORS_MIN color bins */ * cpu_page_colors is non-zero when a page color may be spread across /* size for mnoderanges */ /* size for fpc_mutex and cpc_mutex */ /* size of page_freelists */ /* size of page_cachelists */ * Called once at startup to configure page_coloring data structures and * does the 1st page_free()/page_freelist_add(). * get a page from any list with the given mnode * check up to page_colors + 1 bins - origbin may be checked twice * because of BIN_STEP skip /* check if page within DMA attributes */ /* found a page with specified DMA attributes */ /* failed to find a page in the freelist; try it in the cachelist */ /* reset mtype start for cachelist search */ /* start with the bin of matching color */ /* check if page within DMA attributes */ /* found a page with specified DMA attributes */ * This function is similar to page_get_freelist()/page_get_cachelist() * but it searches both the lists to find a page with the specified * color (or no color) and DMA attributes. The search is done in the * freelist first and then in the cache list within the highest memory * range (based on DMA attributes) before searching in the lower * Note: This function is called only by page_create_io(). /* only base pagesize currently supported */ * If we're passed a specific lgroup, we use it. Otherwise, * assume first-touch placement is desired. * Only hold one freelist or cachelist lock at a time, that way we * can start anywhere and not have to worry about lock * We can guarantee alignment only for page boundary. /* cylcing thru mtype handled by RANGE0 if n == 0 */ * Try local memory node first, but try remote if we can't * get a page of the right color. * allocate pages from high pfn to low. * This function is a copy of page_create_va() with an additional * argument 'mattr' that specifies DMA memory requirements to * the page list functions. This function is used by the segkmem * allocator so it is only to create new pages (i.e PG_EXCL is * Note: This interface is currently used by x86 PSM only and is * not fully specified so the commitment level is only for * private interface specific to x86. This interface uses PSM * specific page_get_anylist() interface. "page_create_start:vp %p off %llx bytes %u flags %x",
* Do the freemem and pcf accounting. "page_create_success:vp %p off %llx",
* If satisfying this request has left us with too little * memory, start the wheels turning to get some back. The * first clause of the test prevents waking up the pageout * daemon in situations where it would decide that there's "pageout_cv_signal:freemem %ld",
freemem);
panic(
"pg_creat_io: hashin failed %p %p %llx",
* page_get_contigpage returns when npages <= sgllen. * Grab the rest of the non-contig pages below from anylist. * Loop around collecting the requested number of pages. * Most of the time, we have to `create' a new page. With * this in mind, pull the page off the free list before * getting the hash lock. This will minimize the hash * lock hold time, nesting, and the like. If it turns * out we don't need the page, we put it back at the end. * Try to get the page of any color either from * the freelist or from the cache list. * Not looking for a special page; * No page found! This can happen * if we are looking for a page * within a specific memory range * for DMA purposes. If PG_WAIT is * specified then we wait for a * while and then try again. The * wait could be forever if we * don't get the page(s) we need. * Note: XXX We really need a mechanism * to wait for pages in the desired * range. For now, we wait for any * pages and see if we can use it. goto fail;
/* undo accounting stuff */ * Since this page came from the * cachelist, we must destroy the * Here we have a page in our hot little mits and are * just waiting to stuff it on the appropriate lists. * Get the mutex and check to see if it really does * Since we hold the page hash mutex and * just searched for this page, page_hashin * had better not fail. If it does, that * means somethread did not follow the * page hash mutex rules. Panic now and * get it over with. As usual, go down panic(
"page_create: hashin fail %p %p %llx %p",
* Hat layer locking need not be done to set * the following bits since the page is not hashed * and was on the free list (i.e., had no mappings). * Set the reference bit to protect * against immediate pageout * XXXmh modify freelist code to set reference * bit so we don't have to do it here. * NOTE: This should not happen for pages associated * with kernel vnode 'kvp'. /* XX64 - to debug why this happens! */ "page_create: page not expected " "in hash list for kernel vnode - pp 0x%p",
* Got a page! It is locked. Acquire the i/o * lock since we are going to use the p_next and * p_prev fields to link the requested pages together. * Did not need this page after all. * Put it back on the free list. * Give up the pages we already got. /*LINTED: constant in conditional ctx*/ * VN_DISPOSE does freemem accounting for the pages in plist * by calling page_free. So, we need to undo the pcf accounting * for only the remaining pages. * Copy the data from the physical page represented by "frompp" to * that represented by "topp". ppcopy uses CPU->cpu_caddr1 and * CPU->cpu_caddr2. It assumes that no one uses either map at interrupt * level and no one sleeps with an active mapping there. * Note that the ref/mod bits in the page_t's are not affected by * this operation, hence it is up to the caller to update them appropriately. * disable pre-emption so that CPU can't change * Zero the physical page from off to off + len given by `pp' * without changing the reference and modified bits of page. * We use this using CPU private page address #2, see ppcopy() for more info. * pagezero() must not be called at interrupt level. * Platform-dependent page scrub call. * For now, we rely on the fact that pagezero() will * set up two private addresses for use on a given CPU for use in ppcopy() * Create the pageout scanner thread. The thread has to * start at procedure with process pp and priority pri. * Function for flushing D-cache when performing module relocations * to an alternate mapping. Unnecessary on Intel / AMD platforms.