2362N/A * Copyright (c) 1999, 2008, Oracle and/or its affiliates. All rights reserved. 0N/A * DO NOT ALTER OR REMOVE COPYRIGHT NOTICES OR THIS FILE HEADER. 0N/A * This code is free software; you can redistribute it and/or modify it 0N/A * under the terms of the GNU General Public License version 2 only, as 2362N/A * published by the Free Software Foundation. Oracle designates this 0N/A * particular file as subject to the "Classpath" exception as provided 2362N/A * by Oracle in the LICENSE file that accompanied this code. 0N/A * This code is distributed in the hope that it will be useful, but WITHOUT 0N/A * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or 0N/A * FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License 0N/A * version 2 for more details (a copy is included in the LICENSE file that 0N/A * accompanied this code). 0N/A * You should have received a copy of the GNU General Public License version 0N/A * 2 along with this work; if not, write to the Free Software Foundation, 0N/A * Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA. 2362N/A * Please contact Oracle, 500 Oracle Parkway, Redwood Shores, CA 94065 USA 2362N/A * or visit www.oracle.com if you need additional information or have any 0N/A * (C) Copyright Taligent, Inc. 1996, 1997 - All Rights Reserved 0N/A * (C) Copyright IBM Corp. 1996 - 2002 - All Rights Reserved 0N/A * The original version of this source code and documentation 0N/A * is copyrighted and owned by Taligent, Inc., a wholly-owned 0N/A * subsidiary of IBM. These materials are provided under terms 0N/A * of a License Agreement between Taligent and Sun. This technology 0N/A * is protected by multiple US and International patents. 0N/A * This notice and attribution to Taligent may not be removed. 0N/A * Taligent is a registered trademark of Taligent, Inc. 0N/A * A subclass of RuleBasedBreakIterator that adds the ability to use a dictionary 0N/A * to further subdivide ranges of text beyond what is possible using just the 0N/A * state-table-based algorithm. This is necessary, for example, to handle 0N/A * word and line breaking in Thai, which doesn't use spaces between words. The 0N/A * state-table-based algorithm used by RuleBasedBreakIterator is used to divide 0N/A * up text as far as possible, and then contiguous ranges of letters are 0N/A * repeatedly compared against a list of known words (i.e., the dictionary) 0N/A * to divide them up into words. 0N/A * DictionaryBasedBreakIterator uses the same rule language as RuleBasedBreakIterator, 0N/A * but adds one more special substitution name: <dictionary>. This substitution 0N/A * name is used to identify characters in words in the dictionary. The idea is that 0N/A * if the iterator passes over a chunk of text that includes two or more characters 0N/A * in a row that are included in <dictionary>, it goes back through that range and 0N/A * derives additional break positions (if possible) using the dictionary. 0N/A * DictionaryBasedBreakIterator is also constructed with the filename of a dictionary 0N/A * file. It follows a prescribed search path to locate the dictionary (right now, 0N/A * and won't find it in JAR files, but this location is likely to change). The 0N/A * dictionary file is in a serialized binary format. We have a very primitive (and 0N/A * slow) BuildDictionaryFile utility for creating dictionary files, but aren't 0N/A * currently making it public. Contact us for help. 0N/A * a list of known words that is used to divide up contiguous ranges of letters, 0N/A * stored in a compressed, indexed, format that offers fast access 0N/A * a list of flags indicating which character categories are contained in 0N/A * the dictionary file (this is used to determine which ranges of characters 0N/A * to apply the dictionary to) 0N/A * a temporary hiding place for the number of dictionary characters in the 0N/A * last range passed over by next() 0N/A * when a range of characters is divided up using the dictionary, the break 0N/A * positions that are discovered are stored here, preventing us from having 0N/A * to use either the dictionary or the state table again until the iterator 0N/A * leaves this range of text 0N/A * if cachedBreakPositions is not null, this indicates which item in the 0N/A * cache the current iteration position refers to 0N/A * Constructs a DictionaryBasedBreakIterator. 0N/A * @param description Same as the description parameter on RuleBasedBreakIterator, 0N/A * except for the special meaning of "<dictionary>". This parameter is just 0N/A * passed through to RuleBasedBreakIterator's constructor. 0N/A * @param dictionaryFilename The filename of the dictionary file to use 0N/A * Sets the current iteration position to the beginning of the text. 0N/A * (i.e., the CharacterIterator's starting offset). 0N/A * @return The offset of the beginning of the text. 0N/A * Sets the current iteration position to the end of the text. 0N/A * (i.e., the CharacterIterator's ending offset). 0N/A * @return The text's past-the-end offset. 0N/A * Advances the iterator one step backwards. 0N/A * @return The position of the last boundary position before the 0N/A * current iteration position 0N/A // if we have cached break positions and we're still in the range 0N/A // covered by them, just move one step backward in the cache 0N/A // otherwise, dump the cache and use the inherited previous() method to move 0N/A // backward. This may fill up the cache with new break positions, in which 0N/A // case we have to mark our position in the cache 0N/A * Sets the current iteration position to the last boundary position 0N/A * before the specified position. 0N/A * @param offset The position to begin searching from 0N/A * @return The position of the last boundary before "offset" 0N/A // if we have no cached break positions, or "offset" is outside the 0N/A // range covered by the cache, we can just call the inherited routine 0N/A // (which will eventually call other routines in this class that may 0N/A // refresh the cache) 0N/A // on the other hand, if "offset" is within the range covered by the cache, 0N/A // then all we have to do is search the cache for the last break position 0N/A * Sets the current iteration position to the first boundary position after 0N/A * the specified position. 0N/A * @param offset The position to begin searching forward from 0N/A * @return The position of the first boundary after "offset" 0N/A // if we have no cached break positions, or if "offset" is outside the 0N/A // range covered by the cache, then dump the cache and call our 0N/A // inherited following() method. This will call other methods in this 0N/A // class that may refresh the cache. 0N/A // on the other hand, if "offset" is within the range covered by the 0N/A // cache, then just search the cache for the first break position 0N/A * This is the implementation function for next(). 0N/A // if there are no cached break positions, or if we've just moved 0N/A // off the end of the range covered by the cache, we have to dump 0N/A // and possibly regenerate the cache 0N/A // start by using the inherited handleNext() to find a tentative return 0N/A // value. dictionaryCharCount tells us how many dictionary characters 0N/A // we passed over on our way to the tentative return value 0N/A // if we passed over more than one dictionary character, then we use 0N/A // divideUpDictionaryRange() to regenerate the cached break positions 0N/A // for the new range 0N/A // otherwise, the value we got back from the inherited fuction 0N/A // is our return value, and we can dump the cache 0N/A // if the cache of break positions has been regenerated (or existed all 0N/A // along), then just advance to the next break position in the cache 0N/A return -
9999;
// SHOULD NEVER GET HERE! 0N/A * Looks up a character category for a character. 0N/A // this override of lookupCategory() exists only to keep track of whether we've 0N/A // passed over any dictionary characters. It calls the inherited lookupCategory() 0N/A // to do the real work, and then checks whether its return value is one of the 0N/A // categories represented in the dictionary. If it is, bump the dictionary- 0N/A * This is the function that actually implements the dictionary-based 0N/A * algorithm. Given the endpoints of a range of text, it uses the 0N/A * dictionary to determine the positions of any boundaries in this 0N/A * range. It stores all the boundary positions it discovers in 0N/A * cachedBreakPositions so that we only have to do this work once 0N/A * for each time we enter the range. 0N/A // the range we're dividing may begin or end with non-dictionary characters 0N/A // (i.e., for line breaking, we may have leading or trailing punctuation 0N/A // that needs to be kept with the word). Seek from the beginning of the 0N/A // range to the first dictionary character 0N/A // initialize. We maintain two stacks: currentBreakPositions contains 0N/A // the list of break positions that will be returned if we successfully 0N/A // finish traversing the whole range now. possibleBreakPositions lists 0N/A // all other possible word ends we've passed along the way. (Whenever 0N/A // we reach an error [a sequence of characters that can't begin any word 0N/A // in the dictionary], we back up, possibly delete some breaks from 0N/A // currentBreakPositions, move a break from possibleBreakPositions 0N/A // to currentBreakPositions, and start over from there. This process 0N/A // continues in this way until we either successfully make it all the way 0N/A // across the range, or exhaust all of our combinations of break 0N/A // the dictionary is implemented as a trie, which is treated as a state 0N/A // machine. -1 represents the end of a legal word. Every word in the 0N/A // dictionary is represented by a path from the root node to -1. A path 0N/A // that ends in state 0 is an illegal combination of characters. 0N/A // these two variables are used for error handling. We keep track of the 0N/A // farthest we've gotten through the range being divided, and the combination 0N/A // of breaks that got us that far. If we use up all possible break 0N/A // combinations, the text contains an error or a word that's not in the 0N/A // dictionary. In this case, we "bless" the break positions that got us the 0N/A // farthest as real break positions, and then start over from scratch with 0N/A // the character where the error occurred. 0N/A // initialize (we always exit the loop with a break statement) 0N/A // if we can transition to state "-1" from our current state, we're 0N/A // on the last character of a legal word. Push that position onto 0N/A // the possible-break-positions stack 0N/A // look up the new state to transition to in the dictionary 0N/A // if the character we're sitting on causes us to transition to 0N/A // the "end of word" state, then it was a non-dictionary character 0N/A // and we've successfully traversed the whole range. Drop out 0N/A // if the character we're sitting on causes us to transition to 0N/A // the error state, or if we've gone off the end of the range 0N/A // without transitioning to the "end of word" state, we've hit 0N/A // if this is the farthest we've gotten, take note of it in 0N/A // case there's an error in the text 0N/A // wrongBreakPositions is a list of all break positions 0N/A // we've tried starting that didn't allow us to traverse 0N/A // all the way through the text. Every time we pop a 0N/A //break position off of currentBreakPositions, we put it 0N/A // into wrongBreakPositions to avoid trying it again later. 0N/A // If we make it to this spot, we're either going to back 0N/A // up to a break in possibleBreakPositions and try starting 0N/A // over from there, or we've exhausted all possible break 0N/A // positions and are going to do the fallback procedure. 0N/A // This loop prevents us from messing with anything in 0N/A // possibleBreakPositions that didn't work as a starting 0N/A // point the last time we tried it (this is to prevent a bunch of 0N/A // repetitive checks from slowing down some extreme cases) 0N/A // if we've used up all possible break-position combinations, there's 0N/A // an error or an unknown word in the text. In this case, we start 0N/A // over, treating the farthest character we've reached as the beginning 0N/A // of the range, and "blessing" the break positions that got us that 0N/A // far as real break positions 0N/A // if we still have more break positions we can try, then promote the 0N/A // last break in possibleBreakPositions into currentBreakPositions, 0N/A // and get rid of all entries in currentBreakPositions that come after 0N/A // it. Then back up to that position and start over from there (i.e., 0N/A // treat that position as the beginning of a new word) 0N/A // re-sync "c" for the next go-round, and drop out of the loop if 0N/A // we've made it off the end of the range 0N/A // if we didn't hit any exceptional conditions on this last iteration, 0N/A // just advance to the next character and loop 0N/A // dump the last break position in the list, and replace it with the actual 0N/A // end of the range (which may be the same character, or may be further on 0N/A // because the range actually ended with non-dictionary characters we want to 0N/A // keep with the word) 0N/A // create a regular array to hold the break positions and copy 0N/A // the break positions from the stack to the array (in addition, 0N/A // our starting position goes into this array as a break position). 0N/A // This array becomes the cache of break positions used by next() 0N/A // and previous(), so this is where we actually refresh the cache.