2362N/A * Copyright (c) 2005, 2009, Oracle and/or its affiliates. All rights reserved. 0N/A * DO NOT ALTER OR REMOVE COPYRIGHT NOTICES OR THIS FILE HEADER. 0N/A * This code is free software; you can redistribute it and/or modify it 0N/A * under the terms of the GNU General Public License version 2 only, as 2362N/A * published by the Free Software Foundation. Oracle designates this 0N/A * particular file as subject to the "Classpath" exception as provided 2362N/A * by Oracle in the LICENSE file that accompanied this code. 0N/A * This code is distributed in the hope that it will be useful, but WITHOUT 0N/A * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or 0N/A * FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License 0N/A * version 2 for more details (a copy is included in the LICENSE file that 0N/A * accompanied this code). 0N/A * You should have received a copy of the GNU General Public License version 0N/A * 2 along with this work; if not, write to the Free Software Foundation, 0N/A * Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA. 2362N/A * Please contact Oracle, 500 Oracle Parkway, Redwood Shores, CA 94065 USA 2362N/A * or visit www.oracle.com if you need additional information or have any 0N/A ******************************************************************************* 1091N/A * (C) Copyright IBM Corp. and others, 1996-2009 - All Rights Reserved * 0N/A * The original version of this source code and documentation is copyrighted * 0N/A * and owned by IBM, These materials are provided under terms of a License * 0N/A * Agreement between IBM and Sun. This technology is protected by multiple * 0N/A * US and International patents. This notice and attribution to IBM may not * 0N/A ******************************************************************************* 0N/A * A mutable set of Unicode characters and multicharacter strings. Objects of this class 0N/A * represent <em>character classes</em> used in regular expressions. 0N/A * A character specifies a subset of Unicode code points. Legal 0N/A * code points are U+0000 to U+10FFFF, inclusive. 0N/A * <p>The UnicodeSet class is not designed to be subclassed. 0N/A * <p><code>UnicodeSet</code> supports two APIs. The first is the 0N/A * <em>operand</em> API that allows the caller to modify the value of 0N/A * a <code>UnicodeSet</code> object. It conforms to Java 2's 0N/A * <code>java.util.Set</code> interface, although 0N/A * <code>UnicodeSet</code> does not actually implement that 0N/A * interface. All methods of <code>Set</code> are supported, with the 0N/A * modification that they take a character range or single character 0N/A * instead of an <code>Object</code>, and they take a 0N/A * <code>UnicodeSet</code> instead of a <code>Collection</code>. The 0N/A * operand API may be thought of in terms of boolean logic: a boolean 0N/A * OR is implemented by <code>add</code>, a boolean AND is implemented 0N/A * by <code>retain</code>, a boolean XOR is implemented by 0N/A * <code>complement</code> taking an argument, and a boolean NOT is 0N/A * implemented by <code>complement</code> with no argument. In terms 0N/A * of traditional set theory function names, <code>add</code> is a 0N/A * union, <code>retain</code> is an intersection, <code>remove</code> 0N/A * is an asymmetric difference, and <code>complement</code> with no 0N/A * argument is a set complement with respect to the superset range 0N/A * <code>MIN_VALUE-MAX_VALUE</code> 0N/A * <p>The second API is the 0N/A * <code>applyPattern()</code>/<code>toPattern()</code> API from the 0N/A * <code>java.text.Format</code>-derived classes. Unlike the 0N/A * methods that add characters, add categories, and control the logic 0N/A * of the set, the method <code>applyPattern()</code> sets all 0N/A * attributes of a <code>UnicodeSet</code> at once, based on a 0N/A * <p><b>Pattern syntax</b></p> 0N/A * Patterns are accepted by the constructors and the 0N/A * <code>applyPattern()</code> methods and returned by the 0N/A * <code>toPattern()</code> method. These patterns follow a syntax 0N/A * similar to that employed by version 8 regular expression character 0N/A * classes. Here are some simple examples: 0N/A * <td nowrap valign="top" align="left"><code>[]</code></td> 0N/A * <td valign="top">No characters</td> 0N/A * </tr><tr align="top"> 0N/A * <td nowrap valign="top" align="left"><code>[a]</code></td> 0N/A * <td valign="top">The character 'a'</td> 0N/A * </tr><tr align="top"> 0N/A * <td nowrap valign="top" align="left"><code>[ae]</code></td> 0N/A * <td valign="top">The characters 'a' and 'e'</td> 0N/A * <td nowrap valign="top" align="left"><code>[a-e]</code></td> 0N/A * <td valign="top">The characters 'a' through 'e' inclusive, in Unicode code 0N/A * <td nowrap valign="top" align="left"><code>[\\u4E01]</code></td> 0N/A * <td valign="top">The character U+4E01</td> 0N/A * <td nowrap valign="top" align="left"><code>[a{ab}{ac}]</code></td> 0N/A * <td valign="top">The character 'a' and the multicharacter strings "ab" and 0N/A * "ac"</td> 0N/A * <td nowrap valign="top" align="left"><code>[\p{Lu}]</code></td> 0N/A * <td valign="top">All characters in the general category Uppercase Letter</td> 0N/A * Any character may be preceded by a backslash in order to remove any special 0N/A * meaning. White space characters, as defined by UCharacterProperty.isRuleWhiteSpace(), are 0N/A * ignored, unless they are escaped. 0N/A * <p>Property patterns specify a set of characters having a certain 0N/A * property as defined by the Unicode standard. Both the POSIX-like 0N/A * "[:Lu:]" and the Perl-like syntax "\p{Lu}" are recognized. For a 0N/A * complete list of supported property patterns, see the User's Guide 0N/A * Actual determination of property data is defined by the underlying 0N/A * Unicode database as implemented by UCharacter. 0N/A * <p>Patterns specify individual characters, ranges of characters, and 0N/A * Unicode property sets. When elements are concatenated, they 0N/A * specify their union. To complement a set, place a '^' immediately 0N/A * after the opening '['. Property patterns are inverted by modifying 0N/A * their delimiters; "[:^foo]" and "\P{foo}". In any other location, 0N/A * '^' has no special meaning. 0N/A * <p>Ranges are indicated by placing two a '-' between two 0N/A * characters, as in "a-z". This specifies the range of all 0N/A * characters from the left to the right, in Unicode order. If the 0N/A * left character is greater than or equal to the 0N/A * right character it is a syntax error. If a '-' occurs as the first 0N/A * character after the opening '[' or '[^', or if it occurs as the 0N/A * last character before the closing ']', then it is taken as a 0N/A * literal. Thus "[a\\-b]", "[-ab]", and "[ab-]" all indicate the same 0N/A * set of three characters, 'a', 'b', and '-'. 0N/A * <p>Sets may be intersected using the '&' operator or the asymmetric 0N/A * set difference may be taken using the '-' operator, for example, 0N/A * "[[:L:]&[\\u0000-\\u0FFF]]" indicates the set of all Unicode letters 0N/A * with values less than 4096. Operators ('&' and '|') have equal 0N/A * precedence and bind left-to-right. Thus 0N/A * "[[:L:]-[a-z]-[\\u0100-\\u01FF]]" is equivalent to 0N/A * "[[[:L:]-[a-z]]-[\\u0100-\\u01FF]]". This only really matters for 0N/A * difference; intersection is commutative. 0N/A * <tr valign=top><td nowrap><code>[a]</code><td>The set containing 'a' 0N/A * <tr valign=top><td nowrap><code>[a-z]</code><td>The set containing 'a' 0N/A * through 'z' and all letters in between, in Unicode order 0N/A * <tr valign=top><td nowrap><code>[^a-z]</code><td>The set containing 0N/A * all characters but 'a' through 'z', 0N/A * that is, U+0000 through 'a'-1 and 'z'+1 through U+10FFFF 0N/A * <tr valign=top><td nowrap><code>[[<em>pat1</em>][<em>pat2</em>]]</code> 0N/A * <td>The union of sets specified by <em>pat1</em> and <em>pat2</em> 0N/A * <tr valign=top><td nowrap><code>[[<em>pat1</em>]&[<em>pat2</em>]]</code> 0N/A * <td>The intersection of sets specified by <em>pat1</em> and <em>pat2</em> 0N/A * <tr valign=top><td nowrap><code>[[<em>pat1</em>]-[<em>pat2</em>]]</code> 0N/A * <td>The asymmetric difference of sets specified by <em>pat1</em> and 0N/A * <tr valign=top><td nowrap><code>[:Lu:] or \p{Lu}</code> 0N/A * <td>The set of characters having the specified 0N/A * Unicode property; in 0N/A * this case, Unicode uppercase letters 0N/A * <tr valign=top><td nowrap><code>[:^Lu:] or \P{Lu}</code> 0N/A * <td>The set of characters <em>not</em> having the given 0N/A * <p><b>Warning</b>: you cannot add an empty string ("") to a UnicodeSet.</p> 0N/A * <p><b>Formal syntax</b></p> 0N/A * <td nowrap valign="top" align="right"><code>pattern := </code></td> 0N/A * <td valign="top"><code>('[' '^'? item* ']') | 0N/A * property</code></td> 0N/A * <td nowrap valign="top" align="right"><code>item := </code></td> 0N/A * <td valign="top"><code>char | (char '-' char) | pattern-expr<br> 0N/A * <td nowrap valign="top" align="right"><code>pattern-expr := </code></td> 0N/A * <td valign="top"><code>pattern | pattern-expr pattern | 0N/A * pattern-expr op pattern<br> 0N/A * <td nowrap valign="top" align="right"><code>op := </code></td> 0N/A * <td valign="top"><code>'&' | '-'<br> 0N/A * <td nowrap valign="top" align="right"><code>special := </code></td> 0N/A * <td valign="top"><code>'[' | ']' | '-'<br> 0N/A * <td nowrap valign="top" align="right"><code>char := </code></td> 0N/A * <td valign="top"><em>any character that is not</em><code> special<br> 0N/A * | ('\\' </code><em>any character</em><code>)<br> 0N/A * | ('\u' hex hex hex hex)<br> 0N/A * <td nowrap valign="top" align="right"><code>hex := </code></td> 0N/A * <td valign="top"><em>any character for which 0N/A * </em><code>Character.digit(c, 16)</code><em> 0N/A * returns a non-negative result</em></td> 0N/A * <td nowrap valign="top" align="right"><code>property := </code></td> 0N/A * <td valign="top"><em>a Unicode property set pattern</td> 0N/A * <table border="1"> 0N/A * <td>Legend: <table> 0N/A * <td nowrap valign="top"><code>a := b</code></td> 0N/A * <td width="20" valign="top"> </td> 0N/A * <td valign="top"><code>a</code> may be replaced by <code>b</code> </td> 0N/A * <td nowrap valign="top"><code>a?</code></td> 0N/A * <td valign="top"></td> 0N/A * <td valign="top">zero or one instance of <code>a</code><br> 0N/A * <td nowrap valign="top"><code>a*</code></td> 0N/A * <td valign="top"></td> 0N/A * <td valign="top">one or more instances of <code>a</code><br> 0N/A * <td nowrap valign="top"><code>a | b</code></td> 0N/A * <td valign="top"></td> 0N/A * <td valign="top">either <code>a</code> or <code>b</code><br> 0N/A * <td nowrap valign="top"><code>'a'</code></td> 0N/A * <td valign="top"></td> 0N/A * <td valign="top">the literal string between the quotes </td> 1091N/A * <p>To iterate over contents of UnicodeSet, use UnicodeSetIterator class. 0N/A private static final int LOW =
0x000000;
// LOW <= all valid values. ZERO for codepoints 0N/A private static final int HIGH =
0x110000;
// HIGH > all valid values. 10000 for code units. 0N/A // 110000 for codepoints 0N/A * Minimum value that can be stored in a UnicodeSet. 0N/A * Maximum value that can be stored in a UnicodeSet. 0N/A private int len;
// length used; list may be longer to minimize reallocs 0N/A private int[]
list;
// MUST be terminated with HIGH 0N/A // NOTE: normally the field should be of type SortedSet; but that is missing a public clone!! 0N/A // is not private so that UnicodeSetIterator can get access 0N/A * The pattern representation of this set. This may not be the 0N/A * most economical pattern. It is the pattern supplied to 0N/A * applyPattern(), with variables substituted and whitespace 0N/A * removed. For sets constructed without applyPattern(), or 0N/A * modified using the non-pattern API, this string will be null, 0N/A * indicating that toPattern() must generate a pattern 0N/A * representation from the inversion list. 0N/A private static final int START_EXTRA =
16;
// initial storage. Must be >= 0 0N/A * A set of all characters _except_ the second through last characters of 0N/A * certain ranges. These ranges are ranges of characters whose 0N/A * properties are all exactly alike, e.g. CJK Ideographs from 0N/A //---------------------------------------------------------------- 0N/A //---------------------------------------------------------------- 0N/A * Constructs an empty set. 0N/A * Constructs a set containing the given range. If <code>end > 0N/A * start</code> then an empty set is created. 0N/A * @param start first character, inclusive, of range 0N/A * @param end last character, inclusive, of range 0N/A * Constructs a set from the given pattern. See the class description 0N/A * for the syntax of the pattern language. Whitespace is ignored. 0N/A * @param pattern a string specifying what characters are in the set 0N/A * @exception java.lang.IllegalArgumentException if the pattern contains 0N/A * Make this object represent the same set as <code>other</code>. 0N/A * @param other a <code>UnicodeSet</code> whose value will be 0N/A * copied to this object 0N/A * Modifies this set to represent the set specified by the given pattern. 0N/A * See the class description for the syntax of the pattern language. 0N/A * Whitespace is ignored. 0N/A * @param pattern a string specifying what characters are in the set 0N/A * @exception java.lang.IllegalArgumentException if the pattern 0N/A * contains a syntax error. 0N/A * Append the <code>toPattern()</code> representation of a 0N/A * string to the given <code>StringBuffer</code>. 0N/A * Append the <code>toPattern()</code> representation of a 0N/A * character to the given <code>StringBuffer</code>. 0N/A // Use hex escape notation (<backslash>uxxxx or <backslash>Uxxxxxxxx) for anything 0N/A // Okay to let ':' pass through 0N/A case '[':
// SET_OPEN: 0N/A case ']':
// SET_CLOSE: 0N/A case '-':
// HYPHEN: 0N/A case '^':
// COMPLEMENT: 0N/A case '&':
// INTERSECTION: 0N/A case '\\':
//BACKSLASH: 0N/A // Escape whitespace 0N/A * Append a string representation of this set to result. This will be 0N/A * a cleaned version of the string passed to applyPattern(), if there 0N/A * is one. Otherwise it will be generated. 0N/A // If the unprintable character is preceded by an odd 0N/A // number of backslashes, then it has been escaped. 0N/A // Before unescaping it, we delete the final 0N/A * Generate and append a string representation of this set to result. 0N/A * This does not use this.pat, the cleaned up copy of the string 0N/A * passed to applyPattern(). 1091N/A * @param includeStrings if false, doesn't include the strings. 0N/A // If the set contains at least 2 intervals and includes both 0N/A // MIN_VALUE and MAX_VALUE, then the inverse representation will 0N/A // be more economical. 0N/A // Default; emit the ranges as pairs 1091N/A // for internal use, after checkFrozen has been called 0N/A * Adds the specified character to this set if it is not already 0N/A * present. If this set already contains the specified character, 0N/A * the call leaves this set unchanged. 1091N/A // for internal use only, after checkFrozen has been called 0N/A // find smallest i such that c < list[i] 0N/A // if odd, then it is IN the set 0N/A // if even, then it is OUT of the set 0N/A if ((i &
1) !=
0)
return this;
0N/A // assert(list[len-1] == HIGH); 0N/A // [start_0, limit_0, start_1, limit_1, HIGH] 0N/A // [..., start_k-1, limit_k-1, start_k, limit_k, ..., HIGH] 0N/A // i == 0 means c is before the first range 0N/A // c is before start of next range 0N/A // if we touched the HIGH mark, then add a new one 0N/A // collapse adjacent ranges 0N/A // [..., start_k-1, c, c, limit_k, ..., HIGH] 0N/A else if (i >
0 && c ==
list[i-
1]) {
0N/A // c is after end of prior range 0N/A // no need to chcek for collapse here 0N/A // At this point we know the new char is not adjacent to 0N/A // any existing ranges, and it is not 10FFFF. 0N/A // [..., start_k-1, limit_k-1, start_k, limit_k, ..., HIGH] 0N/A // [..., start_k-1, limit_k-1, c, c+1, start_k, limit_k, ..., HIGH] 0N/A // Don't use ensureCapacity() to save on copying. 0N/A // NOTE: This has no measurable impact on performance, 0N/A // but it might help in some usage patterns. 0N/A * Adds the specified multicharacter to this set if it is not already 0N/A * present. If this set already contains the multicharacter, 0N/A * the call leaves this set unchanged. 0N/A * Thus "ch" => {"ch"} 0N/A * <br><b>Warning: you cannot add an empty string ("") to a UnicodeSet.</b> 0N/A * @param s the source string 0N/A * @return this object, for chaining 0N/A * @return a code point IF the string consists of a single one. 0N/A * otherwise returns -1. 0N/A * @param string to test 0N/A // at this point, len = 2 0N/A if (
cp >
0xFFFF) {
// is surrogate pair 0N/A * Complements the specified range in this set. Any character in 0N/A * the range will be removed if it is in this set, or will be 0N/A * added if it is not in this set. If <code>end > start</code> 0N/A * then an empty range is complemented, leaving the set unchanged. 0N/A * @param start first character, inclusive, of range to be removed 0N/A * @param end last character, inclusive, of range to be removed 0N/A * This is equivalent to 0N/A * <code>complement(MIN_VALUE, MAX_VALUE)</code>. 0N/A * Returns true if this set contains the given character. 0N/A * @param c character to be checked for containment 0N/A * @return true if the test condition is met 0N/A // Set i to the index of the start item greater than ch 0N/A // We know we will terminate without length test! 0N/A if (c < list[++i]) break; 0N/A return ((i &
1) !=
0);
// return true if odd 0N/A * Returns the smallest value i such that c < list[i]. Caller 0N/A * must ensure that c is a legal value or this method will enter 0N/A * an infinite loop. This method performs a binary search. 0N/A * @param c a character in the range MIN_VALUE..MAX_VALUE 0N/A * @return the smallest integer i in the range 0..len-1, 0N/A * inclusive, such that c < list[i] 0N/A set list[] c=0 1 3 4 7 8 0N/A === ============== =========== 0N/A [] [110000] 0 0 0 0 0 0 0N/A [\u0000-\u0003] [0, 4, 110000] 1 1 1 2 2 2 0N/A [\u0004-\u0007] [4, 8, 110000] 0 0 0 1 1 2 0N/A [:all:] [0, 110000] 1 1 1 1 1 1 0N/A // Return the smallest i such that c < list[i]. Assume 0N/A // list[len - 1] == HIGH and that c is legal (0..HIGH-1). 0N/A // High runner test. c is often after the last range, so an 0N/A // initial check for this condition pays off. 0N/A // invariant: c >= list[lo] 0N/A // invariant: c < list[hi] 0N/A * Adds all of the elements in the specified set to this set if 0N/A * they're not already present. This operation effectively 0N/A * modifies this set so that its value is the <i>union</i> of the two 0N/A * sets. The behavior of this operation is unspecified if the specified 0N/A * collection is modified while the operation is in progress. 0N/A * @param c set whose elements are to be added to this set. 0N/A * Retains only the elements in this set that are contained in the 0N/A * specified set. In other words, removes from this set all of 0N/A * its elements that are not contained in the specified set. This 0N/A * operation effectively modifies this set so that its value is 0N/A * the <i>intersection</i> of the two sets. 0N/A * @param c set that defines which elements this set will retain. 0N/A * Removes from this set all of its elements that are contained in the 0N/A * specified set. This operation effectively modifies this 0N/A * set so that its value is the <i>asymmetric set difference</i> of 0N/A * @param c set that defines which elements will be removed from 0N/A * Removes all of the elements from this set. This set will be 0N/A * empty after this call returns. 0N/A * Iteration method that returns the number of ranges contained in 0N/A * @see #getRangeStart 0N/A * Iteration method that returns the first character in the 0N/A * specified range of this set. 0N/A * @exception ArrayIndexOutOfBoundsException if index is outside 0N/A * the range <code>0..getRangeCount()-1</code> 0N/A * @see #getRangeCount 0N/A * Iteration method that returns the last character in the 0N/A * specified range of this set. 0N/A * @exception ArrayIndexOutOfBoundsException if index is outside 0N/A * the range <code>0..getRangeCount()-1</code> 0N/A * @see #getRangeStart 0N/A //---------------------------------------------------------------- 0N/A // Implementation: Pattern parsing 0N/A //---------------------------------------------------------------- 0N/A * Parses the given pattern, starting at the given position. The character 0N/A * at pattern.charAt(pos.getIndex()) must be '[', or the parse fails. 0N/A * Parsing continues until the corresponding closing ']'. If a syntax error 0N/A * is encountered between the opening and closing brace, the parse fails. 0N/A * Upon return from a successful parse, the ParsePosition is updated to 0N/A * point to the character following the closing ']', and an inversion 0N/A * list for the parsed pattern is returned. This method 0N/A * calls itself recursively to parse embedded subpatterns. 0N/A * @param pattern the string containing the pattern to be parsed. The 0N/A * portion of the string from pos.getIndex(), which must be a '[', to the 0N/A * corresponding closing ']', is parsed. 0N/A * @param pos upon entry, the position at which to being parsing. The 0N/A * character at pattern.charAt(pos.getIndex()) must be a '['. Upon return 0N/A * from a successful parse, pos.getIndex() is either the character after the 0N/A * closing ']' of the parsed pattern, or pattern.length() if the closing ']' 0N/A * is the last character of the pattern string. 0N/A * @return an inversion list for the parsed substring 0N/A * of <code>pattern</code> 0N/A * @exception java.lang.IllegalArgumentException if the parse fails. 0N/A // Need to build the pattern in a temporary string because 0N/A // _applyPattern calls add() etc., which set pat to empty. 0N/A // Skip over trailing whitespace 0N/A "\" failed at " + i);
0N/A * Parse the pattern from the given RuleCharacterIterator. The 0N/A * iterator is advanced over the parsed pattern. 0N/A * @param chars iterator over the pattern characters. Upon return 0N/A * it will be advanced to the first character after the parsed 0N/A * pattern, or the end of the iteration if all characters are 0N/A * @param symbols symbol table to use to parse and dereference 0N/A * variables, or null if none. 0N/A * @param rebuiltPat the pattern that was parsed, rebuilt or 0N/A * copied from the input pattern, as appropriate. 0N/A * @param options a bit mask of zero or more of the following: 0N/A * IGNORE_SPACE, CASE. 0N/A // Syntax characters: [ ] ^ - & { } 0N/A // Recognized special forms for chars, sets: c-c s-s s&s 0N/A // mode: 0=before [, 1=between [...], 2=after ] 0N/A // lastItem: 0=none, 1=char, 2=set 0N/A // Debugging assertion 0N/A // -------- Check for property pattern 0N/A // setMode: 0=none, 1=unicodeset, 2=propertypat, 3=preparsed 0N/A // -------- Parse '[' of opening delimiter OR nested set. 0N/A // If there is a nested set, use `setMode' to define how 0N/A // the set should be parsed. If the '[' is part of the 0N/A // opening delimiter for this pattern, parse special 0N/A // strings "[", "[^", "[-", and "[^-". Check for stand-in 0N/A // characters representing a nested set in the symbol 0N/A // Prepare to backup if necessary 0N/A // Handle opening '[' delimiter 0N/A // Fall through to handle special leading '-'; 0N/A // otherwise restart loop for nested [], \p{}, etc. 0N/A // Fall through to handle literal '-' below 0N/A // -------- Handle a nested set. This either is inline in 0N/A // the pattern or represented by a stand-in that has 0N/A // previously been parsed and was looked up in the symbol 0N/A case 3:
// `nested' already parsed 0N/A // Entire pattern is a category; leave parse loop 0N/A // -------- Parse special (syntax) characters. If the 0N/A // current character is not special, or if it is escaped, 0N/A // then fall through and handle it below. 0N/A // Treat final trailing '-' as a literal 0N/A // Treat final trailing '-' as a literal 0N/A // We have new string. Add it to set and continue; 0N/A // we don't need to drop through to the further 0N/A // symbols nosymbols 0N/A // [a-$] error error (ambiguous) 0N/A // [a$] anchor anchor 0N/A // [a-$x] var "x"* literal '$' 0N/A // [a-$.] error literal '$' 0N/A // *We won't get here in the case of var "x" 0N/A break;
// literal '$' 0N/A // -------- Parse literal characters. This includes both 0N/A // escaped chars ("\u4E01") and non-syntax characters 0N/A // Don't allow redundant (a-a) or empty (b-a) ranges; 0N/A // these are most likely typos. 0N/A // Use the rebuilt pattern (pat) only if necessary. Prefer the 0N/A // generated pattern. 0N/A //---------------------------------------------------------------- 0N/A // Implementation: Utility methods 0N/A //---------------------------------------------------------------- 0N/A * Assumes start <= end. 0N/A //---------------------------------------------------------------- 0N/A // Implementation: Fundamental operations 0N/A //---------------------------------------------------------------- 0N/A // polarity = 0, 3 is normal: x xor y 0N/A // polarity = 1, 2: x xor ~y == x === y 0N/A int i =
0, j =
0, k =
0;
0N/A // simplest of all the routines 0N/A // sort the values, discarding identicals! 0N/A }
else if (a !=
HIGH) {
// at this point, a == b 0N/A // discard both values! 0N/A // swap list and buffer 0N/A // polarity = 0 is normal: x union y 0N/A // polarity = 2: x union ~y 0N/A // polarity = 1: ~x union y 0N/A // polarity = 3: ~x union ~y 0N/A int i =
0, j =
0, k =
0;
0N/A // change from xor is that we have to check overlapping pairs 0N/A // polarity bit 1 means a is second, bit 2 means b is. 0N/A case 0:
// both first; take lower if unequal 0N/A if (a < b) {
// take a 0N/A // Back up over overlapping ranges in buffer[] 0N/A // Pick latter end value in buffer[] vs. list[] 0N/A }
else if (b < a) {
// take b 0N/A }
else {
// a == b, take a, drop b 0N/A // This is symmetrical; it doesn't matter if 0N/A // we backtrack with a or b. - liu 0N/A case 3:
// both second; take higher if unequal, and drop other 0N/A if (b <= a) {
// take a 0N/A case 1:
// a second, b first; if b < a, overlap 0N/A if (a < b) {
// no overlap, take a 0N/A }
else if (b < a) {
// OVERLAP, drop b 0N/A }
else {
// a == b, drop both! 0N/A case 2:
// a first, b second; if a < b, overlap 0N/A if (b < a) {
// no overlap, take b 0N/A }
else if (a < b) {
// OVERLAP, drop a 0N/A }
else {
// a == b, drop both! 0N/A // swap list and buffer 0N/A // polarity = 0 is normal: x intersect y 0N/A // polarity = 2: x intersect ~y == set-minus 0N/A // polarity = 1: ~x intersect y 0N/A // polarity = 3: ~x intersect ~y 0N/A int i =
0, j =
0, k =
0;
0N/A // change from xor is that we have to check overlapping pairs 0N/A // polarity bit 1 means a is second, bit 2 means b is. 0N/A case 0:
// both first; drop the smaller 0N/A if (a < b) {
// drop a 0N/A }
else if (b < a) {
// drop b 0N/A }
else {
// a == b, take one, drop other 0N/A case 3:
// both second; take lower if unequal 0N/A if (a < b) {
// take a 0N/A }
else if (b < a) {
// take b 0N/A }
else {
// a == b, take one, drop other 0N/A case 1:
// a second, b first; 0N/A if (a < b) {
// NO OVERLAP, drop a 0N/A }
else if (b < a) {
// OVERLAP, take b 0N/A }
else {
// a == b, drop both! 0N/A case 2:
// a first, b second; if a < b, overlap 0N/A if (b < a) {
// no overlap, drop b 0N/A }
else if (a < b) {
// OVERLAP, take a 0N/A }
else {
// a == b, drop both! 0N/A // swap list and buffer 0N/A private static final int max(
int a,
int b) {
0N/A return (a > b) ? a : b;
0N/A //---------------------------------------------------------------- 0N/A // Generic filter-based scanning code 0N/A //---------------------------------------------------------------- 0N/A // VersionInfo for unassigned characters 0N/A // Reference comparison ok; VersionInfo caches and reuses 0N/A * Generic filter-based scanning code for UCD property UnicodeSets. 0N/A // Walk through all Unicode characters, noting the start 0N/A // and end of each range for which filter.contain(c) is 0N/A // true. Add each range to a set. 0N/A // To improve performance, use the INCLUSIONS set, which 0N/A // encodes information about character ranges that are known 0N/A // to have identical properties, such as the CJK Ideographs 0N/A // from U+4E00 to U+9FA5. INCLUSIONS contains all characters 0N/A // except the first characters of such ranges. 0N/A // TODO Where possible, instead of scanning over code points, 0N/A // use internal property data to initialize UnicodeSets for 0N/A // those properties. Scanning code points is slow. 0N/A // get current range 0N/A // for all the code points in the range, process 0N/A // only add to the unicodeset on inflection points -- 0N/A // where the hasProperty value changes to false 0N/A * Remove leading and trailing rule white space and compress 0N/A * internal rule white space to a single space character. 0N/A * @see UCharacterProperty#isRuleWhiteSpace 0N/A ch =
' ';
// convert to ' ' 0N/A * Modifies this set to contain those code points which have the 0N/A * given value for the given property. Prior contents of this 0N/A * @param propertyAlias 0N/A * @param symbols if not null, then symbols are first called to see if a property 0N/A * is available. If true, then everything else is skipped. 1091N/A // VersionInfo.getInstance() does not do 0N/A * Return true if the given iterator appears to point at a 0N/A * property pattern. Regardless of the result, return with the 0N/A * iterator unchanged. 0N/A * @param chars iterator over the pattern characters. Upon return 0N/A * it will be unchanged. 0N/A * @param iterOpts RuleCharacterIterator options 0N/A if (c ==
'[' || c ==
'\\') {
0N/A (d ==
'N' || d ==
'p' || d ==
'P');
0N/A * Parse the given property pattern at the given parse position. 0N/A * @param symbols TODO 0N/A // On entry, ppos should point to one of the following locations: 0N/A // Minimum length is 5 characters, e.g. \p{L} 0N/A boolean posix =
false;
// true for [:pat:], false for \p{pat} \P{pat} \N{pat} 0N/A boolean isName =
false;
// true for \N{pat}, o/w false 0N/A // Look for an opening [:, [:^, \p, or \P 0N/A // Syntax error; "\p" or "\P" not followed by "{" 0N/A // Open delimiter not seen 0N/A // Look for the matching close delimiter, either :] or } 0N/A // Syntax error; close delimiter missing 0N/A // Look for an '=' sign. If this is present, we will parse a 0N/A // medium \p{gc=Cf} or long \p{GeneralCategory=Format} 0N/A // Handle case where no '=' is seen, and \N{} 0N/A // This is a little inefficient since it means we have to 0N/A // parse "na" back to UProperty.NAME even though we already 0N/A // know it's UProperty.NAME. If we refactor the API to 0N/A // support args of (int, String) then we can remove 0N/A // "na" and make this a little more efficient. 0N/A // Move to the limit position after the close delimiter 0N/A * Parse a property pattern. 0N/A * @param chars iterator over the pattern characters. Upon return 0N/A * it will be advanced to the first character after the parsed 0N/A * pattern, or the end of the iteration if all characters are 0N/A * @param rebuiltPat the pattern that was parsed, rebuilt or 0N/A * copied from the input pattern, as appropriate. 0N/A * @param symbols TODO 0N/A //---------------------------------------------------------------- 0N/A //---------------------------------------------------------------- 0N/A * Bitmask for constructor and applyPattern() indicating that 0N/A * white space should be ignored. If set, ignore characters for 0N/A * which UCharacterProperty.isRuleWhiteSpace() returns true, 0N/A * unless they are quoted or escaped. This may be ORed together 0N/A * with other selectors.