FloatingDecimal.java revision 926
2362N/A * Copyright 1996-2004 Sun Microsystems, Inc. All Rights Reserved. 0N/A * DO NOT ALTER OR REMOVE COPYRIGHT NOTICES OR THIS FILE HEADER. 0N/A * This code is free software; you can redistribute it and/or modify it 0N/A * under the terms of the GNU General Public License version 2 only, as 2362N/A * published by the Free Software Foundation. Sun designates this 0N/A * particular file as subject to the "Classpath" exception as provided 2362N/A * by Sun in the LICENSE file that accompanied this code. 0N/A * This code is distributed in the hope that it will be useful, but WITHOUT 0N/A * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or 0N/A * FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License 0N/A * version 2 for more details (a copy is included in the LICENSE file that 0N/A * accompanied this code). 0N/A * You should have received a copy of the GNU General Public License version 0N/A * 2 along with this work; if not, write to the Free Software Foundation, 0N/A * Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA. 2362N/A * Please contact Sun Microsystems, Inc., 4150 Network Circle, Santa Clara, 2362N/A * CA 95054 USA or visit www.sun.com if you need additional information or * Constants of the implementation * Most are IEEE-754 related. * (There are more really boring constants at the end.) static final long signMask =
0x8000000000000000L;
static final long expMask =
0x7ff0000000000000L;
static final long highbyte =
0xff00000000000000L;
static final long highbit =
0x8000000000000000L;
* count number of bits from high-order 1 bit to low-order 1 bit, // the strategy is to shift until we get a non-zero sign bit // then shift until we have no bits left, counting the difference. // we do byte shifting as a hack. Hope it helps. while ( v >
0L ) {
// i.e. while ((v&highbit) == 0L ) * Keep big powers of 5 handy for future reference. assert p >=
0 : p;
// negative power of 5 // in order to compute 5^p, // compute its square root, 5^(p/2) and square. // or, let q = p / 2, r = p -q, then // 5^p = 5^(q+r) = 5^q * 5^r // another common operation * Make a floating double into a FDBigInt. * This could also be structured as a FDBigInt * constructor, but we'd have to build a lot of knowledge * about floating-point representation into it, and we don't want to. * AS A SIDE EFFECT, THIS METHOD WILL SET THE INSTANCE VARIABLES * bigIntExp and bigIntNBits assert lbits !=
0L :
lbits;
// doubleToBigInt(0.0) * We now know where the high-order 1 bit is, * and we know how many there are. * Compute a number that is the ULP of the given value, * More difficult if subtracting and the argument * is a normalized a power of 2, as the ULP changes at these points. // for subtraction from normalized, powers of 2, // use next-smaller exponent * Round a double to a float. * In addition to the fraction bits of the double, * look at the class instance variable roundDir, * which should help us avoid double-rounding error. * roundDir was set in hardValueOf if the estimate was * close enough, but not exact. It tells us which direction * of rounding is preferred. // what we have here is special. // don't worry, the right thing will happen. * This is the easy subcase -- * all the significant bits, after scaling, are held in lvalue. * negSign and decExponent tell us what processing and scaling * has already been done. Exceptional cases have already been * lvalue is a finite number (not Inf, nor NaN) * lvalue > 0L (not zero, nor negative). * The only reason that we develop the digits here, rather than * calling on Long.toString() is that we can do it a little faster, * and besides want to treat trailing 0s specially. If Long.toString * changes, we should re-evaluate this strategy! // Discard non-significant low-order bits, while rounding, // up to insignificant value. // round up based on the low-order bits we're discarding // can do int arithmetic rather than long! // same algorithm as above (same bugs, too ) // but using long arithmetic. // add one to the least significant digit. // in the unlikely event there is a carry out, // assert that this will only happen where there // is only one digit, e.g. (float)1e-44 seems to do it. while ( q ==
'9' && i >
0 ){
// carryout! High-order 1, rest 0s, larger exp. * FIRST IMPORTANT CONSTRUCTOR: DOUBLE // discover and delete sign // Discover obvious special cases of NaN and Infinity. // Normalize denormalized numbers. // Insert assumed high-order bit for normalized numbers. // Subtract exponent bias. // not a denorm, just a 0! // call the routine that actually does all the hard work. * SECOND IMPORTANT CONSTRUCTOR: SINGLE // discover and delete sign // Discover obvious special cases of NaN and Infinity. // Normalize denormalized numbers. // Insert assumed high-order bit for normalized numbers. // Subtract exponent bias. // not a denorm, just a 0! // call the routine that actually does all the hard work. int nFractBits;
// number of significant bits of fractBits; int nTinyBits;
// number of these to the right of the point. // Examine number. Determine if it is an easy case, // which we can do pretty trivially using float/long conversion, // or whether we must do real work. // Look more closely at the number to decide if, // with scaling by 10^nTinyBits, the result will fit in * take the fraction bits, which are normalized. * (a) nTinyBits == 0: Shift left or right appropriately * to align the binary point at the extreme right, i.e. * where a long int point is expected to be. The integer * result is easily converted to a string. * (b) nTinyBits > 0: Shift right by expShift-nFractBits, * which effectively converts to long and scales by * 2^nTinyBits. Then multiply by 5^nTinyBits to * complete the scaling. We know this won't overflow * because we just counted the number of bits necessary * in the result. The integer you get from this can * then be converted to a string pretty easily. * The following causes excess digits to be printed * out in the single-float case. Our manipulation of * halfULP here is apparently not correct. If we * better understand how this works, perhaps we can * use this special case again. But for the time being, * fractBits >>>= expShift+1-nFractBits; * fractBits *= long5pow[ nTinyBits ]; * halfULP = long5pow[ nTinyBits ] >> (1+nSignificantBits-nFractBits); * developLongDigits( -nTinyBits, fractBits, halfULP ); * This is the hard case. We are going to compute large positive * integers B and S and integer decExp, s.t. * d = ( B / S ) * 10^decExp * decExp = floor( log10(d) ) * B = d * 2^nTinyBits * 10^max( 0, -decExp ) * S = 10^max( 0, decExp) * 2^nTinyBits * (noting that nTinyBits has already been forced to non-negative) * I am also going to compute a large positive integer * M = (1/2^nSignificantBits) * 2^nTinyBits * 10^max( 0, -decExp ) * i.e. M is (1/2) of the ULP of d, scaled like B. * When we iterate through dividing B/S and picking off the * quotient bits, we will know when to stop when the remainder * We keep track of powers of 2 and powers of 5. * Estimate decimal exponent. (If it is small-ish, * we could double-check.) * First, scale the mantissa bits such that 1 <= d2 < 2. * We are then going to estimate * log10(d2) ~=~ (d2-1.5)/1.5 + log(1.5) * log10(d) ~=~ log10(d2) + binExp * log10(2) * take the floor and call it decExp. * FIXME -- use more precise constants here. It costs no more. (
d2-
1.5D)*
0.289529654D +
0.176091259 + (
double)
binExp *
0.301029995663981 );
int B2,
B5;
// powers of 2 and powers of 5, respectively, in B int S2,
S5;
// powers of 2 and powers of 5, respectively, in S int M2,
M5;
// powers of 2 and powers of 5, respectively, in M int Bbits;
// binary digits needed to represent B, approx. int tenSbits;
// binary digits needed to represent 10*S, approx. * the long integer fractBits contains the (nFractBits) interesting * bits from the mantissa of d ( hidden 1 added if necessary) followed * by (expShift+1-nFractBits) zeros. In the interest of compactness, * I will shift out those zeros before turning fractBits into a * FDBigInt. The resulting whole number will be * d * 2^(nFractBits-1-binExp). * HACK!! For exact powers of two, the next smallest number * is only half as far away as we think (because the meaning of * ULP changes at power-of-two bounds) for this reason, we * hack M2. Hope this works. // since we cannot scale M down far enough, // we must scale the other values up. * Construct, Scale, iterate. * Some day, we'll write a stopping test that takes * account of the asymmetry of the spacing of floating-point * numbers below perfect powers of 2 * 26 Sept 96 is not that day. * So we use a symmetric test. * Detect the special cases where all the numbers we are about * to compute will fit in int or long integers. * In these cases, we will avoid doing FDBigInt arithmetic. * We use the same algorithms, except that we "normalize" * our FDBigInts before iterating. This is to make division easier, * as it makes our fist guess (quotient of high-order words) * Some day, we'll write a stopping test that takes * account of the asymmetry of the spacing of floating-point * numbers below perfect powers of 2 * 26 Sept 96 is not that day. * So we use a symmetric test. // wa-hoo! They're all ints! * Unroll the first iteration. If our decExp estimate * was too high, our first quotient will be zero. In this * case, we discard it and decrement decExp. assert q <
10 : q;
// excessively large digit if ( (q ==
0) && !
high ){
// oops. Usually ignore leading zero. * HACK! Java spec sez that we always have at least * one digit after the . in either F- or E-form output. * Thus we will need more than one digit if we're using assert q <
10 : q;
// excessively large digit // hack -- m might overflow! // in this case, it is certainly > b, // and b+m > tens, too, since that has overflowed // still good! they're all longs! * Unroll the first iteration. If our decExp estimate * was too high, our first quotient will be zero. In this * case, we discard it and decrement decExp. assert q <
10 : q;
// excessively large digit if ( (q ==
0) && !
high ){
// oops. Usually ignore leading zero. * HACK! Java spec sez that we always have at least * one digit after the . in either F- or E-form output. * Thus we will need more than one digit if we're using assert q <
10 : q;
// excessively large digit // hack -- m might overflow! // in this case, it is certainly > b, // and b+m > tens, too, since that has overflowed * We really must do FDBigInt arithmetic. * Fist, construct our FDBigInt initial values. // normalize so that division works better * Unroll the first iteration. If our decExp estimate * was too high, our first quotient will be zero. In this * case, we discard it and decrement decExp. assert q <
10 : q;
// excessively large digit if ( (q ==
0) && !
high ){
// oops. Usually ignore leading zero. * HACK! Java spec sez that we always have at least * one digit after the . in either F- or E-form output. * Thus we will need more than one digit if we're using assert q <
10 : q;
// excessively large digit * Last digit gets rounded based on stopping condition. // choose based on which digits we like. // most brain-dead version // decExponent has 1, 2, or 3, digits result[i++] = (
char)(e/
10 +
'0');
result[i++] = (
char)(e%
10 +
'0');
result[i++] = (
char)(e/
100+
'0');
result[i++] = (
char)(e/
10+
'0');
result[i++] = (
char)(e%
10 +
'0');
in =
in.
trim();
// don't fool around with white space. // throws NullPointerException if null // Check for NaN and Infinity strings if(c ==
'N' || c ==
'I') {
// possible NaN or infinity // compare Input string to "NaN" or "Infinity" else // something is amiss, throw exception // For the candidate string to be a NaN or infinity, // all characters in input string and target char[] // must be matched ==> j must equal targetChars.length else {
// something went wrong, throw exception }
else if (c ==
'0') {
// check for hexadecimal floating-point number if (
ch ==
'x' ||
ch ==
'X' )
// possible hex string }
// look for and process decimal floating-point string char[]
digits =
new char[ l ];
// already saw one ., this is the 2nd. * At this point, we've scanned all the digits and decimal * point we're going to see. Trim off leading and trailing * zeros, which will just confuse us later, and adjust * our initial decimal exponent accordingly. * we have seen i total characters. * nLeadZero of them were zeros before any other digits. * nTrailZero of them were zeros after any other digits. * if ( decSeen ), then a . was seen after decPt characters * ( including leading zeros which have been discarded ) * nDigits characters were neither lead nor trailing * special hack: if we saw no non-zero digits, then the * Unfortunately, we feel honor-bound to keep parsing! // we saw NO DIGITS AT ALL, /* Our initial exponent is decPt, adjusted by the number of * discarded zeros. Or, if there was no decPt, * then its just nDigits adjusted by discarded trailing zeros. * Look for 'e' or 'E' and an optionally signed integer. if ( (i < l) && (((c =
in.
charAt(i) )==
'e') || (c ==
'E') ) ){
// the next character will cause integer break expLoop;
// stop parsing exponent. // The intent here is to end up with // infinity or zero, as appropriate. // The reason for yielding such a small decExponent, // rather than something intuitive such as // expSign*Integer.MAX_VALUE, is that this value // is subject to further manipulation in // doubleValue() and floatValue(), and I don't want // it to be able to cause overflow there! // (The only way we can get into trouble here is for // really outrageous nDigits+nTrailZero, such as 2 billion. ) // this should not overflow, since we tested // for expVal > (MAX+N), where N >= abs(decExp) // if we saw something not a digit ( or end of string ) // after the [Ee][+-], without seeing any digits at all // this is certainly an error. If we saw some digits, // but then some trailing garbage, that might be ok. // so we just fall through in that case. * We parsed everything we could. * If there are leftovers, then this is not good input! * Take a FloatingDecimal, which we presumably just scanned in, * and find out what its value is, as a double. * AS A SIDE EFFECT, SET roundDir TO INDICATE PREFERRED * ROUNDING DIRECTION in case the result is really destined * for a single-precision float. // First, check for NaN and Infinity values * convert the lead kDigits to a long integer. // (special performance hack: start to do it using int) * lValue now contains a long integer with the value of * the first kDigits digits of the number. * dValue contains the (double) of the same. * We know that the digits can be represented * exactly. And if the exponent isn't too outrageous, * the whole thing can be done with one operation, * thus one rounding error. * Note that all our constructors trim all leading and * trailing zeros, so simple values (including zero) * will always end up here * Can get the answer with one operation, * We can multiply dValue by 10^(slop) * and it is still "small" and exact. * Then we can multiply by 10^(exp-slop) * Else we have a hard case with a positive exp. * Can get the answer in one division. * Else we have a hard case with a negative exp. * The sum of digits plus exponent is greater than * what we think we can do with one error. * Start by approximating the right answer by, * naively, scaling by powers of 10. * Lets face it. This is going to be * Infinity. Cut to the chase. for( j =
0;
exp >
1; j++,
exp>>=
1 ){
* The reason for the weird exp > 1 condition * in the above loop was so that the last multiply * would get unrolled. We handle it here. * Look more closely at the result. * If the exponent is just one too large, * then use the maximum finite as our estimate * value. Else call the result infinity * ( I presume this could happen because * rounding forces the result here to be * an ULP or two larger than * Lets face it. This is going to be * zero. Cut to the chase. for( j =
0;
exp >
1; j++,
exp>>=
1 ){
* The reason for the weird exp > 1 condition * in the above loop was so that the last multiply * would get unrolled. We handle it here. * Look more closely at the result. * If the exponent is just one too small, * then use the minimum finite as our estimate * value. Else call the result 0.0 * ( I presume this could happen because * rounding forces the result here to be * an ULP or two less than * dValue is now approximately the result. * The hard part is adjusting it, by comparison * with FDBigInt arithmetic. * Formulate the EXACT big-number result as /* AS A SIDE EFFECT, THIS METHOD WILL SET THE INSTANCE VARIABLES * bigIntExp and bigIntNBits * Scale bigD, bigB appropriately for * big-integer operations. * Naively, we multiply by powers of ten * and powers of two. What we actually do * is keep track of the powers of 5 and * powers of 2 we would use, then factor out * common divisors before doing the work. int B2,
B5;
// powers of 2, 5 in bigB int D2,
D5;
// powers of 2, 5 in bigD int Ulp2;
// powers of 2 in halfUlp. // shift bigB and bigD left by a number s. t. // halfUlp is still an integer. // This is going to be a denormalized number // (if not actually zero). // half an ULP is at 2^-(expBias+expShift+1) // if there are common factors of 2, we might just as well // factor them out, as they add nothing useful. // do multiplications by powers of 5 and 2 // bigB is the scaled-big-int version of our floating-point // bigD is the scaled-big-int version of the exact value // halfUlp is 1/2 an ulp of bigB, except for special cases // the plan is to compare bigB with bigD, and if the difference // is less than halfUlp, then we're satisfied. Otherwise, // use the ratio of difference to halfUlp to calculate a fudge // factor to add to the floating value, then go 'round again. overvalue =
true;
// our candidate is too big. // candidate is a normalized exact power of 2 and // is too big. We will be subtracting. // For our purposes, ulp is the ulp of the // rats. Cannot de-scale ulp this far. // must scale diff in other direction. overvalue =
false;
// our candidate is too small. // the candidate is exactly right! // this happens with surprising frequency // difference is exactly half an ULP // round to some other value maybe, then finish // should check for bigIntNBits == 1 here?? // difference is non-trivial. // could scale addend by ratio of difference to // halfUlp here, if we bothered to compute that difference. // Most of the time ( I hope ) it is about 1 anyway. * Take a FloatingDecimal, which we presumably just scanned in, * and find out what its value is, as a float. * This is distinct from doubleValue() to avoid the extremely * unlikely case of a double rounding error, wherein the conversion * to double has one rounding error, and the conversion of that double * to a float has another rounding error, IN THE WRONG DIRECTION, * ( because of the preference to a zero low-order bit ). // First, check for NaN and Infinity values * convert the lead kDigits to an integer. * iValue now contains an integer with the value of * the first kDigits digits of the number. * fValue contains the (float) of the same. * We know that the digits can be represented * exactly. And if the exponent isn't too outrageous, * the whole thing can be done with one operation, * thus one rounding error. * Note that all our constructors trim all leading and * trailing zeros, so simple values (including zero) * will always end up here. * Can get the answer with one operation, * We can multiply dValue by 10^(slop) * and it is still "small" and exact. * Then we can multiply by 10^(exp-slop) * Else we have a hard case with a positive exp. * Can get the answer in one division. * Else we have a hard case with a negative exp. * In double-precision, this is an exact floating integer. * So we can compute to double, then shorten to float * with one round, and get the right answer. * First, finish accumulating digits. * Then convert that integer to a double, multiply * by the appropriate power of ten, and convert to float. * The sum of digits plus exponent is greater than * what we think we can do with one error. * Start by weeding out obviously out-of-range * results, then convert to double and go to * Lets face it. This is going to be * Infinity. Cut to the chase. * Lets face it. This is going to be * zero. Cut to the chase. * Here, we do 'way too much work, but throwing away * our partial results, and going and doing the whole * thing as double, then throwing away half the bits that computes * when we convert back to float. * The alternative is to reproduce the whole multiple-precision * algorithm for float precision, or to try to parameterize it * for common usage. The former will take about 400 lines of code, * and the latter I tried without success. Thus the semi-hack * All the positive powers of 10 that can be 1.0e1,
1.0e2,
1.0e3,
1.0e4,
1.0e5,
1.0e6,
1.0e7,
1.0e8,
1.0e9,
1.0e10,
1.0e11,
1.0e12,
1.0e13,
1.0e14,
1.0e15,
1.0e16,
1.0e17,
1.0e18,
1.0e19,
1.0e20,
1.0e1f,
1.0e2f,
1.0e3f,
1.0e4f,
1.0e5f,
1.0e6f,
1.0e7f,
1.0e8f,
1.0e9f,
1.0e10f private static final double big10pow[] = {
1e16,
1e32,
1e64,
1e128,
1e256 };
1e-16,
1e-32,
1e-64,
1e-128,
1e-256 };
5*
5*
5*
5*
5*
5*
5*
5*
5*
5*
5*
5*
5 private static final long long5pow[] = {
5L*
5*
5*
5*
5*
5*
5*
5*
5*
5*
5*
5,
5L*
5*
5*
5*
5*
5*
5*
5*
5*
5*
5*
5*
5,
5L*
5*
5*
5*
5*
5*
5*
5*
5*
5*
5*
5*
5*
5,
5L*
5*
5*
5*
5*
5*
5*
5*
5*
5*
5*
5*
5*
5*
5,
5L*
5*
5*
5*
5*
5*
5*
5*
5*
5*
5*
5*
5*
5*
5*
5,
5L*
5*
5*
5*
5*
5*
5*
5*
5*
5*
5*
5*
5*
5*
5*
5*
5,
5L*
5*
5*
5*
5*
5*
5*
5*
5*
5*
5*
5*
5*
5*
5*
5*
5*
5,
5L*
5*
5*
5*
5*
5*
5*
5*
5*
5*
5*
5*
5*
5*
5*
5*
5*
5*
5,
5L*
5*
5*
5*
5*
5*
5*
5*
5*
5*
5*
5*
5*
5*
5*
5*
5*
5*
5*
5,
5L*
5*
5*
5*
5*
5*
5*
5*
5*
5*
5*
5*
5*
5*
5*
5*
5*
5*
5*
5*
5,
5L*
5*
5*
5*
5*
5*
5*
5*
5*
5*
5*
5*
5*
5*
5*
5*
5*
5*
5*
5*
5*
5,
5L*
5*
5*
5*
5*
5*
5*
5*
5*
5*
5*
5*
5*
5*
5*
5*
5*
5*
5*
5*
5*
5*
5,
5L*
5*
5*
5*
5*
5*
5*
5*
5*
5*
5*
5*
5*
5*
5*
5*
5*
5*
5*
5*
5*
5*
5*
5,
5L*
5*
5*
5*
5*
5*
5*
5*
5*
5*
5*
5*
5*
5*
5*
5*
5*
5*
5*
5*
5*
5*
5*
5*
5,
5L*
5*
5*
5*
5*
5*
5*
5*
5*
5*
5*
5*
5*
5*
5*
5*
5*
5*
5*
5*
5*
5*
5*
5*
5*
5,
// approximately ceil( log2( long5pow[i] ) ) private static final int n5bits[] = {
private static final char infinity[] = {
'I',
'n',
'f',
'i',
'n',
'i',
't',
'y' };
private static final char notANumber[] = {
'N',
'a',
'N' };
private static final char zero[] = {
'0',
'0',
'0',
'0',
'0',
'0',
'0',
'0' };
* Grammar is compatible with hexadecimal floating-point constants * described in section 6.4.4.2 of the C99 specification. "([-+])?0[xX](((\\p{XDigit}+)\\.?)|((\\p{XDigit}*)\\.(\\p{XDigit}+)))[pP]([-+])?(\\p{Digit}+)[fFdD]?" * Convert string s to a suitable floating decimal; uses the * double constructor and set the roundDir variable appropriately * in case the value is later converted to a float. // Verify string is a member of the hexadecimal floating-point // Input does not match pattern * We must isolate the sign, significand, and exponent * fields. The sign value is straightforward. Since * floating-point numbers are stored with a normalized * representation, the significand and exponent are * After extracting the sign, we normalized the * significand as a hexadecimal value, calculating an * exponent adjust for any shifts made during * normalization. If the significand is zero, the * exponent doesn't need to be examined since the output * Next the exponent in the input string is extracted. * Afterwards, the significand is normalized as a *binary* * value and the input value's normalized exponent can be * computed. The significand bits are copied into a * double significand; if the string has more logical bits * than can fit in a double, the extra bits affect the * round and sticky bits which are used to round the final // Extract significand sign // Extract Significand magnitude * Based on the form of the significand, calculate how the * binary exponent needs to be adjusted to create a * normalized *hexadecimal* floating-point number; that * is, a number where there is one nonzero hex digit to * the left of the (hexa)decimal point. Since we are * adjusting a binary, not hexadecimal exponent, the * exponent is adjusted by a multiple of 4. * There are a number of significand scenarios to consider; * letters are used in indicate nonzero digits: * 1. 000xxxx => x.xxx normalized * increase exponent by (number of x's - 1)*4 * 2. 000xxx.yyyy => x.xxyyyy normalized * increase exponent by (number of x's - 1)*4 * 3. .000yyy => y.yy normalized * decrease exponent by (number of zeros + 1)*4 * 4. 000.00000yyy => y.yy normalized * decrease exponent by (number of zeros to right of point + 1)*4 * If the significand is exactly zero, return a properly int leftDigits =
0;
// number of meaningful digits to // left of "decimal" point // (leading zeros stripped) // "decimal" point; leading zeros // must always be accounted for * The significand is made up of either * 1. group 4 entirely (integer portion only) * 2. the fractional portion from group 7 plus any * (optional) integer portions from group 6. // Leading zeros never matter on the integer portion // Group 6 is the optional integer; leading zeros // never matter on the integer portion // Turn "integer.fraction" into "integer"+"fraction" * Adjust exponent as described above }
else {
// Cases 3 and 4 // If the significand is zero, the exponent doesn't // matter; return a properly signed zero. * Use an int to read in the exponent value; this should * provide more than sufficient range for non-contrived * inputs. If reading the exponent in as an int does * overflow, examine the sign of the exponent and * significand to determine what to do. // At this point, we know the exponent is // syntactically well-formed as a sequence of // digits. Therefore, if an NumberFormatException // is thrown, it must be due to overflowing int's // range. Also, at this point, we have already // checked for a zero significand. Thus the signs // of the exponent and significand determine the // exponent + +infinity -infinity // Calculate partially adjusted exponent // Starting copying non-zero bits into proper position in // a long; copy explicit bit too; this will be masked // later for normal values. // First iteration is different, since we only copy // from the leading significand bit; one more exponent // adjust will be needed... // IMPORTANT: make leadingDigit a long to avoid // surprising shift semantics! * Left shift the leading digit (53 - (bit position of * leading 1 in digit)); this sets the top bit of the * significand to 1. The nextShift value is adjusted * to take into account the number of bit positions of * the leadingDigit actually used. Finally, the * exponent is adjusted to normalize the significand * as a binary value, not just a hex value. // The preceding if-else could be replaced by a single // code block based on the high-order bit set in // leadingDigit. Given leadingOnePosition, // significand |= leadingDigit << (SIGNIFICAND_WIDTH - leadingOnePosition); // nextShift = 52 - (3 + leadingOnePosition); // exponent += (leadingOnePosition-1); * Now the exponent variable is equal to the normalized * binary exponent. Code below will make representation * adjustments if the exponent is incremented after * rounding (includes overflows to infinity) or if the // Copy digit into significand until the significand can't // hold another full hex digit or there are no more input // After the above loop, the bulk of the string is copied. // Now, we must copy any partial hex digits into the // significand AND compute the round bit and start computing if ( i <
signifLength ) {
// at least one hex input digit exists // from nextShift, figure out how many bits need // three bits need to be copied in; can // two bits need to be copied in; can // set round and start sticky // one bit needs to be copied in // Now set round and start sticky, if possible // all bits copied into significand; set // round and start sticky // nonzeros in three low order bits? // Round is set; sticky might be set. // For the sticky bit, it suffices to check the // current digit and test for any nonzero digits in // the remaining unprocessed input. // else all of string was seen, round and sticky are // Check for overflow and update exponent accordingly. // overflow to properly signed infinity }
else {
// Finite return value // The result returned in this block cannot be a // zero or subnormal; however after the // significand is adjusted from rounding, we could // still overflow in infinity. // AND exponent bits into significand; if the // significand is incremented and overflows from // rounding, this combination will update the // exponent correctly, even in the case of // Double.MAX_VALUE overflowing to infinity. }
else {
// Subnormal or zero // (exponent < DoubleConsts.MIN_EXPONENT) // No way to round back to nonzero value // regardless of significand if the exponent is }
else {
// -1075 <= exponent <= MIN_EXPONENT -1 = -1023 * Find bit position to round to; recompute * round and sticky bits, and shift * significand right appropriately. // Number of bits of significand to preserve is // exponent - abs_min_exp +1 // First, isolate the new round bit // create mask to update sticky bits; low // order bitsDiscarded bits should be 1 // The significand variable now contains the currently // appropriate exponent bits too. * Determine if significand should be incremented; * making this determination depends on the least * significant bit and the round and sticky bits. * Round to nearest even rounding table, adapted from * table 4.7 in "Computer Arithmetic" by IsraelKoren. * The digit to the left of the "decimal" point is the * least significant bit, the digits to the right of * the point are the round and sticky bits * Set roundingDir variable field of fd properly so * that the input string can be properly rounded to a * float value. There are two cases to consider: * 1. rounding to double discards sticky bit * information that would change the result of a float * rounding (near halfway case between two floats) * 2. rounding to double rounds up when rounding up * would not occur when rounding to float. * For former case only needs to be considered when * the bits rounded away when casting to float are all * zero; otherwise, float round bit is properly set * and sticky will already be true. * The lower exponent bound for the code below is the * minimum (normalized) subnormal exponent - 1 since a * value with that exponent can round up to the * minimum subnormal value and the sticky bit * information must be preserved (i.e. case 1). // Outside above exponent range, the float value // will be zero or infinity. * If the low-order 28 bits of a rounded double * significand are 0, the double could be a * half-way case for a rounding to float. If the * double value is a half-way case, the double * significand may have to be modified to round * the the right float value (see the stickyRound * method). If the rounding to double has lost * what would be float sticky bit information, the * double significand must be incremented. If the * double value's significand was itself * incremented, the float value may end up too * large so the increment should be undone. // For negative values, the sign of the // roundDir is the same as for positive values // since adding 1 increasing the significand's // magnitude and subtracting 1 decreases the // significand's magnitude. If neither round // nor sticky is true, the double value is // exact and no adjustment is required for a // proper float rounding. // If round and sticky were both true, // and the least significant // significand bit were 0, the rounded // significand would not have its // low-order bits be zero. Therefore, // we only need to adjust the // significand if round XOR sticky is else {
// prerounding lsb is 1 // If the prerounding lsb is 1 and the // resulting significand has its // low-order bits zero, the significand // was incremented. Here, we undo the // increment, which will ensure the // right guard and sticky bits for the * Return <code>s</code> with any leading zeros removed. * Extract a hexadecimal digit from position <code>position</code> * of string <code>s</code>. throw new AssertionError(
"Unexpected failure of digit conversion of " +
* A really, really simple bigint package * tailored to the needs of floating base conversion. int nWords;
// number of words used int data[];
// value: data[0] is least significant int n= (
nd+
8)/
9;
// estimate size needed. data =
new int[n];
// allocate enough space int limit =
nd-
5;
// slurp digits 5 at a time. v = (
int)
digit[i++]-(
int)
'0';
v =
10*v + (
int)
digit[i++]-(
int)
'0';
multaddMe(
100000, v);
// ... where 100000 is 10^5. v =
10*v + (
int)
digit[i++]-(
int)
'0';
// special hack, since an anticount of 32 won't go! // may have constructed high-order word of 0. * normalize this number by shifting until * the MSB of the number is at 0x08000000. * This is in preparation for quoRemIteration, below. * The idea is that, to make division easier, we want the * divisor to be "normalized" -- usually this means shifting * the MSB into the high words sign bit. But because we know that * the quotient will be 0 < q < 10, we would like to arrange that * the dividend not span up into another word of precision. * (This needs to be explained more clearly!) // oops. Value is zero. Cannot normalize it! * In most cases, we assume that wordcount is zero. This only * makes sense, as we try not to maintain any high-order * words full of zeros. In fact, if there are zeros, we will * simply SHORTEN our number at this point. Watch closely... * Compute how far left we have to shift v s.t. its highest- * order bit is in the right place. Then call lshiftMe to if ( (v &
0xf0000000) !=
0 ){
// will have to shift up into the next word. while ( v <=
0x000fffff ){
// hack: byte-at-a-time shifting while ( v <=
0x07ffffff ){
* Multiply a FDBigInt by an int. * Result is a new FDBigInt. // guess adequate size of r. for(
int i=
0; i <
nWords; i++ ) {
p += v * ((
long)
data[i]&
0xffffffffL);
* Multiply a FDBigInt by an int and add another int. * Result is computed in place. // unroll 0th iteration, doing addition. p = v * ((
long)
data[
0]&
0xffffffffL) + ((
long)
addend&
0xffffffffL);
for(
int i=
1; i <
nWords; i++ ) {
p += v * ((
long)
data[i]&
0xffffffffL);
data[
nWords] = (
int)p;
// will fail noisily if illegal! * Multiply a FDBigInt by another FDBigInt. * Result is a new FDBigInt. // crudely guess adequate size for r // I think I am promised zeros... for( i =
0; i <
this.
nWords; i++ ){
long v = (
long)
this.
data[i] &
0xffffffffL;
// UNSIGNED CONVERSION p += ((
long)r[i+j]&
0xffffffffL) + v*((
long)
other.
data[j]&
0xffffffffL);
// UNSIGNED CONVERSIONS ALL 'ROUND. // compute how much of r we actually needed for all that. for ( i = r.
length-
1; i>
0; i--)
* Add one FDBigInt to another. Return a FDBigInt // arrange such that a.nWords >= b.nWords; // n = a.nWords, m = b.nWords for ( i =
0; i < n; i++ ){
c += (
long)a[i] &
0xffffffffL;
c += (
long)b[i] &
0xffffffffL;
c >>=
32;
// signed shift. // oops -- carry out -- need longer result. int s[] =
new int[ r.
length+
1 ];
* Subtract one FDBigInt from another. Return a FDBigInt * Assert that the result is positive. int r[] =
new int[
this.
nWords ];
for ( i =
0; i < n; i++ ){
c += (
long)
this.
data[i] &
0xffffffffL;
if ( ( r[i] = (
int) c ) ==
0 )
c >>=
32;
// signed shift assert c ==
0L : c;
// borrow out of subtract * Compare FDBigInt with another FDBigInt. Return an integer // if any of my high-order words is non-zero, // then the answer is evident for ( i =
this.
nWords-
1; i > j ; i-- )
if (
this.
data[i] !=
0 )
return 1;
// if any of other's high-order words is non-zero, // then the answer is evident // careful! want unsigned compare! // a is really big, unsigned return a-b;
// both big, negative return 1;
// b not big, answer is obvious; * this = 10 * ( this mod S ) * This is the iteration step of digit development for output. * We assume that S has been normalized, as above, and that * "this" has been lshift'ed accordingly. * Also assume, of course, that the result, q, can be expressed * as an integer, 0 <= q < 10. // ensure that this and S have the same number of // digits. If S is properly normalized and q < 10 then // estimate q the obvious way. We will usually be // right. If not, then we're only off by a little and long q = ((
long)
data[n]&
0xffffffffL) / (
long)S.
data[n];
for (
int i =
0; i <= n ; i++ ){
diff += ((
long)
data[i]&
0xffffffffL) - q*((
long)S.
data[i]&
0xffffffffL);
diff >>=
32;
// N.B. SIGNED shift. // damn, damn, damn. q is too big. // add S back in until this turns +. This should // not be very many times! for (
int i =
0; i <= n; i++ ){
sum += ((
long)
data[i]&
0xffffffffL) + ((
long)S.
data[i]&
0xffffffffL);
sum >>=
32;
// Signed or unsigned, answer is 0 or 1 * Originally the following line read * "if ( sum !=0 && sum != -1 )" * but that would be wrong, because of the * treatment of the two values as entirely unsigned, * it would be impossible for a carry-out to be interpreted * as -1 -- it would have to be a single-bit carry-out, or assert sum ==
0 ||
sum ==
1 :
sum;
// carry out of division correction // finally, we can multiply this by 10. // it cannot overflow, right, as the high-order word has // at least 4 high-order zeros! for (
int i =
0; i <= n; i++ ){
p +=
10*((
long)
data[i]&
0xffffffffL);
p >>=
32;
// SIGNED shift. assert p ==
0L : p;
// Carry out of *10 // if this can be represented as a long, return the value return ((
long)
data[
0]&
0xffffffffL);
assert data[
1] >=
0;
// value too big return ((
long)(
data[
1]) <<
32) | ((
long)
data[
0]&
0xffffffffL);