Coding.java revision 2362
2362N/A * Copyright (c) 2001, 2005, Oracle and/or its affiliates. All rights reserved. 0N/A * DO NOT ALTER OR REMOVE COPYRIGHT NOTICES OR THIS FILE HEADER. 0N/A * This code is free software; you can redistribute it and/or modify it 0N/A * under the terms of the GNU General Public License version 2 only, as 2362N/A * published by the Free Software Foundation. Oracle designates this 0N/A * particular file as subject to the "Classpath" exception as provided 2362N/A * by Oracle in the LICENSE file that accompanied this code. 0N/A * This code is distributed in the hope that it will be useful, but WITHOUT 0N/A * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or 0N/A * FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License 0N/A * version 2 for more details (a copy is included in the LICENSE file that 0N/A * accompanied this code). 0N/A * You should have received a copy of the GNU General Public License version 0N/A * 2 along with this work; if not, write to the Free Software Foundation, 0N/A * Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA. 2362N/A * Please contact Oracle, 500 Oracle Parkway, Redwood Shores, CA 94065 USA 2362N/A * or visit www.oracle.com if you need additional information or have any 0N/A * Define the conversions between sequences of small integers and raw bytes. 0N/A * This is a schema of encodings which incorporates varying lengths, 0N/A * varying degrees of length variability, and varying amounts of signed-ness. 0N/A Coding schema for single integers, parameterized by (B,H,S): 0N/A Let B in [1,5], H in [1,256], S in [0,3]. 0N/A (S limit is arbitrary. B follows the 32-bit limit. H is byte size.) 0N/A A given (B,H,S) code varies in length from 1 to B bytes. 0N/A The 256 values a byte may take on are divided into L=(256-H) and H 0N/A values, with all the H values larger than the L values. 0N/A (That is, the L values are [0,L) and the H are [L,256).) 0N/A The last byte is always either the B-th byte, a byte with "L value" 0N/A (<L), or both. There is no other byte that satisfies these conditions. 0N/A All bytes before the last always have "H values" (>=L). 0N/A Therefore, if L==0, the code always has the full length of B bytes. 0N/A The coding then becomes a classic B-byte little-endian unsigned integer. 0N/A (Also, if L==128, the high bit of each byte acts signals the presence 0N/A of a following byte, up to the maximum length.) 0N/A In the unsigned case (S==0), the coding is compact and monotonic 0N/A in the ordering of byte sequences defined by appending zero bytes 0N/A to pad them to a common length B, reversing them, and ordering them 0N/A lexicographically. (This agrees with "little-endian" byte order.) 0N/A Therefore, the unsigned value of a byte sequence may be defined as: 0N/A or [0..256) if B==1 (**) 0N/A U(b0,b1) == b0 + b1*H 0N/A or [L..L*(1+H) + H^2) if B==2 0N/A U(b0,b1,b2) == b0 + b1*H + b2*H^2 0N/A in [L*(1+H)..L*(1+H+H^2)) 0N/A or [L*(1+H)..L*(1+H+H^2) + H^3) if B==3 0N/A U(b[i]: i<n) == Sum[i<n]( b[i] * H^i ) 0N/A up to L*Sum[i<n]( H^i ) 0N/A or to L*Sum[i<n]( H^i ) + H^n if n==B 0N/A (**) If B==1, the values H,L play no role in the coding. 0N/A As a convention, we require that any (1,H,S) code must always 0N/A encode values less than H. Thus, a simple unsigned byte is coded 0N/A specifically by the code (1,256,0). 0N/A (Properly speaking, the unsigned case should be parameterized as 0N/A S==Infinity. If the schema were regular, the case S==0 would really 0N/A denote a numbering in which all coded values are negative.) 0N/A If S>0, the unsigned value of a byte sequence is regarded as a binary 0N/A integer. If any of the S low-order bits are zero, the corresponding 0N/A signed value will be non-negative. If all of the S low-order bits 0N/A (S>0) are one, the the corresponding signed value will be negative. 0N/A The non-negative signed values are compact and monotonically increasing 0N/A (from 0) in the ordering of the corresponding unsigned values. 0N/A The negative signed values are compact and monotonically decreasing 0N/A (from -1) in the ordering of the corresponding unsigned values. 0N/A In essence, the low-order S bits function as a collective sign bit 0N/A for negative signed numbers, and as a low-order base-(2^S-1) digit 0N/A for non-negative signed numbers. 0N/A Therefore, the signed value corresponding to an unsigned value is: 0N/A Sgn(x) == (x / 2^S)*(2^S-1) + (x % 2^S), if S>0, (x % 2^S) < 2^S-1 0N/A Sgn(x) == -(x / 2^S)-1, if S>0, (x % 2^S) == 2^S-1 0N/A Finally, the value of a byte sequence, given the coding parameters 0N/A (B,H,S), is defined as: 0N/A V(b[i]: i<n) == Sgn(U(b[i]: i<n)) 0N/A The extremal positive and negative signed value for a given range 0N/A of unsigned values may be found by sign-encoding the largest unsigned 0N/A value which is not 2^S-1 mod 2^S, and that which is, respectively. 0N/A Because B,H,S are variable, this is not a single coding but a schema 0N/A of codings. For optimal compression, it is necessary to adaptively 0N/A select specific codings to the data being compressed. 0N/A For example, if a sequence of values happens never to be negative, 0N/A S==0 is the best choice. If the values are equally balanced between 0N/A negative and positive, S==1. If negative values are rare, then S>1 0N/A is more appropriate. 0N/A A (B,H,S) encoding is called a "subrange" if it does not encode 0N/A the largest 32-bit value, and if the number R of values it does 0N/A encode can be expressed as a positive 32-bit value. (Note that 0N/A B=1 implies R<=256, B=2 implies R<=65536, etc.) 0N/A A delta version of a given (B,H,S) coding encodes an array of integers 0N/A by writing their successive differences in the (B,H,S) coding. 0N/A The original integers themselves may be recovered by making a 0N/A running accumulation of sum of the differences as they are read. 0N/A As a special case, if a (B,H,S) encoding is a subrange, its delta 0N/A version will only encode arrays of numbers in the coding's unsigned 0N/A range, [0..R-1]. The coding of deltas is still in the normal signed 0N/A range, if S!=0. During delta encoding, all subtraction results are 0N/A reduced to the signed range, by adding multiples of R. Likewise, 0N/A. during encoding, all addition results are reduced to the unsigned range. 0N/A This special case for subranges allows the benefits of wraparound 0N/A when encoding correlated sequences of very small positive numbers. 0N/A // Code-specific limits: 0N/A // Code range for a all (B,H) codes of length <=nMax (<=B). 0N/A // n < B: L*Sum[i<n]( H^i ) 0N/A // n == B: L*Sum[i<B]( H^i ) + H^B 0N/A assert(B >=
1 && B <=
5);
0N/A assert(H >=
1 && H <=
256);
0N/A if (
nMax ==
0)
return 0;
// no codes of zero length 0N/A if (B ==
1)
return H;
// special case; see (**) above 0N/A /** Largest int representable by (B,H,S) in up to nMax bytes. */ 0N/A //assert(S >= 0 && S <= S_MAX); 0N/A return -
1;
// degenerate max value for empty set of codes 0N/A if (
maxPos <
0)
return -
1;
// No positive codings at all. 0N/A // check for 32-bit wraparound: 0N/A /** Smallest int representable by (B,H,S) in up to nMax bytes. 0N/A Returns Integer.MIN_VALUE if 32-bit wraparound covers 0N/A the entire negative range. 0N/A //assert(S >= 0 && S <= S_MAX); 0N/A // Can code negative values via 32-bit wraparound. 0N/A if (
maxNeg <
0)
return 0;
// No negative codings at all. 0N/A // Some of the arithmetic below is on unsigned 32-bit integers. 0N/A // These must be represented in Java as longs in the range [0..2^32-1]. 0N/A // The conversion to a signed int is just the Java cast (int), but 0N/A // the conversion to an unsigned int is the following little method: 0N/A return ((
long)
sx <<
32) >>>
32;
0N/A assert(
ux >= -
1);
// can be out of 32-bit range; who cares 0N/A // If S>=2 very low negatives are coded by 32-bit-wrapped positives. 0N/A // The lowest negative representable by a negative coding is 0N/A // ~(umax32 >> S), and the next lower number is coded by wrapping 0N/A // the highest positive: 0N/A // CodePos(umax32-1) -> (umax32-1)-((umax32-1)>>S) 0N/A // which simplifies to ~(umax32 >> S)-1. 0N/A return (
0 >
sx) && (
sx >= ~(-
1>>>S));
0N/A return (
int)
ux;
// cast to signed int 0N/A // Sgn(x) == -(x / 2^S)-1 0N/A // Sgn(x) == (x / 2^S)*(2^S-1) + (x % 2^S) 0N/A // Assert special case of S==1: 0N/A assert(!(S ==
1) ||
sx == (((
int)
ux >>>
1) ^ -((
int)
ux &
1)));
0N/A // InvSgn(sx) = (sx / (2^S-1))*2^S + (sx % (2^S-1)) 0N/A // InvSgn(sx) = (-sx-1)*2^S + (2^S-1) 0N/A // Top-level coding of single integers: 0N/A for (
int i =
0; i < B-
1; i++) {
0N/A // Report number of bytes written by updating outpos[0]: 0N/A // Check right away for mis-coding. 0N/A //assert(sx == readInt(out, new int[1], B, H, S)); 0N/A // U(b[i]: i<n) == Sum[i<n]( b[i] * H^i ) 0N/A for (
int i =
0; i < B; i++) {
0N/A //assert(sum >= 0 && sum < codeRangeLong(B, H)); 0N/A // Report number of bytes read by updating inpos[0]: 0N/A // The Stream version doesn't fetch a byte unless it is needed for coding. 0N/A // U(b[i]: i<n) == Sum[i<n]( b[i] * H^i ) 0N/A for (
int i =
0; i < B; i++) {
0N/A public static final int B_MAX =
5;
/* B: [1,5] */ 0N/A public static final int H_MAX =
256;
/* H: [1,256] */ 0N/A public static final int S_MAX =
2;
/* S: [0,2] */ 0N/A private final int B;
/*1..5*/ // # bytes (1..5) 0N/A private final int H;
/*1..256*/ // # codes requiring a higher byte 0N/A private final int L;
/*0..255*/ // # codes requiring a higher byte 0N/A private final int S;
/*0..3*/ // # low-order bits representing sign 0N/A private final int del;
/*0..2*/ // type of delta encoding (0 == none) 0N/A private final int min;
// smallest representable value 0N/A private final int max;
// largest representable value 0N/A private final int umin;
// smallest representable uns. value 0N/A private final int umax;
// largest representable uns. value 0N/A private final int[]
byteMin;
// smallest repr. value, given # bytes 0N/A private final int[]
byteMax;
// largest repr. value, given # bytes 0N/A if (
this.B !=
that.B)
return false;
0N/A if (
this.H !=
that.H)
return false;
0N/A if (
this.S !=
that.S)
return false;
0N/A return (
del<<
14)+(S<<
11)+(B<<
8)+(H<<
0);
0N/A /** Can this coding represent a single value, possibly a delta? 0N/A * This ignores the D property. That is, for delta codings, 0N/A * this tests whether a delta value of 'x' can be coded. 0N/A * For signed delta codings which produce unsigned end values, 0N/A * use canRepresentUnsigned. 0N/A /** Can this coding, apart from its S property, 0N/A * represent a single value? (Negative values 0N/A * can only be represented via 32-bit overflow, 0N/A * so this returns true for negative values 0N/A * if isFullRange is true.) 0N/A // %%% use byte[] buffer 0N/A // Reduce array values to the required range. 0N/A // The following code is a buffered version of this loop: 0N/A // for (int i = start; i < end; i++) 0N/A // writeTo(out, a[i]); 0N/A /** Tell if the range of this coding (number of distinct 0N/A * representable values) can be expressed in 32 bits. 0N/A /** Tell if this coding can represent all 32-bit values. 0N/A * Note: Some codings, such as unsigned ones, can be neither 0N/A * subranges nor full-range codings. 0N/A /** Return the number of values this coding (a subrange) can represent. */ 0N/A return (
max -
min) +
1;
// range includes both min & max 0N/A /** Return a coding suitable for representing summed, modulo-reduced values. */ 0N/A /** Reduce the given value to be within this coding's unsigned range, 0N/A * by adding or subtracting a multiple of (max-min+1). 0N/A // already in unsigned range 0N/A // already in signed range 0N/A // 32-bit overflow, but the next '%=' op needs to be unsigned 0N/A /** Does this coding support at least one negative value? 0N/A Includes codings that can do so via 32-bit wraparound. 0N/A /** Does this coding code arrays by making successive differences? */ 0N/A public int B() {
return B; }
0N/A public int H() {
return H; }
0N/A public int L() {
return L; }
0N/A public int S() {
return S; }
0N/A /** Heuristic measure of the difference between two codings. */ 0N/A // Distance in log space of H (<=128) and L (<128). 0N/A // Double the accuracy of the log: 0N/A // Follow H in log space by the multiplicative inverse of L. 0N/A if (H <=
128)
return H;
0N/A if (L >=
1)
return 128*
128/L;
0N/A /** ceiling(log[2](x)): {1->0, 2->1, 3->2, 4->2, ...} */ 0N/A assert(x-
1 >=
0);
// x in range (int.MIN_VALUE -> 32) 0N/A for (
int i =
10; i >=
0; i = (i <<
1) - (i >>
3)) {
0N/A /** Number of significant bits in i, not counting sign bits. 0N/A * For positive i, it is ceil_lg2(i + 1). 0N/A if (i <
0) i = ~i;
// change sign 0N/A //assert(w == ceil_lg2(i + 1)); 0N/A /** Create an array of successive differences. 0N/A * If min==max, accept any and all 32-bit overflow. 0N/A * Otherwise, avoid 32-bit overflow, and reduce all differences 0N/A * to a value in the given range, by adding or subtracting 0N/A * multiples of the range cardinality (max-min+1). 0N/A * Also, the values are assumed to be in the range [0..(max-min)]. 0N/A // Reduce delta values to the required range. 0N/A // We will force the values to reduce to the right subrange. 0N/A // Huge range; delta values must assume full 32-bit range. 0N/A // final values must be representable 0N/A // Calculate max, min: 0N/A for (
int i =
1; i <
len; i++) {
0N/A /** How many bytes are in the coding of this value? 0N/A * Returns Integer.MAX_VALUE if the value has no coding. 0N/A * The coding must not be a delta coding, since there is no 0N/A * definite size for a single value apart from its context. 0N/A for (
int n =
0; n < B; n++) {
0N/A for (
int n =
0; n < B; n++) {
0N/A //return Coding.of(B, H, S).getLength(deltas, 0, len); 0N/A // add extra bytes for extra-long values 0N/A for (
int n =
1; n <= B; n++) {
0N/A // what is the coding interval [min..max] for n bytes? 0N/A int longer =
0;
// count of guys longer than n bytes 0N/A for (
int i =
0; i <
len; i++) {
0N/A if (
longer ==
0)
break;
// no more passes needed 0N/A || (B ==
1 && H !=
256)
0N/A || (B ==
5 && H ==
256)) {
0N/A return "("+B+
","+H+
","+S+
","+
del+
")";
0N/A // If -ea, print out more informative strings! 0N/A //assert((str = stringForDebug()) != null); 0N/A for (
int n =
1; n <= B; n++) {