Coding.java revision 3063
4756N/A * Copyright (c) 2001, 2005, Oracle and/or its affiliates. All rights reserved. 4756N/A * DO NOT ALTER OR REMOVE COPYRIGHT NOTICES OR THIS FILE HEADER. 4756N/A * This code is free software; you can redistribute it and/or modify it 4756N/A * under the terms of the GNU General Public License version 2 only, as 4756N/A * published by the Free Software Foundation. Oracle designates this 4756N/A * particular file as subject to the "Classpath" exception as provided 4756N/A * by Oracle in the LICENSE file that accompanied this code. 4756N/A * This code is distributed in the hope that it will be useful, but WITHOUT 4756N/A * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or 4756N/A * FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License 4756N/A * version 2 for more details (a copy is included in the LICENSE file that 4756N/A * You should have received a copy of the GNU General Public License version 4756N/A * 2 along with this work; if not, write to the Free Software Foundation, 4756N/A * Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA. 4756N/A * Please contact Oracle, 500 Oracle Parkway, Redwood Shores, CA 94065 USA 4756N/A * or visit www.oracle.com if you need additional information or have any 4756N/A * Define the conversions between sequences of small integers and raw bytes. 4756N/A * This is a schema of encodings which incorporates varying lengths, 4756N/A * varying degrees of length variability, and varying amounts of signed-ness. 4756N/A Coding schema for single integers, parameterized by (B,H,S): 4756N/A Let B in [1,5], H in [1,256], S in [0,3]. 4756N/A (S limit is arbitrary. B follows the 32-bit limit. H is byte size.) 4756N/A A given (B,H,S) code varies in length from 1 to B bytes. 4756N/A The 256 values a byte may take on are divided into L=(256-H) and H 4756N/A values, with all the H values larger than the L values. 4756N/A (That is, the L values are [0,L) and the H are [L,256).) 4756N/A The last byte is always either the B-th byte, a byte with "L value" 4756N/A (<L), or both. There is no other byte that satisfies these conditions. 4756N/A All bytes before the last always have "H values" (>=L). 4756N/A Therefore, if L==0, the code always has the full length of B bytes. 4756N/A The coding then becomes a classic B-byte little-endian unsigned integer. 4756N/A (Also, if L==128, the high bit of each byte acts signals the presence 4756N/A of a following byte, up to the maximum length.) 4756N/A In the unsigned case (S==0), the coding is compact and monotonic 4756N/A in the ordering of byte sequences defined by appending zero bytes 4756N/A to pad them to a common length B, reversing them, and ordering them 4756N/A lexicographically. (This agrees with "little-endian" byte order.) 4756N/A Therefore, the unsigned value of a byte sequence may be defined as: 4756N/A or [L..L*(1+H) + H^2) if B==2 4756N/A U(b0,b1,b2) == b0 + b1*H + b2*H^2 4756N/A or [L*(1+H)..L*(1+H+H^2) + H^3) if B==3 4756N/A U(b[i]: i<n) == Sum[i<n]( b[i] * H^i ) 4756N/A or to L*Sum[i<n]( H^i ) + H^n if n==B 4756N/A (**) If B==1, the values H,L play no role in the coding. 4756N/A As a convention, we require that any (1,H,S) code must always 4756N/A encode values less than H. Thus, a simple unsigned byte is coded 4756N/A specifically by the code (1,256,0). (Properly speaking, the unsigned case should be parameterized as S==Infinity. If the schema were regular, the case S==0 would really denote a numbering in which all coded values are negative.) If S>0, the unsigned value of a byte sequence is regarded as a binary integer. If any of the S low-order bits are zero, the corresponding signed value will be non-negative. If all of the S low-order bits (S>0) are one, the the corresponding signed value will be negative. The non-negative signed values are compact and monotonically increasing (from 0) in the ordering of the corresponding unsigned values. The negative signed values are compact and monotonically decreasing (from -1) in the ordering of the corresponding unsigned values. In essence, the low-order S bits function as a collective sign bit for negative signed numbers, and as a low-order base-(2^S-1) digit for non-negative signed numbers. Therefore, the signed value corresponding to an unsigned value is: Sgn(x) == (x / 2^S)*(2^S-1) + (x % 2^S), if S>0, (x % 2^S) < 2^S-1 Sgn(x) == -(x / 2^S)-1, if S>0, (x % 2^S) == 2^S-1 Finally, the value of a byte sequence, given the coding parameters V(b[i]: i<n) == Sgn(U(b[i]: i<n)) The extremal positive and negative signed value for a given range of unsigned values may be found by sign-encoding the largest unsigned value which is not 2^S-1 mod 2^S, and that which is, respectively. Because B,H,S are variable, this is not a single coding but a schema of codings. For optimal compression, it is necessary to adaptively select specific codings to the data being compressed. For example, if a sequence of values happens never to be negative, S==0 is the best choice. If the values are equally balanced between negative and positive, S==1. If negative values are rare, then S>1 A (B,H,S) encoding is called a "subrange" if it does not encode the largest 32-bit value, and if the number R of values it does encode can be expressed as a positive 32-bit value. (Note that B=1 implies R<=256, B=2 implies R<=65536, etc.) A delta version of a given (B,H,S) coding encodes an array of integers by writing their successive differences in the (B,H,S) coding. The original integers themselves may be recovered by making a running accumulation of sum of the differences as they are read. As a special case, if a (B,H,S) encoding is a subrange, its delta version will only encode arrays of numbers in the coding's unsigned range, [0..R-1]. The coding of deltas is still in the normal signed range, if S!=0. During delta encoding, all subtraction results are reduced to the signed range, by adding multiples of R. Likewise, . during encoding, all addition results are reduced to the unsigned range. This special case for subranges allows the benefits of wraparound when encoding correlated sequences of very small positive numbers. // Code range for a all (B,H) codes of length <=nMax (<=B). // n < B: L*Sum[i<n]( H^i ) // n == B: L*Sum[i<B]( H^i ) + H^B assert(B >=
1 && B <=
5);
assert(H >=
1 && H <=
256);
if (
nMax ==
0)
return 0;
// no codes of zero length if (B ==
1)
return H;
// special case; see (**) above for (
int n =
1; n <=
nMax; n++) {
/** Largest int representable by (B,H,S) in up to nMax bytes. */ public static int codeMax(
int B,
int H,
int S,
int nMax) {
//assert(S >= 0 && S <= S_MAX); return -
1;
// degenerate max value for empty set of codes if (S ==
0 ||
range >= (
long)
1<<
32)
if (
maxPos <
0)
return -
1;
// No positive codings at all. // check for 32-bit wraparound: /** Smallest int representable by (B,H,S) in up to nMax bytes. Returns Integer.MIN_VALUE if 32-bit wraparound covers the entire negative range. public static int codeMin(
int B,
int H,
int S,
int nMax) {
//assert(S >= 0 && S <= S_MAX); // Can code negative values via 32-bit wraparound. if (
maxNeg <
0)
return 0;
// No negative codings at all. // Some of the arithmetic below is on unsigned 32-bit integers. // These must be represented in Java as longs in the range [0..2^32-1]. // The conversion to a signed int is just the Java cast (int), but // the conversion to an unsigned int is the following little method: return ((
long)
sx <<
32) >>>
32;
assert(
ux >= -
1);
// can be out of 32-bit range; who cares // If S>=2 very low negatives are coded by 32-bit-wrapped positives. // The lowest negative representable by a negative coding is // ~(umax32 >> S), and the next lower number is coded by wrapping // CodePos(umax32-1) -> (umax32-1)-((umax32-1)>>S) // which simplifies to ~(umax32 >> S)-1. return (
0 >
sx) && (
sx >= ~(-
1>>>S));
return (
int)
ux;
// cast to signed int // Sgn(x) == -(x / 2^S)-1 // Sgn(x) == (x / 2^S)*(2^S-1) + (x % 2^S) sx = (
int)
ux - ((
int)
ux >>> S);
// Assert special case of S==1: assert(!(S ==
1) ||
sx == (((
int)
ux >>>
1) ^ -((
int)
ux &
1)));
// InvSgn(sx) = (sx / (2^S-1))*2^S + (sx % (2^S-1)) // InvSgn(sx) = (-sx-1)*2^S + (2^S-1) // Top-level coding of single integers: for (
int i =
0; i < B-
1; i++) {
int b_i = (
int)( L + (
sum % H) );
// Report number of bytes written by updating outpos[0]: // Check right away for mis-coding. //assert(sx == readInt(out, new int[1], B, H, S)); public static int readInt(
byte[]
in,
int[]
inpos,
int B,
int H,
int S) {
// U(b[i]: i<n) == Sum[i<n]( b[i] * H^i ) for (
int i =
0; i < B; i++) {
//assert(sum >= 0 && sum < codeRangeLong(B, H)); // Report number of bytes read by updating inpos[0]: // The Stream version doesn't fetch a byte unless it is needed for coding. // U(b[i]: i<n) == Sum[i<n]( b[i] * H^i ) for (
int i =
0; i < B; i++) {
public static final int B_MAX =
5;
/* B: [1,5] */ public static final int H_MAX =
256;
/* H: [1,256] */ public static final int S_MAX =
2;
/* S: [0,2] */ private final int B;
/*1..5*/ // # bytes (1..5) private final int H;
/*1..256*/ // # codes requiring a higher byte private final int L;
/*0..255*/ // # codes requiring a higher byte private final int S;
/*0..3*/ // # low-order bits representing sign private final int del;
/*0..2*/ // type of delta encoding (0 == none) private final int min;
// smallest representable value private final int max;
// largest representable value private final int umin;
// smallest representable uns. value private final int umax;
// largest representable uns. value private final int[]
byteMin;
// smallest repr. value, given # bytes private final int[]
byteMax;
// largest repr. value, given # bytes private Coding(
int B,
int H,
int S) {
private Coding(
int B,
int H,
int S,
int del) {
if (!(x
instanceof Coding))
return false;
if (
this.B !=
that.B)
return false;
if (
this.H !=
that.H)
return false;
if (
this.S !=
that.S)
return false;
return (
del<<
14)+(S<<
11)+(B<<
8)+(H<<
0);
private static synchronized Coding of(
int B,
int H,
int S,
int del) {
public static Coding of(
int B,
int H,
int S) {
/** Can this coding represent a single value, possibly a delta? * This ignores the D property. That is, for delta codings, * this tests whether a delta value of 'x' can be coded. * For signed delta codings which produce unsigned end values, * use canRepresentUnsigned. return (x >=
min && x <=
max);
/** Can this coding, apart from its S property, * represent a single value? (Negative values * can only be represented via 32-bit overflow, * so this returns true for negative values * if isFullRange is true.) byte[]
buf =
new byte[B];
// Reduce array values to the required range. // The following code is a buffered version of this loop: // for (int i = start; i < end; i++) byte[]
buf =
new byte[
1<<
8];
/** Tell if the range of this coding (number of distinct * representable values) can be expressed in 32 bits. /** Tell if this coding can represent all 32-bit values. * Note: Some codings, such as unsigned ones, can be neither * subranges nor full-range codings. /** Return the number of values this coding (a subrange) can represent. */ return (
max -
min) +
1;
// range includes both min & max /** Return a coding suitable for representing summed, modulo-reduced values. */ /** Reduce the given value to be within this coding's unsigned range, * by adding or subtracting a multiple of (max-min+1). // already in unsigned range // already in signed range // 32-bit overflow, but the next '%=' op needs to be unsigned /** Does this coding support at least one negative value? Includes codings that can do so via 32-bit wraparound. /** Does this coding code arrays by making successive differences? */ public int B() {
return B; }
public int H() {
return H; }
public int L() {
return L; }
public int S() {
return S; }
public int del() {
return del; }
public int min() {
return min; }
public int max() {
return max; }
/** Heuristic measure of the difference between two codings. */ // Distance in log space of H (<=128) and L (<128). // Double the accuracy of the log: // Follow H in log space by the multiplicative inverse of L. if (L >=
1)
return 128*
128/L;
/** ceiling(log[2](x)): {1->0, 2->1, 3->2, 4->2, ...} */ assert(x-
1 >=
0);
// x in range (int.MIN_VALUE -> 32) for (
int i =
10; i >=
0; i = (i <<
1) - (i >>
3)) {
/** Number of significant bits in i, not counting sign bits. * For positive i, it is ceil_lg2(i + 1). if (i <
0) i = ~i;
// change sign //assert(w == ceil_lg2(i + 1)); /** Create an array of successive differences. * If min==max, accept any and all 32-bit overflow. * Otherwise, avoid 32-bit overflow, and reduce all differences * to a value in the given range, by adding or subtracting * multiples of the range cardinality (max-min+1). * Also, the values are assumed to be in the range [0..(max-min)]. for (
int i =
0; i <
count; i++) {
for (
int i =
0; i <
count; i++) {
// Reduce delta values to the required range. // We will force the values to reduce to the right subrange. // Huge range; delta values must assume full 32-bit range. // final values must be representable if (
len ==
0)
return true;
for (
int i =
1; i <
len; i++) {
/** How many bytes are in the coding of this value? * Returns Integer.MAX_VALUE if the value has no coding. * The coding must not be a delta coding, since there is no * definite size for a single value apart from its context. for (
int n =
0; n < B; n++) {
for (
int n =
0; n < B; n++) {
if (L ==
0)
return len * B;
//return Coding.of(B, H, S).getLength(deltas, 0, len); int sum =
len;
// at least 1 byte per // add extra bytes for extra-long values for (
int n =
1; n <= B; n++) {
// what is the coding interval [min..max] for n bytes? int longer =
0;
// count of guys longer than n bytes for (
int i =
0; i <
len; i++) {
if (
longer ==
0)
break;
// no more passes needed (
byte)(
del +
2*S +
8*(B-
1)),
if (!((
1 <= B && B <=
B_MAX) &&
(
0 <= S && S <=
S_MAX) &&
(
1 <= H && H <=
H_MAX) &&
|| (B ==
5 && H ==
256)) {
return "("+B+
","+H+
","+S+
","+
del+
")";
// If -ea, print out more informative strings! //assert((str = stringForDebug()) != null); for (
int n =
1; n <= B; n++) {