![]() ![]() Having multiple blocks allows us to adapt to changes in the data by changing the frame of reference (the min delta) which can result in smaller values after the subtraction which, again, means we can store them with a lower bit width. This guarantees that all values are non-negative.Įncode the frame of reference (min delta) as a zigzag ULEB128 int followed by the bit widths of the miniblocks and the delta values (minus the min delta) bit packed per miniblock. Subtract this min delta from all deltas in the block. For the first element in the block, use the last element in the previous block or, in the case of the first block, use the first value of the whole sequence, stored in the header.Ĭompute the frame of reference (the minimum of the deltas in the block). each miniblock is a list of bit packed ints according to the bit width stored at the begining of the blockĬompute the differences between consecutive elements.the bitwidth of each block is stored as a byte.the min delta is a zigzag ULEB128 int (we compute a minimum as we need positive integers for bit packing).the first value is stored as a zigzag ULEB128 int.the total value count is stored as a ULEB128 int.the miniblock count per block is a divisor of the block size such that their quotient, the number of values in a miniblock, is a multiple of 32 it is stored as a ULEB128 int.the block size is a multiple of 128 it is stored as a ULEB128 int.Rle-header := varint-encode( (rle-run-len) Length := length of the in bytes stored as 4 bytes little endian (unsigned int32) The grammar for this encoding looks like this, given a fixed bit-width known in advance: rle-bit-packed-hybrid: This encoding uses a combination of bit-packing and run length encoding to more efficiently store repeated values. Run Length Encoding / Bit-Packing Hybrid (RLE = 3) In a data page and PLAIN in a dictionary page for Parquet 2.0+ files. Using the PLAIN_DICTIONARY enum value is deprecated in the Parquet 2.0 specification. Written first, before the data pages of the column chunk.ĭictionary page format: the entries in the dictionary - in dictionary order - using the plain encoding.ĭata page format: the bit width used to encode the entry ids stored as 1 byte (max bit width = 32),įollowed by the values encoded using RLE/Bit packed described above (with the given bit width). Or number of distinct values, the encoding will fall back to the plain encoding. If the dictionary grows too big, whether in size Using the RLE/Bit-Packing Hybrid encoding. Theĭictionary will be stored in a dictionary page per column chunk. The dictionary encoding builds a dictionary of values encountered in a given column. Dictionary Encoding (PLAIN_DICTIONARY = 2 and RLE_DICTIONARY = 8) Floatingįor the byte array type, it encodes the length as a 4 byte littleĮndian, followed by the bytes. FIXED_LEN_BYTE_ARRAY: the bytes contained in the arrayįor native types, this outputs the data as little endian.BYTE_ARRAY: length in 4 bytes little endian followed by the bytes contained in the array.INT96: 12 bytes little endian (deprecated).The plain encoding is used whenever a more efficient encoding can not be used. This is the plain encoding that must be supported for types. ![]()
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |