org.apache.lucene.codecs.lucene54

Class Lucene54DocValuesFormat

  • All Implemented Interfaces:
    NamedSPILoader.NamedSPI


    public final class Lucene54DocValuesFormat
    extends DocValuesFormat
    Lucene 5.4 DocValues format.

    Encodes the five per-document value types (Numeric,Binary,Sorted,SortedSet,SortedNumeric) with these strategies:

    NUMERIC:

    • Delta-compressed: per-document integers written as deltas from the minimum value, compressed with bitpacking. For more information, see DirectWriter.
    • Table-compressed: when the number of unique values is very small (< 256), and when there are unused "gaps" in the range of values used (such as SmallFloat), a lookup table is written instead. Each per-document entry is instead the ordinal to this table, and those ordinals are compressed with bitpacking (DirectWriter).
    • GCD-compressed: when all numbers share a common divisor, such as dates, the greatest common denominator (GCD) is computed, and quotients are stored using Delta-compressed Numerics.
    • Monotonic-compressed: when all numbers are monotonically increasing offsets, they are written as blocks of bitpacked integers, encoding the deviation from the expected delta.
    • Const-compressed: when there is only one possible non-missing value, only the missing bitset is encoded.
    • Sparse-compressed: only documents with a value are stored, and lookups are performed using binary search.

    BINARY:

    • Fixed-width Binary: one large concatenated byte[] is written, along with the fixed length. Each document's value can be addressed directly with multiplication (docID * length).
    • Variable-width Binary: one large concatenated byte[] is written, along with end addresses for each document. The addresses are written as Monotonic-compressed numerics.
    • Prefix-compressed Binary: values are written in chunks of 16, with the first value written completely and other values sharing prefixes. chunk addresses are written as Monotonic-compressed numerics. A reverse lookup index is written from a portion of every 1024th term.

    SORTED:

    • Sorted: a mapping of ordinals to deduplicated terms is written as Binary, along with the per-document ordinals written using one of the numeric strategies above.

    SORTED_SET:

    • Single: if all documents have 0 or 1 value, then data are written like SORTED.
    • SortedSet table: when there are few unique sets of values (< 256) then each set is assigned an id, a lookup table is written and the mapping from document to set id is written using the numeric strategies above.
    • SortedSet: a mapping of ordinals to deduplicated terms is written as Binary, an ordinal list and per-document index into this list are written using the numeric strategies above.

    SORTED_NUMERIC:

    • Single: if all documents have 0 or 1 value, then data are written like NUMERIC.
    • SortedSet table: when there are few unique sets of values (< 256) then each set is assigned an id, a lookup table is written and the mapping from document to set id is written using the numeric strategies above.
    • SortedNumeric: a value list and per-document index into this list are written using the numeric strategies above.

    Files:

    1. .dvd: DocValues data
    2. .dvm: DocValues metadata