org.apache.lucene.classification.utils

Class DatasetSplitter



  • public class DatasetSplitter
    extends Object
    Utility class for creating training / test / cross validation indexes from the original index.
    • Constructor Detail

      • DatasetSplitter

        public DatasetSplitter(double testRatio,
                               double crossValidationRatio)
        Create a DatasetSplitter by giving test and cross validation IDXs sizes
        Parameters:
        testRatio - the ratio of the original index to be used for the test IDX as a double between 0.0 and 1.0
        crossValidationRatio - the ratio of the original index to be used for the c.v. IDX as a double between 0.0 and 1.0
    • Method Detail

      • split

        public void split(IndexReader originalIndex,
                          Directory trainingIndex,
                          Directory testIndex,
                          Directory crossValidationIndex,
                          Analyzer analyzer,
                          boolean termVectors,
                          String classFieldName,
                          String... fieldNames)
                   throws IOException
        Split a given index into 3 indexes for training, test and cross validation tasks respectively
        Parameters:
        originalIndex - an LeafReader on the source index
        trainingIndex - a Directory used to write the training index
        testIndex - a Directory used to write the test index
        crossValidationIndex - a Directory used to write the cross validation index
        analyzer - Analyzer used to create the new docs
        termVectors - true if term vectors should be kept
        classFieldName - name of the field used as the label for classification; this must be indexed with sorted doc values
        fieldNames - names of fields that need to be put in the new indexes or null if all should be used
        Throws:
        IOException - if any writing operation fails on any of the indexes