org.apache.lucene.analysis.query

Class QueryAutoStopWordAnalyzer

  • All Implemented Interfaces:
    Closeable, AutoCloseable


    public final class QueryAutoStopWordAnalyzer
    extends AnalyzerWrapper
    An Analyzer used primarily at query time to wrap another analyzer and provide a layer of protection which prevents very common words from being passed into queries.

    For very large indexes the cost of reading TermDocs for a very common word can be high. This analyzer was created after experience with a 38 million doc index which had a term in around 50% of docs and was causing TermQueries for this term to take 2 seconds.

    • Constructor Detail

      • QueryAutoStopWordAnalyzer

        public QueryAutoStopWordAnalyzer(Analyzer delegate,
                                         IndexReader indexReader)
                                  throws IOException
        Creates a new QueryAutoStopWordAnalyzer with stopwords calculated for all indexed fields from terms with a document frequency percentage greater than defaultMaxDocFreqPercent
        Parameters:
        delegate - Analyzer whose TokenStream will be filtered
        indexReader - IndexReader to identify the stopwords from
        Throws:
        IOException - Can be thrown while reading from the IndexReader
      • QueryAutoStopWordAnalyzer

        public QueryAutoStopWordAnalyzer(Analyzer delegate,
                                         IndexReader indexReader,
                                         int maxDocFreq)
                                  throws IOException
        Creates a new QueryAutoStopWordAnalyzer with stopwords calculated for all indexed fields from terms with a document frequency greater than the given maxDocFreq
        Parameters:
        delegate - Analyzer whose TokenStream will be filtered
        indexReader - IndexReader to identify the stopwords from
        maxDocFreq - Document frequency terms should be above in order to be stopwords
        Throws:
        IOException - Can be thrown while reading from the IndexReader
      • QueryAutoStopWordAnalyzer

        public QueryAutoStopWordAnalyzer(Analyzer delegate,
                                         IndexReader indexReader,
                                         float maxPercentDocs)
                                  throws IOException
        Creates a new QueryAutoStopWordAnalyzer with stopwords calculated for all indexed fields from terms with a document frequency percentage greater than the given maxPercentDocs
        Parameters:
        delegate - Analyzer whose TokenStream will be filtered
        indexReader - IndexReader to identify the stopwords from
        maxPercentDocs - The maximum percentage (between 0.0 and 1.0) of index documents which contain a term, after which the word is considered to be a stop word
        Throws:
        IOException - Can be thrown while reading from the IndexReader
      • QueryAutoStopWordAnalyzer

        public QueryAutoStopWordAnalyzer(Analyzer delegate,
                                         IndexReader indexReader,
                                         Collection<String> fields,
                                         float maxPercentDocs)
                                  throws IOException
        Creates a new QueryAutoStopWordAnalyzer with stopwords calculated for the given selection of fields from terms with a document frequency percentage greater than the given maxPercentDocs
        Parameters:
        delegate - Analyzer whose TokenStream will be filtered
        indexReader - IndexReader to identify the stopwords from
        fields - Selection of fields to calculate stopwords for
        maxPercentDocs - The maximum percentage (between 0.0 and 1.0) of index documents which contain a term, after which the word is considered to be a stop word
        Throws:
        IOException - Can be thrown while reading from the IndexReader
      • QueryAutoStopWordAnalyzer

        public QueryAutoStopWordAnalyzer(Analyzer delegate,
                                         IndexReader indexReader,
                                         Collection<String> fields,
                                         int maxDocFreq)
                                  throws IOException
        Creates a new QueryAutoStopWordAnalyzer with stopwords calculated for the given selection of fields from terms with a document frequency greater than the given maxDocFreq
        Parameters:
        delegate - Analyzer whose TokenStream will be filtered
        indexReader - IndexReader to identify the stopwords from
        fields - Selection of fields to calculate stopwords for
        maxDocFreq - Document frequency terms should be above in order to be stopwords
        Throws:
        IOException - Can be thrown while reading from the IndexReader
    • Method Detail

      • getWrappedAnalyzer

        protected Analyzer getWrappedAnalyzer(String fieldName)
        Description copied from class: AnalyzerWrapper
        Retrieves the wrapped Analyzer appropriate for analyzing the field with the given name
        Specified by:
        getWrappedAnalyzer in class AnalyzerWrapper
        Parameters:
        fieldName - Name of the field which is to be analyzed
        Returns:
        Analyzer for the field with the given name. Assumed to be non-null
      • wrapComponents

        protected Analyzer.TokenStreamComponents wrapComponents(String fieldName,
                                                                Analyzer.TokenStreamComponents components)
        Description copied from class: AnalyzerWrapper
        Wraps / alters the given TokenStreamComponents, taken from the wrapped Analyzer, to form new components. It is through this method that new TokenFilters can be added by AnalyzerWrappers. By default, the given components are returned.
        Overrides:
        wrapComponents in class AnalyzerWrapper
        Parameters:
        fieldName - Name of the field which is to be analyzed
        components - TokenStreamComponents taken from the wrapped Analyzer
        Returns:
        Wrapped / altered TokenStreamComponents.
      • getStopWords

        public String[] getStopWords(String fieldName)
        Provides information on which stop words have been identified for a field
        Parameters:
        fieldName - The field for which stop words identified in "addStopWords" method calls will be returned
        Returns:
        the stop words identified for a field
      • getStopWords

        public Term[] getStopWords()
        Provides information on which stop words have been identified for all fields
        Returns:
        the stop words (as terms)