Class SkipExistingDocumentsProcessorFactory

  • All Implemented Interfaces:
    UpdateRequestProcessorFactory.RunAlways, NamedListInitializedPlugin, SolrCoreAware

    public class SkipExistingDocumentsProcessorFactory
    extends UpdateRequestProcessorFactory
    implements SolrCoreAware, UpdateRequestProcessorFactory.RunAlways

    This Factory generates an UpdateProcessor that will (by default) skip inserting new documents if there already exists a document with the same uniqueKey value in the index. It will also skip Atomic Updates to a document if that document does not already exist. This behaviour is applied to each document in turn, so adding a batch of documents can result in some being added and some ignored, depending on what is already in the index. If all of the documents are skipped, no changes to the index will occur.

    These two forms of skipping can be switched on or off independently, by using init params:
    • skipInsertIfExists - This boolean parameter defaults to true, but if set to false then inserts (i.e. not Atomic Updates) will be passed through unchanged even if the document already exists.
    • skipUpdateIfMissing - This boolean parameter defaults to true, but if set to false then Atomic Updates will be passed through unchanged regardless of whether the document exists.

    These params can also be specified per-request, to override the configured behaviour for specific updates e.g. /update?skipUpdateIfMissing=true

    This implementation is a simpler alternative to DocBasedVersionConstraintsProcessorFactory when you are not concerned with versioning, and just want to quietly ignore duplicate documents and/or silently skip updates to non-existent documents (in the same way a database UPDATE would). If your documents do have an explicit version field, and you want to ensure older versions are skipped instead of replacing the indexed document, you should consider DocBasedVersionConstraintsProcessorFactory instead.

    An example chain configuration to use this for skipping duplicate inserts, but not skipping updates to missing documents by default, is:

     <updateRequestProcessorChain name="skipexisting">
       <processor class="solr.LogUpdateProcessorFactory" />
       <processor class="solr.SkipExistingDocumentsProcessorFactory">
         <bool name="skipInsertIfExists">true</bool>
         <bool name="skipUpdateIfMissing">false</bool> <!-- Can override this per-request -->
       <processor class="solr.DistributedUpdateProcessorFactory" />
       <processor class="solr.RunUpdateProcessorFactory" />