Automatic ID generation in Apache Solr

I have been working on Apache Solr for last few months, and have been recieving requirements to speed up query process. As part of the investigation, i found out as retrieved documents' unique id generation contributes query processing.And hence i have decided to add this post.

Data Structure

Our sample data structure (field section from schema.xml) looks like specified below:

  <fields>
    <field name="id" type="string" indexed="true" stored="true" required="true" multiValued="false" />
    <field name="name" type="text_general" indexed="true" stored="true" />
    <field name="_version_" type="long" indexed="true" stored="true" />
  </fields>

In addition to this, I've added the information about which field is the one that should contain unique identifiers. This was also done in schema.xml file:

<uniqueKey>id</uniqueKey>

Solr Configuration

In addition to changes in the schema.xml file, i need to modify the solrconfig.xml file and introduce a proper UpdateRequestProcessorChain like specified below:

<updateRequestProcessorChain>
  <processor class="solr.UUIDUpdateProcessorFactory">
    <str name="fieldName">id</str>
  </processor>
  <processor class="solr.LogUpdateProcessoryFactory" />
  <processor class="solr.RunUpdateProcessorFactory" />
</updateRequestProcessorChain>

Above informs Solr that id field contents are to be generated automatically.

Simple Test

Enough with the configuration, time to test the configuration. Run below command from terminal to update document before querying indexed documents.

$> curl -XPOST 'localhost:8993/solr/update?commit=true' --data-binary '<add><doc><field name="name">Test</field></doc></add>' -H 'Content-type:application/xml'

If above command runs successfully without any errors, document will get indexed. After then, in order to query below command can be used:

$> curl -XGET 'localhost:8993/solr/select?q=_:_&indent=true'

Above will return queried documents specified below:

<?xml version="1.0" encoding="UTF-8"?>
<response>
 <lst name="responseHeader">
  <int name="status">0</int>
  <int name="QTime">0</int>
  <lst name="params">
   <str name="indent">true</str>
   <str name="q">*:*</str>
  </lst>
 </lst>
 <result name="response" numFound="1" start="0">
  <doc>
   <str name="name">Test</str>
   <str name="id">1cdee8b4-c42d-4101-8301-4dc350a4d522</str>
   <long name="_version_">1439726523307261952</long>
  </doc>
 </result>
</response>

If you analyze response, you can see the unique identifier was automatically generated. Now if you run same commands ( addition of document & query ) then result would looks like this:

<?xml version="1.0" encoding="UTF-8"?>
<response>
 <lst name="responseHeader">
  <int name="status">0</int>
  <int name="QTime">1</int>
  <lst name="params">
   <str name="indent">true</str>
   <str name="q">*:*</str>
  </lst>
 </lst>
 <result name="response" numFound="2" start="0">
  <doc>
   <str name="name">Test</str>
   <str name="id">1cdee8b4-c42d-4101-8301-4dc350a4d522</str>
   <long name="_version_">1439726523307261952</long>
  </doc>
  <doc>
   <str name="name">Test</str>
   <str name="id">9bedcb5f-1b71-4ab7-80a9-9882a6bf319e</str>
   <long name="_version_">1439726693819351040</long>
  </doc>
 </result>
</response>

As you can see both documents show two different unique identifier generated by solr.