Solr

From CollectiveAccess Documentation
Jump to: navigation, search

About Solr

Solr is an open source enterprise search server based on the Lucene Java search library, with XML/HTTP and JSON APIs, hit highlighting, faceted search, caching, replication, a web administration interface and many more features. It runs in a Java servlet container such as Tomcat [http://lucene.apache.org/solr].

Solr Setup

A quick start tutorial is available here.

The Solr Project Wiki is a good starting point if you want to know more.

Solr Configuration

CollectiveAccess takes care of almost everything as soon as you have a servlet container with Solr up and running. Solr version 1.3.0 has been tested and is definitely supported. Older versions may or may not work. To select Solr as your Search Engine, you have to change the search_engine_plugin setting in app.conf.

search_engine_plugin = Solr

The first thing you need to do is changing the search_solr_home_dir setting in app/conf/search.conf to match your setup:

search_solr_home_dir = /usr/local/solr

After that, you have to make sure that you (the user you're currently logged in with) can write to the directory provided above. If so, please empty the directory. CollectiveAccess will create a completely new Solr configuration for you, including everything you need to get started. As soon as the directory is empty, you need to run the script support/utils/createSolrConfiguration.php using a php command line interpreter. If something goes wrong, the script will hopefully tell you the reason.

Now things get a bit complicated. Since CollectiveAccess needs to be able change the Apache Solr configuration at runtime, the web server user needs to have write permissions to everything that is in the Solr home directory now (there should not be any data yet, just configuration files). So, for instance, if your Apache web server runs as 'www-data' (typical setup on Debian-like Linux systems) you should change ownership of all files in the Solr home to 'www-data' (recursively). Alternatively you can change permissions to 777 but that is not recommended!

After that, you need to restart your servlet container (e.g. Tomcat or Glassfish) and issue a full reindex using the support/utils/reindex.php command line PHP script. CollectiveAccess should take care of the rest.

Solr Plugin inside

For those of you who want to use the Solr (and maybe the data we index in the Solr) for anything else but CollectiveAccess or who want to modify the automatic configuration process, here is a documentation of the things we do.

Basically the whole CollectiveAccess Solr module consists of 3 parts:

  • createSolrConfiguration.php, the SolrConfiguration class and its templates: provides automatic Solr configuration
  • Solr.php: search engine plugin to provide searching facilities
  • SolrResult.php: search result wrapper

Since CollectiveAccess is using seperated indexes for each database entity (e.g. objects, entities ...) and since Solr is using one big index for everything by default, we're using the MultiCore feature. This enables us to adress different indexes by different URLs.

General configuration

The createSolrConfiguration.php (which invokes the static method SolrConfiguration::updateSolrConfiguration()) script creates one core for each table that is set up to be indexed in search_indexing.conf. The core name and the instance path correspond to the primary table name of that entity (e.g. ca_objects). The general core setup is written in a file named solr.xml in the solr home directory root. By default it looks like this:

<?xml version="1.0" encoding="UTF-8" ?>
<solr>
	<cores adminPath="/admin/cores">
		<core name="ca_objects" instanceDir="ca_objects" />
	        ...
	</cores>
</solr>

Solr supports runtime core administration. The "adminPath" attribute of the "cores" element defines the URL of the CoreAdminHandler. We are using that to keep the Solr configuration in sync with the CollectiveAccess configuration.

The "core" elements define which cores Solr is using and where they live.

Per-core configuration

The configuration of each core consists of 2 files: solrconfig.xml and schema.xml, which live in <solrhome>/<core_name>/conf.

The solrconfig.xml defines how we can talk to the Solr and in which ways the Solr talks to us. The file is the same for each core. The template resides in support/utils/solrplugin_templates. The key parts are the following:


<requestHandler name="standard" class="solr.StandardRequestHandler" default="true" />

That tells the Solr to use the standard request handler which means that we can send queries to <solr_url>/<core_name>/select using the core and common query parameters which provide everything we need to do our requests.

<requestHandler name="/update" class="solr.XmlUpdateRequestHandler" />

This one enables us to post updated data to <solr_url>/<core_name>/update,

<requestHandler name="/admin/" class="org.apache.solr.handler.admin.AdminHandlers" />

This line provides a very useful administration web interface at <solr_url>/<core_name>/admin/. Have a look (and do not ommit the trailing slash in the URL)!

<queryResponseWriter name="php" class="org.apache.solr.request.PHPResponseWriter"/>
<queryResponseWriter name="phps" class="org.apache.solr.request.PHPSerializedResponseWriter"/>

These 2 lines enable non-standard response writers suitable for fetching data with PHP (without having to parse the Solr XML responses).

The schema.xml file is meant to define a static "datamodel", a list of fields with certain properties for this core. This list is derived from the search_indexing.conf configuration file. This file allows you to use 2 different field types for each field or "virtual field" - those with the DONT_TOKENIZE option and those without it. In the Solr schema.xml file that structure is cloned:

<types>
	<fieldType name="text" class="solr.TextField" positionIncrementGap="100">
		<analyzer>
			<tokenizer class="solr.WhitespaceTokenizerFactory"/>
			<filter class="solr.LowerCaseFilterFactory"/>
		</analyzer>
	</fieldType>
	<fieldType name="string" class="solr.StrField" />
	<fieldtype name="ignored" stored="false" indexed="false" class="solr.StrField" /> 
</types>

The first fieldType element ("text") represents the general field type (without the DONT_TOKENIZE option) of the search_indexing.conf, the second one ("string") is meant to be used for those fields with DONT_TOKENIZE enabled. The third one is there for technical reasons.

You might have noticed that there is a second option in the CollectiveAccess search_indexing.conf configuration file called "STORE" which enables you to configure fields to be stored in the index to speed up retrieval (but slow down search performance). We don't need to define special field types for that option here since Solr supports that option by default.

After the Solr config generator has been issued all fields in search_indexing.conf should appear in the schema.xml file of the corresponding core. There should be a "fields" element with subelements, like this:

<fields>
...
	<field name="ca_objects.idno" type="string" indexed="true" stored="true" />
...
	<field name="ca_object_labels.name" type="text" indexed="true" stored="false" />
...
	<dynamicField name="*" type="ignored" />
</fields>

In this example you can see the usage of all available options: DONT_TOKENIZE and STORE in the field element "ca_objects.idno" and none of them in the field element "ca_object_labels.name". The last line tells Solr to ignore all fields that are not defined above rather than throwing an error (which is the default behavior). That's why we needed the "ignored" field type above.

You might also have noticed that we support indexing of dynamic fields ("metadata" in CollectiveAccess language) using the virtual field "_metadata" in search_indexing.conf. Of course we don't want to throw everything in one indexing field internally so we need to break it down to single metadata elements as follows:

<field name="ca_objects._ca_attribute_22" type="text" indexed="true" stored="false" />
<field name="ca_objects._ca_attribute_23" type="text" indexed="true" stored="false" />
<field name="ca_objects._ca_attribute_24" type="text" indexed="true" stored="false" />

So the field name rule looks like this: <table_name>._ca_attribute_<element_id>. Since metadata attributes with one element id are not necessarily unique for one record we concatenate them with a newline character if necessary.

There is one field left that remains to be explained:

<field name="text" type="text" indexed="true" stored="false" multiValued="true"/>

This is needed due to the default Solr behavior. If you pass a simple phrase query without field names, Solr will look for the query terms in a default field not in every field in the index. So we need this workaround to support simple full text queries. There is this multivalued "text" field where we copy all other fields at indexing time. Luckily this is kind of easy if you use Solr:

...
<copyField source="ca_objects.idno" dest="text" />
...
<copyField source="ca_objects._ca_attribute_8" dest="text" />
...

Then you define the default field:

<defaultSearchField>text</defaultSearchField>

... and there you go.

If you had a closer look to one of the schema.xml files you might have noticed that there are two things left that were not explained yet:

<uniqueKey>ca_objects.object_id</uniqueKey>
<solrQueryParser defaultOperator="AND"/>

The first line defines a unique key which is used by Solr to identify index documents. Luckily CollectiveAccess also uses similar ids internally so we can copy them without any problem. The cool thing about that is that Solr makes sure that this ID is unique among all documents so if you add a new document with an ID that already exists in this index, Solr overwrites it. You can also use this id for very fast delete calls.

The second line should be clear: We define the default operator. Since most people are used to the AND behaviour (which is also used by Google) - meaning the more query terms you enter in a field, the more restrictive is your search - we used that as well.

Runtime configuration changes

Since things that concern the Solr indexing configuration may change very frequently it is necessary that we use the CoreAdminHandler functionality of the Solr to keep its indexing configuration (i.e. the schema.xml files) up-to-date. It has to be updated everytime a) the CollectiveAccess search_indexing.conf configuration file is changed or b) a new metadata element is created. As these Solr config updates have to be 'backwards-compatible' we have to make sure that we never delete a field definition from a schema.xml file. Therefore we are caching an array (basically a field list with some options) using Zend Cache in the ca_cache namespace. The cache names are built as follows: ca_search_indexing_info_<table_name>. They are stored in app/tmp/. Once a change of the search_indexing.conf or the metadata element list is detected (the detector is called in WLPlugSearchEngineSolr::commitRowIndexing) , we look into that cache, pull the field list out, merge it with the current indexing configuration (config file + metadata elements) and create the Solr configuration based upon that (the static SolrConfiguration::updateSolrConfiguration() function does that). Therefore we never 'lose' a field and the Solr configuration is always backwards-compatible. As soon as the Solr configuration has been updated, we issue a RELOAD on all existing cores so that the new configuration is loaded and used.

The Solr configuration generator method SolrConfiguration::updateSolrConfiguration, however can be triggered to ignore existing caches (by passing the parameter true) which might be useful for initial setups or cleanups.

Indexing and searching

Indexing and searching a properly configured Solr is pretty much straightforward. We use Zend_Client to send GET requests or POST data to the core-specific handlers "select" or "update". Everything that concerns adding, deleting or updating documents is done according to this link. The process of sending GET requests to query the Solr (the !StandardRequestHandler <core_name>/select) is described here. Since the Lucene query syntax is the CollectiveAccess query syntax and Apache Solr is built upon Java Lucene, we don't do any query parsing in the SolrPlugin. The only thing we need to do is passing input queries (from the user interface) through utf8_decode before sending the GET request. Please also note that we're not using the StandardRequestHandler with its default behavior but a special PHP Mode that makes it return a serialized PHP array which enables us to use the result directly, without having to parse the output XML.

Implementation TODO List

  • query rewriting to support wildcard searches (as soon as general query parsing in the search engine is kind of fixed)
  • error handling (i.e. administrative tool which tells you if sth is wrong)
  • return query terms for highlighting
Namespaces

Variants
Actions
Navigation
Tools
User
Personal tools