Home / Delivering the "Right" Search Results

Delivering the "Right" Search Results

The Apache Solr search server that powers Acquia Search has many powerful features. One of the less appreciated ones is the ability to specify at query time that documents matching certain criteria should get an extra "boost" in their relevancy score. This means that they appear higher in the search results.

Imagine that you are maintaining a site and you have recently added Acquia Search. Your boss, Bob, is not pleased, however. He says "I thought you told me this new search would do a better job of finding the most relevant results - but when I try it the ones I expect to see come up first are not there." After protesting that the result are good matches to the key words, further discussion reveals that Bob expect his blog posts to be the most relevant matches!

In the Apache Solr settings you can use the "Content bias settings" tab and "Search fields" tab to adjust the boost (see screen shot below). The boost can be set based on a range of properties including content types and node properties, as well as for cases where a keyword matches a certain node field or taxonomy vocabulary. By changing these configuration options, in most cases you can shift the results so they match the needs of your site. Given the problem with Bob's blog posts, you adjust the settings so that all Blog content gets an extra boost.

However, you may still find that the search results are not optimally relevant, especially if you have certain pieces of content that you think should be highlighted, or some pieces of content that you know are of particularly high quality. In this case, you can add a search boost at the node level to make these "important" nodes come to the top. You can write a very small amount of custom code in a site-specific module to get the desired result.

In our imagined scenario case, Bob is still upset because the developers also write blog posts, and those tend to include more of the keywords so are better matches, plus he's annoyed that when one of his blog posts does show up, it's one he wrote last month. If you have some way to automatically identify the "important" nodes, then you may be able to transform those rules into code if the rules can be formulated as a Lucene query. For example, like this hook implementation:

<?php
/**
* Implementation of hook_apachesolr_prepare_query().
*/
function MYMODULE_apachesolr_prepare_query(&$query, &$params) {
 
// Posts by big boss Bob from within the past week are extra important.
 
$params['bq'][] = "(name:Bob AND created:[NOW-7DAYS TO *])^5";
}
?>

This hook gets called for each search query executed and allows modules to add or alter it. The number after the '^' is the boost value if the document matches this query. Note the use of the date range syntax with date math to get a time range relative to the current time.

In contrast, it may be that these nodes will have to be manually flagged. For example, boss Bob now realizes that selected of his older posts are important, and only some of the newer ones, and that depends on which ones his friend Fred decides to reprint in an industry newsletter. Depending on how these nodes are used on the site, you could apply the "sticky at the top of lists" or "promoted to the front page" flag to them and apply a boost setting for that flag. In that case you can see in the screen shot that it is simple to add a boost for nodes with those set to true.

That often won't work for an existing site where the promoted and sticky flags are already used in the workflow, for example you may not want all the content that's important for search results to also show on the home page. However, you could do something conceptually similar. An option would be to use a CCK text field that is provided via a set of radio buttons with allowed values 'no' and 'yes' and that should be indexed automatically as a separate field in Solr. Let's assume that field is indexed as "ss_cck_field_important". You can then implement a hook to boost at search time all documents with this set to 'yes':

<?php
/**
* Implementation of hook_apachesolr_prepare_query().
*/
function MYMODULE_apachesolr_prepare_query(&$query, &$params) {
 
$params['bq'][] = "ss_cck_field_important:yes^5";
}
?>

Again, the number after the '^' is the boost value. The advantage of doing query time boosting is that like the other options in the UI, you can adjust this value dynamically without re-indexing the documents. Now boss Bob is happy at last because he can micro-manage the relevance of his blog posts.

Another option that could be used is a per-document boost that's set at index time. This is less flexible since you can't change the boost dynamically, but might be appropriate depending on your needs. You'd do something like this, assuming you've added the same kind of flagging field:

<?php
/**
* Implementation of hook_apachesolr_update_index().
*/
function MYMODULE_apachesolr_update_index($document, $node, $namespace) {
 
// Some rule to determine if this is an important node.
 
if (isset($node->field_important) && $node->field_important[0]['value'] == 'yes') {
   
$document->setBoost(2.0);
  }
}
?>

This simple case though we already know we can handle via query-time boosting. So you would probably only consider using this is if you have complex rules that allow you to determine from the full node data that a specific node needs an extra boost, but this would be impossible to do when running a search since the condition cannot be specified as a query. For example, you might want to add a boost if the word count of the body is within a certain range.

<?php
/**
* Implementation of hook_apachesolr_update_index().
*/
function MYMODULE_apachesolr_update_index($document, $node, $namespace) {
 
// Big boss Bob thinks "good" content has 300 to 600 words.
 
$word_count = str_word_count($node->body);
  if (
$word_count >= 300 && $word_count <= 600) {
   
$document->setBoost(3.0);
  }
}
?>

You may or may not have noticed that this example works around a limitation of the current Apache Solr Search Integration module that I had forgotten - the module is not indexing integer fields or on/off checkboxes by default. After having worked through this use case, I think that may be a mistake, so if that interests you please check out this issue: http://drupal.org/node/949768

Reacties

Posted on by mike503 (niet gecontroleerd).

Bob sounds like a jerk! :)

Posted on by Robert Douglass.

In this example:
<?php/*** Implementation of hook_apachesolr_prepare_query().*/function MYMODULE_apachesolr_prepare_query(&$query, &$params) {  $params['bq'][] = "ss_cck_field_important:yes^5";}?>

it's worth noting that the $params['bq'] refers to "boost query". There's a great reference about those and more, here: http://wiki.apache .org/solr/SolrRelevancyFAQ

Then, "ss_cck_field_important" is the Solr field version of the "important" CCK field. To learn what mapping Solr is using for a specific field, you can look under "Administer > Reports > Apache Solr search index" (q=admin/reports/apachesolr) to see a list of the Solr fields in the index.

Robert Douglass
Senior Drupal Advisor, Acquia