Drupal's search compared to Google and Yahoo!

When Drupal does a content search, it optionally weighs the results using up to four scoring factors. These scoring factors include keyword relevancy, recency of the content, number of comments, and (if statistics module is enabled), the number of page views. Site administrators can adjust the relative weighting of these scoring factors from the example.com/admin/settings/search administration page. Setting any scoring factor to zero disables it.

In this article, which applies primarily to Drupal 6 but is relevant for Drupal 5 as well, I explore how useful these scoring factors really are, and whether they help Drupal search live up to the high standards that are set by leaders like Google and Yahoo!. This article is part of a series of search related articles in preparation for the Minnesota Search Sprint.

Searching for Views

According to the site statistics on Drupal.org, the most searched for term is "views" (see the top 30 search terms below). If Google is the leader in search, it should be fair to use Google's results as a benchmark for the effectiveness of any other search solution. Since search accuracy is somewhat difficult to measure scientifically, I'm turning to Google to aid in measuring the results heuristically. Take a look at Google's results for "views" on Drupal.org, and then take a look at Drupal.org's results for the same search, and take note of your conclusions.

Using Google's results as a benchmark, I took note of the top 10 results from Yahoo! and MSN, as well as Drupal.org with four different settings of the scoring factors. I wanted to measure how the results compared. To do this I measured the intersection of the results (the number of results in the top 10 that the different services have in common).

Drupal.org doesn't have the statistics module enabled, so there are only three scoring factors; keyword, recency, and comments. The table in Figure 1 lists the values for these three in parenthesis. The setting on Drupal.org at the time of this writing was keyword = 10, recency = 5, comments = 1, which is represented in the table as Drupal (10, 5, 1).

Search In common with Google
Yahoo! 6
Drupal (0, 0, 0) 4
Drupal (10, 0, 0) 4
Drupal (10, 5, 0) 1
Drupal (10, 5, 1) 1
MSN 0

Figure 1: Number of Top-10 results in common with Google.

In Figure 1, the results between Yahoo! and Google were very similar. The algorithms that drive these two search giants clearly have a lot in common. The best results for Drupal search were with the scoring factors all set to zero (which disables them altogether). In such a case the scoring and ranking is done completely on keyword relevance alone. Note that this is identical in results to having the keyword = 10 and the other factors at zero (10, 0, 0). I discuss below why having all factors set to zero is preferable. MSN's results were truly terrible, as were the results of Drupal's search if either the comment or the recency factors were turned up artificially high (0,10, 0) or (0, 0, 10).

A couple of the individual results deserve mention. Both Google and Yahoo have this link as one of their top 10:

As you can see, this is a listing page of Drupal contributed module projects. Listing pages of all nature don't have any influence on Drupal's internal search. Drupal's search only looks for individual nodes, so views, blocks, and pages like the path node or taxonomy pages like Drupal 6.x don't play any role. Of course a search engine that crawls Drupal.org will see these pages and index them.

One of the results that appeared in the Drupal (10, 5, 1) search also deserves mention:

  • Drupal 6.0 released

    ... also been making changes that will make it dependent on Views for some functionality. 1and1.com doesn't support 6.0 I ... I Views II [I can't wait!]. ... CMS available! Now all I need are up to date CCK & Views modules... ahhhhhhhhhhhhhhhhhh ahhhhhhhhhhhhhhhhhh crazy, i ...

Note that the 1 in (10, 5, 1) is for the comment count scoring factor. This post, announcing the release of Drupal 6.0, has nothing to do with Views, but someone mentioned Views in the comments. Because any weight (1) was given to the number of comments (there are 319), it claimed a place in the top 10 search results. According to my sense of heuristic measurement, this is wrong, and I believe that Drupal.org is best advised to give comment count a scoring factor weight of zero.

The cost of scoring factors

I am not recommending that we shouldn't have scoring factors. However, on Drupal.org, the best search results tend to come when they are all turned off. What are the other implications of scoring factors? One big one is performance. Compare the complexity of the queries in Figure 2. The first one is with all scoring factors set to zero, and the second one is with all set to 5 (the default for new installations).

<?php
// Query 1: Without scoring factors

$q = <<<QUERY
SELECT i.type, i.sid,
    (41.383657379 * SUM(i.score * t.count)) AS score
  FROM search_index i
  INNER JOIN search_total t ON i.word = t.word
  INNER JOIN node n ON n.nid = i.sid
  INNER JOIN users u ON n.uid = u.uid
  WHERE n.status = 1
    AND (i.word = 'nobis')
    AND i.type = 'node'
  GROUP BY i.type, i.sid
  HAVING COUNT(*) >= 1
  ORDER BY score DESC
  LIMIT 0, 10
QUERY;

// Query 2: With scoring factors

$q = <<<QUERY
SELECT i.type, i.sid,
    5 * (41.383657379 * SUM(i.score * t.count)) +
    5 * POW(2,
      (GREATEST(MAX(n.created), MAX(n.changed), MAX(c.last_comment_timestamp)) - 1209387295)
      * 6.43e-8) +
    5 * (2.0 - 2.0 / (1.0 + MAX(c.comment_count) * 1)) AS score
  FROM search_index i
  INNER JOIN search_total t ON i.word = t.word
  INNER JOIN node n ON n.nid = i.sid
  INNER JOIN users u ON n.uid = u.uid
  LEFT JOIN node_comment_statistics c ON c.nid = i.sid
  WHERE n.status = 1
    AND (i.word = 'nobis')
    AND i.type = 'node'
  GROUP BY i.type, i.sid
  HAVING COUNT(*) >= 1
  ORDER BY score DESC
  LIMIT 0, 10
QUERY;
?>

Figure 2: The search queries with and without scoring factors.

The first obvious difference is that the calculation for score is much more complex when scoring factors are enabled. MySql is asked to perform a large number of mathematical operations. There is also an extra LEFT JOIN needed to get the comment count. What? Getting the comment count gives worse search results and makes the query that much more complex? Yes. For any site that is suffering from search related performance problems, turning all of the scoring factors to zero might make the search results better, and will definitely take less processing power to execute.

Is there any future for scoring factors? I believe so. There is an excellent patch in the issue queue that would let modules define their own scoring factors, and this alone would make it easier to come up with more effective combinations. Note that this patch is languishing and needs review, so if you are feeling motivated to help out, this is one good place to start.

Scoring factors can also be genuinely useful for those who are looking for specific, targeted information. Suppose I want to search amongst the nodes that have the most comments? Or perhaps I really am looking for more recent posts rather than older posts? Wouldn't it be nice to be able to decide these things for individual searches? For this reason I think it is a mistake to keep the scoring factor controls buried and only available to the administrator. Another patch that is languishing in the issue queue would move them into the advanced search form. This may not be exactly the right answer, but your participation in that issue is also welcome.

In any case, scoring factors will be one of the things that gets discussed at the upcoming Minnesota Search Sprint. It's pretty neat that our beloved CMS has these features built into it. Understanding the scoring factors completely, however, is important to getting the best results out of Drupal search. Please join the Search Group to participate in improving Drupal's search opportunities.

Appendix: Top search terms on Drupal.org.

Count Message
1289 views (Content).
1007 cck (Content).
746 tinymce (Content).
680 Image (Content).
629 gallery (Content).
562 Panels (Content).
535 calendar (Content).
450 Token (Content).
443 pathauto (Content).
429 menu (Content).
419 fckeditor (Content).
405 forum (Content).
360 video (Content).
351 imce (Content).
347 WYSIWYG (Content).
318 taxonomy (Content).
315 jquery (Content).
312 wiki (Content).
301 webform (Content).
297 event (Content).
274 imagecache (Content).
269 rss (Content).
267 blog (Content).
240 profile (Content).
239 sitemap (Content).
239 imagefield (Content).
238 chat (Content).
237 captcha (Content).
235 content (Content).
231 phpbb (Content).

Comments

Posted on by Isriya (not verified).

This is a brilliant post.

I'm investigating on Drupal.org search system as well. By the weight on resource consuming, I'm nearly convinced that the best solution is to outsource search function to dedicated search server e.g. Solr, as your module is working on this.

Anyway, it isn't the reason we won't improve Druapl search function. One easy improvement is merging some code from Fuzzy Search into Search core.

Posted on by Gerhard Killesreiter (not verified).

I really wish you had done your experiments not on drupal.org but on our scratch site. Setting all weighting factors to 0 will introduce a division by zero error. With our master/slave setup such massiv writes to the watchdog table aren't really to be played with.

Posted on by Robert Douglass.

@Gerhard: lesson learned. But then, aren't these controls *supposed* to be adjustable on a live server? Yes.

Posted on by Gerhard Killesreiter (not verified).

Well, congrats, you found a bug. :p

Actually somebody else found the PHP errors. However, I think it is a good idea to check the log after making such adjustments. Also, apparently the 0,0,0 setting broke the indexing so it cannot be recommended until the bug has been fixed:

http://drupal.org/node/252580

Posted on by Robert Douglass.

Also, apparently the 0,0,0 setting broke the indexing

The indexing on Drupal.org was broken before I started to research this, so that issue has to be looked into still. Changing this setting cannot break the indexing because it only affects the search query, nothing else. Setting to 10, 0, 0 will produce identical results (superior to 10, 5, 1, which is what we've been using), and will avoid the division by zero error.

But this is all a bit off topic here, we should continue on the Drupal.org infrastructure list.

Posted on by Scott Reynolds (not verified).

Robert,

Have you tried more specific queries. Like "How to set up cron job". My intuition leads me to believe that more specific queries will lead to different results for those scoring factors. (meaning that the scoring factors will lead to much better results)

I would also caution a bit, web engines and site searches use very different algorithms. Google results will rank them based on what pages link to the page (among other things) where as Drupal search is doing a pure text retrieval search.

Have you tried experiments with the standard search corpus' to test accuracy ?

Posted on by Robert Douglass.

Hey Scott,

This experiment ended up gravitating towards "the usefulness of the recency and comment count scoring factors", and in that context I don't think it would make any difference for the search phrase you mentioned.

Drupal search is doing a pure text retrieval search

Not quite. Drupal tracks its own internal links and linked text lends its keywords to the target of the link.

I don't know what you mean by "standard search corpus" but am very interested. I would like to start think about building test suites that measure the effectiveness of Drupal's search (text retrieval) and don't know yet how to go about doing this, so any tips would help. How did you test the effectiveness of the Content Recommendation Engine, for example?

Posted on by Jose A Reyero (not verified).

Really interesting read.

About Drupal search, the fact that external search engines may provide better search results than our own search engine only means two things:
- They are using actually more information for ratings, that is external linking, which is usually done manually so its not AI but real "Human intelligence".
- We are not effectively using inner information, that we have and that is not available to external engines. And we are also not using at all semantical information we have.

I.e. the top search terms prove that people is looking for modules most often. Wouldn't it make sense to have a rate for content types (First, modules, second handbook pages...) instead of such generic ones?

I don't think exposing scoring factors will make it any better as that use cases seem to me just anecdotical and also our 'Advanced search' page is already overcrowded. It took me some time to find where to check 'modules' when looking for modules which as we've seen is the most common use case.

One more idea may be scoring for tags. Wouldn't it make sense to have '6.x' tag score higher than '5.x' (or 4.x) ?

Whatever we should be able to provide far better results than external engines when searching in our own site and for that I think the way is to add more 'semantic' scores (content type, tag...) than these arbitrary ones.

As or the keyword factor, and weighting links, if we think of that, internal linking between content is almost non existent as compared to navigation-to-content links (menu items) that google may count but we don't.

Posted on by Robert Douglass.

as compared to navigation-to-content links (menu items) that google may count but we don't

You really hit on something here. The menus should be giving nodes keyword weight and they're not. Great feature request.