Home / Use Apache Solr to search in files

Use Apache Solr to search in files

Drupal's file handling capabilities keep getting better. Beyond the core upload module, the filefield module for CCK has enabled us to build sites with all sorts of files; documents, images, music, videos, and so forth. Searching within these docuements, however, has never been a common feature on Drupal sites. Some solutions have existed, particularly for extracting texts from PDFs and common wordprocessing documents. With Apache Solr, the attachments module, and an extension library called Tika, things can be much better. With Tika you can extract texts not only from Microsoft Office, Open Office, and PDF documents, you can also get text and metadata from images, songs, Flash movies and zipped archives. Searching for these texts is done as part of the normal Apache Solr driven site search.

This article shows how I set up Tika and the Apache Solr Attachments module on my MacBook Pro runing Snowleopard (OSX 10.6). There are two ways to run Tika, either as a client-side component (where the client is Drupal), or as a server-side component (the server being Solr). The advantage of running Tika client-side is that the files don't need to travel over the wire to have their texts extracted. Especially in the case of rich media (movies, images, music) this is quite desirable. Why send a 20M video over the network just to get 15-20 lines of text from it? Another important advantage of running Tika client-side is that it works with Acquia Search.

The disadvantages of running Tika client-side are that you have to install it on every client (in a multi-webserver environment, for example), and the processing workload then falls onto your webserver instead of offloading it to the Solr server. Acquia Search also doesn't currently support the option of offloading extraction to the Solr server, though it is a feature we might add.

This article will show you how to install Tika on the client.

What you need

You need java 1.6 (1.5 should work, but not as many document types are supported). Test this by typing java -version at the command line. Here's what I see on my machine:

robert$ java -version
java version "1.6.0_17"
Java(TM) SE Runtime Environment (build 1.6.0_17-b04-248-10M3025)
Java HotSpot(TM) 64-Bit Server VM (build 14.3-b01-101, mixed mode)

You also need the java build tool called Maven. If you're on OSX 10.6 like I am this should already be the case. Check by typing mvn -v at the command line. Here's what I see on my machine:

robert$ mvn -v
Apache Maven 2.2.0 (r788681; 2009-06-26 15:04:01+0200)
Java version: 1.6.0_17
Java home: /System/Library/Frameworks/JavaVM.framework/Versions/1.6.0/Home
Default locale: en_US, platform encoding: MacRoman
OS name: "mac os x" version: "10.6.2" arch: "x86_64" Family: "mac"

Building Tika

No matter how you decide to run Tika (client or server side), you'll need to get Tika first. A zip file of the source code is available from the download page page on lucene.apache.org. Alternately you can get the source directly from the subversion repository with this command:

svn export -r901084  http://svn.apache.org/repos/asf/lucene/tika/trunk tika-0.6

Either way you'll end up with a directory called tika-0.6. Next we're going to use a tool called Maven to build Tika from the source files. From the command line, change directories into the new tika-0.6 directory. Check to see that you're in the directory containing the pom.xml file. Then type the following two commands. The first gives Maven enough memory, and the second tells it to build Tika:

                                                         
export MAVEN_OPTS="-Xmx1024m -Xms512m"
mvn install

The first time I tried this I was on the train between Cologne and Brussels using the spotty wireless connection on the Thalys. The build process broke three times and each time I just typed mvn install and it picked up where it had left off, eventually succeeding. Later I tried the build process again in normal conditions and it worked seamlessly.

When Maven is done building, your nugget of gold will be tika-app/target/tika-app-0.6.jar. Let's test it out! Still from the tika-0.6 directory, try extracting some text from a file using the new tika-app-0.6.jar file:

java -jar ./tika-app/target/tika-app-0.6.jar -t [path/to/a/file]

Replace [path/to/a/file] with the path to some interesting file you'd like to test. If everything goes right you'll get the text from that file dumped to stdout (which means you'll see it scrolling by on in your command terminal).

As a final step I moved the tika-app-0.6.jar file to ~/bin (the directory where I keep my custom scripts and libraries) and named it tika.jar. This is optional. You can keep the jar file wherever it makes sense to you. Just take note of its absolute path, as you'll need it when configuring the apachesolr_attachments module.

Now we're ready to use Tika within the context of Drupal and Apache Solr searching.

Drupal, Solr and the Apache Solr Attachments module

For instructions on installing Drupal, Solr, or the Apache Solr module, please refer to the linked resources. You can also get up and running very quickly using the Acquia Drupal Stack Installer and Acquia Search (try it for free).

Note that you may need to give Solr more memory when doing attachment searching. For this example I tested using the Jetty container that comes with the Solr download, but I started it using this command:

java -Xmx1024m -Xms512m -jar start.jar

Get and install the Apache Solr Attachments module in the normal fashion. There is one configuration screen, found at q=admin/settings/apachesolr/attachments. Most of the the options are self-explanatory. You may want to allow a wider set of file extensions. For exact information about what is available, see the Tika supported formats page.

apachesolr attachments module configuration

Upload files, run cron, and search

The only thing left to do is to upload some files, run cron, and do a search. The search results that match text in files link to both the file, and to the node to which they belong. Here's and example of me searching for "merlinofchaos" and finding the views-6.x-3.0-alpha2.tar.gz file that I uploaded (yes, Tika can search in tar.gz files, and yes, that's the whole Views 3 module).

searching for merlinofchaos and finding views 3

Here's an example of me searching for "Drupal" and finding both a Word document and an iWork Keynote file.

searching for Drupal and finding a Word document and an iLife Keynote document

Reacties

Posted on by wmostrey (niet gecontroleerd).

I'm curious how the module responds to files that are attached to multiple nodes, or files that are in the files table but that are no longer attached to a node.

Posted on by Peter Wolanin.

As of now, the module only examines files that are attached to nodes.

I think if the same file is attached to multiple nodes it may get indexed once per node. Since this is not common in D6 I haven't really tried to optimize the behavior.

Posted on by jskulski (niet gecontroleerd).

How wonderful! Thanks!