Give your Web site its own search engine using Lucene

While established Web site search engines, such as Google, are available, the flexibility possible with an integrated search engine should not be underestimated. Give your Web site a boost with its own Lucene search engine.

Adding search functionality to your Web site is one of the easiest ways to improve your user's experience, but integrating a search engine with your application isn't necessarily very easy. To help you provide a flexible search engine to your Java applications, I'll explain how to use Lucene, an extremely flexible, open source search engine.

Lucene integrates directly with your Web application. It's written in Java from the Jakarta Apache group. Your Java application can use Lucene as the core of any search functionality. Lucene works with any kind of text data; however, there is no built-in support for Word, Excel, PDF, and XML. But there are solutions to support each of them with Lucene.

One important point about Lucene is that it is just a search engine. There isn't a built-in Web GUI or a Web crawler. To add Lucene to your Web application, you will have to write a servlet or JSP page that displays a query form and another page that lists the results.

Building an index with Lucene
The text content from your application is indexed by Lucene and stored on the file system as a set of index files. Lucene accepts Document objects that represent a single piece of content, such as a Web page or a PDF file. Your application is responsible for turning its content into Document objects that Lucene can understand.

Each document is composed of one or more Field objects. These fields consist of a name and a value, much like an entry in a hash map. Each field should correspond to a piece of information you will need to either query against or display in the search results. For instance, the title would be used in the search results, so it would be added to the Document object as a field. The fields can be either indexed or not indexed, and the original data can be optionally stored in the index. A field that is stored in the index would be useful when building the search results page. Fields that aren't useful for searching, such as a unique ID, don't need to be indexed, just stored.

Fields may also be tokenized, which means that an analyzer breaks up the content of the field's value into tokens that the search engine can use. Lucene comes with several analyzers, but I use one of the most powerful analyzers, the StandardAnalyzer class.

The StandardAnalyzer class changes all of the text to lowercase and also removes some common stop words. Stop words are words like "a," "the," and "in," that are very common inside content but wouldn't be useful to search for alone. The analyzers are also run on the search query, which means that the query will find the matching parts. For example, a piece of content that says "The dog is a golden retriever," might be processed into "dog golden retriever" for the index. When a user searches for the words "a Golden Dog," the analyzer will process the query, and turn it into "golden dog," which would match our content.

Our example is going to use business objects from a Data Access Object (DAO), which is a common pattern in Java application development. The DAO I'll use, ProductDAO, is shown in Listing A.

To keep things simple for this demo, I'm not going to use a database, and the DAO will just contain a collection of Product objects. In the example, I'm taking Product objects from Listing B and turning them into documents for the index.

The Indexer class is in Listing C and it is going to be responsible for this conversion of Products to Lucene Documents, along with the creation of the Lucene index.

The fields on the Product class are ID name, short description, and long description. The ID will be stored as an unindexed, untokenized field using the UnIndexed method on the Field class. The name and short description will be stored as indexed, untokenized fields using the Keyword method on the Field class. The search engine will run the query against the content field, which will consist of the text from the name, short description, and long description fields on the products.

After all of the documents are added, optimize the index and close the index writer, which allows you to use the index. Most implementations of Lucene are going to use incremental indexing, where documents that are already in the index are updated individually, rather than deleting the index and building a new one every time.

Running a query
Creating a query and searching for results in the index is simpler than creating an index. Your application will ask the user for a search query, which could be a simple word. Lucene has some more advanced Query classes available for Boolean searching or searching by complete phrase.

An example of an advanced query would be "Mutual Fund" AND stock*, which would search for documents that contain the phrase "Mutual Fund" and a word that starts with "stock," such as stocks, stock, or even stockings.

For more information on queries in Lucene
The syntax page on the Lucene Web site can provide more detailed information.

The Searcher class is in Listing D and is responsible for searching through the Lucene index for the terms you use. For the demo, I am using a simple query that is just a string, not any of the advanced query functionality. I create a Query object out of the query string with the QueryParser class, which uses the StandardAnalyzer class to split the query string into tokens, removes stop words, and converts the string to lower case.

The Query is passed to an IndexSearcher object. The IndexSearcher is initialized with the location on the file system of the index. The search method on IndexSearcher takes the Query and returns a Hits object. The Hits object contains the search results as Lucene Document objects, along with the length of the results. Use the Doc method on the Hits object to retrieve each document in the Hits object.

The Document object contains the fields I added to the document in the indexer. Some of those fields were stored and not tokenized, and you can pull them off of the document. The example application runs a query against the search engine and then displays the names of the products it finds.

Running the demo
To run the example for this article, you will need to download the latest version of the Lucene binary distribution from the Lucene Web site. The lucene-1.3-rc1.jar file from the Lucene distribution will have to be added to your Java class path to run the demo. The demo will create an index directory called index under the directory where you run the com.greenninja.lucene.Demo class. You will need to have a JDK installed. A typical command line would be: java -cp c:\java\lucene-1.3-rc1\lucene-1.3-rc1.jar;. com.greenninja.lucene.Demo (see Figure A).The sample data is contained in the ProductDAO class, for this example. The query is part of the Demo class.


Figure A
Command line example


Editor's Picks