Graduating from grep: Powerful text-file searching with Isearch

How many times have you said to yourself, “I know there’s a file in here somewhere that mentions such and such, but I just can’t find it?”

This problem bedevils most Linux users from time to time, but it’s particularly serious when you’re running a Web or FTP server. Suppose, for instance, you’ve just received e-mail telling you that some of the links on your pages are dead. You need to find all the files that contain the link—and you need to do so fast.

In the Windows world, Web developers can use proprietary Web site management software to perform such searches, but most Linux developers will find themselves falling back on the default Linux text-searching tool called grep. As you’ll see, grep is a useful tool, but there’s much better software and it’s free for the asking: Isearch.

Isearch is an open source package that’s specifically designed for text searching. It uses the latest technology, and what’s more, it can recognize and deal with the structures inherent in a variety of commonly used text files, including HTML and SGML pages, mail folders, list digests, and more. In this Daily Drill Down, I’ll take a look at some nifty ways you can use this great but little-known software.

What’s wrong with grep?
You can do some cool things with grep; for instance, if you’re looking for a file that contains a particular word or phrase, the grep -l switch comes in handy. To try it, switch to /usr/src/linux/Documentation and enter
grep -l situations *txt

You’ll see a list of all the files that contain at least one instance of this word. Still, grep has more than a few shortcomings, as you’ll discover when you’re using it. The utility’s greatest defect lies in the fact that it’s a line-oriented program. If you try to retrieve a phrase that spans a line break, you’re out of luck. To see why, type grep -l situations where *.txt.

Although my version of devices.txt contains this phrase, the words are separated by a line break. Consequently, grep won’t retrieve it.

The truth is that line-oriented programs are 1970s technology. Newer programs employ a two-step process that begins by creating an index of all the files you want to search. This index, called an inverted file, contains a list of all the unique words that occur in the files, coupled with pointers to their exact locations. Line breaks aren’t a problem—and what’s more, it’s possible to search with Boolean operators, enabling you to create complex, well-designed queries (something that’s very difficult to do with grep). When you search for text, you search the index, not the files, and the search is alot faster. The downside? If you make changes to the files or add new files, you’ll need to run the indexing software again. But that’s a small price to pay for the improved performance.

Introducing Isearch
With assistance from the National Science Foundation, the folks at the Center for Networked Information Discovery and Retrieval (CNIDR) started work a few years ago on freeWAIS, a public-domain implementation of the WAIS (Wide Area Information Server) search engine. From its inception, the CNIDR project addressed a serious shortcoming in WAIS—namely, its failure to separate the Internet protocol from the search engine. CNIDR split the project into two parts, and the one that focused on the search engine produced Isearch.

Isearch isn’t just one program; rather, it’s a package. Here’s what’s included:

Iindex: Creates an index of the text files that you want to search.
Isearch: Performs the searches.
Iutil: Offers various tools for maintaining your databases.

To see a list of available options for any of these commands, just type the command name without any options or arguments. If you type Iutiland press [Enter], for example, you see the options available for this command.

Obtaining and installing Isearch
To obtain a copy of Isearch, visit the lsearch download page . You’ll find tarballs and Red Hat (rpm) packages for the latest version, 1.41. Isearch is distributed under a freeware license that enables redistribution (including commercial redistribution) as long as the copyright and permission notices are supplied.

To get help with Isearch, you can join the Isearch mailing list . This page also provides access to a searchable archive of mailing list postings.

Indexing your files
Once you’ve installed Isearch, you can experiment with the program by making databases of text and HTML files on your system. You start by creating an index.

To create an index of all the text files in a directory, do the following:

Switch to the directory that contains the file you want to index, type du -ambS | sort | less, and press [Enter]. You’ll see a list of the files in the current directory tree, in reverse numerical order. (The largest files are shown first, and the size is displayed in megabytes.) Note the size of the largest file.
Switch to your home directory.
Create a directory called databases (type mkdir databases and press [Enter]).
Now run the Iindex command to create the database. You’ll use the following options:

-d database-name: For database-name, type the name of the database you want to create. I’m creating a database of the Linux documentation text files in /usr/src/linux/Documentation, so I’ll call my database linux-help. You don’t need to supply a suffix (extension); Isearch supplies the suffix for you.
-t SIMPLE: The -t switch enables you to specify the document type. You’ll learn more about document types in subsequent sections; for now, use the SIMPLE document type, which is appropriate for text files.
-m size: If any of the files you’re indexing is larger than one megabyte, you must use the -m option to increase the memory size of Iindex (1 megabyte by default). For example, later in this Daily Drill Down I’ll explain how to create a database of your e-mail messages. The largest file in my e-mail folder is a hefty 19 megabytes, so I ran Iindex with -m 19.
pathname: For pathname, type the name of the directory containing the files you want to index.
-r: If you’d like to index the files in all associated subdirectories as well as the specified directory, you can use this option.

Here’s an example of an Iindex command:
Iindex -d linux-help -t SIMPLE /usr/src/linux/Documentation/*.txt

For this purpose, you can use the text and HTML documentation files included in your system’s /usr/doc directory tree. First, check the total amount of space the files consume by typing du -hs at the top of the files’ directory tree. If the amount is larger than one megabyte, consider using the -m switch to set aside more memory. Here’s an example of the -r switch in action:
Iindex -d linux-help -t SIMPLE -m 4 -r /usr/src/linux/Documentation

Searching the database
Once you’ve created your database, it’s easy to search. To do so, you use the Isearch utility. You’ll need to specify the database with the -d option. (Note that it’s not necessary to specify the extension.) Here’s a simple search, one that scans the linux-help database (of the entire /usr/src/linux/Documentation tree) for those files that contain the word Linus:
Isearch -d linux-help linus

Note that Isearch performs case-insensitive searches by default.

The result of this search is a scored list, with the highest-ranking file positioned at the top. Each file has its own number in the retrieval set. Isearch remains active and prompts you to select a file to view. If you type a file’s number and press [Enter], you’ll see the file whiz by on-screen. This isn’t a defect; Isearch is meant to be used with some kind of front-end program, so the program dumps all its output to the standard output by default. The Isearch front-end program par excellenceis a Web browser. (For more information, visit Isite Information System .) It’s quite easy to find CGI scripts that enable you to query Isearch databases using an HTML form.

Use the asterisk wildcard to truncate the search term, as in the following example:
Isearch -d linux-help linu*

This search retrieves the files containing linux, linus, and any other strings that begin with linu. Note that Isearch performs postfix truncation only; you can’t retrieve linus and linux with *inu*.

To perform a search for a phrase, enclose the phrase within double quotation marks and enclose it again within single quotation marks, as in this example:
Isearch -d linux-help ‘”situations where”‘

The single quotation marks are needed to ensure that the shell passes the entire quoted phrase to Isearch with the quotation marks. If you omit the single quotes, the shell strips the quotation marks before passing the phrase to Isearch, which then performs an OR search.

The above search wouldn’t work with grep, as I explained at the beginning of this Daily Drill Down, because the phrase in question spans a line break. But Isearch handles it just fine.
There’s a 32-character limit for phrase searches within Isearch. Don’t enclose more than 32 characters in a quoted string, or the search won’t work.
Isearch really comes into its own with Boolean searches, which are enabled by the -infix option. The following search finds any file in the database that mentions Linus or Torvalds:
Isearch -d linux-help -infix Linus or Torvalds

You can use the and operator to restrict the search to just those documents that mention both terms, as in the following example:
Isearch -d linux-help -infix Linus and Torvalds

There’s an additional operator, andnot, which you can use to exclude unwanted documents:
Isearch -d linux-help -infix memory andnot sound

By using parentheses, you can nest expressions to perform pinpointed searches:
Isearch -d linux-help -infix ‘(irq and dma) and (sound or audio)’

The Isearch documentation mentions a proximity operator (near), but it doesn’t appear to have been implemented.

Searching HTML files
As I indicated earlier, Iindex can work with the structure of certain types of data files, including HTML files and mail folders. If you specify the HTML data type when you create an index, you can use Isearch to search within a particular HTML tag, such as TITLE, META, or A. Try it!

Begin by creating a database of HTML files using command such as
Iindex -d mywebs -t HTML -r /home/bryan/webs

Using the HTML data type (-t), this command creates a database called mywebs consisting of all the files in the directory /home/bryan/webs and all associated subdirectories.

Once you’ve created the database, you can use the Iutil command to find out which tags you can search:
Iutil -d mywebs -vf

The -vf option displays a list of fields defined in the database, which, for HTML databases, consists of all the tags you’ve used in the specified HTML pages.

Now try a search. To indicate which field you’d like to search, you form the search word using the mode field/search-term, as in these examples:

h1/introduction: Searches for the word introduction within the H1 field of an HTML document.
title/home page: Searches for the phrase home page within the TITLE field of an HTML document.

Here’s an example:
Isearch-d mywebs a/bp@nospam.net

This command searches for all the instances of bp@nospam.net within the link tag (A). On-screen, you’ll see a scored, numbered list of the matching documents, and you’ll see the matching lines.

Isearch tips and techniques
Once you’ve mastered the fundamentals of searching with Isearch, try some of the following tricks:

Combining field-specific search terms: You can use more than one field-specific search term in an Isearch query. For example, the following query locates all the HTML documents that mention home page within the TITLE tag and Suzanne within a P tag: Isearch -d mywebs title/home page and p/Suzanne.
Performing weighted searches: If you append a colon and a number to the search term, Isearch performs a weighted search. For example, the following search looks for documents that contain Linus or Linux but gives higher preference to the documents containing Linus: Isearch -d linux-help linus:10 linux. (You can use negative numbers to give a term negative weight, if you wish.)
Controlling output: By default, Isearch lists the retrieved documents using a default output setting, which is specific to the document type. For SIMPLE documents, the list includes the filename and the first line of text. For HTML documents, the list includes the filename and the contents of the TITLE tag. However, you can use the -p option to specify which of the document type’s fields you want to view. The following retrieves the text placed in the H1 field of HTML documents that mention Suzanne anywhere in the file: search -d mywebs -p h1 Suzanne. If you perform a field-specific search (such as h1/Suzanne), Isearch prints the contents of the H1 field by default.

Creating a mail database
Isearch comes with a number of additional document types (also called doctypes), including MEDLINE (which recognizes the structure of the leading medical database data records), SGMLTAG (which recognizes SGML tags), and FTP (which recognizes file listings in an FTP site). Among the doctype offerings is MAILFOLDER, which enables you to construct a searchable index of a standard-format mail archive. If you’re using an e-mail program that writes messages to standards-conformant mail folders, you can use the MAILFOLDER doctype to create a searchable index of your mail messages. The following command creates an index of all the messages in my Kmail Mail directory and all associated subdirectories:
Iindex -r -t MAILFOLDER -d mail /home/bryan/Mail

When Iindex finishes, you can view the available search fields using Iutil -d mail -vf (substitute your database’s name for mail):

CONTENT-TYPE
MESSAGE-ID
DATE
REPLY-TO
SENDER
FROM
SUBJECT
TO
MESSAGE-BODY

Here are some examples of ways you can search your e-mail database. The following query retrieves messages from Suzanne that mention Friday night in the subject field:
Isearch -d mail -infix from/Suzanne AND subject/ Friday Night”‘

The following searches for any of the specified terms in the message body:
Isearch -d mail -infix message-body/contract OR message-body/proposal

Isearch horizons
Isearch is only one component of a broader development effort called Isite, which includes not only the search software (Isearch) but also the Internet protocol that enables remote database searching. For more information on Isite, take a look at the project documentation page . I’ll take a closer look at Isite in an upcoming Daily Drill Down.

Bryan Pfaffenberger, a UNIX user since 1985, is a University of Virginia professor, an author, and a passionate advocate of Linux and open source software. A Linux Journal columnist, his recent Linux-related books include Linux Clearly Explained (Morgan-Kaufmann) and Mastering Gnome (Sybex; in press). His hobbies include messing around with his home LAN and sailing the southern Chesapeake Bay. He lives in Charlottesville, VA. If you’d like to contact Bryan, send him an e-mail.

The authors and editors have taken care in preparation of the content contained herein, but make no expressed or implied warranty of any kind and assume no responsibility for errors or omissions. No liability is assumed for any damages. Always have a verified backup before making any changes.