The good thing about computers is that they make creating thousands of documents easy. The bad thing about computers is a direct consequence of the good one: finding among all those documents just the ones you need, quickly, can be difficult. That’s where the Open Source Java program called DocFetcher tries to help.

In general, for no particular reason, I’m not too fond of Java applications. I find DocFetcher quite interesting, however. It can work as a multi-user, very portable desktop search engine, if you configure it in the right way. Here are a few tricks to do just that.

First, configure your memory

Installation and basic usage of DocFetcher are quite simple, thanks to its clean user interface (Figure A), integrated manual, and wiki, so I’ll jump directly to the tough parts. The main drawback of DocFetcher may be its default memory configuration.

Figure A

All Java programs have a heap size, that is, a limit on how much memory they can use. Its default value for DocFetcher is 256 megabytes, which is not enough. With that value, DocFetcher runs out of memory not only while indexing my Documents folder (16530 files for a total size just over 4 GBs), but also during searches. To reduce as much as possible the frequency of this problem, increase the heap size in the line of the DocFetcher.sh script that actually launches Java:

java -enableassertions -Xmx256m -Xss2m -cp ".:${CLASSPATH}" -Djava.library.path="lib" net.sourceforge.docfetcher.Main "$@"

Change the value of the -Xmx option to at least 512 MBs and you’ll be much happier. You may also try to play with the Java stack size, that is, Xss, whose meaning is well explained here.

You can make DocFetcher faster and more robust also by limiting the maximum number of results per query. To do that, assign the smallest value that suits your real needs to the MaxResultsTotal option in the general configuration file: conf/program-conf.txt.

Index smartly…

DocFetcher needs to build indexes of your files. After the first run, only new or removed files must be analyzed, so memory consumption becomes less critical. Still, you can sensibly improve the performances and effectiveness of the DocFetcher indexer. If possible, keep compressed archives only in formats that have internal indexes, like Zip, instead of compressed tar files. Otherwise, DocFetcher will have to completely unpack those archives to know what they contain. Above all, make sure that you index only what you really, really, really want or need to index. So, before even starting DocFetcher:

  • remove duplicate files
  • exclude whole classes of files by entering proper regular expressions (e.g.: .*\.xls to skip all Excel spreadsheets) in the “Exclude files” part of the indexing queue window
  • sanitize the names of your files and make them meaningful. As you can see in Figure B, if their names match the search terms, DocFetcher returns relevant files even if their content could not be analyzed as text
  • enable the SourceCodeAnalyzer mode in the configuration file if you need to index a lot of software source code
  • explicitly declare all the formats that are plain text, but may not be recognized as such, starting from markup languages like the great .t2t

Figure B

…and search accordingly

There are two ways to make DocFetcher searches. First, restrict them to as few folders as possible and set the minimum and maximum size of the files you expect to find. What makes the difference, however, is learning how to tell DocFetcher as exactly as possible what it should search. The actual, complete search syntax is the same as the Apache Lucene engine, but here is an extra simplified summary to start practicing. Together with the mere search terms, which are case insensitive, you can specify their:

  • relative weight: “linux^4 windows” means “search for files including both Linux and Windows, but put first those in which Linux is more frequent”
  • distance: “linux windows”~10 stands for “return files in which Linux and Windows are not more than 10 words apart”
  • similarity: “linux~” returns document containing the word Linux or similar ones

Multiuser, portable search

I mentioned at the beginning that DocFetcher “can work as a multi-user, very portable desktop search engine”. Portability begins with the fact that DocFetcher runs with the same index format, also on Mac OS and Windows. Besides that, DocFetcher is, if I may use a buzzword, cloud-compatible. If both the files and the indexes are on online storage somewhere, any copy of DocFetcher pointing to those indexes will be able to search those files.

However, this will only happen if both the indexes and the indexed folders always are in the same relative position (e.g., both under “/dropbox_files”), and you always create the indexes with the “relative paths” option checked. Using this trick, you may also distribute searchable document collections, from e-books to company catalogs, putting all the files and a copy of DocFetcher on CD-ROMS and USB keys! Just remember to also include a copy of the Java runtime installer, in case the end user doesn’t already have one in her computer.

Finally, DocFetcher indexes can be shared among many users. The quick, dirty, low tech way to do this is to make every user install his or her own copy of the portable version of DocFetcher in her folder, and then set by hand the indexes’ location in the file misc/paths.txt. The only problem in such a scenario would be if everybody were able to rebuild the indexes or alter their configuration. To prevent this, disable all the options related to index creation and modification in the program-conf.txt file of all the DocFetcher installations. Of course, in theory, users may rewrite those same options as they please in the “Advanced Settings” panel of DocFetcher, but don’t worry: protect the indexes by giving write access to both them and the folder they live in only to the DocFetcher administrator.