Wednesday, August 17, 2011

Easy Document Searching With DocFetcher

Recently I had to search through about 50,000 files (assorted PDFs, text, Word and html) for specific strings for a report. At first I tried using Google Desktop and limiting the indexing to just the directory containing the files. Unfortunately Google Desktop kept crashing on me after the first couple of hours of indexing. So, off I go looking for alternatives. I came across DocFetcher and it looked perfect - open source, supports lots of file formats and is based on Lucene.

DocFetcher supports the following formats:

  • HTML and plain text (both customizable)
  • Portable Document Format (pdf)
  • Microsoft Office (doc, xls, ppt)
  • Microsoft Office 2007 (docx, xlsx, pptx)
  • OpenOffice.org Writer, Calc, Draw and Impress (odt, ods, odg, odp)
  • Rich Text Format (rtf)
  • AbiWord (abw, abw.gz, zabw)
  • Microsoft Compiled HTML Help (chm)
  • Microsoft Visio (vsd)
  • Scalable Vector Graphics (svg)
It has support for regex exclude lists which is very handy. Searching is super simple with the search field on the top of the interface and supports boolean and file type searches. One really nice featuer is that it was able to index all of the files I wanted indexed in under 2 hours - much faster than Google Desktop was doing before it crashed and it also didn't eat my processor for lunch. There is a portable version that runs on Windows and Linux. I really can't say enough good things about this program.

No comments:

Post a Comment