Open Source optimize

How to search for text inside many OpenDocument files

Marco Fioretti explains how to search inside ODF files from the command line or via a script without manually opening the files.

The OpenDocument Format (ODF) is the one used by default for texts, spreadsheets, and presentations in all major FOSS office suites: Calligra, KOffice, LibreOffice and OpenOffice. Above all, ODF is a really open standard, formally ratified by ISO. Right now, ODF is by far the best solution for complex office documents, for (at least) all these reasons:

  • full support by Free Software
  • usable with Microsoft office too (if there is not too much collaborative editing, versioning, and a few other quirks)
  • highest possible protection from lock-in and intellectual property problems
  • long-term availability
  • last but not least (which is the point of this post): simplicity!

In fact, regardless of their suffixes, all ODF files are just ZIP archives. Inside them, the actual content is always stored in one plain text file called, unsurprisingly, content.xml. This means that it is very simple (especially on Linux!) to analyse, generate, and process ODF documents automatically. All you need are tools already installed by default, or at least available as binary packages, on any Gnu/Linux distribution, plus small shell scripts that even novice users can quickly learn to write.

I have already explained, a couple of years ago, how simple it is to automatically generate ODF texts, spreadsheets, and presentations. Today, I'm going to explain how to search for text inside the same files, from the command line or from a script, without opening them manually.

In general, there are two things that make this task (a bit) complicated. First of all, the files we really want to analyse are hidden inside the only files we see directly, that is the ZIP archives with .odt, .ods or .odp suffixes. The other problem is that those files are... really ugly! During normal operations, content.xml is only seen and processed by software, whose idea of readability is very different from ours. Therefore, while strictly speaking content.xml is plain text that you can manage with any text editor, it also is one unreadable blob of XML, without line breaks or unessential spaces.

My starting point to find which ODF files contain certain text, without opening them, was a shell function I found in the Ubuntu forums:

   1 function odtgrep(){
   2   term="$1"
   3   for file in *.odt; do
   4       unzip -p "$file" content.xml | tidy -q -xml 2> /dev/null | grep "$term";
   5       if [ $? -eq 0 ]; then
   6           echo $file;
   7       fi;
   8   done
   9 }

When used with the -p option, the unzip program extracts and prints to standard output, from the ODF archive passed as first argument (the quotes help in case that name contains spaces), only the file given as second argument. Then the tidy utility reformats that text, to make it easier to parse for the grep command.

All in all, that function extracts the given string in all the ODF text files of the current directory (please note the .odt suffix in line 3 above) and also prints their names: line 6 is executed only when the exit status of grep is 0, which means that the string was found. When I used that function to search which ODF file contained the string "Personal Factory", it returned:

  <text:span text:style-name="T8">Personal Factory (</text:span>

It's not perfect, I'll grant you, but it does find the text you're looking for, much more quickly than opening each file by hand!

However, I wasn't satisfied with that function. I wanted to use the same command on both ODF and plain text files: the reason is that I often need to find some string, from a shell script, in all my writings, no matter what formats I saved them in. A quick and dirty, but efficient way to get there is to modify the function above as follows:

   1 function odfgrep(){
   2 FILE=$1
   3 shift
   4 EXT=`echo ${FILE##*.}`
   5 case $EXT in
   6    odt|ods|odp)
   7         unzip -p "$FILE" content.xml | tidy -q -xml 2> /dev/null | grep "$@" ;;
   8    txt|t2t)
   9         grep  "$@" "$FILE" ;;
  10    *) echo "Sorry, I don't know what to do with $FILE"
  11    ;;
  12 esac
  13 }

What's new here? Well, first of all, this works on all ODF documents (.odt, .ods and .odp), as well as plain text (.txt) and Txt2Tags files (if you don't know what Txt2Tags is, read it here!). Line 4 is the standard Bash black magic to get the extension of a file. If the current file belongs to the first category (lines 6-7), the function works as in the original version. If it is a txt/t2t file (lines 8-9), grep can work on it without any assistance. With all other extensions, the function issues a warning and exits (lines 9-10).

With respect to the original version, this function has another advantage: it can pass options to grep! Here's an example to show you what I mean. If you wanted to find the occurrences of Some_String, case insensitive, in school_report.txt, you would type at the prompt:

grep -i "Some_String" school_report.txt

The odfgrep function works in a similar way (even if you pass multiple options for grep!):

odfgrep school_report.odt -i "Some_String"

The reason is that shell scripts or functions receive all their arguments both in positional parameters ($1 is the first argument, $2 the second and so on), and in the special array called $@. Line 2 of odfgrep saves the file name in $FILE, just before we remove it from $@ with the shift command. In our example, initially $@ will be:

school_report.odt -i "Some_String"

After line 3, however, it will have lost its first element, becoming:

-i "Some_String"

So, we can pass it all to grep in lines 7 or 9! Cool, isn't it? You can use this function either as is at the prompt, adding it to your .bashrc file or inside scripts, and expand it in many ways. Do you already have some ideas?

About

Marco Fioretti is a freelance writer and teacher whose work focuses on the impact of open digital technologies on education, ethics, civil rights, and environmental issues.

0 comments