Linux

OpenOffice and LibreOffice: How to manage hybrid PDFs

If you have many PDF files to manage, it may become difficult to tell which are uneditable and which are hybrid (editable). Here's a tip for sorting out the difference.

My last post lists seven great, little-known features of OpenOffice and LibreOffice. This week I'll look in detail to some implications of the sixth feature, Hybrid PDFs.

Defining and creating such files with either OpenOffice and LibreOffice is, as explained in that other post, really simple: select File | Export as PDF, tick the "Embed OpenDocument file" box, and you'll get a PDF document that embeds a complete copy of the original OpenDocument file. This lets you distribute "read-only" documents that look exactly as you intended, but are still completely editable if necessary.

If this were the whole story, there would be no need to write another post about it. However, if you start thinking about the implications, things become more interesting and deserve more looking into. For example...

How do you recognize hybrid PDFs?

To show you some (lack of) properties of hybrid PDFs, I exported the OpenDocument version of my previous post (called 7tips_aoo_lo.odt) to both normal and hybrid PDF formats. The first, obvious difference between the several files was their size:

  [marco@polaris hybrid_pdfs]$ ls -l test/7tips_aoo_lo*
  35393 May 24 19:35 7tips_aoo_lo.odt
  60782 May 24 19:35 7tips_aoo_lo.pdf
  96371 May 24 19:35 7tips_aoo_lo_emb.pdf

No surprises here: as is expected, the .odt file (being nothing but a ZIP archive) is the smallest one. The normal PDF is much bigger, and the size of the hybrid one (called 7tips_aoo_lo_emb.pdf to highlight that it embeds the original document) is just a few bytes more than the sum of the first two.

As you can see in Figure A, LibreOffice (or Apache OpenOffice) has no problem to recognize exactly what type of PDF you told it to open: the normal PDF was opened by Draw, while the ODF component of the hybrid PDF was seen and directly opened by Writer.

Figure A

The troubles start the day when you find yourself with many files, maybe created years before or by somebody else, and no clue as to which ones are normal PDFs, and which ones are hybrid files.

Being able to make this distinction is more important than you may think, at least for businesses. Many companies probably would not want to publish online PDF brochures, reports and what not... that also include the ODF original document, fully editable, and with plenty of potentially sensitive metadata.

Unfortunately, not all file managers and PDF viewers seem able to signal embedded documents in hybrid PDFs (Okular, for example, keeps the "Embedded Files" entry grayed out in Figure B):

Working on the command line doesn't seem to change anything. On Unix-like systems (and probably on other platforms too) magic numbers are "numbers embedded at or near the beginning of a file that indicate what type of file it is". On Linux you can use magic numbers with the file command. However, file doesn't see any difference between normal and hybrid PDFs:

  [marco@polaris hybrid_pdfs]$ file test/*
  test/7tips_aoo_lo_emb.pdf: PDF document, version 1.4
  test/7tips_aoo_lo.odt:     OpenDocument Text
  test/7tips_aoo_lo.pdf:     PDF document, version 1.4

Even if you add the -i option, that shows complete MIME types:

  [marco@polaris hybrid_pdfs]$ file -i test/*
  test/7tips_aoo_lo_emb.pdf: application/pdf; charset=binary
  test/7tips_aoo_lo.odt:     application/vnd.oasis.opendocument.text; charset=binary
  test/7tips_aoo_lo.pdf:     application/pdf; charset=binary

Searches for other ways to recognize PDF files with embedded objects turn out pages like this, that mention "EF" entries marking those objects. There are, however, no such strings in the hybrid PDFs generated by LibreOffice. After some trials, I found out this dirty but apparently effective way to detect if a PDF file created by AOO or LO contains its OpenDocument version:

       1  #! /bin/bash
       2
       3  for F in `find $1 -type f -iname "*.pdf"`
       4      do
       5          HYBRID=`od -c $F | cut -c8- | tr -d " \012" | grep application | grep vnd | grep oasis | grep -c opendocument`
       6      if [ "$IS_HYBRID" == "1" ]
       7          then
       8          echo $F
       9          fi
      10          done

This is a shell script that (line 3) finds all the files with the .pdf extension in the folder passed as first argument, checks if they contain the strings application, vnd, oasis and opendocument, and prints the file names (line 6-10) if this is the case. Line 5 removes, from an ASCII listing of the PDF file created with od, all the line numbers (with cut), spaces and newlines (with tr). The result is one huge string in which grep` can easily detect the substring(s) that are only present inside hybrid PDFs. The reason to use several invocations of grep is that (don't ask me why) not all LO applications use exactly the same string to mark the files they embed. A more robust version of the script should search for all the exact variants of that string. The code above, in fact, will call hybrid any file that contains all those substrings in any position, but it's enough to illustrate the principle in the space I have.

Automatic generation of hybrid PDFs

As easy as it is, creation of hybrid PDFs from the user interface still is manual work. Can we generate such versions of many files with AOO or LO, without opening them one at a time? Yes, we can. Some time ago, I explained how to automatically convert .doc and ODF files to clean and lean HTML. The same trick explained in that post can be used to generate hybrid PDFs. You only need to use the right export filter, which is writer_globaldocument_pdf_Export (I found its name in the list generated by this macro, also explained in my "lean HTML" post). This command, which you can easily insert in any shell script:

soffice --headless --convert-to pdf:writer_globaldocument_pdf_Export --outdir . some_odf_file

This will create a hybrid PDF copy of some_ODF_file, giving to that copy a name that clearly indicates that it is a hybrid PDF is left as an exercise for the reader.

About

Marco Fioretti is a freelance writer and teacher whose work focuses on the impact of open digital technologies on education, ethics, civil rights, and environmental issues.

0 comments

Editor's Picks