Fixing OpenDocument MIME magic on Linux

by Sander Marechal

When working on the beta of Officeshots.org I ran into an interesting problem with file type and MIME type detection of OpenDocument files. When a user uploads an ODF file to Officeshots I want to determine the MIME type myself using the PHP Fileinfo extension. Windows user who do not have any ODF supporting applications installed will report ODF files as application/zip which is of no use to me. In addition, a malicious user could attempt to upload an executable file and report the MIME type as ODF file.

On Linux, the PHP Fileinfo extension relies on the magic file that is provided by the file package. The magic file contains a series of tests that can determine the file type and MIME type of a file by its contents. I found out that the magic file is incomplete for OpenDocument files. Below I will show you what is wrong with the magic file and how you can fix it.

If you don’t care about the technical explanantion, you can skip to the fix directly.

The problem with magic

First off, some tests. I ran these tests on Debian Lenny, but I have seen other distributions as well that have incomplete file magic support for OpenDocument Format. Here is what I get when I test an odt file using the file command.

  1. ~$ file document.odt
  2. document.odt: OpenDocument Text
  3. ~$ file --mime document.odt
  4. document.odt: application/vnd.oasis.opendocument.text

So far, so good. Both the file type description and the MIME type are right. But for any other type of OpenDocument file only the description is correct. The file type is not. Below I am testing an ods spreadsheet.

  1. ~$ file spreadsheet.ods
  2. spreadsheet.ods: OpenDocument Spreadsheet
  3. ~$ file --mime spreadsheet.ods
  4. spreadsheet.ods: application/octet-stream

The file type "OpenDocument Image Template" is even missing completely from the magic file. There is another problem with the magic file too. An OpenDocument file is basically a zip archive that contains several XML files. The OpenDocument specification (pdf) does not specify what version of zip to use. The magic file only searches for zip 2.0, which is what most ODF applications use, but not all. Some applications use version 1.0 instead and according to the ODF spec that is valid. Here is what happens when you try to detect an ODF file zipped with the zip 1.0 standard.

  1. ~$ file document.odt
  2. document.odt: Zip archive data, at least v1.0 to extract
  3. ~$ file --mime document.odt
  4. document.odt: application/zip

Fixing magic detection

I have written a patch for the magic file that fixes all of the above problems. It removes the version test for the ODF zip container, adds the correct MIME type for all the different ODF file types and adds the missing OpenDocument Image Template. This patch is written for /usr/share/file/magic on Debian Lenny. If you want to patch your own Linux distribution then you may need to adapt it. You can view the patch in our Officeshots Trac or download the patch directly from Subversion.

Update 2009-06-29: I have now also created a patch against the original upstream file-5.0.3.

First, make a backup of your original magic file. Then apply the patch to magic.

  1. ~# cd /usr/share/file
  2. /usr/share/file# cp magic magic.orig
  3. /usr/share/file# patch < ~/magic.patch
  4. patching file magic

After this you need to recompile the magic file. This will create magic.mgc which is the file that is actually used by the file command and the PHP Fileinfo extension.

  1. /usr/share/file# file -C magic

Now your magic file will correctly identify all OpenDocument file types.

  1. ~$ file --mime spreadsheet.ods
  2. spreadsheet.ods: application/vnd.oasis.opendocument.spreadsheet

And that’s all there is to it. Have fun with ODF!.

Creative Commons Attribution-ShareAlike

Comments

#1 Frank Groeneveld (http://techfield.org)

Thanks for this! Did you submit the patch upstream?

#2 Sander Marechal (http://www.jejik.com)

Yes. Both to Debian as well as to the upstream file project. File has already committed the patch. See this thread.

#3 Polprav (http://polprav.blogspot.com/)

Hello from Russia!
Can I quote a post in your blog with the link to you?

#4 Sander Marechal (http://www.jejik.com)

Polprav: Of course you can. All my articles are licensed under a "Creative Commons - Attribution - Share alike" license. See the little logo with three circles at the bottom right of the article (it's a link to the full license".

That means you can use, quote, change or sell,my article, whatever you want. As long as you mention my name (or link back to me) and you also share your article under the same license.

#5 Anonymous Coward

need to know how file command reads the magic number and the filetype is displayed.

Can i read the magic file fromanywhere to know the filetype of a file

#6 Sander Marechal (http://www.jejik.com)

@Anonymous: The rules for filetype detection are in /usr/share/file/magic. It's human readable (and editable) but complex to parse. I would not recommend that you try to parse it yourself.

Instead, I suggest that you use the proper library and API instead. For example, from C and C++ you can use libmagic. In PHP you can use the Fileinfo extension. I am sure that Perl, Python and any other language also have a library that wraps around libmagic. That is much, much easier than trying to do it yourself.

#7 fenderbirds

nice article, keep the posts coming

Comments have been retired for this article.