How to correctly create ODF documents using zip

by Sander Marechal

One of the great advantages of the OpenDocument format is that it is simply a zip file. You can unzip it with any archiver and take a look at the contents, which is a set of XML documents and associated data. Many people are using this feature do create some nifty toolchains. Unzip, make some changes, zip it again and you have a new ODF document. Well… almost.

The OpenDocument Format specification, section 17.4 has one little extra restriction when it comes to zip containers: The file called “mimetype” must be at the beginning of the zip file, it must be uncompressed and it must be stored without any additional file attributes. Unfortunately many developers seem to forget this. It is the number one cause of failed documents at Officeshots.org. If the mimetype file is not correctly zipped then it is not possible to programmatically detect the mimetype of the ODF file. And if the mimetype check fails, Officeshots (and possibly other applications) will refuse the document. This problem is compounded because virtually no ODF validator checks the zip container. They only check the contents.

In this article I will show you how you can properly zip your ODF files, but before I do that I will show you the problem in detail.

Detecting mimetypes

Linux and other Unix-like opratingsystems do not rely on file extensions to determine the type of a file. Relying on file extensions can be a serious sercurity problem, as you can see in the Windows world. It's simply too easy to change the extension and pretend that a file is of a different type than it really is. Instead, the Unix world looks at the contents of the file itself. This happens with a library called “magic”.

The magic library consists of a large set of rules, which it uses to figure out what type of file it is looking at. For example, it can look at a certain byte offset and see what value it contains. This is precisely the reason why the ODF specification says that you need to zip the mimetype first, without any file attributes. If you do that and open the ODF file in a hex editor, you will see something like this:

  1. Offset:    Hexadecimal:                                        ASCII:
  2. 00000000 - 50 4b 03 04  14 00 00 08  00 00 c1 b6  66 3b 5e c6  PK..............
  3. 00000010 - 32 0c 27 00  00 00 27 00  00 00 08 00  00 00 6d 69  2.'...'.......mi
  4. 00000020 - 6d 65 74 79  70 65 61 70  70 6c 69 63  61 74 69 6f  metypeapplicatio
  5. 00000030 - 6e 2f 76 6e  64 2e 6f 61  73 69 73 2e  6f 70 65 6e  n/vnd.oasis.open
  6. 00000040 - 64 6f 63 75  6d 65 6e 74  2e 74 65 78  74 50 4b 03  document.textPK.
  7. ...

This is very easy to match for the magic library. Here is an explanation of the rules that magic uses to test if the file is an ODF file:

  1. Look at the beginning of the file. It should start with the letters PK and then bytes 03 and 04. This means it is a zip file.
  2. Look at offset 30 ("1e" in hex). It should be the string "mimetype".
  3. Look at offset 38 ("26" in hex), directly after the word "mimetype". It should be one of the ODF mimetypes.

You can guess what happens when you don't zip the mimetype file first: The string "mimetype" won't be at the right offset. And if you accidentally zip it with extra file attributes, then the contents of the mimetype file will not start directly after it. There will be several bytes in between. This causes the magic library to detect it as a standard zip file, not as an ODF file. Here is how such a badly zipped ODF could look like. This file was zipped normally, without paying special attention to the mimetype file:

  1. Offset:    Hexadecimal:                                        ASCII:
  2. 00000000 - 50 4b 03 04  0a 00 00 00  00 00 25 01  6e 3c 00 00  PK..............
  3. 00000010 - 00 00 00 00  00 00 00 00  00 00 10 00  15 00 43 6f  ..............Co
  4. 00000020 - 6e 66 69 67  75 72 61 74  69 6f 6e 73  32 2f 55 54  nfigurations2/UT
  5. 00000030 - 09 00 03 16  1b 9c 4b 47  1e 9c 4b 55  78 04 00 e8  ......KG..KUx...
  6. 00000040 - 03 e8 03 50  4b 03 04 0a  00 00 00 00  00 25 01 6e  ...PK........%.n
  7. ...

As you can see, it does not match the rules that the magic library has. Instead of checking your ODF file with a hex editor, you can also simply use the "file" command. For example:

  1. $ file --mime my-document.odt
  2. my-document.odt: application/vnd.oasis.opendocument.text

If that command results in "application/zip" or "application/octet-stream" then it means that your ODF file is probably incorrectly zipped. Note that the magic library shipped with "file" up to version 5.0.3 does not contain all mimetypes for ODF files but only for OpenDocument Text (odt) files. File 5.0.3 is the version most commenly shipped with Linux distributions today. I have since submitted a patch that includes all known ODF mimetypes. It was accepted and it should be included in file version 5.0.4 and later.

How to zip an ODF file

So, here is how you can zip an ODF file the right way. Suppose that I have an unzipped ODF file that looks like this:

  1. + my-document/
  2.     + Configurations2/
  3.     + META-INF/
  4.         - manifest.xml
  5.     + Thumbnails/
  6.         - thumbnail.png
  7.     - content.xml
  8.     - meta.xml
  9.     - mimetype
  10.     - settings.xml
  11.     - styles.xml

Start by creating a new zip file that just contains the mimetype file:

  1. $ zip -0 -X ../my-document.odt mimetype

The -0 parameter means that the file will not be compressed. The -X parameter means that no extra file attributes will be stored. Next you can add the rest of the files:

  1. $ zip -r ../my-document.odt * -x mimetype

Be sure to exclude the mimetype file. Now if you look at it with a hex editor, you will see it has been zipped correctly:

  1. Offset:    Hexadecimal:                                        ASCII:
  2. 00000000 - 50 4b 03 04  14 00 00 08  00 00 c1 b6  66 3b 5e c6  PK..............
  3. 00000010 - 32 0c 27 00  00 00 27 00  00 00 08 00  00 00 6d 69  2.'...'.......mi
  4. 00000020 - 6d 65 74 79  70 65 61 70  70 6c 69 63  61 74 69 6f  metypeapplicatio
  5. 00000030 - 6e 2f 76 6e  64 2e 6f 61  73 69 73 2e  6f 70 65 6e  n/vnd.oasis.open
  6. 00000040 - 64 6f 63 75  6d 65 6e 74  2e 74 65 78  74 50 4b 03  document.textPK.
  7. ...

Happy zipping everyone!

Creative Commons Attribution-ShareAlike

Comments

#1 Anonymous Coward

I believe you meant my-document.odt in the mimetype step.

#2 Mira

Note: There should be "my-document.odt" instead of "my-document.zip" in your command:
zip -0 -X ../my-document.zip mimetype

I tested your procedure in Ubuntu 10.04, but I got this:
----------------
00000000 50 4b 03 04 0a 00 00 00 00 00 9d 55 6f 3c 5e c6 |PK.........Uo<^.|
00000010 32 0c 27 00 00 00 27 00 00 00 08 00 00 00 6d 69 |2.'...'.......mi|
00000020 6d 65 74 79 70 65 61 70 70 6c 69 63 61 74 69 6f |metypeapplicatio|
00000030 6e 2f 76 6e 64 2e 6f 61 73 69 73 2e 6f 70 65 6e |n/vnd.oasis.open|
00000040 64 6f 63 75 6d 65 6e 74 2e 74 65 78 74 50 4b 03 |document.textPK.|
--------------------
$ file --mime my-document.odt
my-document.odt: application/zip; charset=binary

It seems that problem is in the bytes 01 and 07

#3 Sander Marechal (http://www.jejik.com)

@Anonymous: Thanks, I fixed it.

@Mira: The output you posted looks fine. What version of `file` are you using? Older versions of `file` assume that the Zip version used is 2.0 or better because that is what OpenOffice.org uses. But the ODF specification does not say anything about it. On most Linux installations, zip creates version 1.0 files unless otherwise specified (that's what the "0a 00" bytes mean at offset 04).

It looks like your magic library is thinking that because it's zip 1.0, it cannot be an ODF file. There doesn't seem to be any way way to force zip to use version 2.0.

You can always try to upload your document to Officeshots.org. It is always running the latest, patched magic library. If it uploads, it is recognised.

#4 Bart Hanssens (http://www.fedict.be)

Actually, the -0 is not necessary. Zip will just store the mimetype, with or without -0, since the file is too small to deflate

#5 jcp

Open Office asks me if I want to repair the file - then it opens. If I only extract content.xml modify it and put it back with: zip -f ../my-document.odt ./content.xml it works. Any suggestion?
(using debian testing, Kernel 3.0.0-1-686-pae, LibreOffice 3.4.3 OOO340m1 (Build:302))

#6 Sander Marechal (http://www.jejik.com)

Not really. I haven't tried this with LibreOffice yet. Do other ODF applications complain about the documents you create?

#7 Yves

Works fine for me at the command line but can't find a way to do the same thing with the Java API. Any ideas how to do that?

#8 Sander Marechal (http://www.jejik.com)

No, sorry. I'm a PHP developer, not a Java developer.

#9 RPW

Maximum respect!
This repacked odf "I/O Error" was driving me crazy!
You saved my day, all praise go to you! And thanks for sharing.
BTW, to help other users: according to a quick test, you can apparently still build a valid file...
* Omitting 'Configurations2' and 'Thumbnails' altogether
* Omitting statistics declarations in the manifest
Cheers,
RPW
Post a new comment

Registration is not required to post comments, but cookies must be enabled. One of the advantages of registration is that you can edit your comments later on (editing not yet implemented). You can register or login here.




Your e-mail address will not be published, but your website URL will. All links that you post will tagged rel="nofollow" to throw off spammers. You are allowed to use the following XHTML tags in your comment: <em> <strong> <u> <b> <i> <strike> <blockquote> <big> <small> <ul> <ol> <li> <a href=""> <pre> <code> <tt> <br>. Please allow up to 60 second processing time after you post a comment. Our spam filters may take some time.