HowTo: Translating PDF files with Iceni Infix and OmegaT

This HowTo provides information on using OmegaT and Iceni Infix to translate PDF files.

Background

PDF files fall into two categories: true and "scanned" PDFs.

A "scanned" PDF file is a file in which the PDF format merely serves as a convenient container for scans of hardcopy pages. Often, these scans contain text for translation. There is no way of translating a scanned PDF file other than to recreate the text, either by typing it out again or by OCR (optical character recognition), and to recreate the file layout from scratch. Scanned PDFs are not the subject of this HowTo.

True PDF files (sometimes called "native" or "distilled" PDF files, though "distilled" has a more precise meaning) are PDF files that have been exported from another application, usually a desktop publishing (DTP) program. For a true PDF file to be translated, the correct procedure is usually to produce the translation in the original (DTP) application, and then to follow the same procedure for production of the PDF that was followed for the original. Translating the PDF file by editing it directly is not usually a practical proposition. For translation of PDF files "for information", translators often resort to converting the PDF file to some other file format before translation, such as RTF; the results may be adequate for this purpose, but will not meet the professional standards of the original DTP process.

Iceni Infix

Iceni Infix offers a further option. Infix is a PDF editor, i.e. the text in the PDF can be edited directly. Although it is debatable whether the results of this procedure will be comparable to re-layouting of the translation by a DTP professional, they are likely to be much better than conversion to completely different formats (such as RTF).

The "Professional" version of Infix has a further function of interest to translators: XML text export. This enables the text to be exported to an XML file which can be translated in a CAT tool. The translated text can then be re-imported into Infix Professional. OmegaT is among the CAT tools for which this process can be used. The procedure is described in this HowTo.

Platforms

Although Infix is a Windows application, Iceni has made efforts to cater for Linux and Macintosh users. Infix Professional can be used on these platforms in conjunction with Crossover Linux and Crossover Macintosh respectively. Crossover Linux and Crossover Macintosh cost approximately €40. Again, free demonstration versions are available. Crossover Linux and Crossover Macintosh can be obtained from the Codeweavers' website. Specific information on running Iceni Infix on Crossover Linux or Crossover Macintosh is also available.

Translating a PDF file: procedure

Obtain and install Iceni Infix Professional from the Iceni webpage. A demonstration version is available; at the time of writing, the full version costs around US$150. If you are using Linux or Macintosh, obtain and install the relevant Crossover version before installing Iceni Infix Professional. (Infix is also reported to work on WINE.)

Launch Iceni Infix and open the PDF file you wish to translate. The example in the screenshot is the European Commission's SME user guide in Hungarian.

infix1.png

Export the text from the PDF in Infix's XML format with Document > Translate > Export XML. Save the PDF. This is important: when you export the XML file from the PDF, Infix makes a note in the file of where all the pieces of text ("stories") belong, so you must use this version of the file when re-importing the translated XML file.

Create an OmegaT project in the usual way.

If you are using OmegaT version 2.3 or later, simply place the XML file exported from Infix as described above in the /source folder of your OmegaT project.

If you are using an earlier version of OmegaT, upgrading your OmegaT installation is recommended. Alternatively, you should be able to translate the Infix XML file with good results using earlier versions of OmegaT by means of the HTML filter. To use this filter, simply change the file extension of the XML file exported by Infix from .xml to .html.

Reload your OmegaT project. You can now translate the text (see screenshot).

infix2.png

Note: OmegaT's Infix filter maps the <BR/> of Infix to <brx/> tags. This enables the HTML segmentation rule to be used to choose whether or not segmentation should occur at these points.

After completing your translation, create the translated document in the usual way (Ctrl+S, Ctrl+D). Locate the translated XML file in the OmegaT project's /target folder. If you changed the file extension to .html, change the extension of the translated file back to .xml.

Back in Iceni Infix, import the translated XML file back into the PDF from which you exported it. Save the changes.

All being well, your translation will come out looking like the original, but translated. (See screenshot: only the first three segments have been translated.)

Note that if you use the demonstration version of Iceni Infix in this procedure, your translated PDF file will bear an Iceni watermark. For translations intended "for information", this may not be a problem. Iceni Infix also enables you to save PDF files in RTF format, but in this case the demonstration version really is only suitable for demonstration purposes, as it contains random character substitutions.

infix3.png

As is so often the case with things technical (and things translation), there are catches.

You may find that the embedded fonts in the PDF do not contain all the characters you need. You can presumably resolve this by obtaining and installing the necessary fonts; or you can select a different font for this purpose – which may or may not be an adequate solution.

There is a strong possibility that at some points, your translation will be longer than the original. This needs to be dealt with in Infix, for example by enlarging the box containing the text. Infix has functions for dealing with this and other problems that are beyond the scope of this HowTo.

You may find that segments are broken by hard line breaks at inconvenient places. This situation will probably be familiar to you if you have translated Powerpoint files in OmegaT, or for that matter other CAT tools. To resolve this, open the original PDF file again in Infix. Select Tools > Text tool. Clicking on the text in question will display a text box and formatting marks. The screenshot shows an example: infix4.png Remove the line break. Then save the changes and re-export the PDF to the XML file (and change the file extension etc.), and reload your OmegaT project. If your file contains lots of such inconvenient breaks, it is more efficient to remove them all at once, toggling between OmegaT and Infix to see where they are.

Some inconvenient line breaks may be required for correct positioning of the text. In these cases, it is practical to remove them before exporting the file to XML, so that you are presented with cohesive segments for translation, and then to re-insert them in Infix at the end of the process.

Copyright Marc Prior 2011