Translating Word 2007 (Office Open XML, .docx) files in OmegaT
This HowTo provides tips on translating MS Office 2007 (and later) files in OmegaT.
With the advent of Microsoft Office 2007, Microsoft Word, Excel and Powerpoint have new default file formats. These formats are formally known as "Office Open XML", and have the extensions .docx, .xlsx and .pptx respectively. The formats are also used in MS Office 2010. For the sake of convenience, they will be referred to below as "MS Office 2007 file formats".
From Version 1.7.1 onwards, OmegaT has been able to handle Microsoft Office 2007 files directly, without conversion. With the appearance of OmegaT Version 2.1.8, the handling of this file format in OmegaT also became much more user-friendly.
The advantages of using the .docx format in conjunction with OmegaT
As customers/authors upgrade Microsoft Office/Word to more recent versions, translators can expect to receive files increasingly in the new format. Unlike the legacy .doc, .xls and .ppt formats, these files can be handled directly in OmegaT, with no loss of formatting as a result of conversion to and from other formats.
The new format can also serve as a useful way of handling the legacy .doc, .xls and .ppt formats, since they can be converted to their MS Office 2007 counterparts, translated in OmegaT, and converted back again. This procedure is therefore an alternative for translators who would prefer not to use OpenOffice.org for this purpose.
Converting to MS Office 2007 format
You can convert MS Office 1997/2000/2003/XP files to MS Office 2007 format by opening them in MS Office 2007 and saving them with "Save As" in the new format. (Since this is now the standard format, it is simply described as Word, Excel or Powerpoint in MS Office 2007).
For users who do not have MS Office 2007 or 2010 and do not wish to buy it, Microsoft provides a compatibility plug-in for earlier versions.
Linux users: both MS Office 2007 and the Microsoft compatibility plug-in run on Crossover Linux.
Points to note when using the .docx, .xlsx and .pptx formats in OmegaT
In OmegaT versions prior to 2.1.8, each formatting change in an MS Office 2007 file resulted in a long series of multiple tags, which often made handling this file format impractical. As of version 2.1.8, by default, these multiple tags are now aggregated (condensed) into a single tag. Users wishing to translate MS Office 2007 files should therefore upgrade to OmegaT version 2.1.8 or later.
(The greater ease of tag handling with .docx files comes at the cost of a slight loss in control over formatting. For example, without the "aggregate tags" function, where a word in the source text is in bold and italics, the translator could choose to render it in bold only, or in italics only. With the "aggregate tags" function enabled, this is not possible. Most users will probably find that the greater ease of use outweighs this drawback.)
The .docx format is also prone to the introduction of "nuisance" formatting code which results in unwanted and unnecessary tags appearing in OmegaT's editor pane. Since these are inconvenient during translation, it is worthwhile trying to remove these nuisance tags before beginning translation in OmegaT. OmegaT shares this problem with other CAT tools that handle the .docx format, and the solutions are similar or the same.
How to remove unwanted formatting code from .docx files
Word 2007 marks any words not in its spelling dictionary with tags, so even a couple of words which it considers to be mis-spelt will result in a plethora of tags. If every single word in OmegaT is enclosed within tags, it is possible that Word has attempted to spell-check the text without access to a suitable spelling dictionary, and has therefore marked each word as mis-spelt. One solution is to switch off automatic spelling checking in Word.
Automatic hyphenation also adds unnecessary tags, so disable it. It may be necessary to do this for each document, in which case the procedure in Word 2007 is: Click the Page Layout tab; in the Page Setup group, click Hyphenation; then click None.
Use Search and Replace to remove double spaces (replacing them with a single space).
- If your text contains both Asian and "normal" (e.g. Latin) characters, use Ctrl+A to select the entire text, then select a single font for it. Be careful to choose a font that contains both the necessary Asian and "normal" characters, such as MS Mincho, but without the "[text body]" style suffix.
Word 2007 inserts a tag at page breaks, even automatic page breaks. This tag apparently cannot be removed.
Note that even after Word's settings have been changed so that it does not insert "nuisance" codes, for example for auto-hyphenation, such codes may still be present in the file and not removed by the change in configuration. Removal of these codes is described below. Before following the instructions below for removal of the nuisance codes, however, remember to make the configuration changes in Word first, or Word may simply re-insert nuisance codes again when the file is next opened.
As already mentioned, this problem with MS Word is not unique to OmegaT. Third-party tools have been produced to deal with the problem. Two of these are:
Document Cleaner (free)
Levelling character formatting
An alternative to running the CodeZapper macro on your text is to "level" the formatting. "Levelling" the formatting means applying the format of the first character in a selection of text (such as a whole paragraph) to all subsequent characters. Note that this is not the same as "deleting" the formatting, which would cause the character formatting to revert to the document's default.
You can level the formatting of paragraphs essentially by copying whole paragraphs in MS Word and pasting them back over themselves such that the entire paragraph assumes the formatting of the first character.
Levelling the character formatting of a paragraph manually
Step 1: Unhide the paragraph mark. Mark each paragraph by clicking on it repeatedly until the entire paragraph is highlighted. Then move the end of the highlighting one character to the left so that it does not include the paragraph mark.
Step 2: Copy this highlighted text (Ctrl+C). Then select "Paste Special" (Word 2003) or click on the arrow at the bottom of the "Insert" button (Word 2007) to obtain the extended Paste options. Finally, insert the content using the "Unformatted Unicode text" option.
Where paragraphs do contain inline formatting (bold, italics, hyperlinks, etc.), you can either:
- level the formatting from these paragraphs using the macro as described above, then re-insert the formatting; or
- mark only the text up to the beginning of the formatted part, copy this text and paste it back over itself, then repeat the process on the text after the formatted part.
Creating a macro to level the character formatting of a paragraph
You can automate the above procedure somewhat by creating a macro. Brief instructions for creating the macro (in MS Office 2007) are provided below. (For more detailed instructions, refer to your instruction manual or click here, here or here.)
If you have not already done so, add the Developer tab to the ribbon as follows: click on the Office button. Select Word Options. Click Popular in the Word Options dialog box. On the ribbon, check Show Developer tab and confirm with OK. Close Word.
Launch Word again, create a new Word file and add a paragraph of sample text, at least three lines long, e.g.:
This is a paragraph. This is a paragraph. This is a paragraph. This is a paragraph. This is a paragraph. This is a paragraph. This is a paragraph. This is a paragraph. This is a paragraph. This is a paragraph. This is a paragraph. This is a paragraph. This is a paragraph. This is a paragraph. This is a paragraph. This is a paragraph. This is a paragraph. This is a paragraph. This is a paragraph. This is a paragraph. This is a paragraph. This is a paragraph.
Click on the paragraph mark button to make paragraph marks visible.
Place the cursor in your paragraph of text (not in the first or last lines).
Create the macro:
On the Developer tab, click on Record Macro.
In the Macro name box, type a name for the macro, such as "levelformat". To make the macro available in all documents, select Normal.dotm in the "Store macro in" box. If you have made changes to your Normal.dotm, you might want to back this up first.
Click on the Keyboard button. Click in the box for a new keyboard combination, then experiment with keyboard combinations to find one that is not assigned. If a combination you try is already assigned, a message to this effect is displayed in the dialogue, and you can delete the combination and try another. Ctrl+Shift+9 appears not to be assigned to any other function in the basic Word installation, but you can use any other combination that has not already been assigned.
Click on Assign to assign the shortcut to the macro that you are about to record. Anything you now do will be recorded in the macro, so follow this procedure carefully:
Ctrl+Cursor Up (this takes the cursor to the beginning of the paragraph)
Ctrl+Shift+Cursor Down (this selects the paragraph, including the paragraph mark)
Shift+Cursor Left (this moves the end of the selection one character to the left so that it no longer includes the paragraph mark)
Ctrl+C (this copies the paragraph)
Ctrl+V (this pastes the text of the paragraph back over the paragraph)
After entering these commands, click Stop Recording.
Click on Macros and select levelformat (or whatever name you gave to your macro), then Edit. The code of your macro(s) will be displayed. For levelformat, this should be:
' levelformat macro
Selection.MoveUp Unit:=wdParagraph, Count:=1
Selection.MoveDown Unit:=wdParagraph, Count:=1, Extend:=wdExtend
Selection.MoveLeft Unit:=wdWord, Count:=1, Extend:=wdExtend
Change the line:
Save with Ctrl+S, and close the macro editing window.
If all has gone according to plan, your macro should now work. To test it, add some formatting to your Word document, for example by making one word bold. Now simply place the cursor anywhere in the paragraph and press Ctrl+Shift+9 (or whatever other keyboard shortcut you selected), and you should see the formatting disappear.
You can change the keyboard shortcut at any time. To do this, click the main "Office" button, then Word options (bottom right of the dialog).
Click Customize > Choose commands. Choose Macros from the drop-down list. Select "levelformat" (or whatever you called it).
At the bottom of the dialog next to Keyboard shortcuts, click on Customize.
In the Categories box, scroll down to Macros and select it. Then on the right, under Macros, select levelformat (or whatever you called your macro). The current shortcut will be displayed in the relevant box. You can clear it by selecting it and hitting Delete. You can then enter a new shortcut and assign it with Assign as you did before. Close/confirm the dialogs.
You can use a similar procedure to create a button for your macro:
Again, click the main "Office" button, then Word options.
Click Customize > Choose commands. Choose Macros from the drop-down list. Select your macro.
Click on Add. You should see the macro command appear in the right-hand column. Click on Modify.
Pick a symbol, then OK > OK. The symbol for your macro should appear in the toolbar.
This macro is useful for levelling the formatting of a whole paragraph that contains no visible formatting. Where a paragraph contains desired formatting, you must restore the deleted formatting after levelling the paragraph. In paragraphs containing lots of formatting, restoring the deleted formatting can be quite a lot of work. For such paragraphs, consider the following alternatives:
Use Dave Turner's CodeZapper macro.
Create a second macro (with its own button and/or keyboard shortcut) of your own that only levels highlighted text, rather than the entire paragraph. (Record as described above, but only recording the command Ctrl+C followed by Ctrl+V.) You can then highlight text between the points at which formatting changes. This will level the formatting over sections of text that have the same, desired formatting, and you will not then need to restore deleted formatting.
Use the level formatting macro only on paragraphs with little or no formatting. Paragraphs that you have not "levelled" will then contain nuisance tags, but at least the others will be much "cleaner".
More advanced users may wish to try OmegaT's Text Export function. In conjunction with the te-st-notags script, this provides a separate window in which tags are not displayed. You must still handle tags properly when producing and editing your translation, but you can view it in a separate window without the clutter of tags ("tag soup").
Copyright Marc Prior 2009-2011