OmegaT HowTo: Using the OmegaT tokenizer

The tokenizer plug-in was integrated into OmegaT in version 3.0.0. The following information is therefore applicable only if you are using a version of OmegaT earlier than 3.0.0.
If you have been using the tokenizer plug-in with an earlier version of OmegaT and have upgraded to version 3.0.0 or later of OmegaT, delete the tokenizer plug-in files from your plug-in folder.

The OmegaT tokenizer is a plugin for OmegaT. It provides better fuzzy and glossary matches in OmegaT, by computing the roots ("stemming") of the source words. For example, it will recognize inflected words in texts and display the corresponding glossary entry, even if the glossary entry contains only the uninflected form of a word.

Preparation

Before using the tokenizer with OmegaT, you must first ensure that your version of OmegaT is suitable and prepared for use with it.

Webstart version of OmegaT: the tokenizer is not compatible with the Webstart version of OmegaT. If you wish to use the tokenizer, install the standard version of OmegaT (latest beta version) for your system.

OmegaT version 2.1.0 and older: the current tokenizer is not compatible with these versions. (The tokenizer can be used with versions 2.0.x and 2.1.0, but this requires both a different version of the tokenizer, and a different installation procedure.) Users are advised to upgrade to the latest beta version of OmegaT.

Windows versions of OmegaT: in order for OmegaT (any version) to be used with the tokenizer, it must be launched from a launch script file. A launch script file is not supplied with the Windows OmegaT versions. If you are using the Windows version with JRE, download the file OmegaT_with_JRE.bat; If you are using the Windows version without JRE, download the file OmegaT_without_JRE.bat. After downloading, place the file in the main OmegaT folder (the folder containing the file OmegaT.jar).

Platform-neutral version (on Windows): locate your OmegaT launch script file (OmegaT.bat).

Linux versions/systems: locate your OmegaT launch script file (OmegaT or OmegaT.sh).

Check that OmegaT is launched when you execute the launch script file:
- On Linux, on the command line
- On Windows, by clicking on the launch script file

Installing the tokenizer

After preparing for installation (see above), install the tokenizer as follows:

1. Download the tokenizer zip package (for OmegaT version 2.1.1 and later).

2. Unzip the files from the tokenizer zip package.

3. In the main OmegaT program folder (i.e. the folder containing the file OmegaT.jar), create a sub-folder called "plugins", if a sub-folder with this name does not already exist. Copy the files that you have unzipped from the tokenizer package directly into this sub-folder.

4. Open your launch script file in a text editor. Windows users (in particular): do not simply click on this file. Instead, launch a text editor (such as Notepad or Wordpad), then open the launch script file with File > Open. You may also be able to right-click with the mouse on the file, then select a text editor in which to open it.

5. The launch file script contains the OmegaT launch command. The basic form of this command is:

java -jar OmegaT.jar

Depending upon your system configuration, the launch command may be slightly different.

6. Choose a tokenizer from the following list, according to your source language:

org.omegat.plugins.tokenizer.LuceneArabicTokenizer
org.omegat.plugins.tokenizer.LuceneBrazilianTokenizer
org.omegat.plugins.tokenizer.LuceneChineseTokenizer
org.omegat.plugins.tokenizer.LuceneCJKTokenizer
org.omegat.plugins.tokenizer.LuceneCzechTokenizer
org.omegat.plugins.tokenizer.LuceneDutchTokenizer
org.omegat.plugins.tokenizer.LuceneFrenchTokenizer
org.omegat.plugins.tokenizer.LuceneGermanTokenizer
org.omegat.plugins.tokenizer.LuceneGreekTokenizer
org.omegat.plugins.tokenizer.LucenePersianTokenizer
org.omegat.plugins.tokenizer.LuceneSmartChineseTokenizer
org.omegat.plugins.tokenizer.LuceneRussianTokenizer
org.omegat.plugins.tokenizer.LuceneThaiTokenizer
org.omegat.plugins.tokenizer.SnowballDanishTokenizer
org.omegat.plugins.tokenizer.SnowballDutchTokenizer
org.omegat.plugins.tokenizer.SnowballEnglishTokenizer
org.omegat.plugins.tokenizer.SnowballFinnishTokenizer
org.omegat.plugins.tokenizer.SnowballFrenchTokenizer
org.omegat.plugins.tokenizer.SnowballGerman2Tokenizer
org.omegat.plugins.tokenizer.SnowballGermanTokenizer
org.omegat.plugins.tokenizer.SnowballHungarianTokenizer
org.omegat.plugins.tokenizer.SnowballItalianTokenizer
org.omegat.plugins.tokenizer.SnowballNorwegianTokenizer
org.omegat.plugins.tokenizer.SnowballPorterTokenizer
org.omegat.plugins.tokenizer.SnowballPortugueseTokenizer
org.omegat.plugins.tokenizer.SnowballRomanianTokenizer
org.omegat.plugins.tokenizer.SnowballRussianTokenizer
org.omegat.plugins.tokenizer.SnowballSpanishTokenizer
org.omegat.plugins.tokenizer.SnowballSwedishTokenizer
org.omegat.plugins.tokenizer.SnowballTurkishTokenizer

Add the argument --ITokenizer= followed by the full name of this tokenizer (by copying the entire line, inserting a space) to the end of the launch command in your OmegaT launch script file.

For example, to use the English tokenizer (when translating from English), your launch command might now be:

java -jar OmegaT.jar %* --ITokenizer=org.omegat.plugins.tokenizer.SnowballEnglishTokenizer

Or if you are translating from Turkish, it might now be:

java -jar OmegaT.jar %* --ITokenizer=org.omegat.plugins.tokenizer.SnowballTurkishTokenizer

Important: this whole command must appear on one line (even if it appears to be on two lines in the display in which you are reading this).

7. Execute this file, and OmegaT should now launch with the tokenizer function. To test, check whether glossary entries are displayed even when the current OmegaT segment contains a term in an inflected form from that in the glossary.

8. If you wish to use different tokenizers because you translate from more than one language, create a separate OmegaT launch script file for each tokenizer you wish to use. Name the launch script files appropriately, for example "OmegaT-EN.bat" for the launch script file containing the command with the English tokenizer and "OmegaT-TR.bat" for the launch script file containing the command with the Turkish tokenizer.

9. In some cases, you may find that the source-language tokenizer interferes with the target-language spelling checker. You can eliminate this problem by specifying a tokenizer for the target language as well (where available), with the argument --ITokenizerTarget=.

For instance, if you are translating from Chinese to Dutch, try:

java -jar OmegaT.jar %* --ITokenizer=org.omegat.plugins.tokenizer.LuceneChineseTokenizer --ITokenizerTarget=org.omegat.plugins.tokenizer.LuceneDutchTokenizer

10. After creating a launch script as described above, you can configure your system so that OmegaT is launched more conveniently, for example by creating a shortcut. To create a shortcut in Windows:

Right-click on the launch script (OmegaT.bat), then keeping the right mouse button depressed, drag the script to a convenient location, such as your desktop. When you release the right click, a dialog opens with a number of options. Choose "Create Shortcuts Here".

Alternatively, right-click on the launch script. Select: "Send to", then select "Desktop (create shortcut)".

After creating and testing the shortcut, you can add it to the Start menu by dragging it there.

Copyright Marc Prior 2010-2011