HOWTO: Using the OmegaT tokenizer

About the OmegaT tokenizer

The OmegaT tokenizer is a plugin for OmegaT. It provides better fuzzy and glossary matches in OmegaT, by computing the roots ("stemming") of the source words. For example, it will recognize inflected words in texts and display the corresponding glossary entry, even if the glossary entry contains only the uninflected form of a word.

Preparation

Before using the tokenizer with OmegaT, you must first ensure that your version of OmegaT is suitable and prepared for use with it.

Webstart version of OmegaT: the tokenizer is not compatible with the Webstart version of OmegaT. If you wish to use the tokenizer, install the standard version of OmegaT (latest beta version) for your system.

OmegaT version 2.1.0 and older: the current tokenizer is not compatible with these versions. (The tokenizer can be used with versions 2.0.x and 2.10, but this requires both a different version of the tokenizer, and a different installation procedure.) Users are advised to upgrade to the latest beta version of OmegaT.

Windows versions of OmegaT: in order for OmegaT (any version) to be used with the tokenizer, it must be launched from a launch script file. A launch script file is not supplied with the Windows OmegaT versions. If you are using the Windows version with JRE, download the file OmegaT_with_JRE.bat; If you are using the Windows version without JRE, download the file OmegaT_without_JRE.bat. After downloading, place the file in the main OmegaT folder (the folder containing the file OmegaT.jar).

Platform-neutral version (on Windows): locate your OmegaT launch script file (OmegaT.bat).

Linux versions/systems: locate your OmegaT launch script file (OmegaT or OmegaT.sh).

Check that OmegaT is launched when you execute the launch script file:
- On Linux, on the command line
- On Windows, by clicking on the launch script file

Installing the tokenizer

After preparing for installation (see above), install the tokenizer as follows:

1. Download the tokenizer from sourceforge.net/projects/omegat-plugins.

2. Unzip the files from the tokenizer zip package.

3. In the main OmegaT program folder (i.e. the folder containing the file OmegaT.jar), create a sub-folder called "plugins", if a sub-folder with this name does not already exist. Copy the files that you have unzipped from the tokenizer package directly into this sub-folder.

4. Open your launch script file in a text editor. Windows users (in particular): do not simply click on this file. Instead, launch a text editor (such as Notepad or Wordpad), then open the launch script file with File > Open. You may also be able to right-click with the mouse on the file, then select a text editor in which to open it.

5. The launch file script contains the OmegaT launch command. The basic form of this command is:

java -jar OmegaT.jar

Depending upon your system configuration, the launch command may be slightly different.

6. Choose a tokenizer from the following list, according to your source language:

org.omegat.plugins.tokenizer.LuceneArabicTokenizer
org.omegat.plugins.tokenizer.LuceneBrazilianTokenizer
org.omegat.plugins.tokenizer.LuceneChineseTokenizer
org.omegat.plugins.tokenizer.LuceneCJKTokenizer
org.omegat.plugins.tokenizer.LuceneCzechTokenizer
org.omegat.plugins.tokenizer.LuceneDutchTokenizer
org.omegat.plugins.tokenizer.LuceneFrenchTokenizer
org.omegat.plugins.tokenizer.LuceneGermanTokenizer
org.omegat.plugins.tokenizer.LuceneGreekTokenizer
org.omegat.plugins.tokenizer.LucenePersianTokenizer
org.omegat.plugins.tokenizer.LuceneSmartChineseTokenizer
org.omegat.plugins.tokenizer.LuceneRussianTokenizer
org.omegat.plugins.tokenizer.LuceneThaiTokenizer
org.omegat.plugins.tokenizer.SnowballDanishTokenizer
org.omegat.plugins.tokenizer.SnowballDutchTokenizer
org.omegat.plugins.tokenizer.SnowballEnglishTokenizer
org.omegat.plugins.tokenizer.SnowballFinnishTokenizer
org.omegat.plugins.tokenizer.SnowballFrenchTokenizer
org.omegat.plugins.tokenizer.SnowballGerman2Tokenizer
org.omegat.plugins.tokenizer.SnowballGermanTokenizer
org.omegat.plugins.tokenizer.SnowballHungarianTokenizer
org.omegat.plugins.tokenizer.SnowballItalianTokenizer
org.omegat.plugins.tokenizer.SnowballNorwegianTokenizer
org.omegat.plugins.tokenizer.SnowballPorterTokenizer
org.omegat.plugins.tokenizer.SnowballPortugueseTokenizer
org.omegat.plugins.tokenizer.SnowballRomanianTokenizer
org.omegat.plugins.tokenizer.SnowballRussianTokenizer
org.omegat.plugins.tokenizer.SnowballSpanishTokenizer
org.omegat.plugins.tokenizer.SnowballSwedishTokenizer
org.omegat.plugins.tokenizer.SnowballTurkishTokenizer

Add the full name of this tokenizer (by copying the entire line, inserting a space) to the end of the launch command in your OmegaT launch script file.

For example, to use the English tokenizer (when translating from English), your launch command might now be:

java -jar OmegaT.jar %* --ITokenizer=org.omegat.plugins.tokenizer.SnowballEnglishTokenizer

Or if you are translating from Turkish, it might now be:

java -jar OmegaT.jar %* --ITokenizer=org.omegat.plugins.tokenizer.SnowballTurkishTokenizer

Important: this whole command must appear on one line (even if it appears to be on two lines in the display in which you are reading this).

7. Execute this file, and OmegaT should now launch with the tokenizer function. To test, check whether glossary entries are displayed even when the current OmegaT segment contains a term in an inflected form from that in the glossary.

8. If you wish to use different tokenizers because you translate from more than one language, create a separate OmegaT launch script file for each tokenizer you wish to use. Name the launch script files appropriately, for example "OmegaT-EN.bat" for the launch script file containing the command with the English tokenizer and "OmegaT-TR.bat" for the launch script file containing the command with the Turkish tokenizer.

9. You can configure your system so that OmegaT is launched more conveniently. For instance, in Windows, drag the OmegaT launch script file to the Start menu to add it.

Copyright Marc Prior 2010