TextMaven - TextMaven Getting Started

Overview

The Translator reads a list of words as input, performs a lookup in a single or multiple dictionaries and writes the translation into an output file. The dictionaries which can be looked up are configured in the configuration file config.xml. The dictionaries which are actually used are either determined by the default dictionary specified in the configuration file, on the command line or in the input file. The output format written is defined by the writer which is also specified in the command line. Like dictionaries, available writers are configured in the configuration file.

Running the Translator

In order to run the translator, type the following commands:

cd %TM_HOME%\bin
set_env.bat
java -Xmx256m -Dtextmaven.home=%TM_HOME% -cp %_tm_classpath% textmaven.application.translator.Main [options] [input-file]

The command file set_env.bat is provided for convenience reasons. It sets the environment variable _tm_classpath to the classpath required to run the analyzer. Also note that as a prerequisite the server should have been started.

Instead of having to type the lengthy command line, the command file tm is provided.

cd %TM_HOME%\bin
tm [options]  [input-file]

Options and parameters submitted to the command file are directly passed through to the java program.

Options are:

`-?`	Prints usage notes, as you are reading currently.
`-o writer-id`	Specifies the output driver (writer) to use. As default the first driver is used. To get a list of all supported writers type the option -s. At the time of this writing `browser` and `word` are supported.
`-b title`	Sets the title of the book in the output file.
`-s`	Prints the ids of all supported languages, dictionaries and writers.
`-f file`	Output is written to the specified file.
`-g`	Turns greedy mode on. In greedy mode word reduction continues recursively even if a match in the dictionary was found already.
`-c n`	Overwrites default number of sentences to be included in output. Default is 3. This option has no effect if the specified writer does not support it.
`-d id=id1,id2,...`	Dictionaries to be used for lookups. If the option is not specified, the default dictionary is used. More than one id. The id on the left hand side specifies, the dictionary class to be instantiated. The dictionary has to be of type CompositeDictionary and will be initialized with the dictionaries specified in the list on the right hand side. Currently, two types of composite dictionaries are available `sequence` (id is seqAll) and `union` (id is unionAll). A sequence composite dictionary will search all the dictionaries it contains sequentially until a word lookup succeed. In a union composite dictionaries all dictionaries it is composed of will be searched and all translations found will be written.

Note: If no input file is specified, the program reads its input from the console.

In order to specify a different configuration file than config.xml set the system property textmaven.configuration to the appropriate configuration file.

Writers

The installation comes with a number of writers configured:

lit	Produces a the translations as ebook which can be read by Microsoft Reader. By some manual steps which I haven't yet written down it can be converted into a MS Reader dictionary.
card	This writer is for use in a currently experimental project to create learning cards.
browser	Translations are written as HTML file organized in a table. Output is optimized for viewing in browser. If you copy the output file to somewhere else, the CSS file `TM_HOME\resources\styles.css` should be copied along with it.
vocab	Translations are written as comma separated file which can be used in postprocessing tools, like for instance a vocabular trainer.
word	Translations are written as HTML file organized in a table. Output is optimized for viewing in Microsoft Word.

Input file format

Input files are ASCII files each word on a separate line. Each line may consist of a number of fields. Each field separated by the separator specified in the configuration file. The current default separator is the ° character which I hope does not occur in a translation.

Each line may consist of the following fields:

Field 1	The word to be looked up
Field 2	the page the word occurred on.
Field 3	The chapter the word occurred in.
Field 4	The dictionary to be used for lookup.
Field 5 ....	Sentences or fragments of sentences where the word to be looked up occurred in.

The following example is a fragment of the sample file TM_HOME\test\wm_test1.txt rovided with the installation kit:

singeing°°Chapter 1
singing
°°Chapter 2
batting
humming°2°°°his voice was humming°humming, stunning°humming, hum, plum etc.°should not occcur -> humming
slowly°2
tries°3°Chapter 2°°how many tries
bottles°°°°messages in bottles

General Information

Documentation

Resources

Development

Project Documentation

Overview

Running the Translator

Writers

Input file format