TextMaven - Analyzer

Overview

Basically the Analyzer extracts words from any ASCII file. Its original intention was to gather statistical data on ebooks, like word occurences in the ebook. Over time it was enhanced to generate an input file which can be used by the Translator in order to produce vocabular files. A file containing stop words, i.e. which should not be considered in the process, can be specified on the command line. In addition, to generate a simple word list the ebook can be wrappend as HTML file with each word not contained in the stop word file linked to the vocabulary file to be produced with the Translator. This feature supports the production of a standalone HTML ebook which can be read online with all relevant vocabulary accessible by a single click.

Running the Analyzer

In order to run the analyzer, type the following commands:

cd %TM_HOME%\bin
set_env.bat
java  -Xmx256m -Dtextmaven.home=%TM_HOME% -cp %_tm_classpath% textmaven.application.analyzer.Main [options] [input-file]

The command file set_env.bat is provided for convenience reasons. It sets the environment variable _tm_classpath to the classpath required to run the analyzer. Also note that as a prerequisite the server should have been started.

Instead of having to type the lengthy command line, the command file ta is provided.

cd %TM_HOME%\bin
ta [options]  [input-file]

Options and parameters submitted to the command file are directly passed through to the java program.

Options are:

-? Prints usage notes, as you are reading currently.
-o file Name of the output file. If none is specified output is written to the console.
-e files Lists of files which contain words to be excluded from processing. Words which are exluded are not written and not counted in the statistics. You can specify either a single file or comma separated list of files. If more than one file is specified they have to put in quotes.
-s char Character used to separate the word and the distribtion and occurence number. Default is ,
-nodist No distribution values are written. If this option is active, instead of writing the normalized form of the word read in the input, it is written as it was read.
-noocc No occurence values are written.
-v Verbose mode. If not set no output logging info is written to the console at all. In this case the output of the program can be piped to another program.
-sort asc|asc Sort output by their distribution values in ascending or descending order, by default: order by occurence.
-a[=n] Prints all words in the order they occurred without removing duplicates. The optional value n specifies the maximum number the word should be repeated.
-c[=n] Prints the context, i.e. sentences, the word occurred. If the word occurrs multiple times in the text, the value n specifies the maximum number of sentences to write to the output.
-m file Merge output of this program run with the specfied file. File has to be in the same format as it is produced by this program run
-h file Decorates the input file by wrapping each word in a hypertext link, linking the word to a vocabulary file which can be produced in a subsequent step with WordMagic. In order to accomplish this, take the output file specified with by the -o option and feed it into WordMagic. Note: The hypertext link will be directed to the file output-file_vocab.html. Thus, WordMagic has to be started accordingly using the option -f output-file_vocab.html. After having generated the vocab file with WordMagic the text can be viewed through the HTML file output-file_frame.html (see -h option)
-t template-file Specfies the template file to be used in combination with the -h option (See above). The default template file is template.html and is located in the bin directory.
-version Prints program version.

In order to produce output which can be processed by the WordMagic processor specify the options -noocc and -nodist to suppress distribution and occurence information.

Note: If no input file is specified, the program reads it input from the console.

General Information

Documentation

Resources

Development

Project Documentation

Overview

Running the Analyzer

`-?`	Prints usage notes, as you are reading currently.
`-o file`	Name of the output file. If none is specified output is written to the console.
`-e files`	Lists of files which contain words to be excluded from processing. Words which are exluded are not written and not counted in the statistics. You can specify either a single file or comma separated list of files. If more than one file is specified they have to put in quotes.
`-s char`	Character used to separate the word and the distribtion and occurence number. Default is `,`
`-nodist`	No distribution values are written. If this option is active, instead of writing the normalized form of the word read in the input, it is written as it was read.
`-noocc`	No occurence values are written.
`-v`	Verbose mode. If not set no output logging info is written to the console at all. In this case the output of the program can be piped to another program.
`-sort asc\|asc`	Sort output by their distribution values in ascending or descending order, by default: order by occurence.
`-a[=n]`	Prints all words in the order they occurred without removing duplicates. The optional value `n` specifies the maximum number the word should be repeated.
`-c[=n]`	Prints the context, i.e. sentences, the word occurred. If the word occurrs multiple times in the text, the value `n` specifies the maximum number of sentences to write to the output.
`-m file`	Merge output of this program run with the specfied file. File has to be in the same format as it is produced by this program run
`-h file`	Decorates the input file by wrapping each word in a hypertext link, linking the word to a vocabulary file which can be produced in a subsequent step with WordMagic. In order to accomplish this, take the output file specified with by the -o option and feed it into WordMagic. Note: The hypertext link will be directed to the file `output-file_vocab.html`. Thus, WordMagic has to be started accordingly using the option `-f output-file_vocab.html`. After having generated the vocab file with WordMagic the text can be viewed through the HTML file `output-file_frame.html` (see -h option)
`-t template-file`	Specfies the template file to be used in combination with the `-h` option (See above). The default template file is `template.html` and is located in the `bin` directory.
`-version`	Prints program version.