Translation Equivalence Detection Client GUI and API

Version .01 Beta
Author of this file and creator of distributed computing wrapper: Michael Nossal, The University of Maryland, nossal@umiacs.umd.edu
Creator of server side translation equivalence detection system: Noah Smith, UMD '01, currently PhD student at Johns Hopkins University
Advisor: Professor Philip Resnik, UMD, resnik@umiacs.umd.edu

Overview:

This software serves as 1) an application to allow documents from separate languages to be sent to a server for the detection of translation equivalence, 2) an API to enable developers of NLP applications to quickly integrate translation detection into their distributed system architecture, and 3) an exercise to discover some of the issues that relavent to building a distributed NLP system.
The system is built in two parts. The actual document translation equivalence software was built by Noah Smith as an undergraduate honors thesis project while at the University of Maryland. To this system, we added a wrapper to support distributed computing, as described above.

Application

The distribute software contains a graphical client that enables the user to send a collection of documents to the Document Translation Equivalence server for scoring (screenshot).

Installation Instructions:

Expand the tar file. The tar file should create the following directories:
1. dte -- root directory that contains subdirectory and this README file
2. dte/doc -- javadoc documentation for Java source code.
3. dte/jars -- contains required libraries to run or develop with software. It contains:
4. dte/test -- contains a small set of documents for testing the software. The subdirectories, "english_bad", "english_good", "french_bad" and "french_good" contain documents to fill out a 2 X 2 matrix in which the language is one axis and whether the document pair is in fact correct or a red-herring is the other axis. The real and apparent pairs are signified by having identical file name prefixes (e.g. 12232.1 and 12232.2 from the English_good and French_good directories are translations of eachother).
5. dte/client -- contains source code for the client API
6. dte/gui -- contains source code for the client GUI application
download and install a Java runtime environment version 1.3 or later from www.javasoft.com (IBM should work as well).
from the command-line, go to the "jars" directory that was created when you expanded the jar file. Type the following command:

java -cp dteclient.jar:parser.jar:xmlrpc.jar dte.gui.ClientMain chocolate.umiacs.umd.edu 8888
Note: on windows, replaces the colons with semi-colons

User Instructions:

Using the application is straight-forward.

First, type a job name. This will be included with the email that is asynchronously sent back to you after a job is completed on the server. In other words, it labels the request.
Choose a language pair, for instance "French/English". The client gets a list of available language pairs from the server at start-up.
Choose a language. Here you are labeling the document you are selecting as belonging to a particular language.
Choose a file or directory. If you choose a directory, then all of the directory's files are added to the document set.
Either choose to wait for the results to appear in the GUI or have them emailed to the provided address. The server may take a long time to process the document collection, especially as the size of the document set increases.

Application Programmer Interface (API)

There are two levels to the API. The Java level is contained in the classes in the client folder. Because the communications with the server is done using XML-RPC, the messages themselves provide a second API.

Java API

Among the classes in the dte.client package, the most important is the MatchTranslationsClient class. This package is heavily documented, so I will defer to the javadoc for explanation of the API.

XML-RPC API

Using this lower-level messaging API, it is possible to build applications in a variety of languages like C++, PHP, or perl to access the DTE functionality. Indeed, XML-RPC client support has been implemented for a variety of languages. You can read about XML-RPC at http://www.xml-rpc.com, which includes links to the specification and a number of implementations.
The "doc" directory contains two files that document the xml-rpc messages in raw form that get sent over the wire from both the client's and server's perspectives.

Understanding the results

The results of the document translation equivalence are returned as a table with five columns, which is in ".csv" format (e.g. records separated by newlines, columns delimited by commas). The column headers are as follows:

doc A id

doc B id

t score

correspondence score A

correspondence score B

The T-Score is a measure of confidence in the two documents being translation equivalents. Since the T-Score is a decimal between zero and one rather than a discrete boolean value, you will have to decide on how to interpret the score. One possibility is to set a threshold. The records are ordered from top to bottom by T-score. Please refer to Noah Smith's undergraduate thesis for guidance on how to understanding of the T score and the correspondence scores, which is found here:

Smith, Noah A. (2001). "Detection of Translational Equivalence." Unpublished Undergraduate Honors Thesis. University of Maryland, College Park, MD, USA.

http://www.wam.umd.edu/~nasmith/cmsc-thesis.ps (postscript)

http://www.wam.umd.edu/~nasmith/cmsc-thesis.pdf (pdf)

http://www.wam.umd.edu/~nasmith/csthesis/index.html (html)

An exercise to discover difficulties in building a distributed architecture for NLP

I really didn't try to anticipate all of the methods that would be needed to support a generic framework for NLP modules. Instead, I tried to identify the minimal API for client-server communication in order to support a basic client application. Among the pertinent issues are the following:

how are documents defined, what meta-information is required (this application would need the document's language)
what pre-processing steps should occur before a given process is performed. This application assumes no pre-processing. However, it would be improved by better sentence detection and tokenization. Within a general framework, I, as a programmer of a module, would like to insist that certain transformation are done to the documents before they are handed-off to my module. This, of course, is the point of a distributed architecture.
real vs. asynchronous time. NLP modules may take a long time to complete their steps. For this application, asynchronous communication through email makes more sense than requiring that a connection between the client and server is maintained throughout the course of the transaction, which could take hours.
denial of service attacks (intentional and unintentional). Not addressed by my software, but it is a real concern, considering the amount of computer resources that are allocated to each request.
discovery -- this application only deals with discovery in very superficial ways.

Please reference the OpenNLP project for a more credible attempt to define the framework and protocols to support distributed NLP systems. http://opennlp.sourceforge.net.