Rapidly Retargetable Translingual Detection

Organization: University of Maryland / Johns Hopkins University

Principal Investigator: Douglas W. Oard (Maryland) / William J. Byrne (Hopkins)

Objective:

The objective of this project is to rapidly create usable systems for translingual document detection that can be employed by analysts who are fluent in English to detect potentially important documents that are written in other languages.

Approach:

This objective will be met by developing a core set of technologies to automatically extract translation knowledge from naturally occurring resources and for using those resources in translingual detection applications. The extraction effort is focused on three types of naturally occurring resources:

printed bilingual dictionaries that can be rapidly scanned,
translation-equivalent Web pages that can be automatically detected in large collections, and
topically-related collections of "comparable" monolingual texts in each language that can be assembled automatically.

These sources have complementary strengths and limitations-by exploiting all three it will be possible to rapidly assemble relatively comprehensive translation lexicons. The resulting lexicons will be used with fully automatic and semi-automatic (interactive) translingual detection techniques that are tuned to the characteristics of those sources of translation knowledge.

Recent Accomplishments:

Scanned Bilingual Dictionaries. Processing scanned bilingual dictionaries involves three stages: zoning, tagging, and lexicon construction. Development of an automated system for zoning has been completed, and some dictionary-specific solutions for the second and third stages have been built as a proof of concept.
Mining Comparable Corpora. Some types of terms that are often important in translingual retrieval applications cannot be reliably obtained from dictionaries. An inventory of dictionary coverage deficiencies has been completed, domain-specific technical terminology and transliterated named entities have been identified as important types of terms that might be found in comparable corpora, and techniques have been designed that leverage the availability of comparable corpora to address those challenges.
Arabic Information Retrieval. Semitic languages pose unusual challenges for document detection, and the connected characters used in printed Arabic pose unusual challenges for OCR. The newly available TREC Arabic information retrieval test collection was used as a basis for exploring retrieval from collections of scanned Arabic documents, obtaining excellent results from techniques based on indexing character n-grams.
Interactive Translingual Retrieval. Automatic document detection techniques based on term matching allow large collections to be searched efficiently, but human interaction is often needed in order to refine information need profiles or to cull the most promising of the detected documents for subsequent processing (e.g., human translation). Evaluation methods for user interaction in translingual retrieval have been developed in cooperation with the European Cross-Language Evaluation Forum (CLEF). Three teams (Maryland, Sheffield, and UNED) participated in the 2001 CLEF interactive track evaluation, reporting results from user studies with a total of 44 participants. The principal results were development of a suitable evaluation methodology for interactive document selection and identification of translation of isolated phrases as a compact and useful selection cue. A second set of experiments that focus on interactive query formulation have also recently been completed.
Translation Detection. Recognizing the availability of existing translations is an important capability, since human translations are typically far superior to those that can be produced automatically. A system for dictionary-based automatic translation detection has been developed.
String Error Model for OCR. Modeling OCR errors with high fidelity is an important prerequisite to integration of translation lexicons derived from scanned bilingual dictionaries into translingual retrieval systems, and good models might also result in improved accuracy when searching scanned documents. A novel approach to parameter estimation for string error model training in OCR applications has been developed.

Current Plan:

Mining the Web for Parallel Text. A system to efficiently search the Internet Archive for translation-equivalent pages is near completion. That system will be used to create translation lexicons in several languages (including Arabic) and to experimentally evaluate the utility of those lexicons for document detection.
Scanned Bilingual Dictionaries. Machine learning techniques will be applied to tag components of bilingual dictionary entries that result from optical character recognition with a minimum of human involvement, and the utility of the resulting tagged entries will be demonstrated in a document detection application by automatically extracting a simple translation lexicon that can be used in conjunction with the string error model in batch ranked retrieval experiments.
Arabic Information Retrieval. Translingual document detection for Arabic will be evaluated at TREC-2002, and possibly also at TDT-2002.
Interactive Translingual Retrieval. The results of a collaboration with BBN to generate translated headlines will be evaluated in interactive document detection experiments.
Mining Comparable Corpora. A small effort will focus on learning improved transliteration models for named entities from comparable corpora.

Technology Transition:

Translation Detection. The translation detection software was packaged as a service using the OnTAP service architecture. Integration into OnTAP has not yet been completed, but the software is presently available on an open source basis.
Arabic Information Retrieval. An Arabic light stemmer was packaged as open source software for use by TREC-2002 CLIR track participants in a "standard resources" run. The University of Massachusetts has contributed an improvement to that stemmer, which is now included in the standard distribution. A statistically trained morphological analysis system for Arabic and the full Arabic document image retrieval system have also been packaged as open source software that may be useful to other TIDES and TDT teams.
Interactive Translingual Retrieval. A second-generation interactive translingual retrieval interface has been developed in Java and packaged as open source software. This software has been shared with the University of Sheffield to facilitate experiments for the CLEF 2002 interactive track.

Updated by Daqing He / UMIACS / UMD / on