Rapidly Retargetable Translingual Detection
Objective:
The objective of this project is to rapidly create usable systems for translingual
document detection that can be employed by analysts who are fluent in English
to detect potentially important documents that are written in other languages.
Approach:
This objective will be met by developing a core set of technologies to
automatically extract translation knowledge from naturally occurring resources
and for using those resources in translingual detection applications. The
extraction effort is focused on three types of naturally occurring resources:
-
printed bilingual dictionaries that can be rapidly
scanned,
-
translation-equivalent Web pages that can be automatically
detected in large collections, and
-
topically-related collections of "comparable" monolingual
texts in each language that can be assembled automatically.
These sources have complementary strengths and limitations-by exploiting
all three it will be possible to rapidly assemble relatively comprehensive
translation lexicons. The resulting lexicons will be used with fully automatic
and semi-automatic (interactive) translingual detection techniques that
are tuned to the characteristics of those sources of translation knowledge.
Recent Accomplishments:
-
Scanned Bilingual Dictionaries. Processing
scanned bilingual dictionaries involves three stages: zoning, tagging,
and lexicon construction. Development of an automated system for zoning
has been completed, and some dictionary-specific solutions for the second
and third stages have been built as a proof of concept.
-
Mining Comparable Corpora. Some types of terms
that are often important in translingual retrieval applications cannot
be reliably obtained from dictionaries. An inventory of dictionary coverage
deficiencies has been completed, domain-specific technical terminology
and transliterated named entities have been identified as important types
of terms that might be found in comparable corpora, and techniques have
been designed that leverage the availability of comparable corpora to address
those challenges.
-
Arabic Information Retrieval. Semitic languages
pose unusual challenges for document detection, and the connected characters
used in printed Arabic pose unusual challenges for OCR. The newly available
TREC Arabic information retrieval test collection was used as a basis for
exploring retrieval from collections of scanned Arabic documents, obtaining
excellent results from techniques based on indexing character n-grams.
-
Interactive Translingual Retrieval. Automatic
document detection techniques based on term matching allow large collections
to be searched efficiently, but human interaction is often needed in order
to refine information need profiles or to cull the most promising of the
detected documents for subsequent processing (e.g., human translation).
Evaluation methods for user interaction in translingual retrieval have
been developed in cooperation with the European Cross-Language Evaluation
Forum (CLEF). Three teams (Maryland, Sheffield, and UNED) participated
in the 2001 CLEF interactive track evaluation, reporting results from user
studies with a total of 44 participants. The principal results were development
of a suitable evaluation methodology for interactive document selection
and identification of translation of isolated phrases as a compact and
useful selection cue. A second set of experiments that focus on interactive
query formulation have also recently been completed.
-
Translation Detection. Recognizing the availability
of existing translations is an important capability, since human translations
are typically far superior to those that can be produced automatically.
A system for dictionary-based automatic translation detection has been
developed.
-
String Error Model for OCR. Modeling OCR errors
with high fidelity is an important prerequisite to integration of translation
lexicons derived from scanned bilingual dictionaries into translingual
retrieval systems, and good models might also result in improved accuracy
when searching scanned documents. A novel approach to parameter estimation
for string error model training in OCR applications has been developed.
Current Plan:
-
Mining the Web for Parallel Text. A system
to efficiently search the Internet Archive for translation-equivalent pages
is near completion. That system will be used to create translation lexicons
in several languages (including Arabic) and to experimentally evaluate
the utility of those lexicons for document detection.
-
Scanned Bilingual Dictionaries. Machine learning
techniques will be applied to tag components of bilingual dictionary entries
that result from optical character recognition with a minimum of human
involvement, and the utility of the resulting tagged entries will be demonstrated
in a document detection application by automatically extracting a simple
translation lexicon that can be used in conjunction with the string error
model in batch ranked retrieval experiments.
-
Arabic Information Retrieval. Translingual
document detection for Arabic will be evaluated at TREC-2002, and possibly
also at TDT-2002.
-
Interactive Translingual Retrieval. The results
of a collaboration with BBN to generate translated headlines will be evaluated
in interactive document detection experiments.
-
Mining Comparable Corpora. A small effort
will focus on learning improved transliteration models for named entities
from comparable corpora.
Technology Transition:
-
Translation Detection. The translation detection
software was packaged as a service using the OnTAP service architecture.
Integration into OnTAP has not yet been completed, but the software is
presently available on an open source basis.
-
Arabic Information Retrieval. An Arabic light
stemmer was packaged as open source software for use by TREC-2002 CLIR
track participants in a "standard resources" run. The University of Massachusetts
has contributed an improvement to that stemmer, which is now included in
the standard distribution. A statistically trained morphological analysis
system for Arabic and the full Arabic document image retrieval system have
also been packaged as open source software that may be useful to other
TIDES and TDT teams.
-
Interactive Translingual Retrieval. A second-generation
interactive translingual retrieval interface has been developed in Java
and packaged as open source software. This software has been shared with
the University of Sheffield to facilitate experiments for the CLEF 2002
interactive track.
Updated by Daqing He
/ UMIACS / UMD
/ on