Class AbstractTreebankLanguagePack

  extended by edu.stanford.nlp.trees.AbstractTreebankLanguagePack
All Implemented Interfaces:
TreebankLanguagePack, Serializable
Direct Known Subclasses:
ChineseTreebankLanguagePack, NegraPennLanguagePack, PennTreebankLanguagePack, TueBaDZLanguagePack

public abstract class AbstractTreebankLanguagePack
extends Object
implements TreebankLanguagePack, Serializable

This provides an implementation of parts of the TreebankLanguagePack API to reduce the load on fresh implementations. Only the abstract methods below need to be implemented to give a reasonable solution for a new language.

Christopher Manning
See Also:
Serialized Form

Field Summary
          Use this as the default encoding for Readers and Writers of Treebank data.
Constructor Summary
          Gives a handle to the TreebankLanguagePack
Method Summary
 String basicCategory(String category)
          Returns the basic syntactic category of a String.
 String categoryAndFunction(String category)
          Returns the syntactic category and 'function' of a String.
 Filter evalBIgnoredPunctuationTagAcceptFilter()
          Returns a filter that accepts a String that is a punctuation tag that should be ignored by EVALB-style evaluation, and rejects everything else.
 Filter evalBIgnoredPunctuationTagRejectFilter()
          Returns a filter that accepts everything except a String that is a punctuation tag that should be ignored by EVALB-style evaluation.
 String[] evalBIgnoredPunctuationTags()
          Returns a String array of punctuation tags that EVALB-style evaluation should ignore for this treebank/language.
 Function getBasicCategoryFunction()
          Returns a Function object that maps Strings to Strings according to this TreebankLanguagePack's basicCategory method.
 Function getCategoryAndFunctionFunction()
          Returns a Function object that maps Strings to Strings according to this TreebankLanguagePack's categoryAndFunction method.
 String getEncoding()
          Return the input Charset encoding for the Treebank.
 TokenizerFactory getTokenizerFactory()
          Return a tokenizer which might be suitable for tokenizing text that will be used with this Treebank/Language pair, without tokenizing carriage returns (i.e., treating them as white space).
 GrammaticalStructureFactory grammaticalStructureFactory()
          Return a GrammaticalStructureFactory suitable for this language/treebank.
 boolean isEvalBIgnoredPunctuationTag(String str)
          Accepts a String that is a punctuation tag that should be ignored by EVALB-style evaluation, and rejects everything else.
 boolean isLabelAnnotationIntroducingCharacter(char ch)
          Say whether this character is an annotation introducing character.
 boolean isPunctuationTag(String str)
          Accepts a String that is a punctuation tag name, and rejects everything else.
 boolean isPunctuationWord(String str)
          Accepts a String that is a punctuation word, and rejects everything else.
 boolean isSentenceFinalPunctuationTag(String str)
          Accepts a String that is a sentence end punctuation tag, and rejects everything else.
 boolean isStartSymbol(String str)
          Accepts a String that is a start symbol of the treebank.
 char[] labelAnnotationIntroducingCharacters()
          Return an array of characters at which a String should be truncated to give the basic syntactic category of a label.
 Filter punctuationTagAcceptFilter()
          Return a filter that accepts a String that is a punctuation tag name, and rejects everything else.
 Filter punctuationTagRejectFilter()
          Return a filter that rejects a String that is a punctuation tag name, and rejects everything else.
abstract  String[] punctuationTags()
          Returns a String array of punctuation tags for this treebank/language.
 Filter punctuationWordAcceptFilter()
          Returns a filter that accepts a String that is a punctuation word, and rejects everything else.
 Filter punctuationWordRejectFilter()
          Returns a filter that accepts a String that is not a punctuation word, and rejects punctuation.
abstract  String[] punctuationWords()
          Returns a String array of punctuation words for this treebank/language.
 Filter sentenceFinalPunctuationTagAcceptFilter()
          Returns a filter that accepts a String that is a sentence end punctuation tag, and rejects everything else.
abstract  String[] sentenceFinalPunctuationTags()
          Returns a String array of sentence final punctuation tags for this treebank/language.
 String startSymbol()
          Returns a String which is the first (perhaps unique) start symbol of the treebank, or null if none is defined.
 Filter startSymbolAcceptFilter()
          Return a filter that accepts a String that is a start symbol of the treebank, and rejects everything else.
abstract  String[] startSymbols()
          Returns a String array of treebank start symbols.
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
Methods inherited from interface edu.stanford.nlp.trees.TreebankLanguagePack
sentenceFinalPunctuationWords, treebankFileExtension

Field Detail


public static final String DEFAULT_ENCODING
Use this as the default encoding for Readers and Writers of Treebank data.

See Also:
Constant Field Values
Constructor Detail


public AbstractTreebankLanguagePack()
Gives a handle to the TreebankLanguagePack

Method Detail


public abstract String[] punctuationTags()
Returns a String array of punctuation tags for this treebank/language.

Specified by:
punctuationTags in interface TreebankLanguagePack
The punctuation tags


public abstract String[] punctuationWords()
Returns a String array of punctuation words for this treebank/language.

Specified by:
punctuationWords in interface TreebankLanguagePack
The punctuation words


public abstract String[] sentenceFinalPunctuationTags()
Returns a String array of sentence final punctuation tags for this treebank/language.

Specified by:
sentenceFinalPunctuationTags in interface TreebankLanguagePack
The sentence final punctuation tags


public String[] evalBIgnoredPunctuationTags()
Returns a String array of punctuation tags that EVALB-style evaluation should ignore for this treebank/language. Traditionally, EVALB has ignored a subset of the total set of punctuation tags in the English Penn Treebank (quotes and period, comma, colon, etc., but not brackets)

Specified by:
evalBIgnoredPunctuationTags in interface TreebankLanguagePack
Whether this is a EVALB-ignored punctuation tag


public boolean isPunctuationTag(String str)
Accepts a String that is a punctuation tag name, and rejects everything else.

Specified by:
isPunctuationTag in interface TreebankLanguagePack
Whether this is a punctuation tag


public boolean isPunctuationWord(String str)
Accepts a String that is a punctuation word, and rejects everything else. If one can't tell for sure (as for ' in the Penn Treebank), it maks the best guess that it can.

Specified by:
isPunctuationWord in interface TreebankLanguagePack
Whether this is a punctuation word


public boolean isSentenceFinalPunctuationTag(String str)
Accepts a String that is a sentence end punctuation tag, and rejects everything else.

Specified by:
isSentenceFinalPunctuationTag in interface TreebankLanguagePack
Whether this is a sentence final punctuation tag


public boolean isEvalBIgnoredPunctuationTag(String str)
Accepts a String that is a punctuation tag that should be ignored by EVALB-style evaluation, and rejects everything else. Traditionally, EVALB has ignored a subset of the total set of punctuation tags in the English Penn Treebank (quotes and period, comma, colon, etc., but not brackets)

Specified by:
isEvalBIgnoredPunctuationTag in interface TreebankLanguagePack
Whether this is a EVALB-ignored punctuation tag


public Filter punctuationTagAcceptFilter()
Return a filter that accepts a String that is a punctuation tag name, and rejects everything else.

Specified by:
punctuationTagAcceptFilter in interface TreebankLanguagePack
The filter


public Filter punctuationTagRejectFilter()
Return a filter that rejects a String that is a punctuation tag name, and rejects everything else.

Specified by:
punctuationTagRejectFilter in interface TreebankLanguagePack
The filter


public Filter punctuationWordAcceptFilter()
Returns a filter that accepts a String that is a punctuation word, and rejects everything else. If one can't tell for sure (as for ' in the Penn Treebank), it makes the best guess that it can.

Specified by:
punctuationWordAcceptFilter in interface TreebankLanguagePack
The Filter


public Filter punctuationWordRejectFilter()
Returns a filter that accepts a String that is not a punctuation word, and rejects punctuation. If one can't tell for sure (as for ' in the Penn Treebank), it makes the best guess that it can.

Specified by:
punctuationWordRejectFilter in interface TreebankLanguagePack
The Filter


public Filter sentenceFinalPunctuationTagAcceptFilter()
Returns a filter that accepts a String that is a sentence end punctuation tag, and rejects everything else.

Specified by:
sentenceFinalPunctuationTagAcceptFilter in interface TreebankLanguagePack
The Filter


public Filter evalBIgnoredPunctuationTagAcceptFilter()
Returns a filter that accepts a String that is a punctuation tag that should be ignored by EVALB-style evaluation, and rejects everything else. Traditionally, EVALB has ignored a subset of the total set of punctuation tags in the English Penn Treebank (quotes and period, comma, colon, etc., but not brackets)

Specified by:
evalBIgnoredPunctuationTagAcceptFilter in interface TreebankLanguagePack
The Filter


public Filter evalBIgnoredPunctuationTagRejectFilter()
Returns a filter that accepts everything except a String that is a punctuation tag that should be ignored by EVALB-style evaluation. Traditionally, EVALB has ignored a subset of the total set of punctuation tags in the English Penn Treebank (quotes and period, comma, colon, etc., but not brackets)

Specified by:
evalBIgnoredPunctuationTagRejectFilter in interface TreebankLanguagePack
The Filter


public String getEncoding()
Return the input Charset encoding for the Treebank. See documentation for the Charset class.

Specified by:
getEncoding in interface TreebankLanguagePack
Name of Charset


public char[] labelAnnotationIntroducingCharacters()
Return an array of characters at which a String should be truncated to give the basic syntactic category of a label. The idea here is that Penn treebank style labels follow a syntactic category with various functional and crossreferencing information introduced by special characters (such as "NP-SBJ=1"). This would be truncated to "NP" by the array containing '-' and "=".

Specified by:
labelAnnotationIntroducingCharacters in interface TreebankLanguagePack
An array of characters that set off label name suffixes


public String basicCategory(String category)
Returns the basic syntactic category of a String. This implementation basically truncates stuff after an occurrence of one of the labelAnnotationIntroducingCharacters(). However, there is also special case stuff to deal with labelAnnotationIntroducingCharacters in category labels: (i) if the first char is in this set, it's never truncated (e.g., '-' or '=' as a token), and (ii) if it starts with one ofthis set, a second item of this set is also excluded (to deal with '-LLB-', '-RCB-', etc.).

Specified by:
basicCategory in interface TreebankLanguagePack
category - The whole String name of the label
The basic category of the String


public Function getBasicCategoryFunction()
Returns a Function object that maps Strings to Strings according to this TreebankLanguagePack's basicCategory method.

Specified by:
getBasicCategoryFunction in interface TreebankLanguagePack
the String->String Function object


public String categoryAndFunction(String category)
Returns the syntactic category and 'function' of a String. This normally involves truncating numerical coindexation showing coreference, etc. By 'function', this means keeping, say, Penn Treebank functional tags or ICE phrasal functions, perhaps returning them as category-function.

This implementation strips numeric tags after label introducing characters (assuming that non-numeric things are functional tags).

Specified by:
categoryAndFunction in interface TreebankLanguagePack
category - The whole String name of the label
A String giving the category and function


public Function getCategoryAndFunctionFunction()
Returns a Function object that maps Strings to Strings according to this TreebankLanguagePack's categoryAndFunction method.

Specified by:
getCategoryAndFunctionFunction in interface TreebankLanguagePack
the String->String Function object


public boolean isLabelAnnotationIntroducingCharacter(char ch)
Say whether this character is an annotation introducing character.

Specified by:
isLabelAnnotationIntroducingCharacter in interface TreebankLanguagePack
ch - The character to check
Whether it is an annotation introducing character


public boolean isStartSymbol(String str)
Accepts a String that is a start symbol of the treebank.

Specified by:
isStartSymbol in interface TreebankLanguagePack
Whether this is a start symbol


public Filter startSymbolAcceptFilter()
Return a filter that accepts a String that is a start symbol of the treebank, and rejects everything else.

Specified by:
startSymbolAcceptFilter in interface TreebankLanguagePack
The filter


public abstract String[] startSymbols()
Returns a String array of treebank start symbols.

Specified by:
startSymbols in interface TreebankLanguagePack
The start symbols


public String startSymbol()
Returns a String which is the first (perhaps unique) start symbol of the treebank, or null if none is defined.

Specified by:
startSymbol in interface TreebankLanguagePack
The start symbol


public TokenizerFactory getTokenizerFactory()
Return a tokenizer which might be suitable for tokenizing text that will be used with this Treebank/Language pair, without tokenizing carriage returns (i.e., treating them as white space). The implementation in AbstractTreebankLanguagePack returns a factory for WhitespaceTokenizer.

Specified by:
getTokenizerFactory in interface TreebankLanguagePack
A tokenizer


public GrammaticalStructureFactory grammaticalStructureFactory()
Return a GrammaticalStructureFactory suitable for this language/treebank. (To be overridden in subclasses.)

Specified by:
grammaticalStructureFactory in interface TreebankLanguagePack
A GrammaticalStructureFactory suitable for this language/treebank

Stanford NLP Group