edu.stanford.nlp.parser.lexparser
Class BaseLexicon

java.lang.Object
  extended by edu.stanford.nlp.parser.lexparser.BaseLexicon
All Implemented Interfaces:
Lexicon, Serializable
Direct Known Subclasses:
ChineseLexicon

public class BaseLexicon
extends Object
implements Lexicon

This is the default concrete instantiation of the Lexicon interface. It was originally built for Penn Treebank English.

Author:
Dan Klein, Galen Andrew, Christopher Manning
See Also:
Serialized Form

Field Summary
protected  int lastSentencePosition
           
protected  int lastSignatureIndex
          We cache the last signature looked up, because it asks for the same one many times when an unknown word is encountered! (Note that under the current scheme, one unknown word, if seen sentence-initially and non-initially, will be parsed with two different signatures....)
protected  int lastWordToSignaturize
           
protected static short nullTag
           
protected static int nullWord
           
protected  List[] rulesWithWord
           
protected  Counter<IntTaggedWord> seenCounter
           
protected  boolean smartMutation
           
protected  int smoothInUnknownsThreshold
           
protected  Set<IntTaggedWord> tags
           
protected  int unknownLevel
          What type of equivalence classing is done in getSignature
protected  Counter<IntTaggedWord> unSeenCounter
          Has counts for taggings in terms of unseen signatures.
protected  Set<IntTaggedWord> words
           
 
Fields inherited from interface edu.stanford.nlp.parser.lexparser.Lexicon
BOUNDARY, BOUNDARY_TAG, UNKNOWN_WORD
 
Constructor Summary
BaseLexicon()
           
BaseLexicon(Options.LexOptions op)
           
 
Method Summary
protected  void addTagging(boolean seen, IntTaggedWord itw, double count)
          Adds the tagging with count to the data structures in this Lexicon.
 double evaluateCoverage(Collection<Tree> trees, Set missingWords, Set missingTags, Set<IntTaggedWord> missingTW)
          Evaluates how many words (= terminals) in a collection of trees are covered by the lexicon.
protected  String getSignature(String word, int loc)
          This routine returns a String that is the "signature" of the class of a word.
protected  int getSignatureIndex(int wordIndex, int sentencePosition)
          Returns the index of the signature of the word numbered wordIndex, where the signature is the String representation of unknown word features.
protected  void initRulesWithWord()
           
 boolean isKnown(int word)
          Checks whether a word is in the lexicon.
 boolean isKnown(String word)
          Checks whether a word is in the lexicon.
 void printLexStats()
          Print some statistics about this lexicon.
 void readData(BufferedReader in)
          Populates data in this Lexicon from the character stream given by the Reader r.
 Iterator ruleIteratorByWord(int word, int loc)
          Returns the possible POS taggings for a word.
 double score(IntTaggedWord iTW, int loc)
          Get the score of this word with this tag (as an IntTaggedWord) at this loc.
 void train(Collection trees)
          Trains this lexicon on the Collection of trees.
 void train(Collection trees, double weight)
          Trains this lexicon on the Collection of trees.
protected  List<IntTaggedWord> treeToEvents(Tree tree)
           
 void tune(Collection<Tree> trees)
           
 void writeData(Writer w)
          Writes out data from this Object to the Writer w.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

unknownLevel

protected int unknownLevel
What type of equivalence classing is done in getSignature


smoothInUnknownsThreshold

protected int smoothInUnknownsThreshold

smartMutation

protected boolean smartMutation

rulesWithWord

protected transient List[] rulesWithWord

tags

protected transient Set<IntTaggedWord> tags

words

protected transient Set<IntTaggedWord> words

seenCounter

protected Counter<IntTaggedWord> seenCounter

unSeenCounter

protected Counter<IntTaggedWord> unSeenCounter
Has counts for taggings in terms of unseen signatures. The IntTagWords are for (tag,sig), (tag,null), (null,sig), (null,null). (None for basic UNK if there are signatures.)


nullWord

protected static final int nullWord
See Also:
Constant Field Values

nullTag

protected static final short nullTag
See Also:
Constant Field Values

lastSignatureIndex

protected transient int lastSignatureIndex
We cache the last signature looked up, because it asks for the same one many times when an unknown word is encountered! (Note that under the current scheme, one unknown word, if seen sentence-initially and non-initially, will be parsed with two different signatures....)


lastSentencePosition

protected transient int lastSentencePosition

lastWordToSignaturize

protected transient int lastWordToSignaturize
Constructor Detail

BaseLexicon

public BaseLexicon()

BaseLexicon

public BaseLexicon(Options.LexOptions op)
Method Detail

isKnown

public boolean isKnown(int word)
Checks whether a word is in the lexicon. This version will compile the lexicon into the rulesWithWord array, if that hasn't already happened

Specified by:
isKnown in interface Lexicon
Parameters:
word - The word as an int index to a Numberer
Returns:
Whether the word is in the lexicon

isKnown

public boolean isKnown(String word)
Checks whether a word is in the lexicon. This version works even while compiling lexicon with current counters (rather than using the compiled rulesWithWord array).

Specified by:
isKnown in interface Lexicon
Parameters:
word - The word as a String
Returns:
Whether the word is in the lexicon

ruleIteratorByWord

public Iterator ruleIteratorByWord(int word,
                                   int loc)
Returns the possible POS taggings for a word.

Specified by:
ruleIteratorByWord in interface Lexicon
Parameters:
word - The word, represented as an integer in Numberer
loc - The position of the word in the sentence (counting from 0). Implementation note: The BaseLexicon class doesn't actually make use of this position information.
Returns:
An Iterator over a List ofIntTaggedWords, which pair the word with possible taggings as integer pairs. (Each can be thought of as a tag -> word rule.)

initRulesWithWord

protected void initRulesWithWord()

treeToEvents

protected List<IntTaggedWord> treeToEvents(Tree tree)

train

public void train(Collection trees)
Trains this lexicon on the Collection of trees.

Specified by:
train in interface Lexicon

train

public void train(Collection trees,
                  double weight)
Trains this lexicon on the Collection of trees.


addTagging

protected void addTagging(boolean seen,
                          IntTaggedWord itw,
                          double count)
Adds the tagging with count to the data structures in this Lexicon.


getSignatureIndex

protected int getSignatureIndex(int wordIndex,
                                int sentencePosition)
Returns the index of the signature of the word numbered wordIndex, where the signature is the String representation of unknown word features. Caches the last signature index returned.


getSignature

protected String getSignature(String word,
                              int loc)
This routine returns a String that is the "signature" of the class of a word. For, example, it might represent whether it is a number of ends in -s. The strings returned by convention match the pattern UNK-.* , which is just assumed to not match any real word. Behavior depends on the unknownLevel (-uwm flag) passed in to the class. The recognized numbers are 1-5: 5 is fairly English-specific; 4, 3, and 2 look for various word features (digits, dashes, etc.) which are only vaguely English-specific; 1 uses the last two characters combined with a simple classification by capitalization.

Parameters:
word - The word to make a signature for
loc - Its position in the sentence (mainly so sentence-initial capitalized words can be treated differently)
Returns:
A String that is its signature (equivalence class)

score

public double score(IntTaggedWord iTW,
                    int loc)
Get the score of this word with this tag (as an IntTaggedWord) at this loc. (Presumably an estimate of P(word | tag).)

Implementation documentation: Seen: c_W = count(W) c_TW = count(T,W) c_T = count(T) c_Tunseen = count(T) among new words in 2nd half total = count(seen words) totalUnseen = count("unseen" words) p_T_U = Pmle(T|"unseen") pb_T_W = P(T|W). If (c_W > smoothInUnknownsThreshold) = c_TW/c_W Else (if not smart mutation) pb_T_W = bayes prior smooth[1] with p_T_U p_T= Pmle(T) p_W = Pmle(W) pb_W_T = log(pb_T_W * p_W / p_T) [Bayes rule] Note that this doesn't really properly reserve mass to unknowns. Unseen: c_TS = count(T,Sig|Unseen) c_S = count(Sig) c_T = count(T|Unseen) c_U = totalUnseen above p_T_U = Pmle(T|Unseen) pb_T_S = Bayes smooth of Pmle(T|S) with P(T|Unseen) [smooth[0]] pb_W_T = log(P(W|T)) inverted

Specified by:
score in interface Lexicon
Parameters:
iTW - An IntTaggedWord pairing a word and POS tag
loc - The position in the sentence. In the default implementation this is used only for unknown words to change their probability distribution when sentence initial
Returns:
A double valued score, usually - log P(word|tag)

tune

public void tune(Collection<Tree> trees)

readData

public void readData(BufferedReader in)
              throws IOException
Populates data in this Lexicon from the character stream given by the Reader r.

Specified by:
readData in interface Lexicon
Parameters:
in - The BufferedReader to read from
Throws:
IOException

writeData

public void writeData(Writer w)
               throws IOException
Writes out data from this Object to the Writer w. Rules are separated by newline, and rule elements are delimited by \t.

Specified by:
writeData in interface Lexicon
Parameters:
w - The writer to output to
Throws:
IOException

printLexStats

public void printLexStats()
Print some statistics about this lexicon.


evaluateCoverage

public double evaluateCoverage(Collection<Tree> trees,
                               Set missingWords,
                               Set missingTags,
                               Set<IntTaggedWord> missingTW)
Evaluates how many words (= terminals) in a collection of trees are covered by the lexicon. First arg is the collection of trees; second through fourth args get the results.



Stanford NLP Group