BaseLexicon (Stanford JavaNLP API)

Overview

Package

Class

Tree

Deprecated

Index

Help

PREV CLASS NEXT CLASS

FRAMES NO FRAMES

SUMMARY: NESTED | FIELD | CONSTR | METHOD

DETAIL: FIELD | CONSTR | METHOD

edu.stanford.nlp.parser.lexparser
Class BaseLexicon

java.lang.Object
  edu.stanford.nlp.parser.lexparser.BaseLexicon

All Implemented Interfaces:: Lexicon, Serializable

Direct Known Subclasses:: ChineseLexicon

public class BaseLexicon
extends Object
implements Lexicon
extends Object
implements Lexicon

This is the default concrete instantiation of the Lexicon interface. It was originally built for Penn Treebank English.

Author:: Dan Klein, Galen Andrew, Christopher Manning
See Also:: Serialized Form

Field Summary
`protected int`	`lastSentencePosition`
`protected int`	`lastSignatureIndex` We cache the last signature looked up, because it asks for the same one many times when an unknown word is encountered! (Note that under the current scheme, one unknown word, if seen sentence-initially and non-initially, will be parsed with two different signatures....)
`protected int`	`lastWordToSignaturize`
`protected static short`	`nullTag`
`protected static int`	`nullWord`
`protected List[]`	`rulesWithWord`
`protected Counter<IntTaggedWord>`	`seenCounter`
`protected boolean`	`smartMutation`
`protected int`	`smoothInUnknownsThreshold`
`protected Set<IntTaggedWord>`	`tags`
`protected int`	`unknownLevel` What type of equivalence classing is done in getSignature
`protected Counter<IntTaggedWord>`	`unSeenCounter` Has counts for taggings in terms of unseen signatures.
`protected Set<IntTaggedWord>`	`words`

Fields inherited from interface edu.stanford.nlp.parser.lexparser.Lexicon
`BOUNDARY, BOUNDARY_TAG, UNKNOWN_WORD`

Constructor Summary
`BaseLexicon()`
`BaseLexicon(Options.LexOptions op)`

Method Summary
`protected void`	`addTagging(boolean seen, IntTaggedWord itw, double count)` Adds the tagging with count to the data structures in this Lexicon.
`double`	`evaluateCoverage(Collection<Tree> trees, Set missingWords, Set missingTags, Set<IntTaggedWord> missingTW)` Evaluates how many words (= terminals) in a collection of trees are covered by the lexicon.
`protected String`	`getSignature(String word, int loc)` This routine returns a String that is the "signature" of the class of a word.
`protected int`	`getSignatureIndex(int wordIndex, int sentencePosition)` Returns the index of the signature of the word numbered wordIndex, where the signature is the String representation of unknown word features.
`protected void`	`initRulesWithWord()`
`boolean`	`isKnown(int word)` Checks whether a word is in the lexicon.
`boolean`	`isKnown(String word)` Checks whether a word is in the lexicon.
`void`	`printLexStats()` Print some statistics about this lexicon.
`void`	`readData(BufferedReader in)` Populates data in this Lexicon from the character stream given by the Reader r.
`Iterator`	`ruleIteratorByWord(int word, int loc)` Returns the possible POS taggings for a word.
`double`	`score(IntTaggedWord iTW, int loc)` Get the score of this word with this tag (as an IntTaggedWord) at this loc.
`void`	`train(Collection trees)` Trains this lexicon on the Collection of trees.
`void`	`train(Collection trees, double weight)` Trains this lexicon on the Collection of trees.
`protected List<IntTaggedWord>`	`treeToEvents(Tree tree)`
`void`	`tune(Collection<Tree> trees)`
`void`	`writeData(Writer w)` Writes out data from this Object to the Writer w.

Methods inherited from class java.lang.Object
`clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait`

Field Detail

unknownLevel

protected int unknownLevel

What type of equivalence classing is done in getSignature

smoothInUnknownsThreshold

protected int smoothInUnknownsThreshold

smartMutation

protected boolean smartMutation

rulesWithWord

protected transient List[] rulesWithWord

words

protected transient Set<IntTaggedWord> words

seenCounter

protected Counter<IntTaggedWord> seenCounter

unSeenCounter

protected Counter<IntTaggedWord> unSeenCounter

Has counts for taggings in terms of unseen signatures. The IntTagWords are for (tag,sig), (tag,null), (null,sig), (null,null). (None for basic UNK if there are signatures.)

nullWord

protected static final int nullWord

See Also:: Constant Field Values

nullTag

protected static final short nullTag

See Also:: Constant Field Values

lastSignatureIndex

protected transient int lastSignatureIndex

We cache the last signature looked up, because it asks for the same one many times when an unknown word is encountered! (Note that under the current scheme, one unknown word, if seen sentence-initially and non-initially, will be parsed with two different signatures....)

lastSentencePosition

protected transient int lastSentencePosition

lastWordToSignaturize

protected transient int lastWordToSignaturize

Constructor Detail

BaseLexicon

public BaseLexicon()

BaseLexicon

public BaseLexicon(Options.LexOptions op)

Method Detail

isKnown

public boolean isKnown(int word)

Checks whether a word is in the lexicon. This version will compile the lexicon into the rulesWithWord array, if that hasn't already happened

Specified by:: isKnown in interface Lexicon

Parameters:: word - The word as an int index to a Numberer
Returns:: Whether the word is in the lexicon

isKnown

public boolean isKnown(String word)

Checks whether a word is in the lexicon. This version works even while compiling lexicon with current counters (rather than using the compiled rulesWithWord array).

Specified by:: isKnown in interface Lexicon

Parameters:: word - The word as a String
Returns:: Whether the word is in the lexicon

ruleIteratorByWord

public Iterator ruleIteratorByWord(int word,
                                   int loc)

Returns the possible POS taggings for a word.

Specified by:: ruleIteratorByWord in interface Lexicon

Parameters:: word - The word, represented as an integer in Numberer; loc - The position of the word in the sentence (counting from 0). Implementation note: The BaseLexicon class doesn't actually make use of this position information.
Returns:: An Iterator over a List ofIntTaggedWords, which pair the word with possible taggings as integer pairs. (Each can be thought of as a tag -> word rule.)





initRulesWithWord
protected void initRulesWithWord()











treeToEvents
protected List<IntTaggedWord> treeToEvents(Tree tree)











train
public void train(Collection trees)

Trains this lexicon on the Collection of trees.


Specified by:
train in interface Lexicon








train
public void train(Collection trees,
                  double weight)

Trains this lexicon on the Collection of trees.











addTagging
protected void addTagging(boolean seen,
                          IntTaggedWord itw,
                          double count)

Adds the tagging with count to the data structures in this Lexicon.











getSignatureIndex
protected int getSignatureIndex(int wordIndex,
                                int sentencePosition)

Returns the index of the signature of the word numbered wordIndex,
 where the signature is the String representation of unknown word
 features.  Caches the last signature index returned.











getSignature
protected String getSignature(String word,
                              int loc)

This routine returns a String that is the "signature" of the class of a
 word.
 For, example, it might represent whether it is a number of ends in -s.
 The strings returned by convention match the pattern UNK-.* , which
 is just assumed to not match any real word.
 Behavior depends on the unknownLevel (-uwm flag) passed in to the class.
 The recognized numbers are 1-5: 5 is fairly English-specific; 4, 3, and 2
 look for various word features (digits, dashes, etc.) which are only
 vaguely English-specific; 1 uses the last two characters combined with 
 a simple classification by capitalization.





Parameters:
word - The word to make a signature for
loc - Its position in the sentence (mainly so sentence-initial
             capitalized words can be treated differently)
Returns:
A String that is its signature (equivalence class)





score
public double score(IntTaggedWord iTW,
                    int loc)

Get the score of this word with this tag (as an IntTaggedWord) at this 
 loc.
 (Presumably an estimate of P(word | tag).)
 
 Implementation documentation:  Seen:
 c_W = count(W)    c_TW = count(T,W)
 c_T = count(T)    c_Tunseen = count(T) among new words in 2nd half
 total = count(seen words)    totalUnseen = count("unseen" words)
 p_T_U = Pmle(T|"unseen") 
 pb_T_W = P(T|W).  If (c_W > smoothInUnknownsThreshold) = c_TW/c_W
 Else (if not smart mutation) pb_T_W = bayes prior smooth[1] with p_T_U
 p_T= Pmle(T)    p_W = Pmle(W)
 pb_W_T = log(pb_T_W * p_W / p_T)    [Bayes rule]
 Note that this doesn't really properly reserve mass to unknowns.

 Unseen: 
 c_TS = count(T,Sig|Unseen)    c_S = count(Sig)    c_T = count(T|Unseen)
 c_U = totalUnseen above
 p_T_U = Pmle(T|Unseen)
 pb_T_S = Bayes smooth of Pmle(T|S) with P(T|Unseen)   [smooth[0]]
 pb_W_T = log(P(W|T)) inverted


Specified by:
score in interface Lexicon


Parameters:
iTW - An IntTaggedWord pairing a word and POS tag
loc - The position in the sentence.  In the default implementation
               this is used only for unknown words to change their
               probability distribution when sentence initial
Returns:
A double valued score, usually - log P(word|tag)






tune
public void tune(Collection<Tree> trees)











readData
public void readData(BufferedReader in)
              throws IOException

Populates data in this Lexicon from the character stream
 given by the Reader r.


Specified by:
readData in interface Lexicon


Parameters:
in - The BufferedReader to read from
Throws:
IOException





writeData
public void writeData(Writer w)
               throws IOException

Writes out data from this Object to the Writer w. Rules are separated by
 newline, and rule elements are delimited by \t.


Specified by:
writeData in interface Lexicon


Parameters:
w - The writer to output to
Throws:
IOException





printLexStats
public void printLexStats()

Print some statistics about this lexicon.











evaluateCoverage
public double evaluateCoverage(Collection<Tree> trees,
                               Set missingWords,
                               Set missingTags,
                               Set<IntTaggedWord> missingTW)

Evaluates how many words (= terminals) in a collection of trees
 are covered by the lexicon. First arg is the collection of
 trees; second through fourth args get the results.




















  
      Overview 
      Package 
    Class 
      Tree 
      Deprecated 
      Index 
      Help 
  









 PREV CLASS 
 NEXT CLASS

  FRAMES   
 NO FRAMES   
 







  SUMMARY: NESTED | FIELD | CONSTR | METHOD

DETAIL: FIELD | CONSTR | METHOD






Stanford NLP Group

edu.stanford.nlp.parser.lexparser Class BaseLexicon

unknownLevel

smoothInUnknownsThreshold

smartMutation

rulesWithWord

tags

words

seenCounter

unSeenCounter

nullWord

nullTag

lastSignatureIndex

lastSentencePosition

lastWordToSignaturize

BaseLexicon

BaseLexicon

isKnown

isKnown

ruleIteratorByWord

initRulesWithWord

treeToEvents

train

train

addTagging

getSignatureIndex

getSignature

score

tune

readData

writeData

printLexStats

evaluateCoverage

edu.stanford.nlp.parser.lexparser
Class BaseLexicon