|
|||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.Objectedu.stanford.nlp.parser.lexparser.BaseLexicon
public class BaseLexicon
This is the default concrete instantiation of the Lexicon interface. It was originally built for Penn Treebank English.
Field Summary | |
---|---|
protected int |
lastSentencePosition
|
protected int |
lastSignatureIndex
We cache the last signature looked up, because it asks for the same one many times when an unknown word is encountered! (Note that under the current scheme, one unknown word, if seen sentence-initially and non-initially, will be parsed with two different signatures....) |
protected int |
lastWordToSignaturize
|
protected static short |
nullTag
|
protected static int |
nullWord
|
protected List[] |
rulesWithWord
|
protected Counter<IntTaggedWord> |
seenCounter
|
protected boolean |
smartMutation
|
protected int |
smoothInUnknownsThreshold
|
protected Set<IntTaggedWord> |
tags
|
protected int |
unknownLevel
What type of equivalence classing is done in getSignature |
protected Counter<IntTaggedWord> |
unSeenCounter
Has counts for taggings in terms of unseen signatures. |
protected Set<IntTaggedWord> |
words
|
Fields inherited from interface edu.stanford.nlp.parser.lexparser.Lexicon |
---|
BOUNDARY, BOUNDARY_TAG, UNKNOWN_WORD |
Constructor Summary | |
---|---|
BaseLexicon()
|
|
BaseLexicon(Options.LexOptions op)
|
Method Summary | |
---|---|
protected void |
addTagging(boolean seen,
IntTaggedWord itw,
double count)
Adds the tagging with count to the data structures in this Lexicon. |
double |
evaluateCoverage(Collection<Tree> trees,
Set missingWords,
Set missingTags,
Set<IntTaggedWord> missingTW)
Evaluates how many words (= terminals) in a collection of trees are covered by the lexicon. |
protected String |
getSignature(String word,
int loc)
This routine returns a String that is the "signature" of the class of a word. |
protected int |
getSignatureIndex(int wordIndex,
int sentencePosition)
Returns the index of the signature of the word numbered wordIndex, where the signature is the String representation of unknown word features. |
protected void |
initRulesWithWord()
|
boolean |
isKnown(int word)
Checks whether a word is in the lexicon. |
boolean |
isKnown(String word)
Checks whether a word is in the lexicon. |
void |
printLexStats()
Print some statistics about this lexicon. |
void |
readData(BufferedReader in)
Populates data in this Lexicon from the character stream given by the Reader r. |
Iterator |
ruleIteratorByWord(int word,
int loc)
Returns the possible POS taggings for a word. |
double |
score(IntTaggedWord iTW,
int loc)
Get the score of this word with this tag (as an IntTaggedWord) at this loc. |
void |
train(Collection trees)
Trains this lexicon on the Collection of trees. |
void |
train(Collection trees,
double weight)
Trains this lexicon on the Collection of trees. |
protected List<IntTaggedWord> |
treeToEvents(Tree tree)
|
void |
tune(Collection<Tree> trees)
|
void |
writeData(Writer w)
Writes out data from this Object to the Writer w. |
Methods inherited from class java.lang.Object |
---|
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
Field Detail |
---|
protected int unknownLevel
protected int smoothInUnknownsThreshold
protected boolean smartMutation
protected transient List[] rulesWithWord
protected transient Set<IntTaggedWord> tags
protected transient Set<IntTaggedWord> words
protected Counter<IntTaggedWord> seenCounter
protected Counter<IntTaggedWord> unSeenCounter
protected static final int nullWord
protected static final short nullTag
protected transient int lastSignatureIndex
protected transient int lastSentencePosition
protected transient int lastWordToSignaturize
Constructor Detail |
---|
public BaseLexicon()
public BaseLexicon(Options.LexOptions op)
Method Detail |
---|
public boolean isKnown(int word)
isKnown
in interface Lexicon
word
- The word as an int index to a Numberer
public boolean isKnown(String word)
isKnown
in interface Lexicon
word
- The word as a String
public Iterator ruleIteratorByWord(int word, int loc)
ruleIteratorByWord
in interface Lexicon
word
- The word, represented as an integer in Numbererloc
- The position of the word in the sentence (counting from 0).
Implementation note: The BaseLexicon class doesn't
actually make use of this position information.
tag -> word rule.)
protected void initRulesWithWord()
protected List<IntTaggedWord> treeToEvents(Tree tree)
public void train(Collection trees)
train
in interface Lexicon
public void train(Collection trees, double weight)
protected void addTagging(boolean seen, IntTaggedWord itw, double count)
protected int getSignatureIndex(int wordIndex, int sentencePosition)
protected String getSignature(String word, int loc)
word
- The word to make a signature forloc
- Its position in the sentence (mainly so sentence-initial
capitalized words can be treated differently)
public double score(IntTaggedWord iTW, int loc)
Implementation documentation: Seen: c_W = count(W) c_TW = count(T,W) c_T = count(T) c_Tunseen = count(T) among new words in 2nd half total = count(seen words) totalUnseen = count("unseen" words) p_T_U = Pmle(T|"unseen") pb_T_W = P(T|W). If (c_W > smoothInUnknownsThreshold) = c_TW/c_W Else (if not smart mutation) pb_T_W = bayes prior smooth[1] with p_T_U p_T= Pmle(T) p_W = Pmle(W) pb_W_T = log(pb_T_W * p_W / p_T) [Bayes rule] Note that this doesn't really properly reserve mass to unknowns. Unseen: c_TS = count(T,Sig|Unseen) c_S = count(Sig) c_T = count(T|Unseen) c_U = totalUnseen above p_T_U = Pmle(T|Unseen) pb_T_S = Bayes smooth of Pmle(T|S) with P(T|Unseen) [smooth[0]] pb_W_T = log(P(W|T)) inverted
score
in interface Lexicon
iTW
- An IntTaggedWord pairing a word and POS tagloc
- The position in the sentence. In the default implementation
this is used only for unknown words to change their
probability distribution when sentence initial
public void tune(Collection<Tree> trees)
public void readData(BufferedReader in) throws IOException
readData
in interface Lexicon
in
- The BufferedReader to read from
IOException
public void writeData(Writer w) throws IOException
writeData
in interface Lexicon
w
- The writer to output to
IOException
public void printLexStats()
public double evaluateCoverage(Collection<Tree> trees, Set missingWords, Set missingTags, Set<IntTaggedWord> missingTW)
|
|||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |