edu.stanford.nlp.parser.lexparser
Class ChineseLexicon

java.lang.Object
  extended by edu.stanford.nlp.parser.lexparser.BaseLexicon
      extended by edu.stanford.nlp.parser.lexparser.ChineseLexicon
All Implemented Interfaces:
Lexicon, Serializable

public class ChineseLexicon
extends BaseLexicon

A lexicon class for Chinese. Extends the current Lexicon class, overriding its score and train methods to include a ChineseUnknownWordModel.

Author:
Roger Levy
See Also:
Serialized Form

Field Summary
static boolean useCharBasedUnknownWordModel
           
static boolean useGoodTuringUnknownWordModel
           
 
Fields inherited from class edu.stanford.nlp.parser.lexparser.BaseLexicon
lastSentencePosition, lastSignatureIndex, lastWordToSignaturize, nullTag, nullWord, rulesWithWord, seenCounter, smartMutation, smoothInUnknownsThreshold, tags, unknownLevel, unSeenCounter, words
 
Fields inherited from interface edu.stanford.nlp.parser.lexparser.Lexicon
BOUNDARY, BOUNDARY_TAG, UNKNOWN_WORD
 
Constructor Summary
ChineseLexicon(Options.LexOptions op)
           
 
Method Summary
 double score(IntTaggedWord iTW, int loc)
          Get the score of this word with this tag (as an IntTaggedWord) at this loc.
 void train(Collection trees)
          Trains this lexicon on the Collection of trees.
 
Methods inherited from class edu.stanford.nlp.parser.lexparser.BaseLexicon
addTagging, evaluateCoverage, getSignature, getSignatureIndex, initRulesWithWord, isKnown, isKnown, printLexStats, readData, ruleIteratorByWord, train, treeToEvents, tune, writeData
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

useCharBasedUnknownWordModel

public static boolean useCharBasedUnknownWordModel

useGoodTuringUnknownWordModel

public static boolean useGoodTuringUnknownWordModel
Constructor Detail

ChineseLexicon

public ChineseLexicon(Options.LexOptions op)
Method Detail

train

public void train(Collection trees)
Description copied from class: BaseLexicon
Trains this lexicon on the Collection of trees.

Specified by:
train in interface Lexicon
Overrides:
train in class BaseLexicon

score

public double score(IntTaggedWord iTW,
                    int loc)
Description copied from class: BaseLexicon
Get the score of this word with this tag (as an IntTaggedWord) at this loc. (Presumably an estimate of P(word | tag).)

Implementation documentation: Seen: c_W = count(W) c_TW = count(T,W) c_T = count(T) c_Tunseen = count(T) among new words in 2nd half total = count(seen words) totalUnseen = count("unseen" words) p_T_U = Pmle(T|"unseen") pb_T_W = P(T|W). If (c_W > smoothInUnknownsThreshold) = c_TW/c_W Else (if not smart mutation) pb_T_W = bayes prior smooth[1] with p_T_U p_T= Pmle(T) p_W = Pmle(W) pb_W_T = log(pb_T_W * p_W / p_T) [Bayes rule] Note that this doesn't really properly reserve mass to unknowns. Unseen: c_TS = count(T,Sig|Unseen) c_S = count(Sig) c_T = count(T|Unseen) c_U = totalUnseen above p_T_U = Pmle(T|Unseen) pb_T_S = Bayes smooth of Pmle(T|S) with P(T|Unseen) [smooth[0]] pb_W_T = log(P(W|T)) inverted

Specified by:
score in interface Lexicon
Overrides:
score in class BaseLexicon
Parameters:
iTW - An IntTaggedWord pairing a word and POS tag
loc - The position in the sentence. In the default implementation this is used only for unknown words to change their probability distribution when sentence initial
Returns:
A double valued score, usually - log P(word|tag)


Stanford NLP Group