edu.stanford.nlp.parser.lexparser
Class ChineseTreebankParserParams

java.lang.Object
  extended by edu.stanford.nlp.parser.lexparser.AbstractTreebankParserParams
      extended by edu.stanford.nlp.parser.lexparser.ChineseTreebankParserParams
All Implemented Interfaces:
TreebankLangParserParams, Serializable

public class ChineseTreebankParserParams
extends AbstractTreebankParserParams

Parameter file for parsing the Penn Chinese Treebank. Includes category enrichments specific to the Penn Chinese Treebank.

Author:
Roger Levy, Christopher Manning, Galen Andrew
See Also:
Serialized Form

Nested Class Summary
 
Nested classes/interfaces inherited from class edu.stanford.nlp.parser.lexparser.AbstractTreebankParserParams
AbstractTreebankParserParams.SubcategoryStripper
 
Field Summary
 boolean bikelHeadFinder
           
 boolean charTags
           
static boolean chineseSelectiveTagPA
           
static boolean chineseSplitDouHao
          Chinese: Split the dou hao (a punctuation mark separating members of a list) from other punctuation.
static boolean chineseSplitPunct
          Chinese: split Chinese punctuation several ways, along the lines of English punctuation plus another category for the dou hao.
static boolean chineseSplitPunctLR
          Chinese: split left right/paren quote (if chineseSplitPunct is also true.
static boolean chineseSplitVP3
          Chinese: split VPs into VP-COMP, VP-CRD, VP-ADJ.
static boolean chineseVerySelectiveTagPA
           
 boolean discardFrags
           
static boolean gpaAD
          Grandparent annotate all AD.
static boolean markADgrandchildOfIP
          Chinese: mark ADs that are grandchild of IP.
static boolean markCC
          Mark phrases which are conjunctions.
static boolean markIPadjsubj
           
static boolean markIPconj
          Chinese: mark IPs that are conjuncts.
static boolean markIPsisDEC
          Chinese: mark IPs that are part of prenominal modifiers.
static boolean markIPsisterBA
          Chinese: mark IPs that are sister of BA.
static boolean markIPsisterVVorP
          Chinese: mark IP's that are sister of VV or P.
static boolean markModifiedNP
          Chinese: mark left-modified NPs (rightmost NPs with a left-side mod).
static boolean markMultiNtag
          Chinese: mark nominal tags that are part of multi-nominal rewrites.
static boolean markNPconj
          Chinese: mark NPs that are conjuncts.
static boolean markNPmodNP
          Chinese: mark NP modifiers of NPs.
static boolean markPostverbalP
          Chinese: mark P with a left aunt VV, and PP with a left sister VV.
static boolean markPostverbalPP
           
static boolean markPsisterIP
          Chinese: mark P's that are sister of IP.
static boolean markVPadjunct
          Chinese: mark phrases that are adjuncts of VP (these tend to be locatives/temporals, and have a specific distribution).
static boolean markVVsisterIP
          Chinese: mark VVs that are sister of IP (communication & small-clause-taking verbs).
static boolean mergeNNVV
          Chinese: merge NN and VV.
static boolean paRootDtr
          Chinese: parent annotate daughter of root.
 boolean segmentMarkov
           
 boolean segmentMaxMatch
           
static boolean splitBaseNP
          Mark base NPs.
static boolean splitNPTMP
          Whether to retain the -TMP functional tag on various phrasal categories.
static boolean splitPPTMP
           
static boolean splitXPTMP
           
 boolean sunJurafskyHeadFinder
           
static boolean tagWordSize
          Annotate tags for number of characters contained.
static boolean unaryCP
           
static boolean unaryIP
          Chinese: unary category marking
 boolean useCharacterBasedLexicon
           
 boolean useMaxentDepGrammar
           
 boolean useMaxentLexicon
           
 boolean useSimilarWordMap
           
 
Fields inherited from class edu.stanford.nlp.parser.lexparser.AbstractTreebankParserParams
inputEncoding, outputEncoding, tlp
 
Constructor Summary
ChineseTreebankParserParams()
           
 
Method Summary
 TreeTransformer collinizer()
          Returns a ChineseCollinizer
 TreeTransformer collinizerEvalb()
          Returns a ChineseCollinizer that doesn't delete punctuation
 List defaultTestSentence()
          Return a default sentence for the language (for testing)
 edu.stanford.nlp.parser.lexparser.Extractor dependencyGrammarExtractor(Options op)
           
 DiskTreebank diskTreebank()
          Uses a DiskTreebank with a CHTBTokenizer and a BobChrisTreeNormalizer.
 void display()
          display language-specific settings
 HeadFinder headFinder()
          Returns a ChineseHeadFinder
 Lexicon lex(Options.LexOptions op)
          Returns a ChineseLexicon
static void main(String[] args)
          For testing: loads a treebank and prints the trees.
 MemoryTreebank memoryTreebank()
          Uses a MemoryTreebank with a CHTBTokenizer and a BobChrisTreeNormalizer
 double[] MLEDependencyGrammarSmoothingParams()
          Give the parameters for smoothing in the MLEDependencyGrammar.
 int setOptionFlag(String[] args, int i)
          Set language-specific options according to flags.
 String[] sisterSplitters()
          Returns the splitting strings used for selective splits.
 Tree transformTree(Tree t, Tree root)
          transformTree does all language-specific tree transformations.
 TreeReaderFactory treeReaderFactory()
          Returns a factory for reading in trees from the source you want.
 
Methods inherited from class edu.stanford.nlp.parser.lexparser.AbstractTreebankParserParams
dependencyObjectify, getInputEncoding, getOutputEncoding, lex, parsevalObjectify, parsevalObjectify, pw, pw, setInputEncoding, setOutputEncoding, subcategoryStripper, testMemoryTreebank, treebankLanguagePack, treeTokenizerFactory, typedDependencyClasser, typedDependencyObjectify, untypedDependencyObjectify
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

charTags

public boolean charTags

useCharacterBasedLexicon

public boolean useCharacterBasedLexicon

useMaxentLexicon

public boolean useMaxentLexicon

useMaxentDepGrammar

public boolean useMaxentDepGrammar

segmentMarkov

public boolean segmentMarkov

segmentMaxMatch

public boolean segmentMaxMatch

sunJurafskyHeadFinder

public boolean sunJurafskyHeadFinder

bikelHeadFinder

public boolean bikelHeadFinder

discardFrags

public boolean discardFrags

useSimilarWordMap

public boolean useSimilarWordMap

chineseSplitDouHao

public static boolean chineseSplitDouHao
Chinese: Split the dou hao (a punctuation mark separating members of a list) from other punctuation. Good but included below.


chineseSplitPunct

public static boolean chineseSplitPunct
Chinese: split Chinese punctuation several ways, along the lines of English punctuation plus another category for the dou hao. Good.


chineseSplitPunctLR

public static boolean chineseSplitPunctLR
Chinese: split left right/paren quote (if chineseSplitPunct is also true. Only very marginal gains, but seems positive.


markVVsisterIP

public static boolean markVVsisterIP
Chinese: mark VVs that are sister of IP (communication & small-clause-taking verbs). Good: give 0.5%


markPsisterIP

public static boolean markPsisterIP
Chinese: mark P's that are sister of IP. Negative effect


markIPsisterVVorP

public static boolean markIPsisterVVorP
Chinese: mark IP's that are sister of VV or P. These rarely have punctuation. Small positive effect.


markADgrandchildOfIP

public static boolean markADgrandchildOfIP
Chinese: mark ADs that are grandchild of IP.


gpaAD

public static boolean gpaAD
Grandparent annotate all AD. Seems slightly negative.


chineseVerySelectiveTagPA

public static boolean chineseVerySelectiveTagPA

chineseSelectiveTagPA

public static boolean chineseSelectiveTagPA

markIPsisterBA

public static boolean markIPsisterBA
Chinese: mark IPs that are sister of BA. These always have overt NP. Very slightly positive.


markVPadjunct

public static boolean markVPadjunct
Chinese: mark phrases that are adjuncts of VP (these tend to be locatives/temporals, and have a specific distribution). Necessary even with chineseSplitVP3 and parent annotation because parent annotation happens with unsplit parent categories. Slightly positive.


markNPmodNP

public static boolean markNPmodNP
Chinese: mark NP modifiers of NPs. Quite positive (0.5%)


markModifiedNP

public static boolean markModifiedNP
Chinese: mark left-modified NPs (rightmost NPs with a left-side mod). Slightly positive.


markNPconj

public static boolean markNPconj
Chinese: mark NPs that are conjuncts. Negative on small set.


markMultiNtag

public static boolean markMultiNtag
Chinese: mark nominal tags that are part of multi-nominal rewrites. Doesn't seem any good.


markIPsisDEC

public static boolean markIPsisDEC
Chinese: mark IPs that are part of prenominal modifiers. Negative.


markIPconj

public static boolean markIPconj
Chinese: mark IPs that are conjuncts. Or those that have (adjuncts or subjects)


markIPadjsubj

public static boolean markIPadjsubj

chineseSplitVP3

public static boolean chineseSplitVP3
Chinese: split VPs into VP-COMP, VP-CRD, VP-ADJ. Negative value.


mergeNNVV

public static boolean mergeNNVV
Chinese: merge NN and VV. A lark.


unaryIP

public static boolean unaryIP
Chinese: unary category marking


unaryCP

public static boolean unaryCP

paRootDtr

public static boolean paRootDtr
Chinese: parent annotate daughter of root. Meant only for selectivesplit=false.


markPostverbalP

public static boolean markPostverbalP
Chinese: mark P with a left aunt VV, and PP with a left sister VV. Note that it's necessary to mark both to thread the context-marking. Used to identify post-verbal P's, which are rare.


markPostverbalPP

public static boolean markPostverbalPP

splitBaseNP

public static boolean splitBaseNP
Mark base NPs. Good.


tagWordSize

public static boolean tagWordSize
Annotate tags for number of characters contained.


markCC

public static boolean markCC
Mark phrases which are conjunctions. Appears negative, even with 200K words training data.


splitNPTMP

public static boolean splitNPTMP
Whether to retain the -TMP functional tag on various phrasal categories. On 80K words training, minutely helpful; on 200K words, best option gives 0.6%. Doing splitNPTMP and splitPPTMP (but not splitXPTMP) is best.


splitPPTMP

public static boolean splitPPTMP

splitXPTMP

public static boolean splitXPTMP
Constructor Detail

ChineseTreebankParserParams

public ChineseTreebankParserParams()
Method Detail

headFinder

public HeadFinder headFinder()
Returns a ChineseHeadFinder

Specified by:
headFinder in interface TreebankLangParserParams
Specified by:
headFinder in class AbstractTreebankParserParams

lex

public Lexicon lex(Options.LexOptions op)
Returns a ChineseLexicon

Specified by:
lex in interface TreebankLangParserParams
Overrides:
lex in class AbstractTreebankParserParams

MLEDependencyGrammarSmoothingParams

public double[] MLEDependencyGrammarSmoothingParams()
Description copied from class: AbstractTreebankParserParams
Give the parameters for smoothing in the MLEDependencyGrammar. Defaults are the ones previously hard coded into MLEDependencyGrammar.

Specified by:
MLEDependencyGrammarSmoothingParams in interface TreebankLangParserParams
Overrides:
MLEDependencyGrammarSmoothingParams in class AbstractTreebankParserParams
Returns:
an array of doubles with smooth_aT_hTWd, smooth_aTW_hTWd, smooth_stop, and interp

treeReaderFactory

public TreeReaderFactory treeReaderFactory()
Description copied from interface: TreebankLangParserParams
Returns a factory for reading in trees from the source you want. It's the responsibility of trf to deal properly with character-set encoding of the input. It also is the responsibility of trf to properly normalize trees.

Returns:
A factory that vends an appropriate TreeReader

diskTreebank

public DiskTreebank diskTreebank()
Uses a DiskTreebank with a CHTBTokenizer and a BobChrisTreeNormalizer.


memoryTreebank

public MemoryTreebank memoryTreebank()
Uses a MemoryTreebank with a CHTBTokenizer and a BobChrisTreeNormalizer

Specified by:
memoryTreebank in interface TreebankLangParserParams
Specified by:
memoryTreebank in class AbstractTreebankParserParams

collinizer

public TreeTransformer collinizer()
Returns a ChineseCollinizer

Specified by:
collinizer in interface TreebankLangParserParams
Specified by:
collinizer in class AbstractTreebankParserParams

collinizerEvalb

public TreeTransformer collinizerEvalb()
Returns a ChineseCollinizer that doesn't delete punctuation

Specified by:
collinizerEvalb in interface TreebankLangParserParams
Specified by:
collinizerEvalb in class AbstractTreebankParserParams

sisterSplitters

public String[] sisterSplitters()
Description copied from class: AbstractTreebankParserParams
Returns the splitting strings used for selective splits.

Specified by:
sisterSplitters in interface TreebankLangParserParams
Specified by:
sisterSplitters in class AbstractTreebankParserParams
Returns:
An array containing ancestor-annotated Strings: categories should be split according to these ancestor annotations.

transformTree

public Tree transformTree(Tree t,
                          Tree root)
transformTree does all language-specific tree transformations. Any parameterizations should be inside the specific TreebankLangParserParams class.

Specified by:
transformTree in interface TreebankLangParserParams
Specified by:
transformTree in class AbstractTreebankParserParams
Parameters:
t - The input tree (with non-language specific annotation already done, so you need to strip back to basic categories)
root - The root of the current tree (can be null for words)
Returns:
The fully annotated tree node (with daughters still as you want them in the final result)

display

public void display()
Description copied from class: AbstractTreebankParserParams
display language-specific settings

Specified by:
display in interface TreebankLangParserParams
Specified by:
display in class AbstractTreebankParserParams

setOptionFlag

public int setOptionFlag(String[] args,
                         int i)
Set language-specific options according to flags. This routine should process the option starting in args[i] (which might potentially be several arguments long if it takes arguments). It should return the index after the last index it consumed in processing. In particular, if it cannot process the current option, the return value should be i.

Specified by:
setOptionFlag in interface TreebankLangParserParams
Specified by:
setOptionFlag in class AbstractTreebankParserParams
Parameters:
args - Array of command line arguments
i - Index in command line arguments to try to process as an option
Returns:
The index of the item after arguments processed as part of this command line option.

dependencyGrammarExtractor

public edu.stanford.nlp.parser.lexparser.Extractor dependencyGrammarExtractor(Options op)
Specified by:
dependencyGrammarExtractor in interface TreebankLangParserParams
Overrides:
dependencyGrammarExtractor in class AbstractTreebankParserParams

defaultTestSentence

public List defaultTestSentence()
Return a default sentence for the language (for testing)


main

public static void main(String[] args)
For testing: loads a treebank and prints the trees.



Stanford NLP Group