|
|||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.Objectedu.stanford.nlp.parser.lexparser.LexicalizedParser
public class LexicalizedParser
This class provides the top-level API and command-line interface to a set of reasonably good treebank-trained parsers. The name reflects the main factored parsing model, which provides a lexicalized PCFG parser implemented as a product model of a plain PCFG parser and a lexicalized dependency parser. But you can also run either component parser alone. In particular, it is often useful to do unlexicalized PCFG parsing by using just that component parser.
See the package documentation for more details and examples of use. See the main method documentation for details of invoking the parser.
Note that training requires a lot of memory to run. Try -mx1500m.
Field Summary | |
---|---|
protected edu.stanford.nlp.parser.lexparser.BiLexPCFGParser |
bparser
The factored parser that combines the dependency and PCFG parsers. |
protected TreeTransformer |
debinarizer
|
protected edu.stanford.nlp.parser.lexparser.ExhaustiveDependencyParser |
dparser
The dependency parser. |
protected ExhaustivePCFGParser |
pparser
The PCFG parser. |
Constructor Summary | |
---|---|
LexicalizedParser()
Construct a new LexicalizedParser object from a previously serialized grammar read from a property edu.stanford.nlp.SerializedLexicalizedParser ,
or a default file location. |
|
LexicalizedParser(ObjectInputStream in)
Construct a new LexicalizedParser object from a previously assembled grammar read from an InputStream. |
|
LexicalizedParser(Options op)
Construct a new LexicalizedParser object from a previously serialized grammar read from a System property edu.stanford.nlp.SerializedLexicalizedParser ,
or a default file location
(/u/nlp/data/lexparser/englishPCFG.ser.gz ). |
|
LexicalizedParser(ParserData pd)
Construct a new LexicalizedParser object from a previously assembled grammar. |
|
LexicalizedParser(String parserFileOrUrl)
|
|
LexicalizedParser(String parserFileOrUrl,
boolean isTextGrammar,
Options op)
Construct a new LexicalizedParser. |
|
LexicalizedParser(String treebankPath,
FileFilter filt,
Options op)
|
|
LexicalizedParser(String parserFileOrUrl,
Options op)
Construct a new LexicalizedParser. |
|
LexicalizedParser(Treebank trainTreebank,
DiskTreebank secondaryTrainTreebank,
double weight,
GrammarCompactor compactor,
Options op)
|
|
LexicalizedParser(Treebank trainTreebank,
GrammarCompactor compactor,
Options op)
Construct a new LexicalizedParser. |
|
LexicalizedParser(Treebank trainTreebank,
GrammarCompactor compactor,
Options op,
Treebank tuneTreebank)
Construct a new LexicalizedParser. |
|
LexicalizedParser(Treebank trainTreebank,
Options op)
|
Method Summary | |
---|---|
Object |
apply(Object in)
Converts a Sentence/List/String into a Tree. |
Tree |
getBestDependencyParse()
|
Tree |
getBestDependencyParse(boolean debinarize)
|
Tree |
getBestParse()
Return the best parse of the sentence most recently parsed. |
Tree |
getBestPCFGParse()
|
Tree |
getBestPCFGParse(boolean stripSubcategories)
|
Lexicon |
getLexicon()
|
Options |
getOp()
|
static ParserData |
getParserDataFromFile(String parserFileOrUrl,
Options op)
|
protected static ParserData |
getParserDataFromSerializedFile(String serializedFileOrUrl)
|
protected static ParserData |
getParserDataFromTextFile(String textFileOrUrl,
Options op)
|
protected ParserData |
getParserDataFromTreebank(Treebank trainTreebank,
DiskTreebank secondaryTrainTreebank,
double weight,
GrammarCompactor compactor)
A method for training from two different treebanks, the second of which is presumed to be orders of magnitude larger. |
protected ParserData |
getParserDataFromTreebank(Treebank trainTreebank,
GrammarCompactor compactor,
Treebank tuneTreebank)
|
double |
getPCFGScore()
|
double |
getPCFGScore(String goalStr)
|
TreePrint |
getTreePrint()
Return a TreePrint for formatting parsed output trees. |
static void |
main(String[] args)
A main program for using the parser with various options. |
protected void |
makeParsers()
|
boolean |
parse(LatticeReader lr)
Parse a lattice with PCFG parser. |
boolean |
parse(List sentence)
Parse a sentence represented as a List. |
boolean |
parse(Sentence sentence)
Parse a Sentence. |
boolean |
parse(Sentence sentence,
String goal)
Parse a Sentence. |
boolean |
parse(String sentence)
Tokenize and parse a sentence. |
ParserData |
parserData()
|
void |
setMaxLength(int maxLength)
Set the maximum length of a sentence that the parser will be willing to parse. |
void |
setOptionFlags(String[] flags)
This will set options to the parser, in a way exactly equivalent to passing in the same sequence of command-line arguments. |
double |
testOnTreebank(Treebank testTreebank)
Test the parser on a treebank. |
Methods inherited from class java.lang.Object |
---|
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
Field Detail |
---|
protected ExhaustivePCFGParser pparser
protected edu.stanford.nlp.parser.lexparser.ExhaustiveDependencyParser dparser
protected edu.stanford.nlp.parser.lexparser.BiLexPCFGParser bparser
protected TreeTransformer debinarizer
Constructor Detail |
---|
public LexicalizedParser()
edu.stanford.nlp.SerializedLexicalizedParser
,
or a default file location.
public LexicalizedParser(Options op)
edu.stanford.nlp.SerializedLexicalizedParser
,
or a default file location
(/u/nlp/data/lexparser/englishPCFG.ser.gz
).
op
- Options to the parser. These get overwritten by the
Options read from the serialized parser; I think the only
thing determined by them is the encoding of the grammar
iff it is a text grammarpublic LexicalizedParser(String parserFileOrUrl, Options op)
IllegalArgumentException
- If parser data cannot be loadedpublic LexicalizedParser(String parserFileOrUrl)
public LexicalizedParser(String parserFileOrUrl, boolean isTextGrammar, Options op)
IllegalArgumentException
- If parser data cannot be loadedpublic LexicalizedParser(ParserData pd)
pd
- A ParserData
object (not null
)public LexicalizedParser(ObjectInputStream in) throws Exception
in
- The ObjectInputStream
Exception
public LexicalizedParser(Treebank trainTreebank, GrammarCompactor compactor, Options op)
trainTreebank
- a treebank to train frompublic LexicalizedParser(String treebankPath, FileFilter filt, Options op)
public LexicalizedParser(Treebank trainTreebank, GrammarCompactor compactor, Options op, Treebank tuneTreebank)
trainTreebank
- a treebank to train fromtuneTreebank
- a treebank to tune free params on (may be null)public LexicalizedParser(Treebank trainTreebank, DiskTreebank secondaryTrainTreebank, double weight, GrammarCompactor compactor, Options op)
public LexicalizedParser(Treebank trainTreebank, Options op)
Method Detail |
---|
public Options getOp()
public Object apply(Object in)
apply
in interface Function
in
- The input Sentence/List/String
IllegalArgumentException
- If argument isn't a List or Stringpublic boolean parse(Sentence sentence)
parse
in interface Parser
sentence
- A Sentence
to be parsed
public boolean parse(Sentence sentence, String goal)
parse
in interface Parser
sentence
- A Sentence
to be parsedgoal
- The category to parse the sentence as (e.g., NP, S)
public TreePrint getTreePrint()
public boolean parse(String sentence)
sentence
-
public boolean parse(List sentence)
sentence
- The sentence to parse
UnsupportedOperationException
- If the Sentence is too long or
of zero length or the parse
otherwise fails for resource reasonspublic boolean parse(LatticeReader lr)
lr
- a lattice to parse
public Tree getBestParse()
getBestParse
in interface ViterbiParser
NoSuchElementException
- If no previously successfully parsed
sentencepublic Tree getBestPCFGParse()
public Tree getBestPCFGParse(boolean stripSubcategories)
public double getPCFGScore()
public double getPCFGScore(String goalStr)
public Tree getBestDependencyParse()
public Tree getBestDependencyParse(boolean debinarize)
public void setMaxLength(int maxLength)
maxLength
- public static ParserData getParserDataFromFile(String parserFileOrUrl, Options op)
public ParserData parserData()
public Lexicon getLexicon()
protected static ParserData getParserDataFromTextFile(String textFileOrUrl, Options op)
protected static ParserData getParserDataFromSerializedFile(String serializedFileOrUrl)
protected final ParserData getParserDataFromTreebank(Treebank trainTreebank, GrammarCompactor compactor, Treebank tuneTreebank)
protected final ParserData getParserDataFromTreebank(Treebank trainTreebank, DiskTreebank secondaryTrainTreebank, double weight, GrammarCompactor compactor)
protected final void makeParsers()
public double testOnTreebank(Treebank testTreebank)
Test.verbose
is true.
testTreebank
- The treebank to parse
public void setOptionFlags(String[] flags)
-tLPP
)
to use. The TreebankLangParserParams should be set up on construction of
a LexicalizedParser, by constructing an Options that uses the required
TreebankLangParserParams, and passing that to a LexicalizedParser
constructor. Note that despite
this method being an instance method, many flags are actually set as
static class variables.
flags
- Arguments to the parser, for example,
{"-outputFormat", "typedDependencies", "-maxLength", "70"}
IllegalArgumentException
- If an unknown flag is passed inpublic static void main(String[] args)
Sample Usages:
java -mx1500m edu.stanford.nlp.parser.lexparser.LexicalizedParser [-v]
-train trainFilesPath fileRange -saveToSerializedFile
serializedGrammarFilename
java -mx1500m edu.stanford.nlp.parser.lexparser.LexicalizedParser
[-v] -train trainFilesPath fileRange
-testTreebank testFilePath fileRange
java -mx512m edu.stanford.nlp.parser.lexparser.LexicalizedParser
[-v] serializedGrammarPath filename+
java -mx512m edu.stanford.nlp.parser.lexparser.LexicalizedParser
[-v] -loadFromSerializedFile serializedGrammarPath
-testTreebank testFilePath fileRange
If the serializedGrammarPath
ends in .gz
,
then the grammar is written and read as a compressed file (GZip).
If the serializedGrammarPath
is a URL, starting with
http://
, then the parser is read from the URL.
A fileRange specifies a numeric value that must be included within a
filename for it to be used in training or testing (this works well with
most current treebanks). It can be specified like a range of pages to be
printed, for instance as 200-2199
or
1-300,500-725,9000
or just as 1
.
The parser can write a grammar as either a serialized Java object file
or in a text format, specified with the following alternate
usages:
java edu.stanford.nlp.parser.lexparser.LexicalizedParser [-v] -train
trainFilesPath [fileRange] [-saveToSerializedFile grammarPath]
[-saveToTextFile grammarPath]
If no files are supplied to parse, then a hardwired sentence is parsed.
In the same position as the verbose flag (-v
), many other
options can be specified. The most useful to an end user are:
-tLPP class
Specify a different
TreebankLangParserParams, for when using a different language or
treebank (the default is English Penn Treebank)-encoding charset
Specify the character encoding of the
input files. This will override the value in the
TreebankLangParserParams
, provided this option appears
after any -tLPP
-tokenized
Says that the input is already separated
into whitespace-delimited tokens. Unless you also specify
-escaper
, these tokens must all be correctly
tokenized tokens of the appropriate treebank for the parser to work
well (for instance, if using the Penn English Treebank, you must have
coded "(" as "-LRB-", "3/4" as "3\/4", etc.)-escaper class
specify a class of type
Function
<List<HasWord>,List<HasWord>> to do
customized escaping of tokenized text. This class will be run over the
tokenized text and can fix the representation of tokens. For instance,
it could change "(" to "-LRB-" for the Penn English Treebank. A
provided escaper that does such things for the Penn English Treebank is
edu.stanford.nlp.process.PTBEscapingProcessor
-tokenizerFactory class
Specifies a
TokenizerFactory class to be used for tokenization-sentences token
Specifies a token that marks sentence
boundaries. All other tokens will be interpreted literally, and must be
tokens returned by the tokenizer. A value of
"newline" causes sentence breaking on newlines.-parseInside element
Specifies that parsing should only
be done for tokens inside the indicated XML-style
elements (done as simple pattern matching, rather than XML parsing).
For example, if this is specified as sentence
, then
the text inside the sentence
element
would be parsed as a sentence. Sentences cannot span elements.
(Using "-parseInside s" gives you support for the input format of
Charniak's parser.)
-tagSeparator char
Specifies to look for tags on words
following the word and separated from it by a special character
char
. For instance, many tagged corpora have the
representation "house/NN" and you would use -tagSeparator /
.
Notes: This option requires that the input be pretokenized.
The separator has to be only a single character, and there is no
escaping mechanism. However, splitting is done on the last
instance of the character in the token, so that cases like
"3\/4/CD" are handled correctly. The parser will in all normal
circumstances use the tag you provide, but will override it in the
case of very common words in cases where the tag that you provide
is not one that it regards as a possible tagging for the word.
The parser supports a format where only some of the words in a sentence
have a tag (if you are calling the parser programmatically, you indicate
them by having them implement the HasTag
interface).
You can do this at the command-line by only having tags after some words,
but you are limited by the fact that there is no way to escape the
tagSeparator character.-maxLength leng
Specify the longest sentence that
will be parsed (and hence indirectly the amount of memory
needed for the parser). If this is not specified, the parser will
try to dynamically grow its parse chart when long sentence are
encountered, but may run out of memory trying to do so.-outputFormat styles
Choose the style(s) of output
sentences: penn
for prettyprinting as in the Penn
treebank files, or oneline
for printing sentences one
per line, words
, wordsAndTags
,
dependencies
, typedDependencies
,
or typedDependenciesCollapsed
.
Multiple options may be specified as a comma-separated
list. See TreePrint class for further documentation.-outputFormatOptions
Provide options that control the
behavior of various -outputFormat
choices, such as
lexicalize
, stem
, markHeadNodes
,
or xml
.
Options are specified as a comma-separated list.
args
- Command line arguments, as above
|
|||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |