LexicalizedParser (Stanford JavaNLP API)

Overview

Package

Class

Tree

Deprecated

Index

Help

PREV CLASS NEXT CLASS

FRAMES NO FRAMES

SUMMARY: NESTED | FIELD | CONSTR | METHOD

DETAIL: FIELD | CONSTR | METHOD

edu.stanford.nlp.parser.lexparser
Class LexicalizedParser

java.lang.Object
  edu.stanford.nlp.parser.lexparser.LexicalizedParser

All Implemented Interfaces:: Parser, ViterbiParser, Function, Serializable

public class LexicalizedParser
extends Object
implements ViterbiParser, Function
extends Object
implements ViterbiParser, Function

This class provides the top-level API and command-line interface to a set of reasonably good treebank-trained parsers. The name reflects the main factored parsing model, which provides a lexicalized PCFG parser implemented as a product model of a plain PCFG parser and a lexicalized dependency parser. But you can also run either component parser alone. In particular, it is often useful to do unlexicalized PCFG parsing by using just that component parser.

See the package documentation for more details and examples of use. See the main method documentation for details of invoking the parser.

Note that training requires a lot of memory to run. Try -mx1500m.

Author:: Dan Klein (original version), Christopher Manning (better features, ParserParams, serialization), Roger Levy (internationalization), Teg Grenager (grammar compaction, tokenization, etc.), Galen Andrew (considerable refactoring)
See Also:: Serialized Form

Field Summary
`protected edu.stanford.nlp.parser.lexparser.BiLexPCFGParser`	`bparser` The factored parser that combines the dependency and PCFG parsers.
`protected TreeTransformer`	`debinarizer`
`protected edu.stanford.nlp.parser.lexparser.ExhaustiveDependencyParser`	`dparser` The dependency parser.
`protected ExhaustivePCFGParser`	`pparser` The PCFG parser.

Constructor Summary
`LexicalizedParser()` Construct a new LexicalizedParser object from a previously serialized grammar read from a property `edu.stanford.nlp.SerializedLexicalizedParser`, or a default file location.
`LexicalizedParser(ObjectInputStream in)` Construct a new LexicalizedParser object from a previously assembled grammar read from an InputStream.
`LexicalizedParser(Options op)` Construct a new LexicalizedParser object from a previously serialized grammar read from a System property `edu.stanford.nlp.SerializedLexicalizedParser`, or a default file location (`/u/nlp/data/lexparser/englishPCFG.ser.gz`).
`LexicalizedParser(ParserData pd)` Construct a new LexicalizedParser object from a previously assembled grammar.
`LexicalizedParser(String parserFileOrUrl)`
`LexicalizedParser(String parserFileOrUrl, boolean isTextGrammar, Options op)` Construct a new LexicalizedParser.
`LexicalizedParser(String treebankPath, FileFilter filt, Options op)`
`LexicalizedParser(String parserFileOrUrl, Options op)` Construct a new LexicalizedParser.
`LexicalizedParser(Treebank trainTreebank, DiskTreebank secondaryTrainTreebank, double weight, GrammarCompactor compactor, Options op)`
`LexicalizedParser(Treebank trainTreebank, GrammarCompactor compactor, Options op)` Construct a new LexicalizedParser.
`LexicalizedParser(Treebank trainTreebank, GrammarCompactor compactor, Options op, Treebank tuneTreebank)` Construct a new LexicalizedParser.
`LexicalizedParser(Treebank trainTreebank, Options op)`

Method Summary
`Object`	`apply(Object in)` Converts a Sentence/List/String into a Tree.
`Tree`	`getBestDependencyParse()`
`Tree`	`getBestDependencyParse(boolean debinarize)`
`Tree`	`getBestParse()` Return the best parse of the sentence most recently parsed.
`Tree`	`getBestPCFGParse()`
`Tree`	`getBestPCFGParse(boolean stripSubcategories)`
`Lexicon`	`getLexicon()`
`Options`	`getOp()`
`static ParserData`	`getParserDataFromFile(String parserFileOrUrl, Options op)`
`protected static ParserData`	`getParserDataFromSerializedFile(String serializedFileOrUrl)`
`protected static ParserData`	`getParserDataFromTextFile(String textFileOrUrl, Options op)`
`protected ParserData`	`getParserDataFromTreebank(Treebank trainTreebank, DiskTreebank secondaryTrainTreebank, double weight, GrammarCompactor compactor)` A method for training from two different treebanks, the second of which is presumed to be orders of magnitude larger.
`protected ParserData`	`getParserDataFromTreebank(Treebank trainTreebank, GrammarCompactor compactor, Treebank tuneTreebank)`
`double`	`getPCFGScore()`
`double`	`getPCFGScore(String goalStr)`
`TreePrint`	`getTreePrint()` Return a TreePrint for formatting parsed output trees.
`static void`	`main(String[] args)` A main program for using the parser with various options.
`protected void`	`makeParsers()`
`boolean`	`parse(LatticeReader lr)` Parse a lattice with PCFG parser.
`boolean`	`parse(List sentence)` Parse a sentence represented as a List.
`boolean`	`parse(Sentence sentence)` Parse a Sentence.
`boolean`	`parse(Sentence sentence, String goal)` Parse a Sentence.
`boolean`	`parse(String sentence)` Tokenize and parse a sentence.
`ParserData`	`parserData()`
`void`	`setMaxLength(int maxLength)` Set the maximum length of a sentence that the parser will be willing to parse.
`void`	`setOptionFlags(String[] flags)` This will set options to the parser, in a way exactly equivalent to passing in the same sequence of command-line arguments.
`double`	`testOnTreebank(Treebank testTreebank)` Test the parser on a treebank.

Methods inherited from class java.lang.Object
`clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait`

Field Detail

pparser

protected ExhaustivePCFGParser pparser

The PCFG parser.

dparser

protected edu.stanford.nlp.parser.lexparser.ExhaustiveDependencyParser dparser

The dependency parser.

bparser

protected edu.stanford.nlp.parser.lexparser.BiLexPCFGParser bparser

The factored parser that combines the dependency and PCFG parsers.

debinarizer

protected TreeTransformer debinarizer

Constructor Detail

LexicalizedParser

public LexicalizedParser()

Construct a new LexicalizedParser object from a previously serialized grammar read from a property edu.stanford.nlp.SerializedLexicalizedParser, or a default file location.

LexicalizedParser

public LexicalizedParser(Options op)

Construct a new LexicalizedParser object from a previously serialized grammar read from a System property edu.stanford.nlp.SerializedLexicalizedParser, or a default file location (/u/nlp/data/lexparser/englishPCFG.ser.gz).

Parameters:: op - Options to the parser. These get overwritten by the Options read from the serialized parser; I think the only thing determined by them is the encoding of the grammar iff it is a text grammar

LexicalizedParser

public LexicalizedParser(String parserFileOrUrl,
                         Options op)

Construct a new LexicalizedParser. This loads a grammar that was previously assembled and stored.

Throws:: IllegalArgumentException - If parser data cannot be loaded

LexicalizedParser

public LexicalizedParser(String parserFileOrUrl)

LexicalizedParser

public LexicalizedParser(String parserFileOrUrl,
                         boolean isTextGrammar,
                         Options op)

Construct a new LexicalizedParser. This loads a grammar that was previously assembled and stored.

Throws:: IllegalArgumentException - If parser data cannot be loaded

LexicalizedParser

public LexicalizedParser(ParserData pd)

Construct a new LexicalizedParser object from a previously assembled grammar.

Parameters:: pd - A ParserData object (not null)

LexicalizedParser

public LexicalizedParser(ObjectInputStream in)
                  throws Exception

Construct a new LexicalizedParser object from a previously assembled grammar read from an InputStream. One (ParserData) object is read from the stream. It isn't closed.

Parameters:: in - The ObjectInputStream
Throws:: Exception

LexicalizedParser

public LexicalizedParser(Treebank trainTreebank,
                         GrammarCompactor compactor,
                         Options op)

Construct a new LexicalizedParser.

Parameters:: trainTreebank - a treebank to train from

LexicalizedParser

public LexicalizedParser(String treebankPath,
                         FileFilter filt,
                         Options op)

LexicalizedParser

public LexicalizedParser(Treebank trainTreebank,
                         GrammarCompactor compactor,
                         Options op,
                         Treebank tuneTreebank)

Construct a new LexicalizedParser.

Parameters:: trainTreebank - a treebank to train from; tuneTreebank - a treebank to tune free params on (may be null)

LexicalizedParser

public LexicalizedParser(Treebank trainTreebank,
                         DiskTreebank secondaryTrainTreebank,
                         double weight,
                         GrammarCompactor compactor,
                         Options op)

LexicalizedParser

public LexicalizedParser(Treebank trainTreebank,
                         Options op)

Method Detail

getOp

public Options getOp()

apply

public Object apply(Object in)

Converts a Sentence/List/String into a Tree. If it can't be parsed, it is made into a trivial tree in which each word is attached to a dummy tag ("X") and then to a start nonterminal (also "X").

Specified by:: apply in interface Function

Parameters:: in - The input Sentence/List/String
Returns:: A Tree that is the parse tree for the sentence. If the parser fails, a new Tree is synthesized which attaches all words to the root.
Throws:: IllegalArgumentException - If argument isn't a List or String

parse

public boolean parse(Sentence sentence)

Parse a Sentence.

Specified by:: parse in interface Parser

Parameters:: sentence - A Sentence to be parsed
Returns:: true iff it could be parsed

parse

public boolean parse(Sentence sentence,
                     String goal)

Parse a Sentence. This hasn't yet been implemented. At present the goal is ignored.

Specified by:: parse in interface Parser

Parameters:: sentence - A Sentence to be parsed; goal - The category to parse the sentence as (e.g., NP, S)
Returns:: true iff it could be parsed

getTreePrint

public TreePrint getTreePrint()

Return a TreePrint for formatting parsed output trees.

parse

public boolean parse(String sentence)

Tokenize and parse a sentence.

Parameters:: sentence -
Returns:: true iff it could be parsed

parse

public boolean parse(List sentence)

Parse a sentence represented as a List.

Parameters:: sentence - The sentence to parse
Returns:: true Iff the sentence was accepted by the grammar
Throws:: UnsupportedOperationException - If the Sentence is too long or of zero length or the parse otherwise fails for resource reasons

parse

public boolean parse(LatticeReader lr)

Parse a lattice with PCFG parser.

Parameters:: lr - a lattice to parse
Returns:: Whether the lattice could be parsed by the grammar

getBestParse

public Tree getBestParse()

Return the best parse of the sentence most recently parsed. This will be from the factored parser, if it was used and it succeeeded else from the PCFG if it was used and succeed, else from the dependency parser.

Specified by:: getBestParse in interface ViterbiParser

Returns:: The best tree
Throws:: NoSuchElementException - If no previously successfully parsed sentence

getBestPCFGParse

public Tree getBestPCFGParse()

getBestPCFGParse

public Tree getBestPCFGParse(boolean stripSubcategories)

getPCFGScore

public double getPCFGScore()

getPCFGScore

public double getPCFGScore(String goalStr)

getBestDependencyParse

public Tree getBestDependencyParse()

getBestDependencyParse

public Tree getBestDependencyParse(boolean debinarize)

setMaxLength

public void setMaxLength(int maxLength)

Set the maximum length of a sentence that the parser will be willing to parse. Sentences longer than this will not be parsed (an Exception will be thrown).

Parameters:: maxLength -

getParserDataFromFile

public static ParserData getParserDataFromFile(String parserFileOrUrl,
                                               Options op)

parserData

public ParserData parserData()

getLexicon

public Lexicon getLexicon()

getParserDataFromTextFile

protected static ParserData getParserDataFromTextFile(String textFileOrUrl,
                                                      Options op)

getParserDataFromSerializedFile

protected static ParserData getParserDataFromSerializedFile(String serializedFileOrUrl)

getParserDataFromTreebank

protected final ParserData getParserDataFromTreebank(Treebank trainTreebank,
                                                     GrammarCompactor compactor,
                                                     Treebank tuneTreebank)

getParserDataFromTreebank

protected final ParserData getParserDataFromTreebank(Treebank trainTreebank,
                                                     DiskTreebank secondaryTrainTreebank,
                                                     double weight,
                                                     GrammarCompactor compactor)

A method for training from two different treebanks, the second of which is presumed to be orders of magnitude larger.

Trees are not read into memory but processed as they are read from disk.

A weight (typically < 1) can be put on the second treebank.

makeParsers

protected final void makeParsers()

testOnTreebank

public double testOnTreebank(Treebank testTreebank)

Test the parser on a treebank. Parses will be written to stdout, and various other information will be written to stderr and stdout, particularly if Test.verbose is true.

Parameters:: testTreebank - The treebank to parse
Returns:: The labeled precision/recall F₁ (EVALB measure) of the parser on the treebank.

setOptionFlags

public void setOptionFlags(String[] flags)

This will set options to the parser, in a way exactly equivalent to passing in the same sequence of command-line arguments. This is a useful convenience method when building a parser programmatically. The options passed in should be specified like command-line arguments, including with an initial minus sign.

Notes: This can be used to set parsing-time flags for a serialized parser. You can also still change things serialized in Options, but this will probably degrade parsing performance. The vast majority of command line flags can be passed to this method, but you cannot pass in options that specify the treebank or grammar to be loaded, the grammar to be written, trees or files to be parsed or details of their encoding, nor the TreebankLangParserParams (-tLPP) to use. The TreebankLangParserParams should be set up on construction of a LexicalizedParser, by constructing an Options that uses the required TreebankLangParserParams, and passing that to a LexicalizedParser constructor. Note that despite this method being an instance method, many flags are actually set as static class variables.

Parameters:: flags - Arguments to the parser, for example, {"-outputFormat", "typedDependencies", "-maxLength", "70"}
Throws:: IllegalArgumentException - If an unknown flag is passed in

main

public static void main(String[] args)

A main program for using the parser with various options. This program can be used for building and serializing a parser from treebank data, for parsing sentences from a file or URL using a serialized or text grammar parser, and (mainly for parser quality testing) for training and testing a parser on a treebank all in one go.

Sample Usages:
java -mx1500m edu.stanford.nlp.parser.lexparser.LexicalizedParser [-v] -train trainFilesPath fileRange -saveToSerializedFile serializedGrammarFilename

java -mx1500m edu.stanford.nlp.parser.lexparser.LexicalizedParser [-v] -train trainFilesPath fileRange -testTreebank testFilePath fileRange

java -mx512m edu.stanford.nlp.parser.lexparser.LexicalizedParser [-v] serializedGrammarPath filename+

java -mx512m edu.stanford.nlp.parser.lexparser.LexicalizedParser [-v] -loadFromSerializedFile serializedGrammarPath -testTreebank testFilePath fileRange

If the serializedGrammarPath ends in .gz, then the grammar is written and read as a compressed file (GZip). If the serializedGrammarPath is a URL, starting with http://, then the parser is read from the URL. A fileRange specifies a numeric value that must be included within a filename for it to be used in training or testing (this works well with most current treebanks). It can be specified like a range of pages to be printed, for instance as 200-2199 or 1-300,500-725,9000 or just as 1. The parser can write a grammar as either a serialized Java object file or in a text format, specified with the following alternate usages:

java edu.stanford.nlp.parser.lexparser.LexicalizedParser [-v] -train trainFilesPath [fileRange] [-saveToSerializedFile grammarPath] [-saveToTextFile grammarPath]

If no files are supplied to parse, then a hardwired sentence is parsed.

In the same position as the verbose flag (-v), many other options can be specified. The most useful to an end user are:

-tLPP class Specify a different TreebankLangParserParams, for when using a different language or treebank (the default is English Penn Treebank)
-encoding charset Specify the character encoding of the input files. This will override the value in the TreebankLangParserParams, provided this option appears after any -tLPP
-tokenized Says that the input is already separated into whitespace-delimited tokens. Unless you also specify -escaper, these tokens must all be correctly tokenized tokens of the appropriate treebank for the parser to work well (for instance, if using the Penn English Treebank, you must have coded "(" as "-LRB-", "3/4" as "3\/4", etc.)
-escaper class specify a class of type Function<List<HasWord>,List<HasWord>> to do customized escaping of tokenized text. This class will be run over the tokenized text and can fix the representation of tokens. For instance, it could change "(" to "-LRB-" for the Penn English Treebank. A provided escaper that does such things for the Penn English Treebank is edu.stanford.nlp.process.PTBEscapingProcessor
-tokenizerFactory class Specifies a TokenizerFactory class to be used for tokenization
-sentences token Specifies a token that marks sentence boundaries. All other tokens will be interpreted literally, and must be tokens returned by the tokenizer. A value of "newline" causes sentence breaking on newlines.
-parseInside element Specifies that parsing should only be done for tokens inside the indicated XML-style elements (done as simple pattern matching, rather than XML parsing). For example, if this is specified as sentence, then the text inside the sentence element would be parsed as a sentence. Sentences cannot span elements. (Using "-parseInside s" gives you support for the input format of Charniak's parser.)
-tagSeparator char Specifies to look for tags on words following the word and separated from it by a special character char. For instance, many tagged corpora have the representation "house/NN" and you would use -tagSeparator /. Notes: This option requires that the input be pretokenized. The separator has to be only a single character, and there is no escaping mechanism. However, splitting is done on the last instance of the character in the token, so that cases like "3\/4/CD" are handled correctly. The parser will in all normal circumstances use the tag you provide, but will override it in the case of very common words in cases where the tag that you provide is not one that it regards as a possible tagging for the word. The parser supports a format where only some of the words in a sentence have a tag (if you are calling the parser programmatically, you indicate them by having them implement the HasTag interface). You can do this at the command-line by only having tags after some words, but you are limited by the fact that there is no way to escape the tagSeparator character.
-maxLength leng Specify the longest sentence that will be parsed (and hence indirectly the amount of memory needed for the parser). If this is not specified, the parser will try to dynamically grow its parse chart when long sentence are encountered, but may run out of memory trying to do so.
-outputFormat styles Choose the style(s) of output sentences: penn for prettyprinting as in the Penn treebank files, or oneline for printing sentences one per line, words, wordsAndTags, dependencies, typedDependencies, or typedDependenciesCollapsed. Multiple options may be specified as a comma-separated list. See TreePrint class for further documentation.
-outputFormatOptions Provide options that control the behavior of various -outputFormat choices, such as lexicalize, stem, markHeadNodes, or xml. Options are specified as a comma-separated list.

See also the package documentation for more details and examples of use.

Parameters:: args - Command line arguments, as above

Overview

Package

Class

Tree

Deprecated

Index

Help

PREV CLASS NEXT CLASS

FRAMES NO FRAMES

SUMMARY: NESTED | FIELD | CONSTR | METHOD

DETAIL: FIELD | CONSTR | METHOD

Stanford NLP Group

edu.stanford.nlp.parser.lexparser Class LexicalizedParser

pparser

dparser

bparser

debinarizer

LexicalizedParser

LexicalizedParser

LexicalizedParser

LexicalizedParser

LexicalizedParser

LexicalizedParser

LexicalizedParser

LexicalizedParser

LexicalizedParser

LexicalizedParser

LexicalizedParser

LexicalizedParser

getOp

apply

parse

parse

getTreePrint

parse

parse

parse

getBestParse

getBestPCFGParse

getBestPCFGParse

getPCFGScore

getPCFGScore

getBestDependencyParse

getBestDependencyParse

setMaxLength

getParserDataFromFile

parserData

getLexicon

getParserDataFromTextFile

getParserDataFromSerializedFile

getParserDataFromTreebank

getParserDataFromTreebank

makeParsers

testOnTreebank

setOptionFlags

main

edu.stanford.nlp.parser.lexparser
Class LexicalizedParser