WordToSentenceProcessor (Stanford JavaNLP API)

Overview

Package

Class

Tree

Deprecated

Index

Help

PREV CLASS NEXT CLASS

FRAMES NO FRAMES

SUMMARY: NESTED | FIELD | CONSTR | METHOD

DETAIL: FIELD | CONSTR | METHOD

edu.stanford.nlp.process
Class WordToSentenceProcessor

java.lang.Object
  edu.stanford.nlp.process.AbstractListProcessor
      edu.stanford.nlp.process.WordToSentenceProcessor

All Implemented Interfaces:: ListProcessor, Processor

public class WordToSentenceProcessor
extends AbstractListProcessor
extends AbstractListProcessor

Transforms a Document of Words into a Document of Sentences by grouping the Words. The word stream is assumed to already be adequately tokenized, and this class just divides the list into sentences, perhaps discarding some separator tokens based on the setting of the following three sets:

sentenceBoundaryTokens are tokens that are left in a sentence, but are to be regarded as ending a sentence. A canonical example is a period. If two of these follow each other, the second will be a sentence consisting of only the sentenceBoundaryToken.
sentenceBoundaryFollowers are tokens that are left in a sentence, and which can follow a sentenceBoundaryToken while still belonging to the previous sentence. They cannot begin a sentence (except at the beginning of a document). A canonical example is a close parenthesis ')'.
sentenceBoundaryToDiscard are tokens which separate sentences and which should be thrown away. In web documents, a typical example would be a '<p>' tag. If two of these follow each other, they are coalesced: no empty Sentence is output. The end-of-file is not represented in this Set, but the code behaves as if it were a member.
sentenceRegionBeginPattern A regular expression for marking the start of a sentence region. Not included in the sentence.
sentenceRegionEndPattern A regular expression for marking the end of a sentence region. Not included in the sentence.

Author:: Joseph Smarr (jsmarr@stanford.edu), Christopher Manning, Teg Grenager (grenager@stanford.edu)

Constructor Summary
`WordToSentenceProcessor()` Create a `WordToSentenceProcessor` using a sensible default list of tokens to split on.
`WordToSentenceProcessor(Pattern regionBeginPattern, Pattern regionEndPattern)`
`WordToSentenceProcessor(Set boundaryTokens)` Flexibly set the set of acceptable sentence boundary tokens, but with a default set of allowed boundary following tokens.
`WordToSentenceProcessor(Set boundaryTokens, Set boundaryFollowers)` Flexibly set the set of acceptable sentence boundary tokens and also the set of tokens commonly following sentence boundaries, and the set of discarded separator tokens.
`WordToSentenceProcessor(Set boundaryTokens, Set boundaryFollowers, Set boundaryToDiscard)` Flexibly set the set of acceptable sentence boundary tokens, the set of tokens commonly following sentence boundaries, and also the set of tokens that are sentences boundaries that should be discarded.

Method Summary
`static void`	`main(String[] args)` This will print out as sentences some text.
`List`	`process(List words)` Returns a List of Sentences where each element is built from a run of Words in the input Document.

Methods inherited from class edu.stanford.nlp.process.AbstractListProcessor
`processDocument, processLists`

Methods inherited from class java.lang.Object
`clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait`

Constructor Detail

WordToSentenceProcessor

public WordToSentenceProcessor()

Create a WordToSentenceProcessor using a sensible default list of tokens to split on. The default set is: {".","?","!"}.

WordToSentenceProcessor

public WordToSentenceProcessor(Set boundaryTokens)

Flexibly set the set of acceptable sentence boundary tokens, but with a default set of allowed boundary following tokens. The allowed set of boundary followers is: {")","]","\"","\'", "''", "-RRB-", "-RSB-"}.

WordToSentenceProcessor

public WordToSentenceProcessor(Set boundaryTokens,
                               Set boundaryFollowers)

Flexibly set the set of acceptable sentence boundary tokens and also the set of tokens commonly following sentence boundaries, and the set of discarded separator tokens. The default set of discarded separator tokens is: {"\n"}.

WordToSentenceProcessor

public WordToSentenceProcessor(Set boundaryTokens,
                               Set boundaryFollowers,
                               Set boundaryToDiscard)

Flexibly set the set of acceptable sentence boundary tokens, the set of tokens commonly following sentence boundaries, and also the set of tokens that are sentences boundaries that should be discarded.

WordToSentenceProcessor

public WordToSentenceProcessor(Pattern regionBeginPattern,
                               Pattern regionEndPattern)

Method Detail

process

public List process(List words)

Returns a List of Sentences where each element is built from a run of Words in the input Document. Specifically, reads through each word in the input document and breaks off a sentence after finding a valid sentence boundary token or end of file. Note that for this to work, the words in the input document must have been tokenized with a tokenizer that makes sentence boundary tokens their own tokens (e.g., PTBTokenizer).

Parameters:: words - A list of already tokenized words (must implement HasWord)
Returns:: A list of Sentence
See Also:: WordToSentenceProcessor(Set, Set, Set), Sentence

main

public static void main(String[] args)

This will print out as sentences some text. It can be used to test sentence division.
Usage: java edu.stanford.nlp.process.WordToSentenceProcessor fileOrUrl+

Parameters:: args - Command line argument: files or URLs

Overview

Package

Class

Tree

Deprecated

Index

Help

PREV CLASS NEXT CLASS

FRAMES NO FRAMES

SUMMARY: NESTED | FIELD | CONSTR | METHOD

DETAIL: FIELD | CONSTR | METHOD

Stanford NLP Group

edu.stanford.nlp.process Class WordToSentenceProcessor

WordToSentenceProcessor

WordToSentenceProcessor

WordToSentenceProcessor

WordToSentenceProcessor

WordToSentenceProcessor

process

main

edu.stanford.nlp.process
Class WordToSentenceProcessor