DocumentPreprocessor (Stanford JavaNLP API)

Overview

Package

Class

Tree

Deprecated

Index

Help

PREV CLASS NEXT CLASS

FRAMES NO FRAMES

SUMMARY: NESTED | FIELD | CONSTR | METHOD

DETAIL: FIELD | CONSTR | METHOD

edu.stanford.nlp.process
Class DocumentPreprocessor

java.lang.Object
  edu.stanford.nlp.process.DocumentPreprocessor

public class DocumentPreprocessor
extends Object
extends Object

Fully customizable preprocessor for XML, HTML, and plain text documents. Can take any of a number of input formats and return a List of tokenized strings.

Author:: Chris Cox, Jenny Finkel

Constructor Summary
`DocumentPreprocessor()` Constructs a preprocessor using the default tokenzier: `PTBTokenizer`.
`DocumentPreprocessor(TokenizerFactory tokenizerFactory)`

Method Summary
`List`	`getSentencesFromHTML(Reader input)`
`List`	`getSentencesFromHTML(String fileOrURL)`
`List`	`getSentencesFromText(Reader input)`
`List`	`getSentencesFromText(Reader input, Function<List<HasWord>,List<HasWord>> escaper, String sentenceDelimiter, int tagDelimiter)` Produce a list of sentences from a reader.
`List`	`getSentencesFromText(String fileOrURL)` Reads the file and outputs a list of sentences.
`List`	`getSentencesFromText(String fileOrURL, boolean doPTBEscaping, String sentenceDelimiter, int tagDelimiter)`
`List`	`getSentencesFromText(String input, Function<List<HasWord>,List<HasWord>> escaper, String sentenceDelimiter, int tagDelimiter)` Produce a list of sentences from text.
`List`	`getSentencesFromXML(Reader input, String splitOnTag, boolean doPTBEscaping)` Returns a list of sentences contained in an XML file, occuring between the begin and end of a selected tag.
`List`	`getSentencesFromXML(String fileOrURL, String splitOnTag)` Returns a list of sentences contained in an XML file, occuring between the begin and end of a selected tag.
`List`	`getSentencesFromXML(String fileOrURL, String splitOnTag, boolean doPTBEscaping)` Returns a list of sentences contained in an XML file, occuring between the begin and end of a selected tag.
`List`	`getWordsFromHTML(Reader input)`
`List`	`getWordsFromHTML(String fileOrURL)`
`List`	`getWordsFromString(String input)` Gets a list of words from a string.
`List`	`getWordsFromText(Reader input)`
`List`	`getWordsFromText(String fileOrURL)` Reads the file into a single list of words.
`static void`	`main(String[] args)` This provides a simple test method for DocumentPreprocessor.
`void`	`setEncoding(String encoding)` Set the character encoding.
`void`	`setSentenceFinalPuncWords(String[] sentenceFinalPuncWords)`
`void`	`setTokenizerFactory(TokenizerFactory newTokenizerFactory)` Sets the factory from which to produce a `Tokenizer`.
`void`	`usePTBTokenizer()`
`void`	`useWhitespaceTokenizer()` Use tokenizers which tokenize on whitespace.

Methods inherited from class java.lang.Object
`clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait`

Constructor Detail

DocumentPreprocessor

public DocumentPreprocessor(TokenizerFactory tokenizerFactory)

DocumentPreprocessor

public DocumentPreprocessor()

Constructs a preprocessor using the default tokenzier: PTBTokenizer.

Method Detail

setEncoding

public void setEncoding(String encoding)

Set the character encoding.

Parameters:: encoding -

setSentenceFinalPuncWords

public void setSentenceFinalPuncWords(String[] sentenceFinalPuncWords)

setTokenizerFactory

public void setTokenizerFactory(TokenizerFactory newTokenizerFactory)

Sets the factory from which to produce a Tokenizer. The default is PTBTokenizer.

Parameters:: newTokenizerFactory -

usePTBTokenizer

public void usePTBTokenizer()

useWhitespaceTokenizer

public void useWhitespaceTokenizer()

Use tokenizers which tokenize on whitespace.

getWordsFromText

public List getWordsFromText(String fileOrURL)
                      throws IOException

Reads the file into a single list of words.

Parameters:: fileOrURL - the path of a text file or URL
Returns:: a list of objects of type Word representing words
Throws:: IOException

getWordsFromText

public List getWordsFromText(Reader input)

Parameters:: input - a Reader of text
Returns:: a List of objects of type Word representing words

getSentencesFromText

public List getSentencesFromText(String fileOrURL)
                          throws IOException

Reads the file and outputs a list of sentences.

Parameters:: fileOrURL - the path of a text file or URL
Returns:: a list of objects of type Sentence
Throws:: IOException

getSentencesFromText

public List getSentencesFromText(String fileOrURL,
                                 boolean doPTBEscaping,
                                 String sentenceDelimiter,
                                 int tagDelimiter)
                          throws IOException

Throws:: IOException

getSentencesFromText

public List getSentencesFromText(Reader input)

Parameters:: input - a Reader of text
Returns:: a List of objects of type Sentence

getSentencesFromText

public List getSentencesFromText(String input,
                                 Function<List<HasWord>,List<HasWord>> escaper,
                                 String sentenceDelimiter,
                                 int tagDelimiter)
                          throws IOException

Produce a list of sentences from text.

Parameters:: input - the path to the filename or URL; escaper - a Function that takes a List of HasWords and returns an escaped version of those words. Passing in null here means that no escaping is done.; sentenceDelimiter - If null, means that sentences are not segmented already, and should be using default sentence delimiters if non-null, means that sentences have already been segmented, and are delimited with this token.; tagDelimiter -
Returns:: a List of objects of type List representing sentences
Throws:: IOException

getSentencesFromText

public List getSentencesFromText(Reader input,
                                 Function<List<HasWord>,List<HasWord>> escaper,
                                 String sentenceDelimiter,
                                 int tagDelimiter)

Produce a list of sentences from a reader.

Parameters:: input - the input; escaper - a Function that takes a List of HasWords and returns an escaped version of those words. Passing in null here means that no escaping is done.; sentenceDelimiter - If null, means that sentences are not segmented already, and should be using default sentence delimiters if non-null, means that sentences have already been segmented, and are delimited with this token.; tagDelimiter -
Returns:: a List of objects of type List representing sentences

getWordsFromString

public List getWordsFromString(String input)

Gets a list of words from a string.

Parameters:: input - string
Returns:: a List of objects of type Word representing words

getSentencesFromXML

public List getSentencesFromXML(String fileOrURL,
                                String splitOnTag)
                         throws IOException

Returns a list of sentences contained in an XML file, occuring between the begin and end of a selected tag. It escapes sentences using a WordToSentenceProcessor. By default, it does PTBEscaping as well.

Parameters:: fileOrURL -; splitOnTag - the tag which denotes text boundaries
Returns:
Throws:: IOException

getSentencesFromXML

public List getSentencesFromXML(String fileOrURL,
                                String splitOnTag,
                                boolean doPTBEscaping)
                         throws IOException

Returns a list of sentences contained in an XML file, occuring between the begin and end of a selected tag. It escapes sentences using a WordToSentenceProcessor.

Parameters:: fileOrURL -; splitOnTag - the tag which denotes text boundaries; doPTBEscaping - whether to escape PTB tokens using a PTBEscapingProcessor
Returns:
Throws:: IOException

getSentencesFromXML

public List getSentencesFromXML(Reader input,
                                String splitOnTag,
                                boolean doPTBEscaping)

Returns a list of sentences contained in an XML file, occuring between the begin and end of a selected tag.

Parameters:: input -; splitOnTag - the tag which denotes text boundaries; doPTBEscaping - whether to escape PTB tokens using a PTBEscapingProcessor
Returns:

getWordsFromHTML

public List getWordsFromHTML(String fileOrURL)
                      throws IOException

Throws:: IOException

getWordsFromHTML

public List getWordsFromHTML(Reader input)

getSentencesFromHTML

public List getSentencesFromHTML(String fileOrURL)
                          throws IOException

Throws:: IOException

getSentencesFromHTML

public List getSentencesFromHTML(Reader input)

main

public static void main(String[] args)
                 throws IOException

This provides a simple test method for DocumentPreprocessor. Usage: DocumentPreprocessor -file filename [-xml tag|-html] [-noSplitSentence]

Throws:: IOException

Overview

Package

Class

Tree

Deprecated

Index

Help

PREV CLASS NEXT CLASS

FRAMES NO FRAMES

SUMMARY: NESTED | FIELD | CONSTR | METHOD

DETAIL: FIELD | CONSTR | METHOD

Stanford NLP Group

edu.stanford.nlp.process Class DocumentPreprocessor

DocumentPreprocessor

DocumentPreprocessor

setEncoding

setSentenceFinalPuncWords

setTokenizerFactory

usePTBTokenizer

useWhitespaceTokenizer

getWordsFromText

getWordsFromText

getSentencesFromText

getSentencesFromText

getSentencesFromText

getSentencesFromText

getSentencesFromText

getWordsFromString

getSentencesFromXML

getSentencesFromXML

getSentencesFromXML

getWordsFromHTML

getWordsFromHTML

getSentencesFromHTML

getSentencesFromHTML

main

edu.stanford.nlp.process
Class DocumentPreprocessor