edu.stanford.nlp.process
Class DocumentPreprocessor

java.lang.Object
  extended by edu.stanford.nlp.process.DocumentPreprocessor

public class DocumentPreprocessor
extends Object

Fully customizable preprocessor for XML, HTML, and plain text documents. Can take any of a number of input formats and return a List of tokenized strings.

Author:
Chris Cox, Jenny Finkel

Constructor Summary
DocumentPreprocessor()
          Constructs a preprocessor using the default tokenzier: PTBTokenizer.
DocumentPreprocessor(TokenizerFactory tokenizerFactory)
           
 
Method Summary
 List getSentencesFromHTML(Reader input)
           
 List getSentencesFromHTML(String fileOrURL)
           
 List getSentencesFromText(Reader input)
           
 List getSentencesFromText(Reader input, Function<List<HasWord>,List<HasWord>> escaper, String sentenceDelimiter, int tagDelimiter)
          Produce a list of sentences from a reader.
 List getSentencesFromText(String fileOrURL)
          Reads the file and outputs a list of sentences.
 List getSentencesFromText(String fileOrURL, boolean doPTBEscaping, String sentenceDelimiter, int tagDelimiter)
           
 List getSentencesFromText(String input, Function<List<HasWord>,List<HasWord>> escaper, String sentenceDelimiter, int tagDelimiter)
          Produce a list of sentences from text.
 List getSentencesFromXML(Reader input, String splitOnTag, boolean doPTBEscaping)
          Returns a list of sentences contained in an XML file, occuring between the begin and end of a selected tag.
 List getSentencesFromXML(String fileOrURL, String splitOnTag)
          Returns a list of sentences contained in an XML file, occuring between the begin and end of a selected tag.
 List getSentencesFromXML(String fileOrURL, String splitOnTag, boolean doPTBEscaping)
          Returns a list of sentences contained in an XML file, occuring between the begin and end of a selected tag.
 List getWordsFromHTML(Reader input)
           
 List getWordsFromHTML(String fileOrURL)
           
 List getWordsFromString(String input)
          Gets a list of words from a string.
 List getWordsFromText(Reader input)
           
 List getWordsFromText(String fileOrURL)
          Reads the file into a single list of words.
static void main(String[] args)
          This provides a simple test method for DocumentPreprocessor.
 void setEncoding(String encoding)
          Set the character encoding.
 void setSentenceFinalPuncWords(String[] sentenceFinalPuncWords)
           
 void setTokenizerFactory(TokenizerFactory newTokenizerFactory)
          Sets the factory from which to produce a Tokenizer.
 void usePTBTokenizer()
           
 void useWhitespaceTokenizer()
          Use tokenizers which tokenize on whitespace.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

DocumentPreprocessor

public DocumentPreprocessor(TokenizerFactory tokenizerFactory)

DocumentPreprocessor

public DocumentPreprocessor()
Constructs a preprocessor using the default tokenzier: PTBTokenizer.

Method Detail

setEncoding

public void setEncoding(String encoding)
Set the character encoding.

Parameters:
encoding -

setSentenceFinalPuncWords

public void setSentenceFinalPuncWords(String[] sentenceFinalPuncWords)

setTokenizerFactory

public void setTokenizerFactory(TokenizerFactory newTokenizerFactory)
Sets the factory from which to produce a Tokenizer. The default is PTBTokenizer.

Parameters:
newTokenizerFactory -

usePTBTokenizer

public void usePTBTokenizer()

useWhitespaceTokenizer

public void useWhitespaceTokenizer()
Use tokenizers which tokenize on whitespace.


getWordsFromText

public List getWordsFromText(String fileOrURL)
                      throws IOException
Reads the file into a single list of words.

Parameters:
fileOrURL - the path of a text file or URL
Returns:
a list of objects of type Word representing words
Throws:
IOException

getWordsFromText

public List getWordsFromText(Reader input)
Parameters:
input - a Reader of text
Returns:
a List of objects of type Word representing words

getSentencesFromText

public List getSentencesFromText(String fileOrURL)
                          throws IOException
Reads the file and outputs a list of sentences.

Parameters:
fileOrURL - the path of a text file or URL
Returns:
a list of objects of type Sentence
Throws:
IOException

getSentencesFromText

public List getSentencesFromText(String fileOrURL,
                                 boolean doPTBEscaping,
                                 String sentenceDelimiter,
                                 int tagDelimiter)
                          throws IOException
Throws:
IOException

getSentencesFromText

public List getSentencesFromText(Reader input)
Parameters:
input - a Reader of text
Returns:
a List of objects of type Sentence

getSentencesFromText

public List getSentencesFromText(String input,
                                 Function<List<HasWord>,List<HasWord>> escaper,
                                 String sentenceDelimiter,
                                 int tagDelimiter)
                          throws IOException
Produce a list of sentences from text.

Parameters:
input - the path to the filename or URL
escaper - a Function that takes a List of HasWords and returns an escaped version of those words. Passing in null here means that no escaping is done.
sentenceDelimiter - If null, means that sentences are not segmented already, and should be using default sentence delimiters if non-null, means that sentences have already been segmented, and are delimited with this token.
tagDelimiter -
Returns:
a List of objects of type List representing sentences
Throws:
IOException

getSentencesFromText

public List getSentencesFromText(Reader input,
                                 Function<List<HasWord>,List<HasWord>> escaper,
                                 String sentenceDelimiter,
                                 int tagDelimiter)
Produce a list of sentences from a reader.

Parameters:
input - the input
escaper - a Function that takes a List of HasWords and returns an escaped version of those words. Passing in null here means that no escaping is done.
sentenceDelimiter - If null, means that sentences are not segmented already, and should be using default sentence delimiters if non-null, means that sentences have already been segmented, and are delimited with this token.
tagDelimiter -
Returns:
a List of objects of type List representing sentences

getWordsFromString

public List getWordsFromString(String input)
Gets a list of words from a string.

Parameters:
input - string
Returns:
a List of objects of type Word representing words

getSentencesFromXML

public List getSentencesFromXML(String fileOrURL,
                                String splitOnTag)
                         throws IOException
Returns a list of sentences contained in an XML file, occuring between the begin and end of a selected tag. It escapes sentences using a WordToSentenceProcessor. By default, it does PTBEscaping as well.

Parameters:
fileOrURL -
splitOnTag - the tag which denotes text boundaries
Returns:
Throws:
IOException

getSentencesFromXML

public List getSentencesFromXML(String fileOrURL,
                                String splitOnTag,
                                boolean doPTBEscaping)
                         throws IOException
Returns a list of sentences contained in an XML file, occuring between the begin and end of a selected tag. It escapes sentences using a WordToSentenceProcessor.

Parameters:
fileOrURL -
splitOnTag - the tag which denotes text boundaries
doPTBEscaping - whether to escape PTB tokens using a PTBEscapingProcessor
Returns:
Throws:
IOException

getSentencesFromXML

public List getSentencesFromXML(Reader input,
                                String splitOnTag,
                                boolean doPTBEscaping)
Returns a list of sentences contained in an XML file, occuring between the begin and end of a selected tag.

Parameters:
input -
splitOnTag - the tag which denotes text boundaries
doPTBEscaping - whether to escape PTB tokens using a PTBEscapingProcessor
Returns:

getWordsFromHTML

public List getWordsFromHTML(String fileOrURL)
                      throws IOException
Throws:
IOException

getWordsFromHTML

public List getWordsFromHTML(Reader input)

getSentencesFromHTML

public List getSentencesFromHTML(String fileOrURL)
                          throws IOException
Throws:
IOException

getSentencesFromHTML

public List getSentencesFromHTML(Reader input)

main

public static void main(String[] args)
                 throws IOException
This provides a simple test method for DocumentPreprocessor. Usage: DocumentPreprocessor -file filename [-xml tag|-html] [-noSplitSentence]

Throws:
IOException


Stanford NLP Group