|
|||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.Objectedu.stanford.nlp.process.DocumentPreprocessor
public class DocumentPreprocessor
Fully customizable preprocessor for XML, HTML, and plain text documents.
Can take any of a number of input formats and return a List
of tokenized strings.
Constructor Summary | |
---|---|
DocumentPreprocessor()
Constructs a preprocessor using the default tokenzier: PTBTokenizer . |
|
DocumentPreprocessor(TokenizerFactory tokenizerFactory)
|
Method Summary | |
---|---|
List |
getSentencesFromHTML(Reader input)
|
List |
getSentencesFromHTML(String fileOrURL)
|
List |
getSentencesFromText(Reader input)
|
List |
getSentencesFromText(Reader input,
Function<List<HasWord>,List<HasWord>> escaper,
String sentenceDelimiter,
int tagDelimiter)
Produce a list of sentences from a reader. |
List |
getSentencesFromText(String fileOrURL)
Reads the file and outputs a list of sentences. |
List |
getSentencesFromText(String fileOrURL,
boolean doPTBEscaping,
String sentenceDelimiter,
int tagDelimiter)
|
List |
getSentencesFromText(String input,
Function<List<HasWord>,List<HasWord>> escaper,
String sentenceDelimiter,
int tagDelimiter)
Produce a list of sentences from text. |
List |
getSentencesFromXML(Reader input,
String splitOnTag,
boolean doPTBEscaping)
Returns a list of sentences contained in an XML file, occuring between the begin and end of a selected tag. |
List |
getSentencesFromXML(String fileOrURL,
String splitOnTag)
Returns a list of sentences contained in an XML file, occuring between the begin and end of a selected tag. |
List |
getSentencesFromXML(String fileOrURL,
String splitOnTag,
boolean doPTBEscaping)
Returns a list of sentences contained in an XML file, occuring between the begin and end of a selected tag. |
List |
getWordsFromHTML(Reader input)
|
List |
getWordsFromHTML(String fileOrURL)
|
List |
getWordsFromString(String input)
Gets a list of words from a string. |
List |
getWordsFromText(Reader input)
|
List |
getWordsFromText(String fileOrURL)
Reads the file into a single list of words. |
static void |
main(String[] args)
This provides a simple test method for DocumentPreprocessor. |
void |
setEncoding(String encoding)
Set the character encoding. |
void |
setSentenceFinalPuncWords(String[] sentenceFinalPuncWords)
|
void |
setTokenizerFactory(TokenizerFactory newTokenizerFactory)
Sets the factory from which to produce a Tokenizer . |
void |
usePTBTokenizer()
|
void |
useWhitespaceTokenizer()
Use tokenizers which tokenize on whitespace. |
Methods inherited from class java.lang.Object |
---|
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
Constructor Detail |
---|
public DocumentPreprocessor(TokenizerFactory tokenizerFactory)
public DocumentPreprocessor()
PTBTokenizer
.
Method Detail |
---|
public void setEncoding(String encoding)
encoding
- public void setSentenceFinalPuncWords(String[] sentenceFinalPuncWords)
public void setTokenizerFactory(TokenizerFactory newTokenizerFactory)
Tokenizer
. The default is
PTBTokenizer
.
newTokenizerFactory
- public void usePTBTokenizer()
public void useWhitespaceTokenizer()
public List getWordsFromText(String fileOrURL) throws IOException
fileOrURL
- the path of a text file or URL
IOException
public List getWordsFromText(Reader input)
input
- a Reader of text
public List getSentencesFromText(String fileOrURL) throws IOException
fileOrURL
- the path of a text file or URL
Sentence
IOException
public List getSentencesFromText(String fileOrURL, boolean doPTBEscaping, String sentenceDelimiter, int tagDelimiter) throws IOException
IOException
public List getSentencesFromText(Reader input)
input
- a Reader of text
public List getSentencesFromText(String input, Function<List<HasWord>,List<HasWord>> escaper, String sentenceDelimiter, int tagDelimiter) throws IOException
input
- the path to the filename or URLescaper
- a Function
that takes a List of HasWords and returns an escaped version of those words.
Passing in null
here means that no escaping is done.sentenceDelimiter
- If null, means that sentences are not segmented already, and should be using default sentence delimiters
if non-null, means that sentences have already been segmented, and are delimited with this token.tagDelimiter
-
IOException
public List getSentencesFromText(Reader input, Function<List<HasWord>,List<HasWord>> escaper, String sentenceDelimiter, int tagDelimiter)
input
- the inputescaper
- a Function
that takes a List of HasWords and returns an escaped version of those words.
Passing in null
here means that no escaping is done.sentenceDelimiter
- If null, means that sentences are not segmented already, and should be using default sentence delimiters
if non-null, means that sentences have already been segmented, and are delimited with this token.tagDelimiter
-
public List getWordsFromString(String input)
input
- string
public List getSentencesFromXML(String fileOrURL, String splitOnTag) throws IOException
WordToSentenceProcessor
.
By default, it does PTBEscaping as well.
fileOrURL
- splitOnTag
- the tag which denotes text boundaries
IOException
public List getSentencesFromXML(String fileOrURL, String splitOnTag, boolean doPTBEscaping) throws IOException
WordToSentenceProcessor
.
fileOrURL
- splitOnTag
- the tag which denotes text boundariesdoPTBEscaping
- whether to escape PTB tokens using a PTBEscapingProcessor
IOException
public List getSentencesFromXML(Reader input, String splitOnTag, boolean doPTBEscaping)
input
- splitOnTag
- the tag which denotes text boundariesdoPTBEscaping
- whether to escape PTB tokens using a PTBEscapingProcessor
public List getWordsFromHTML(String fileOrURL) throws IOException
IOException
public List getWordsFromHTML(Reader input)
public List getSentencesFromHTML(String fileOrURL) throws IOException
IOException
public List getSentencesFromHTML(Reader input)
public static void main(String[] args) throws IOException
IOException
|
|||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |