|
|||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.Objectedu.stanford.nlp.ling.DocumentReader
public class DocumentReader
Basic mechanism for reading in Documents from various input sources. This default implementation can read from strings, files, URLs, and InputStreams and can use a given Tokenizer to turn the text into words. When working with a new data format, make a new DocumentReader to parse it and then use it with the existing Document APIs (rather than having to make new Document classes). Use the protected class variables (in, tokenizer, keepOriginalText) to read text and create docs appropriately. Subclasses should ideally provide similar constructors to this class, though only the constructor that takes a Reader is required.
Field Summary | |
---|---|
protected Reader |
in
Reader used to read in document text. |
protected boolean |
keepOriginalText
Whether to keep source text in document along with tokenized words. |
protected TokenizerFactory |
tokenizerFactory
Tokenizer used to chop up document text into words. |
Constructor Summary | |
---|---|
DocumentReader()
Constructs a new DocumentReader without an initial input source. |
|
DocumentReader(Reader in)
Constructs a new DocumentReader using a PTBTokenizerFactory and keeps the original text. |
|
DocumentReader(Reader in,
TokenizerFactory tokenizerFactory,
boolean keepOriginalText)
Constructs a new DocumentReader that will read text from the given Reader and tokenize it into words using the given Tokenizer. |
Method Summary | |
---|---|
static BufferedReader |
getBufferedReader(Reader in)
Wraps the given Reader in a BufferedReader or returns it directly if it is already a BufferedReader. |
boolean |
getKeepOriginalText()
Returns whether created documents will store their source text along with tokenized words. |
Reader |
getReader()
Returns the reader for the text input source of this DocumentReader. |
static Reader |
getReader(File file)
Returns a Reader that reads in the given file. |
static Reader |
getReader(InputStream in)
Returns a Reader that reads in the given InputStream. |
static Reader |
getReader(Object in)
Intelligently returns a Reader for a variety of input sources. |
static Reader |
getReader(String text)
Returns a Reader that reads in the given text. |
static Reader |
getReader(URL url)
Returns a Reader that reads in the given URL. |
TokenizerFactory |
getTokenizerFactory()
Returns the tokenizer used to chop up text into words for the documents. |
protected Document |
parseDocumentText(String text)
Creates a new Document for the given text. |
Document |
readDocument()
Reads the next document's worth of text from the reader and turns it into a Document. |
protected String |
readNextDocumentText()
Reads the next document's worth of text from the reader. |
static String |
readText(Reader in)
Returns everything that can be read from the given Reader as a String. |
void |
setKeepOriginalText(boolean keepOriginalText)
Sets whether created documents should store their source text along with tokenized words. |
void |
setReader(Reader in)
Sets the reader from which to read and create documents. |
void |
setTokenizerFactory(TokenizerFactory tokenizerFactory)
Sets the tokenizer used to chop up text into words for the documents. |
Methods inherited from class java.lang.Object |
---|
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
Field Detail |
---|
protected Reader in
protected TokenizerFactory tokenizerFactory
protected boolean keepOriginalText
Constructor Detail |
---|
public DocumentReader()
setReader(java.io.Reader)
before trying to read any documents.
Uses a PTBTokenizer and keeps original text.
public DocumentReader(Reader in)
public DocumentReader(Reader in, TokenizerFactory tokenizerFactory, boolean keepOriginalText)
Method Detail |
---|
public Reader getReader()
public void setReader(Reader in)
public TokenizerFactory getTokenizerFactory()
public void setTokenizerFactory(TokenizerFactory tokenizerFactory)
public boolean getKeepOriginalText()
public void setKeepOriginalText(boolean keepOriginalText)
public Document readDocument() throws IOException
readNextDocumentText()
and passes it to parseDocumentText(java.lang.String)
to create the document.
Subclasses may wish to override either or both of those methods to handle
custom formats of document collections and individual documents
respectively. This method can also be overridden in its entirety to
provide custom reading and construction of documents from input text.
IOException
protected String readNextDocumentText() throws IOException
IOException
protected Document parseDocumentText(String text)
public static BufferedReader getBufferedReader(Reader in)
public static String readText(Reader in) throws IOException
IOException
public static Reader getReader(String text)
public static Reader getReader(File file) throws FileNotFoundException
FileNotFoundException
public static Reader getReader(URL url) throws IOException
IOException
public static Reader getReader(InputStream in)
public static Reader getReader(Object in) throws FileNotFoundException, IOException
getReader(String)
. If in is not one
of these file types, null is returned.
FileNotFoundException
IOException
|
|||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |