edu.stanford.nlp.ling
Class DocumentReader

java.lang.Object
  extended by edu.stanford.nlp.ling.DocumentReader

public class DocumentReader
extends Object

Basic mechanism for reading in Documents from various input sources. This default implementation can read from strings, files, URLs, and InputStreams and can use a given Tokenizer to turn the text into words. When working with a new data format, make a new DocumentReader to parse it and then use it with the existing Document APIs (rather than having to make new Document classes). Use the protected class variables (in, tokenizer, keepOriginalText) to read text and create docs appropriately. Subclasses should ideally provide similar constructors to this class, though only the constructor that takes a Reader is required.

Author:
Joseph Smarr (jsmarr@stanford.edu)

Field Summary
protected  Reader in
          Reader used to read in document text.
protected  boolean keepOriginalText
          Whether to keep source text in document along with tokenized words.
protected  TokenizerFactory tokenizerFactory
          Tokenizer used to chop up document text into words.
 
Constructor Summary
DocumentReader()
          Constructs a new DocumentReader without an initial input source.
DocumentReader(Reader in)
          Constructs a new DocumentReader using a PTBTokenizerFactory and keeps the original text.
DocumentReader(Reader in, TokenizerFactory tokenizerFactory, boolean keepOriginalText)
          Constructs a new DocumentReader that will read text from the given Reader and tokenize it into words using the given Tokenizer.
 
Method Summary
static BufferedReader getBufferedReader(Reader in)
          Wraps the given Reader in a BufferedReader or returns it directly if it is already a BufferedReader.
 boolean getKeepOriginalText()
          Returns whether created documents will store their source text along with tokenized words.
 Reader getReader()
          Returns the reader for the text input source of this DocumentReader.
static Reader getReader(File file)
          Returns a Reader that reads in the given file.
static Reader getReader(InputStream in)
          Returns a Reader that reads in the given InputStream.
static Reader getReader(Object in)
          Intelligently returns a Reader for a variety of input sources.
static Reader getReader(String text)
          Returns a Reader that reads in the given text.
static Reader getReader(URL url)
          Returns a Reader that reads in the given URL.
 TokenizerFactory getTokenizerFactory()
          Returns the tokenizer used to chop up text into words for the documents.
protected  Document parseDocumentText(String text)
          Creates a new Document for the given text.
 Document readDocument()
          Reads the next document's worth of text from the reader and turns it into a Document.
protected  String readNextDocumentText()
          Reads the next document's worth of text from the reader.
static String readText(Reader in)
          Returns everything that can be read from the given Reader as a String.
 void setKeepOriginalText(boolean keepOriginalText)
          Sets whether created documents should store their source text along with tokenized words.
 void setReader(Reader in)
          Sets the reader from which to read and create documents.
 void setTokenizerFactory(TokenizerFactory tokenizerFactory)
          Sets the tokenizer used to chop up text into words for the documents.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

in

protected Reader in
Reader used to read in document text. In default implementation, this is guaranteed to be a BufferedReader (so cast down) but it's typed as Reader in case subclasses don't want it buffered for some reason.


tokenizerFactory

protected TokenizerFactory tokenizerFactory
Tokenizer used to chop up document text into words.


keepOriginalText

protected boolean keepOriginalText
Whether to keep source text in document along with tokenized words.

Constructor Detail

DocumentReader

public DocumentReader()
Constructs a new DocumentReader without an initial input source. Must call setReader(java.io.Reader) before trying to read any documents. Uses a PTBTokenizer and keeps original text.


DocumentReader

public DocumentReader(Reader in)
Constructs a new DocumentReader using a PTBTokenizerFactory and keeps the original text.


DocumentReader

public DocumentReader(Reader in,
                      TokenizerFactory tokenizerFactory,
                      boolean keepOriginalText)
Constructs a new DocumentReader that will read text from the given Reader and tokenize it into words using the given Tokenizer. The default implementation will internally buffer the reader if it is not already buffered, so there is no need to pre-wrap the reader with a BufferedReader. This class provides many getReader methods for conviniently reading from many input sources.

Method Detail

getReader

public Reader getReader()
Returns the reader for the text input source of this DocumentReader.


setReader

public void setReader(Reader in)
Sets the reader from which to read and create documents. Default implementation automatically buffers the Reader if it's not already buffered. Subclasses that don't want buffering may want to override this method to simply set the global in directly.


getTokenizerFactory

public TokenizerFactory getTokenizerFactory()
Returns the tokenizer used to chop up text into words for the documents.


setTokenizerFactory

public void setTokenizerFactory(TokenizerFactory tokenizerFactory)
Sets the tokenizer used to chop up text into words for the documents.


getKeepOriginalText

public boolean getKeepOriginalText()
Returns whether created documents will store their source text along with tokenized words.


setKeepOriginalText

public void setKeepOriginalText(boolean keepOriginalText)
Sets whether created documents should store their source text along with tokenized words.


readDocument

public Document readDocument()
                      throws IOException
Reads the next document's worth of text from the reader and turns it into a Document. Default implementation calls readNextDocumentText() and passes it to parseDocumentText(java.lang.String) to create the document. Subclasses may wish to override either or both of those methods to handle custom formats of document collections and individual documents respectively. This method can also be overridden in its entirety to provide custom reading and construction of documents from input text.

Throws:
IOException

readNextDocumentText

protected String readNextDocumentText()
                               throws IOException
Reads the next document's worth of text from the reader. Default implementation reads all the text. Subclasses wishing to read multiple documents from a single input source should read until the next document delimiter and return the text so far. Returns null if there is no more text to be read.

Throws:
IOException

parseDocumentText

protected Document parseDocumentText(String text)
Creates a new Document for the given text. Default implementation tokenizes the text using the tokenizer provided during construction and sticks the words in a new BasicDocument. The text is also stored as the original text in the BasicDocument if keepOriginalText was set in the constructor. Subclasses may wish to extract additional information from the text and/or return another document subclass with additional meta-data.


getBufferedReader

public static BufferedReader getBufferedReader(Reader in)
Wraps the given Reader in a BufferedReader or returns it directly if it is already a BufferedReader. Subclasses should use this method before reading from in for efficiency and/or to read entire lines at a time. Note that this should only be done once per reader because when you read from a buffered reader, it reads more than necessary and stores the rest, so if you then throw that buffered reader out and get a new one for the original reader, text will be missing. In the default DocumentReader text, the Reader passed in at construction is wrapped in a buffered reader so you can just cast in down to a BufferedReader without calling this method.


readText

public static String readText(Reader in)
                       throws IOException
Returns everything that can be read from the given Reader as a String. Returns null if the given Reader is null.

Throws:
IOException

getReader

public static Reader getReader(String text)
Returns a Reader that reads in the given text.


getReader

public static Reader getReader(File file)
                        throws FileNotFoundException
Returns a Reader that reads in the given file.

Throws:
FileNotFoundException

getReader

public static Reader getReader(URL url)
                        throws IOException
Returns a Reader that reads in the given URL.

Throws:
IOException

getReader

public static Reader getReader(InputStream in)
Returns a Reader that reads in the given InputStream.


getReader

public static Reader getReader(Object in)
                        throws FileNotFoundException,
                               IOException
Intelligently returns a Reader for a variety of input sources. If in is a File, String, URL, InputStream, or Reader, a Reader is opened for it using one of the other static getReader methods of this class. NOTE: If in is a String it is treated as a Filename, not a text source as with getReader(String). If in is not one of these file types, null is returned.

Throws:
FileNotFoundException
IOException


Stanford NLP Group