DocumentReader (Stanford JavaNLP API)

Overview

Package

Class

Tree

Deprecated

Index

Help

PREV CLASS NEXT CLASS

FRAMES NO FRAMES

SUMMARY: NESTED | FIELD | CONSTR | METHOD

DETAIL: FIELD | CONSTR | METHOD

edu.stanford.nlp.ling
Class DocumentReader

java.lang.Object
  edu.stanford.nlp.ling.DocumentReader

public class DocumentReader
extends Object
extends Object

Basic mechanism for reading in Documents from various input sources. This default implementation can read from strings, files, URLs, and InputStreams and can use a given Tokenizer to turn the text into words. When working with a new data format, make a new DocumentReader to parse it and then use it with the existing Document APIs (rather than having to make new Document classes). Use the protected class variables (in, tokenizer, keepOriginalText) to read text and create docs appropriately. Subclasses should ideally provide similar constructors to this class, though only the constructor that takes a Reader is required.

Author:: Joseph Smarr (jsmarr@stanford.edu)

Field Summary
`protected Reader`	`in` Reader used to read in document text.
`protected boolean`	`keepOriginalText` Whether to keep source text in document along with tokenized words.
`protected TokenizerFactory`	`tokenizerFactory` Tokenizer used to chop up document text into words.

Constructor Summary
`DocumentReader()` Constructs a new DocumentReader without an initial input source.
`DocumentReader(Reader in)` Constructs a new DocumentReader using a PTBTokenizerFactory and keeps the original text.
`DocumentReader(Reader in, TokenizerFactory tokenizerFactory, boolean keepOriginalText)` Constructs a new DocumentReader that will read text from the given Reader and tokenize it into words using the given Tokenizer.

Method Summary
`static BufferedReader`	`getBufferedReader(Reader in)` Wraps the given Reader in a BufferedReader or returns it directly if it is already a BufferedReader.
`boolean`	`getKeepOriginalText()` Returns whether created documents will store their source text along with tokenized words.
`Reader`	`getReader()` Returns the reader for the text input source of this DocumentReader.
`static Reader`	`getReader(File file)` Returns a Reader that reads in the given file.
`static Reader`	`getReader(InputStream in)` Returns a Reader that reads in the given InputStream.
`static Reader`	`getReader(Object in)` Intelligently returns a Reader for a variety of input sources.
`static Reader`	`getReader(String text)` Returns a Reader that reads in the given text.
`static Reader`	`getReader(URL url)` Returns a Reader that reads in the given URL.
`TokenizerFactory`	`getTokenizerFactory()` Returns the tokenizer used to chop up text into words for the documents.
`protected Document`	`parseDocumentText(String text)` Creates a new Document for the given text.
`Document`	`readDocument()` Reads the next document's worth of text from the reader and turns it into a Document.
`protected String`	`readNextDocumentText()` Reads the next document's worth of text from the reader.
`static String`	`readText(Reader in)` Returns everything that can be read from the given Reader as a String.
`void`	`setKeepOriginalText(boolean keepOriginalText)` Sets whether created documents should store their source text along with tokenized words.
`void`	`setReader(Reader in)` Sets the reader from which to read and create documents.
`void`	`setTokenizerFactory(TokenizerFactory tokenizerFactory)` Sets the tokenizer used to chop up text into words for the documents.

Methods inherited from class java.lang.Object
`clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait`

Field Detail

in

protected Reader in

Reader used to read in document text. In default implementation, this is guaranteed to be a BufferedReader (so cast down) but it's typed as Reader in case subclasses don't want it buffered for some reason.

tokenizerFactory

protected TokenizerFactory tokenizerFactory

Tokenizer used to chop up document text into words.

keepOriginalText

protected boolean keepOriginalText

Whether to keep source text in document along with tokenized words.

Constructor Detail

DocumentReader

public DocumentReader()

Constructs a new DocumentReader without an initial input source. Must call setReader(java.io.Reader) before trying to read any documents. Uses a PTBTokenizer and keeps original text.

DocumentReader

public DocumentReader(Reader in)

Constructs a new DocumentReader using a PTBTokenizerFactory and keeps the original text.

DocumentReader

public DocumentReader(Reader in,
                      TokenizerFactory tokenizerFactory,
                      boolean keepOriginalText)

Constructs a new DocumentReader that will read text from the given Reader and tokenize it into words using the given Tokenizer. The default implementation will internally buffer the reader if it is not already buffered, so there is no need to pre-wrap the reader with a BufferedReader. This class provides many getReader methods for conviniently reading from many input sources.

Method Detail

getReader

public Reader getReader()

Returns the reader for the text input source of this DocumentReader.

setReader

public void setReader(Reader in)

Sets the reader from which to read and create documents. Default implementation automatically buffers the Reader if it's not already buffered. Subclasses that don't want buffering may want to override this method to simply set the global in directly.

getTokenizerFactory

public TokenizerFactory getTokenizerFactory()

Returns the tokenizer used to chop up text into words for the documents.

setTokenizerFactory

public void setTokenizerFactory(TokenizerFactory tokenizerFactory)

Sets the tokenizer used to chop up text into words for the documents.

getKeepOriginalText

public boolean getKeepOriginalText()

Returns whether created documents will store their source text along with tokenized words.

setKeepOriginalText

public void setKeepOriginalText(boolean keepOriginalText)

Sets whether created documents should store their source text along with tokenized words.

readDocument

public Document readDocument()
                      throws IOException

Reads the next document's worth of text from the reader and turns it into a Document. Default implementation calls readNextDocumentText() and passes it to parseDocumentText(java.lang.String) to create the document. Subclasses may wish to override either or both of those methods to handle custom formats of document collections and individual documents respectively. This method can also be overridden in its entirety to provide custom reading and construction of documents from input text.

Throws:: IOException

readNextDocumentText

protected String readNextDocumentText()
                               throws IOException

Reads the next document's worth of text from the reader. Default implementation reads all the text. Subclasses wishing to read multiple documents from a single input source should read until the next document delimiter and return the text so far. Returns null if there is no more text to be read.

Throws:: IOException

parseDocumentText

protected Document parseDocumentText(String text)

Creates a new Document for the given text. Default implementation tokenizes the text using the tokenizer provided during construction and sticks the words in a new BasicDocument. The text is also stored as the original text in the BasicDocument if keepOriginalText was set in the constructor. Subclasses may wish to extract additional information from the text and/or return another document subclass with additional meta-data.

getBufferedReader

public static BufferedReader getBufferedReader(Reader in)

Wraps the given Reader in a BufferedReader or returns it directly if it is already a BufferedReader. Subclasses should use this method before reading from in for efficiency and/or to read entire lines at a time. Note that this should only be done once per reader because when you read from a buffered reader, it reads more than necessary and stores the rest, so if you then throw that buffered reader out and get a new one for the original reader, text will be missing. In the default DocumentReader text, the Reader passed in at construction is wrapped in a buffered reader so you can just cast in down to a BufferedReader without calling this method.

readText

public static String readText(Reader in)
                       throws IOException

Returns everything that can be read from the given Reader as a String. Returns null if the given Reader is null.

Throws:: IOException

getReader

public static Reader getReader(String text)

Returns a Reader that reads in the given text.

getReader

public static Reader getReader(File file)
                        throws FileNotFoundException

Returns a Reader that reads in the given file.

Throws:: FileNotFoundException

getReader

public static Reader getReader(URL url)
                        throws IOException

Returns a Reader that reads in the given URL.

Throws:: IOException

getReader

public static Reader getReader(InputStream in)

Returns a Reader that reads in the given InputStream.

getReader

public static Reader getReader(Object in)
                        throws FileNotFoundException,
                               IOException

Intelligently returns a Reader for a variety of input sources. If in is a File, String, URL, InputStream, or Reader, a Reader is opened for it using one of the other static getReader methods of this class. NOTE: If in is a String it is treated as a Filename, not a text source as with getReader(String). If in is not one of these file types, null is returned.

Throws:: FileNotFoundException; IOException

Overview

Package

Class

Tree

Deprecated

Index

Help

PREV CLASS NEXT CLASS

FRAMES NO FRAMES

SUMMARY: NESTED | FIELD | CONSTR | METHOD

DETAIL: FIELD | CONSTR | METHOD

Stanford NLP Group

edu.stanford.nlp.ling Class DocumentReader

in

tokenizerFactory

keepOriginalText

DocumentReader

DocumentReader

DocumentReader

getReader

setReader

getTokenizerFactory

setTokenizerFactory

getKeepOriginalText

setKeepOriginalText

readDocument

readNextDocumentText

parseDocumentText

getBufferedReader

readText

getReader

getReader

getReader

getReader

getReader

edu.stanford.nlp.ling
Class DocumentReader