BasicDocument (Stanford JavaNLP API)

Overview

Package

Class

Tree

Deprecated

Index

Help

PREV CLASS NEXT CLASS

FRAMES NO FRAMES

SUMMARY: NESTED | FIELD | CONSTR | METHOD

DETAIL: FIELD | CONSTR | METHOD

edu.stanford.nlp.ling
Class BasicDocument

java.lang.Object
  java.util.AbstractCollection<E>
      java.util.AbstractList<E>
          java.util.ArrayList
              edu.stanford.nlp.ling.BasicDocument

All Implemented Interfaces:: Datum, Document, Featurizable, Labeled, Serializable, Cloneable, Iterable, Collection, List, RandomAccess

public class BasicDocument
extends ArrayList
implements Document
extends ArrayList
implements Document

Basic implementation of Document that should be suitable for most needs. BasicDocument is an ArrayList for storing words and performs tokenization during construction. Override parse(String) to provide support for custom document formats or to do a custom job of tokenization. BasicDocument should only be used for documents that are small enough to store in memory.

The easiest way to use BasicDocuments is to construct them and call an init method in the same line (we use init methods instead of constructors because they're inherited and allow subclasses to have other more specific constructors). For example, to read in a file file and tokenize it, you can call

Document doc=new BasicDocument().init(file);

Author:: Joseph Smarr (jsmarr@stanford.edu)
See Also:: Serialized Form

Field Summary
`protected List`	`labels` Label(s) for this document.
`protected String`	`originalText` original text of this document (may be null).
`protected String`	`title` title of this document (never null).
`protected TokenizerFactory`	`tokenizerFactory` TokenizerFactory used to convert the text into words inside `parse(String)`.

Fields inherited from class java.util.AbstractList
`modCount`

Constructor Summary
`BasicDocument()` Constructs a new (empty) BasicDocument using a `PTBTokenizer`.
`BasicDocument(Collection d)`
`BasicDocument(Document d)`
`BasicDocument(TokenizerFactory tokenizerFactory)` Constructs a new (empty) BasicDocument using the given tokenizer.

Method Summary
`void`	`addLabel(Object label)` Adds the given label to the List of labels for this Document if it is not null.
`Collection`	`asFeatures()` Returns `this` (the features are the list of words).
`Document`	`blankDocument()` Returns a new empty BasicDocument with the same title, labels, and tokenizer as this Document.
`BasicDocument`	`init()` Calls init((String)null,null,true)
`BasicDocument`	`init(File textFile)` Calls init(textFile,textFile.getCanonicalPath(),true)
`BasicDocument`	`init(File textFile, boolean keepOriginalText)` Calls init(textFile,textFile.getCanonicalPath(),keepOriginalText)
`BasicDocument`	`init(File textFile, String title)` Calls init(textFile,title,true)
`BasicDocument`	`init(File textFile, String title, boolean keepOriginalText)` Inits a new BasicDocument by reading in the text from the given File.
`BasicDocument`	`init(List words)` Calls init(words,null)
`BasicDocument`	`init(List words, String title)` Inits a new BasicDocument with the given list of words and title.
`BasicDocument`	`init(Reader textReader)` Calls init(textReader,null,true)
`BasicDocument`	`init(Reader textReader, boolean keepOriginalText)` Calls init(textReader,null,keepOriginalText)
`BasicDocument`	`init(Reader textReader, String title)` Calls init(textReader,title,true)
`BasicDocument`	`init(Reader textReader, String title, boolean keepOriginalText)` Inits a new BasicDocument by reading in the text from the given Reader.
`BasicDocument`	`init(String text)` Calls init(text,null,true)
`BasicDocument`	`init(String text, boolean keepOriginalText)` Calls init(text,null,keepOriginalText)
`BasicDocument`	`init(String text, String title)` Calls init(text,title,true)
`BasicDocument`	`init(String text, String title, boolean keepOriginalText)` Inits a new BasicDocument with the given text contents and title.
`BasicDocument`	`init(URL textURL)` Calls init(textURL,textURL.toExternalForm(),true)
`BasicDocument`	`init(URL textURL, boolean keepOriginalText)` Calls init(textURL,textFile.toExternalForm(),keepOriginalText)
`BasicDocument`	`init(URL textURL, String title)` Calls init(textURL,title,true)
`BasicDocument`	`init(URL textURL, String title, boolean keepOriginalText)` Constructs a new BasicDocument by reading in the text from the given URL.
`Object`	`label()` Returns the first label for this Document, or null if none have been set.
`Collection`	`labels()` Returns the complete List of labels for this Document.
`static void`	`main(String[] args)` For internal debugging purposes only.
`String`	`originalText()` Returns the text originally used to construct this document, or null if there was no original text.
`protected void`	`parse(String text)` Tokenizes the given text to populate the list of words this Document represents.
`String`	`presentableText()` Returns a "pretty" version of the words in this Document suitable for display.
`static void`	`printState(BasicDocument bd)` For internal debugging purposes only.
`void`	`setLabel(Object label)` Removes all currently assigned labels for this Document then adds the given label.
`void`	`setLabels(Collection labels)` Removes all currently assigned labels for this Document then adds all of the given labels.
`void`	`setTitle(String title)` Sets the title of this Document to the given title.
`void`	`setTokenizerFactory(TokenizerFactory tokenizerFactory)` Sets the tokenizerFactory to be used by `parse(String)`.
`String`	`title()` Returns the title of this document.
`TokenizerFactory`	`tokenizerFactory()` Returns the current TokenizerFactory used by `parse(String)`.

Methods inherited from class java.util.ArrayList
`add, add, addAll, addAll, clear, clone, contains, ensureCapacity, get, indexOf, isEmpty, lastIndexOf, remove, remove, removeRange, set, size, toArray, toArray, trimToSize`

Methods inherited from class java.util.AbstractList
`equals, hashCode, iterator, listIterator, listIterator, subList`

Methods inherited from class java.util.AbstractCollection
`containsAll, removeAll, retainAll, toString`

Methods inherited from class java.lang.Object
`finalize, getClass, notify, notifyAll, wait, wait, wait`

Methods inherited from interface java.util.List
`add, add, addAll, addAll, clear, contains, containsAll, equals, get, hashCode, indexOf, isEmpty, iterator, lastIndexOf, listIterator, listIterator, remove, remove, removeAll, retainAll, set, size, subList, toArray, toArray`

Field Detail

title

protected String title

title of this document (never null).

originalText

protected String originalText

original text of this document (may be null).

labels

protected final List labels

Label(s) for this document.

tokenizerFactory

protected TokenizerFactory tokenizerFactory

TokenizerFactory used to convert the text into words inside parse(String).

Constructor Detail

BasicDocument

public BasicDocument()

Constructs a new (empty) BasicDocument using a PTBTokenizer. Call one of the init * methods to populate the document from a desired source.

BasicDocument

public BasicDocument(TokenizerFactory tokenizerFactory)

Constructs a new (empty) BasicDocument using the given tokenizer. Call one of the init * methods to populate the document from a desired source.

BasicDocument

public BasicDocument(Document d)

BasicDocument

public BasicDocument(Collection d)

Method Detail

init

public BasicDocument init(String text,
                          String title,
                          boolean keepOriginalText)

Inits a new BasicDocument with the given text contents and title. The text is tokenized using parse(String) to populate the list of words ("" is used if text is null). If specified, a reference to the original text is also maintained so that the text() method returns the text given to this constructor. Returns a reference to this BasicDocument for convenience (so it's more like a constructor, but inherited).

init

public BasicDocument init(String text,
                          String title)

Calls init(text,title,true)

init

public BasicDocument init(String text,
                          boolean keepOriginalText)

Calls init(text,null,keepOriginalText)

init

public BasicDocument init(String text)

Calls init(text,null,true)

init

public BasicDocument init()

Calls init((String)null,null,true)

init

public BasicDocument init(Reader textReader,
                          String title,
                          boolean keepOriginalText)
                   throws IOException

Inits a new BasicDocument by reading in the text from the given Reader.

Throws:: IOException
See Also:: init(String,String,boolean)

init

public BasicDocument init(Reader textReader,
                          String title)
                   throws IOException

Calls init(textReader,title,true)

Throws:: IOException

init

public BasicDocument init(Reader textReader,
                          boolean keepOriginalText)
                   throws IOException

Calls init(textReader,null,keepOriginalText)

Throws:: IOException

init

public BasicDocument init(Reader textReader)
                   throws IOException

Calls init(textReader,null,true)

Throws:: IOException

init

public BasicDocument init(File textFile,
                          String title,
                          boolean keepOriginalText)
                   throws FileNotFoundException,
                          IOException

Inits a new BasicDocument by reading in the text from the given File.

Throws:: FileNotFoundException; IOException
See Also:: init(String,String,boolean)

init

public BasicDocument init(File textFile,
                          String title)
                   throws FileNotFoundException,
                          IOException

Calls init(textFile,title,true)

Throws:: FileNotFoundException; IOException

init

public BasicDocument init(File textFile,
                          boolean keepOriginalText)
                   throws FileNotFoundException,
                          IOException

Calls init(textFile,textFile.getCanonicalPath(),keepOriginalText)

Throws:: FileNotFoundException; IOException

init

public BasicDocument init(File textFile)
                   throws FileNotFoundException,
                          IOException

Calls init(textFile,textFile.getCanonicalPath(),true)

Throws:: FileNotFoundException; IOException

init

public BasicDocument init(URL textURL,
                          String title,
                          boolean keepOriginalText)
                   throws IOException

Constructs a new BasicDocument by reading in the text from the given URL.

Throws:: IOException
See Also:: init(String,String,boolean)

init

public BasicDocument init(URL textURL,
                          String title)
                   throws FileNotFoundException,
                          IOException

Calls init(textURL,title,true)

Throws:: FileNotFoundException; IOException

init

public BasicDocument init(URL textURL,
                          boolean keepOriginalText)
                   throws FileNotFoundException,
                          IOException

Calls init(textURL,textFile.toExternalForm(),keepOriginalText)

Throws:: FileNotFoundException; IOException

init

public BasicDocument init(URL textURL)
                   throws FileNotFoundException,
                          IOException

Calls init(textURL,textURL.toExternalForm(),true)

Throws:: FileNotFoundException; IOException

init

public BasicDocument init(List words,
                          String title)

Inits a new BasicDocument with the given list of words and title.

init

public BasicDocument init(List words)

Calls init(words,null)

parse

protected void parse(String text)

Tokenizes the given text to populate the list of words this Document represents. The default implementation uses the current tokenizer and tokenizes the entirety of the text into words. Subclasses should override this method to parse documents in non-standard formats, and/or to pull the title of the document from the text. The given text may be empty ("") but will never be null. Subclasses may want to do additional processing and then just call super.parse.

See Also:: setTokenizerFactory(edu.stanford.nlp.objectbank.TokenizerFactory)

asFeatures

public Collection asFeatures()

Returns this (the features are the list of words).

Specified by:: asFeatures in interface Featurizable

label

public Object label()

Returns the first label for this Document, or null if none have been set.

Specified by:: label in interface Labeled

labels

public Collection labels()

Returns the complete List of labels for this Document. This is an empty collection if none have been set.

Specified by:: labels in interface Labeled

setLabel

public void setLabel(Object label)

Removes all currently assigned labels for this Document then adds the given label. Calling setLabel(null) effectively clears all labels.

setLabels

public void setLabels(Collection labels)

Removes all currently assigned labels for this Document then adds all of the given labels.

addLabel

public void addLabel(Object label)

Adds the given label to the List of labels for this Document if it is not null.

title

public String title()

Returns the title of this document. The title may be empty ("") but will never be null.

Specified by:: title in interface Document

setTitle

public void setTitle(String title)

Sets the title of this Document to the given title. If the given title is null, sets the title to "".

tokenizerFactory

public TokenizerFactory tokenizerFactory()

Returns the current TokenizerFactory used by parse(String).

setTokenizerFactory

public void setTokenizerFactory(TokenizerFactory tokenizerFactory)

Sets the tokenizerFactory to be used by parse(String). Set this tokenizer before calling one of the init methods because it will probably call parse. Note that the tokenizer can equivalently be passed in to the constructor.

See Also:: BasicDocument(TokenizerFactory)

blankDocument

public Document blankDocument()

Returns a new empty BasicDocument with the same title, labels, and tokenizer as this Document. This is useful when you want to make a new Document that's like the old document but can be filled with new text (e.g. if you're transforming the contents non-destructively).

Subclasses that want to preserve extra state should override this method and add the extra state to the new document before returning it. The new BasicDocument is created by calling getClass().newInstance() so it should be of the correct subclass, and thus you should be able to cast it down and add extra meta data directly. Note however that in the event an Exception is thrown on instantiation (e.g. if your subclass doesn't have a public empty constructor--it should btw!) then a new BasicDocument is used instead. Thus if you want to be paranoid (or some would say "correct") you should check that your instance is of the correct sub-type as follows (this example assumes the subclass is called NumberedDocument and it has the additional numberproperty):

Document blankDocument=super.blankDocument();
 if(blankDocument instanceof NumberedDocument) {
     ((NumberedDocument)blankDocument).setNumber(getNumber());

Specified by:: blankDocument in interface Document

originalText

public String originalText()

Returns the text originally used to construct this document, or null if there was no original text.

presentableText

public String presentableText()

Returns a "pretty" version of the words in this Document suitable for display. The default implementation returns each of the words in this Document separated by spaces. Specifically, each element that implements HasWord has its HasWord.word() printed, and other elements are skipped.

Subclasses that maintain additional information may which to override this method.

main

public static void main(String[] args)

For internal debugging purposes only. Creates and tests various instances of BasicDocument.

printState

public static void printState(BasicDocument bd)
                       throws Exception

For internal debugging purposes only. Prints the state of the given BasicDocument to stderr.

Throws:: Exception

Overview

Package

Class

Tree

Deprecated

Index

Help

PREV CLASS NEXT CLASS

FRAMES NO FRAMES

SUMMARY: NESTED | FIELD | CONSTR | METHOD

DETAIL: FIELD | CONSTR | METHOD

Stanford NLP Group

edu.stanford.nlp.ling Class BasicDocument

title

originalText

labels

tokenizerFactory

BasicDocument

BasicDocument

BasicDocument

BasicDocument

init

init

init

init

init

init

init

init

init

init

init

init

init

init

init

init

init

init

init

parse

asFeatures

label

labels

setLabel

setLabels

addLabel

title

setTitle

tokenizerFactory

setTokenizerFactory

blankDocument

originalText

presentableText

main

printState

edu.stanford.nlp.ling
Class BasicDocument