edu.stanford.nlp.ling
Class BasicDocument

java.lang.Object
  extended by java.util.AbstractCollection<E>
      extended by java.util.AbstractList<E>
          extended by java.util.ArrayList
              extended by edu.stanford.nlp.ling.BasicDocument
All Implemented Interfaces:
Datum, Document, Featurizable, Labeled, Serializable, Cloneable, Iterable, Collection, List, RandomAccess

public class BasicDocument
extends ArrayList
implements Document

Basic implementation of Document that should be suitable for most needs. BasicDocument is an ArrayList for storing words and performs tokenization during construction. Override parse(String) to provide support for custom document formats or to do a custom job of tokenization. BasicDocument should only be used for documents that are small enough to store in memory.

The easiest way to use BasicDocuments is to construct them and call an init method in the same line (we use init methods instead of constructors because they're inherited and allow subclasses to have other more specific constructors). For example, to read in a file file and tokenize it, you can call

Document doc=new BasicDocument().init(file);
.

Author:
Joseph Smarr (jsmarr@stanford.edu)
See Also:
Serialized Form

Field Summary
protected  List labels
          Label(s) for this document.
protected  String originalText
          original text of this document (may be null).
protected  String title
          title of this document (never null).
protected  TokenizerFactory tokenizerFactory
          TokenizerFactory used to convert the text into words inside parse(String).
 
Fields inherited from class java.util.AbstractList
modCount
 
Constructor Summary
BasicDocument()
          Constructs a new (empty) BasicDocument using a PTBTokenizer.
BasicDocument(Collection d)
           
BasicDocument(Document d)
           
BasicDocument(TokenizerFactory tokenizerFactory)
          Constructs a new (empty) BasicDocument using the given tokenizer.
 
Method Summary
 void addLabel(Object label)
          Adds the given label to the List of labels for this Document if it is not null.
 Collection asFeatures()
          Returns this (the features are the list of words).
 Document blankDocument()
          Returns a new empty BasicDocument with the same title, labels, and tokenizer as this Document.
 BasicDocument init()
          Calls init((String)null,null,true)
 BasicDocument init(File textFile)
          Calls init(textFile,textFile.getCanonicalPath(),true)
 BasicDocument init(File textFile, boolean keepOriginalText)
          Calls init(textFile,textFile.getCanonicalPath(),keepOriginalText)
 BasicDocument init(File textFile, String title)
          Calls init(textFile,title,true)
 BasicDocument init(File textFile, String title, boolean keepOriginalText)
          Inits a new BasicDocument by reading in the text from the given File.
 BasicDocument init(List words)
          Calls init(words,null)
 BasicDocument init(List words, String title)
          Inits a new BasicDocument with the given list of words and title.
 BasicDocument init(Reader textReader)
          Calls init(textReader,null,true)
 BasicDocument init(Reader textReader, boolean keepOriginalText)
          Calls init(textReader,null,keepOriginalText)
 BasicDocument init(Reader textReader, String title)
          Calls init(textReader,title,true)
 BasicDocument init(Reader textReader, String title, boolean keepOriginalText)
          Inits a new BasicDocument by reading in the text from the given Reader.
 BasicDocument init(String text)
          Calls init(text,null,true)
 BasicDocument init(String text, boolean keepOriginalText)
          Calls init(text,null,keepOriginalText)
 BasicDocument init(String text, String title)
          Calls init(text,title,true)
 BasicDocument init(String text, String title, boolean keepOriginalText)
          Inits a new BasicDocument with the given text contents and title.
 BasicDocument init(URL textURL)
          Calls init(textURL,textURL.toExternalForm(),true)
 BasicDocument init(URL textURL, boolean keepOriginalText)
          Calls init(textURL,textFile.toExternalForm(),keepOriginalText)
 BasicDocument init(URL textURL, String title)
          Calls init(textURL,title,true)
 BasicDocument init(URL textURL, String title, boolean keepOriginalText)
          Constructs a new BasicDocument by reading in the text from the given URL.
 Object label()
          Returns the first label for this Document, or null if none have been set.
 Collection labels()
          Returns the complete List of labels for this Document.
static void main(String[] args)
          For internal debugging purposes only.
 String originalText()
          Returns the text originally used to construct this document, or null if there was no original text.
protected  void parse(String text)
          Tokenizes the given text to populate the list of words this Document represents.
 String presentableText()
          Returns a "pretty" version of the words in this Document suitable for display.
static void printState(BasicDocument bd)
          For internal debugging purposes only.
 void setLabel(Object label)
          Removes all currently assigned labels for this Document then adds the given label.
 void setLabels(Collection labels)
          Removes all currently assigned labels for this Document then adds all of the given labels.
 void setTitle(String title)
          Sets the title of this Document to the given title.
 void setTokenizerFactory(TokenizerFactory tokenizerFactory)
          Sets the tokenizerFactory to be used by parse(String).
 String title()
          Returns the title of this document.
 TokenizerFactory tokenizerFactory()
          Returns the current TokenizerFactory used by parse(String).
 
Methods inherited from class java.util.ArrayList
add, add, addAll, addAll, clear, clone, contains, ensureCapacity, get, indexOf, isEmpty, lastIndexOf, remove, remove, removeRange, set, size, toArray, toArray, trimToSize
 
Methods inherited from class java.util.AbstractList
equals, hashCode, iterator, listIterator, listIterator, subList
 
Methods inherited from class java.util.AbstractCollection
containsAll, removeAll, retainAll, toString
 
Methods inherited from class java.lang.Object
finalize, getClass, notify, notifyAll, wait, wait, wait
 
Methods inherited from interface java.util.List
add, add, addAll, addAll, clear, contains, containsAll, equals, get, hashCode, indexOf, isEmpty, iterator, lastIndexOf, listIterator, listIterator, remove, remove, removeAll, retainAll, set, size, subList, toArray, toArray
 

Field Detail

title

protected String title
title of this document (never null).


originalText

protected String originalText
original text of this document (may be null).


labels

protected final List labels
Label(s) for this document.


tokenizerFactory

protected TokenizerFactory tokenizerFactory
TokenizerFactory used to convert the text into words inside parse(String).

Constructor Detail

BasicDocument

public BasicDocument()
Constructs a new (empty) BasicDocument using a PTBTokenizer. Call one of the init * methods to populate the document from a desired source.


BasicDocument

public BasicDocument(TokenizerFactory tokenizerFactory)
Constructs a new (empty) BasicDocument using the given tokenizer. Call one of the init * methods to populate the document from a desired source.


BasicDocument

public BasicDocument(Document d)

BasicDocument

public BasicDocument(Collection d)
Method Detail

init

public BasicDocument init(String text,
                          String title,
                          boolean keepOriginalText)
Inits a new BasicDocument with the given text contents and title. The text is tokenized using parse(String) to populate the list of words ("" is used if text is null). If specified, a reference to the original text is also maintained so that the text() method returns the text given to this constructor. Returns a reference to this BasicDocument for convenience (so it's more like a constructor, but inherited).


init

public BasicDocument init(String text,
                          String title)
Calls init(text,title,true)


init

public BasicDocument init(String text,
                          boolean keepOriginalText)
Calls init(text,null,keepOriginalText)


init

public BasicDocument init(String text)
Calls init(text,null,true)


init

public BasicDocument init()
Calls init((String)null,null,true)


init

public BasicDocument init(Reader textReader,
                          String title,
                          boolean keepOriginalText)
                   throws IOException
Inits a new BasicDocument by reading in the text from the given Reader.

Throws:
IOException
See Also:
init(String,String,boolean)

init

public BasicDocument init(Reader textReader,
                          String title)
                   throws IOException
Calls init(textReader,title,true)

Throws:
IOException

init

public BasicDocument init(Reader textReader,
                          boolean keepOriginalText)
                   throws IOException
Calls init(textReader,null,keepOriginalText)

Throws:
IOException

init

public BasicDocument init(Reader textReader)
                   throws IOException
Calls init(textReader,null,true)

Throws:
IOException

init

public BasicDocument init(File textFile,
                          String title,
                          boolean keepOriginalText)
                   throws FileNotFoundException,
                          IOException
Inits a new BasicDocument by reading in the text from the given File.

Throws:
FileNotFoundException
IOException
See Also:
init(String,String,boolean)

init

public BasicDocument init(File textFile,
                          String title)
                   throws FileNotFoundException,
                          IOException
Calls init(textFile,title,true)

Throws:
FileNotFoundException
IOException

init

public BasicDocument init(File textFile,
                          boolean keepOriginalText)
                   throws FileNotFoundException,
                          IOException
Calls init(textFile,textFile.getCanonicalPath(),keepOriginalText)

Throws:
FileNotFoundException
IOException

init

public BasicDocument init(File textFile)
                   throws FileNotFoundException,
                          IOException
Calls init(textFile,textFile.getCanonicalPath(),true)

Throws:
FileNotFoundException
IOException

init

public BasicDocument init(URL textURL,
                          String title,
                          boolean keepOriginalText)
                   throws IOException
Constructs a new BasicDocument by reading in the text from the given URL.

Throws:
IOException
See Also:
init(String,String,boolean)

init

public BasicDocument init(URL textURL,
                          String title)
                   throws FileNotFoundException,
                          IOException
Calls init(textURL,title,true)

Throws:
FileNotFoundException
IOException

init

public BasicDocument init(URL textURL,
                          boolean keepOriginalText)
                   throws FileNotFoundException,
                          IOException
Calls init(textURL,textFile.toExternalForm(),keepOriginalText)

Throws:
FileNotFoundException
IOException

init

public BasicDocument init(URL textURL)
                   throws FileNotFoundException,
                          IOException
Calls init(textURL,textURL.toExternalForm(),true)

Throws:
FileNotFoundException
IOException

init

public BasicDocument init(List words,
                          String title)
Inits a new BasicDocument with the given list of words and title.


init

public BasicDocument init(List words)
Calls init(words,null)


parse

protected void parse(String text)
Tokenizes the given text to populate the list of words this Document represents. The default implementation uses the current tokenizer and tokenizes the entirety of the text into words. Subclasses should override this method to parse documents in non-standard formats, and/or to pull the title of the document from the text. The given text may be empty ("") but will never be null. Subclasses may want to do additional processing and then just call super.parse.

See Also:
setTokenizerFactory(edu.stanford.nlp.objectbank.TokenizerFactory)

asFeatures

public Collection asFeatures()
Returns this (the features are the list of words).

Specified by:
asFeatures in interface Featurizable

label

public Object label()
Returns the first label for this Document, or null if none have been set.

Specified by:
label in interface Labeled

labels

public Collection labels()
Returns the complete List of labels for this Document. This is an empty collection if none have been set.

Specified by:
labels in interface Labeled

setLabel

public void setLabel(Object label)
Removes all currently assigned labels for this Document then adds the given label. Calling setLabel(null) effectively clears all labels.


setLabels

public void setLabels(Collection labels)
Removes all currently assigned labels for this Document then adds all of the given labels.


addLabel

public void addLabel(Object label)
Adds the given label to the List of labels for this Document if it is not null.


title

public String title()
Returns the title of this document. The title may be empty ("") but will never be null.

Specified by:
title in interface Document

setTitle

public void setTitle(String title)
Sets the title of this Document to the given title. If the given title is null, sets the title to "".


tokenizerFactory

public TokenizerFactory tokenizerFactory()
Returns the current TokenizerFactory used by parse(String).


setTokenizerFactory

public void setTokenizerFactory(TokenizerFactory tokenizerFactory)
Sets the tokenizerFactory to be used by parse(String). Set this tokenizer before calling one of the init methods because it will probably call parse. Note that the tokenizer can equivalently be passed in to the constructor.

See Also:
BasicDocument(TokenizerFactory)

blankDocument

public Document blankDocument()
Returns a new empty BasicDocument with the same title, labels, and tokenizer as this Document. This is useful when you want to make a new Document that's like the old document but can be filled with new text (e.g. if you're transforming the contents non-destructively).

Subclasses that want to preserve extra state should override this method and add the extra state to the new document before returning it. The new BasicDocument is created by calling getClass().newInstance() so it should be of the correct subclass, and thus you should be able to cast it down and add extra meta data directly. Note however that in the event an Exception is thrown on instantiation (e.g. if your subclass doesn't have a public empty constructor--it should btw!) then a new BasicDocument is used instead. Thus if you want to be paranoid (or some would say "correct") you should check that your instance is of the correct sub-type as follows (this example assumes the subclass is called NumberedDocument and it has the additional numberproperty):

Document blankDocument=super.blankDocument();
 if(blankDocument instanceof NumberedDocument) {
     ((NumberedDocument)blankDocument).setNumber(getNumber());

Specified by:
blankDocument in interface Document

originalText

public String originalText()
Returns the text originally used to construct this document, or null if there was no original text.


presentableText

public String presentableText()
Returns a "pretty" version of the words in this Document suitable for display. The default implementation returns each of the words in this Document separated by spaces. Specifically, each element that implements HasWord has its HasWord.word() printed, and other elements are skipped.

Subclasses that maintain additional information may which to override this method.


main

public static void main(String[] args)
For internal debugging purposes only. Creates and tests various instances of BasicDocument.


printState

public static void printState(BasicDocument bd)
                       throws Exception
For internal debugging purposes only. Prints the state of the given BasicDocument to stderr.

Throws:
Exception


Stanford NLP Group