|
|||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.Objectjava.util.AbstractCollection<E>
java.util.AbstractList<E>
java.util.ArrayList
edu.stanford.nlp.ling.BasicDocument
public class BasicDocument
Basic implementation of Document that should be suitable for most needs.
BasicDocument is an ArrayList for storing words and performs tokenization
during construction. Override parse(String)
to provide support
for custom
document formats or to do a custom job of tokenization. BasicDocument should
only be used for documents that are small enough to store in memory.
Document doc=new BasicDocument().init(file);.
Field Summary | |
---|---|
protected List |
labels
Label(s) for this document. |
protected String |
originalText
original text of this document (may be null). |
protected String |
title
title of this document (never null). |
protected TokenizerFactory |
tokenizerFactory
TokenizerFactory used to convert the text into words inside parse(String) . |
Fields inherited from class java.util.AbstractList |
---|
modCount |
Constructor Summary | |
---|---|
BasicDocument()
Constructs a new (empty) BasicDocument using a PTBTokenizer . |
|
BasicDocument(Collection d)
|
|
BasicDocument(Document d)
|
|
BasicDocument(TokenizerFactory tokenizerFactory)
Constructs a new (empty) BasicDocument using the given tokenizer. |
Method Summary | |
---|---|
void |
addLabel(Object label)
Adds the given label to the List of labels for this Document if it is not null. |
Collection |
asFeatures()
Returns this (the features are the list of words). |
Document |
blankDocument()
Returns a new empty BasicDocument with the same title, labels, and tokenizer as this Document. |
BasicDocument |
init()
Calls init((String)null,null,true) |
BasicDocument |
init(File textFile)
Calls init(textFile,textFile.getCanonicalPath(),true) |
BasicDocument |
init(File textFile,
boolean keepOriginalText)
Calls init(textFile,textFile.getCanonicalPath(),keepOriginalText) |
BasicDocument |
init(File textFile,
String title)
Calls init(textFile,title,true) |
BasicDocument |
init(File textFile,
String title,
boolean keepOriginalText)
Inits a new BasicDocument by reading in the text from the given File. |
BasicDocument |
init(List words)
Calls init(words,null) |
BasicDocument |
init(List words,
String title)
Inits a new BasicDocument with the given list of words and title. |
BasicDocument |
init(Reader textReader)
Calls init(textReader,null,true) |
BasicDocument |
init(Reader textReader,
boolean keepOriginalText)
Calls init(textReader,null,keepOriginalText) |
BasicDocument |
init(Reader textReader,
String title)
Calls init(textReader,title,true) |
BasicDocument |
init(Reader textReader,
String title,
boolean keepOriginalText)
Inits a new BasicDocument by reading in the text from the given Reader. |
BasicDocument |
init(String text)
Calls init(text,null,true) |
BasicDocument |
init(String text,
boolean keepOriginalText)
Calls init(text,null,keepOriginalText) |
BasicDocument |
init(String text,
String title)
Calls init(text,title,true) |
BasicDocument |
init(String text,
String title,
boolean keepOriginalText)
Inits a new BasicDocument with the given text contents and title. |
BasicDocument |
init(URL textURL)
Calls init(textURL,textURL.toExternalForm(),true) |
BasicDocument |
init(URL textURL,
boolean keepOriginalText)
Calls init(textURL,textFile.toExternalForm(),keepOriginalText) |
BasicDocument |
init(URL textURL,
String title)
Calls init(textURL,title,true) |
BasicDocument |
init(URL textURL,
String title,
boolean keepOriginalText)
Constructs a new BasicDocument by reading in the text from the given URL. |
Object |
label()
Returns the first label for this Document, or null if none have been set. |
Collection |
labels()
Returns the complete List of labels for this Document. |
static void |
main(String[] args)
For internal debugging purposes only. |
String |
originalText()
Returns the text originally used to construct this document, or null if there was no original text. |
protected void |
parse(String text)
Tokenizes the given text to populate the list of words this Document represents. |
String |
presentableText()
Returns a "pretty" version of the words in this Document suitable for display. |
static void |
printState(BasicDocument bd)
For internal debugging purposes only. |
void |
setLabel(Object label)
Removes all currently assigned labels for this Document then adds the given label. |
void |
setLabels(Collection labels)
Removes all currently assigned labels for this Document then adds all of the given labels. |
void |
setTitle(String title)
Sets the title of this Document to the given title. |
void |
setTokenizerFactory(TokenizerFactory tokenizerFactory)
Sets the tokenizerFactory to be used by parse(String) . |
String |
title()
Returns the title of this document. |
TokenizerFactory |
tokenizerFactory()
Returns the current TokenizerFactory used by parse(String) . |
Methods inherited from class java.util.ArrayList |
---|
add, add, addAll, addAll, clear, clone, contains, ensureCapacity, get, indexOf, isEmpty, lastIndexOf, remove, remove, removeRange, set, size, toArray, toArray, trimToSize |
Methods inherited from class java.util.AbstractList |
---|
equals, hashCode, iterator, listIterator, listIterator, subList |
Methods inherited from class java.util.AbstractCollection |
---|
containsAll, removeAll, retainAll, toString |
Methods inherited from class java.lang.Object |
---|
finalize, getClass, notify, notifyAll, wait, wait, wait |
Methods inherited from interface java.util.List |
---|
add, add, addAll, addAll, clear, contains, containsAll, equals, get, hashCode, indexOf, isEmpty, iterator, lastIndexOf, listIterator, listIterator, remove, remove, removeAll, retainAll, set, size, subList, toArray, toArray |
Field Detail |
---|
protected String title
protected String originalText
protected final List labels
protected TokenizerFactory tokenizerFactory
parse(String)
.
Constructor Detail |
---|
public BasicDocument()
PTBTokenizer
.
Call one of the init * methods to populate the document
from a desired source.
public BasicDocument(TokenizerFactory tokenizerFactory)
public BasicDocument(Document d)
public BasicDocument(Collection d)
Method Detail |
---|
public BasicDocument init(String text, String title, boolean keepOriginalText)
parse(String)
to populate the list of words
("" is used if text is null). If specified, a reference to the
original text is also maintained so that the text() method returns the
text given to this constructor. Returns a reference to this
BasicDocument
for convenience (so it's more like a constructor, but inherited).
public BasicDocument init(String text, String title)
public BasicDocument init(String text, boolean keepOriginalText)
public BasicDocument init(String text)
public BasicDocument init()
public BasicDocument init(Reader textReader, String title, boolean keepOriginalText) throws IOException
IOException
init(String,String,boolean)
public BasicDocument init(Reader textReader, String title) throws IOException
IOException
public BasicDocument init(Reader textReader, boolean keepOriginalText) throws IOException
IOException
public BasicDocument init(Reader textReader) throws IOException
IOException
public BasicDocument init(File textFile, String title, boolean keepOriginalText) throws FileNotFoundException, IOException
FileNotFoundException
IOException
init(String,String,boolean)
public BasicDocument init(File textFile, String title) throws FileNotFoundException, IOException
FileNotFoundException
IOException
public BasicDocument init(File textFile, boolean keepOriginalText) throws FileNotFoundException, IOException
FileNotFoundException
IOException
public BasicDocument init(File textFile) throws FileNotFoundException, IOException
FileNotFoundException
IOException
public BasicDocument init(URL textURL, String title, boolean keepOriginalText) throws IOException
IOException
init(String,String,boolean)
public BasicDocument init(URL textURL, String title) throws FileNotFoundException, IOException
FileNotFoundException
IOException
public BasicDocument init(URL textURL, boolean keepOriginalText) throws FileNotFoundException, IOException
FileNotFoundException
IOException
public BasicDocument init(URL textURL) throws FileNotFoundException, IOException
FileNotFoundException
IOException
public BasicDocument init(List words, String title)
public BasicDocument init(List words)
protected void parse(String text)
setTokenizerFactory(edu.stanford.nlp.objectbank.TokenizerFactory)
public Collection asFeatures()
asFeatures
in interface Featurizable
public Object label()
label
in interface Labeled
public Collection labels()
labels
in interface Labeled
public void setLabel(Object label)
public void setLabels(Collection labels)
public void addLabel(Object label)
public String title()
title
in interface Document
public void setTitle(String title)
public TokenizerFactory tokenizerFactory()
parse(String)
.
public void setTokenizerFactory(TokenizerFactory tokenizerFactory)
parse(String)
.
Set this tokenizer before calling one of the init methods
because
it will probably call parse. Note that the tokenizer can equivalently be
passed in to the constructor.
BasicDocument(TokenizerFactory)
public Document blankDocument()
Subclasses that want to preserve extra state should override this method and add the extra state to the new document before returning it. The new BasicDocument is created by calling getClass().newInstance() so it should be of the correct subclass, and thus you should be able to cast it down and add extra meta data directly. Note however that in the event an Exception is thrown on instantiation (e.g. if your subclass doesn't have a public empty constructor--it should btw!) then a new BasicDocument is used instead. Thus if you want to be paranoid (or some would say "correct") you should check that your instance is of the correct sub-type as follows (this example assumes the subclass is called NumberedDocument and it has the additional numberproperty):
Document blankDocument=super.blankDocument(); if(blankDocument instanceof NumberedDocument) { ((NumberedDocument)blankDocument).setNumber(getNumber());
blankDocument
in interface Document
public String originalText()
public String presentableText()
HasWord
has its
HasWord.word()
printed, and other elements are skipped.
Subclasses that maintain additional information may which to override this method.
public static void main(String[] args)
public static void printState(BasicDocument bd) throws Exception
Exception
|
|||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |