|
|||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.Objectedu.stanford.nlp.process.AbstractTokenizer
edu.stanford.nlp.process.PTBTokenizer
public class PTBTokenizer
Tokenizer implementation that conforms to the Penn Treebank tokenization conventions. This tokenizer is a Java implementation of Professor Chris Manning's Flex tokenizer, pgtt-treebank.l. It reads raw text and outputs tokens as edu.stanford.nlp.trees.Words in the Penn treebank format. It can optionally return carriage returns as tokens.
Nested Class Summary | |
---|---|
static class |
PTBTokenizer.PTBTokenizerFactory
|
Field Summary |
---|
Fields inherited from class edu.stanford.nlp.process.AbstractTokenizer |
---|
nextToken |
Constructor Summary | |
---|---|
PTBTokenizer(Reader r)
Constructs a new PTBTokenizer that treats carriage returns as normal whitespace. |
|
PTBTokenizer(Reader r,
boolean tokenizeCRs)
Constructs a new PTBTokenizer that optionally returns carriage returns as their own token. |
|
PTBTokenizer(Reader r,
boolean tokenizeCRs,
LexedTokenFactory tokenFactory)
Constructs a new PTBTokenizer that optionally returns carriage returns as their own token, and has a custom LexedTokenFactory. |
Method Summary | |
---|---|
static TokenizerFactory |
factory()
|
protected Object |
getNext()
Internally fetches the next token. |
static void |
main(String[] args)
Reads a file from the argument and prints its tokens one per line. |
static String |
ptb2Text(List ptbWords)
Returns a presentable version of the given PTB-tokenized words. |
static String |
ptb2Text(String ptbText)
Returns a presentable version of the given PTB-tokenized text. |
void |
setSource(Reader r)
Sets the source of this Tokenizer to be the Reader r. |
Methods inherited from class edu.stanford.nlp.process.AbstractTokenizer |
---|
hasNext, next, peek, remove, tokenize |
Methods inherited from class java.lang.Object |
---|
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
Constructor Detail |
---|
public PTBTokenizer(Reader r)
public PTBTokenizer(Reader r, boolean tokenizeCRs)
PTBLexer.cr
.
public PTBTokenizer(Reader r, boolean tokenizeCRs, LexedTokenFactory tokenFactory)
PTBLexer.cr
.
tokenFactory
- The LexedTokenFactory to use to create
tokens from the text.Method Detail |
---|
protected Object getNext()
getNext
in class AbstractTokenizer
public void setSource(Reader r)
public static String ptb2Text(String ptbText)
public static String ptb2Text(List ptbWords)
ptb2Text(String)
on the
output. This method will check if the elements in the list are subtypes
of Word, and if so, it will take the word() values to prevent additional
text from creeping in (e.g., POS tags). Otherwise the toString value will
be used.
public static TokenizerFactory factory()
public static void main(String[] args) throws IOException
java edu.stanford.nlp.process.PTBTokenizer [-charset charset] [-nl] filename
args
- Command line arguments
IOException
|
|||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |