edu.stanford.nlp.trees.international.pennchinese
Class CHTBTokenizer
java.lang.Object
edu.stanford.nlp.process.AbstractTokenizer
edu.stanford.nlp.trees.international.pennchinese.CHTBTokenizer
- All Implemented Interfaces:
- Tokenizer, Iterator
public class CHTBTokenizer
- extends AbstractTokenizer
A simple tokenizer for tokenizing Penn Chinese Treebank files. A
token is any parenthesis, node label, or terminal. All SGML
content of the files is ignored.
- Author:
- Roger Levy
Method Summary |
Object |
getNext()
Internally fetches the next token. |
static void |
main(String[] args)
The main() method tokenizes a file in the specified Encoding
and prints it to standard output in the specified Encoding. |
Methods inherited from class java.lang.Object |
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
CHTBTokenizer
public CHTBTokenizer(Reader r)
- Constructs a new tokenizer from a Reader. Note that getting
the bytes going into the Reader into Java-internal Unicode is
not the tokenizer's job. This can be done by converting the
file with
ConvertEncodingThread
, or by specifying
the files encoding explicitly in the Reader with
java.io.InputStreamReader
.
- Parameters:
r
- Reader
getNext
public Object getNext()
- Internally fetches the next token.
- Specified by:
getNext
in class AbstractTokenizer
- Returns:
- the next token in the token stream, or null if none exists.
main
public static void main(String[] args)
throws IOException
- The main() method tokenizes a file in the specified Encoding
and prints it to standard output in the specified Encoding.
Its arguments are (Infile, Encoding).
- Throws:
IOException
Stanford NLP Group