edu.stanford.nlp.process
Class WhitespaceTokenizer

java.lang.Object
  extended by edu.stanford.nlp.process.AbstractTokenizer
      extended by edu.stanford.nlp.process.WhitespaceTokenizer
All Implemented Interfaces:
Tokenizer, Iterator

public class WhitespaceTokenizer
extends AbstractTokenizer

Simple Tokenizer implementation that tokenizes on whitespace. This implementation returns Word objects. It has a parameter for whether to make EOL a token. If it is, it is return as a Word with String value "\n".

Author:
Joseph Smarr (jsmarr@stanford.edu), Teg Grenager (grenager@stanford.edu)

Field Summary
 
Fields inherited from class edu.stanford.nlp.process.AbstractTokenizer
nextToken
 
Constructor Summary
WhitespaceTokenizer(Reader r)
          Constructs a new WhitespaceTokenizer with the Reader r as its source.
WhitespaceTokenizer(Reader r, boolean eolIsSignificant)
          Constructs a new WhitespaceTokenizer with the Reader r as its source.
 
Method Summary
static TokenizerFactory factory()
           
static TokenizerFactory factory(boolean eolIsSignificant)
           
protected  Object getNext()
          Internally fetches the next token.
static void main(String[] args)
          Reads a file from the argument and prints its tokens one per line.
 
Methods inherited from class edu.stanford.nlp.process.AbstractTokenizer
hasNext, next, peek, remove, tokenize
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

WhitespaceTokenizer

public WhitespaceTokenizer(Reader r)
Constructs a new WhitespaceTokenizer with the Reader r as its source.


WhitespaceTokenizer

public WhitespaceTokenizer(Reader r,
                           boolean eolIsSignificant)
Constructs a new WhitespaceTokenizer with the Reader r as its source.

Method Detail

getNext

protected Object getNext()
Internally fetches the next token.

Specified by:
getNext in class AbstractTokenizer
Returns:
the next token in the token stream, or null if none exists.

factory

public static TokenizerFactory factory()

factory

public static TokenizerFactory factory(boolean eolIsSignificant)

main

public static void main(String[] args)
                 throws IOException
Reads a file from the argument and prints its tokens one per line. This is mainly as a testing aid, but it can also be quite useful standalone to turn a corpus into a one token per line file of tokens.

Usage: java edu.stanford.nlp.process.WhitespaceTokenizer filename

Parameters:
args - Command line arguments
Throws:
IOException


Stanford NLP Group