PTBTokenizer (Stanford JavaNLP API)

Overview

Package

Class

Tree

Deprecated

Index

Help

PREV CLASS NEXT CLASS

FRAMES NO FRAMES

SUMMARY: NESTED | FIELD | CONSTR | METHOD

DETAIL: FIELD | CONSTR | METHOD

edu.stanford.nlp.process
Class PTBTokenizer

java.lang.Object
  edu.stanford.nlp.process.AbstractTokenizer
      edu.stanford.nlp.process.PTBTokenizer

All Implemented Interfaces:: Tokenizer, Iterator

public class PTBTokenizer
extends AbstractTokenizer
extends AbstractTokenizer

Tokenizer implementation that conforms to the Penn Treebank tokenization conventions. This tokenizer is a Java implementation of Professor Chris Manning's Flex tokenizer, pgtt-treebank.l. It reads raw text and outputs tokens as edu.stanford.nlp.trees.Words in the Penn treebank format. It can optionally return carriage returns as tokens.

Author:: Tim Grow, Teg Grenager (grenager@stanford.edu), Christopher Manning

Nested Class Summary
`static class`	`PTBTokenizer.PTBTokenizerFactory`

Field Summary

Fields inherited from class edu.stanford.nlp.process.AbstractTokenizer
`nextToken`

Constructor Summary
`PTBTokenizer(Reader r)` Constructs a new PTBTokenizer that treats carriage returns as normal whitespace.
`PTBTokenizer(Reader r, boolean tokenizeCRs)` Constructs a new PTBTokenizer that optionally returns carriage returns as their own token.
`PTBTokenizer(Reader r, boolean tokenizeCRs, LexedTokenFactory tokenFactory)` Constructs a new PTBTokenizer that optionally returns carriage returns as their own token, and has a custom LexedTokenFactory.

Method Summary
`static TokenizerFactory`	`factory()`
`protected Object`	`getNext()` Internally fetches the next token.
`static void`	`main(String[] args)` Reads a file from the argument and prints its tokens one per line.
`static String`	`ptb2Text(List ptbWords)` Returns a presentable version of the given PTB-tokenized words.
`static String`	`ptb2Text(String ptbText)` Returns a presentable version of the given PTB-tokenized text.
`void`	`setSource(Reader r)` Sets the source of this Tokenizer to be the Reader r.

Methods inherited from class edu.stanford.nlp.process.AbstractTokenizer
`hasNext, next, peek, remove, tokenize`

Methods inherited from class java.lang.Object
`clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait`

Constructor Detail

PTBTokenizer

public PTBTokenizer(Reader r)

Constructs a new PTBTokenizer that treats carriage returns as normal whitespace.

PTBTokenizer

public PTBTokenizer(Reader r,
                    boolean tokenizeCRs)

Constructs a new PTBTokenizer that optionally returns carriage returns as their own token. CRs come back as Words whose text is the value of PTBLexer.cr.

PTBTokenizer

public PTBTokenizer(Reader r,
                    boolean tokenizeCRs,
                    LexedTokenFactory tokenFactory)

Constructs a new PTBTokenizer that optionally returns carriage returns as their own token, and has a custom LexedTokenFactory. CRs come back as Words whose text is the value of PTBLexer.cr.

Parameters:: tokenFactory - The LexedTokenFactory to use to create tokens from the text.

Method Detail

getNext

protected Object getNext()

Internally fetches the next token.

Specified by:: getNext in class AbstractTokenizer

Returns:: the next token in the token stream, or null if none exists.

setSource

public void setSource(Reader r)

Sets the source of this Tokenizer to be the Reader r.

ptb2Text

public static String ptb2Text(String ptbText)

Returns a presentable version of the given PTB-tokenized text. PTB tokenization splits up punctuation and does various other things that makes simply joining the tokens with spaces look bad. So join the tokens with space and run it through this method to produce nice looking text. It's not perfect, but it works pretty well.

ptb2Text

public static String ptb2Text(List ptbWords)

Returns a presentable version of the given PTB-tokenized words. Pass in a List of Words or Strings, or a Document and this method will join the words with spaces and call ptb2Text(String) on the output. This method will check if the elements in the list are subtypes of Word, and if so, it will take the word() values to prevent additional text from creeping in (e.g., POS tags). Otherwise the toString value will be used.

factory

public static TokenizerFactory factory()

main

public static void main(String[] args)
                 throws IOException

Reads a file from the argument and prints its tokens one per line. This is mainly as a testing aid, but it can also be quite useful standalone to turn a corpus into a one token per line file of tokens. This main method assumes that the input file is in utf-8 encoding, unless it is specified.

Usage: java edu.stanford.nlp.process.PTBTokenizer [-charset charset] [-nl] filename

Parameters:: args - Command line arguments
Throws:: IOException

Overview

Package

Class

Tree

Deprecated

Index

Help

PREV CLASS NEXT CLASS

FRAMES NO FRAMES

SUMMARY: NESTED | FIELD | CONSTR | METHOD

DETAIL: FIELD | CONSTR | METHOD

Stanford NLP Group

edu.stanford.nlp.process Class PTBTokenizer

PTBTokenizer

PTBTokenizer

PTBTokenizer

getNext

setSource

ptb2Text

ptb2Text

factory

main

edu.stanford.nlp.process
Class PTBTokenizer