edu.stanford.nlp.process
Class StripTagsProcessor

java.lang.Object
  extended by edu.stanford.nlp.process.AbstractListProcessor
      extended by edu.stanford.nlp.process.StripTagsProcessor
All Implemented Interfaces:
ListProcessor, Processor

public class StripTagsProcessor
extends AbstractListProcessor

A Processor whose process method deletes all SGML/XML/HTML tags (tokens starting with < and ending with >. Optionally, newlines can be inserted after the end of block-level tags to roughly simulate where continuous text was broken up (this helps finding sentence boundaries for example).

Author:
Christopher Manning

Field Summary
static Set blockTags
          Block-level HTML tags that are rendered with surrounding line breaks.
 
Constructor Summary
StripTagsProcessor()
          Constructs a new StripTagsProcessor that doesn't mark line breaks.
StripTagsProcessor(boolean markLineBreaks)
          Constructs a new StripTagProcessor that marks line breaks as specified.
 
Method Summary
 boolean getMarkLineBreaks()
          Retruns whether the output of the processor will contain newline words ("\n") at the end of block-level tags.
static void main(String[] args)
          For internal debugging purposes only.
 List process(List in)
          Returns a new Document with the same meta-data as in, and the same words except tags are stripped.
 void setMarkLineBreaks(boolean markLineBreaks)
          Sets whether the output of the processor will contain newline words ("\n") at the end of block-level tags.
 
Methods inherited from class edu.stanford.nlp.process.AbstractListProcessor
processDocument, processLists
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

blockTags

public static final Set blockTags
Block-level HTML tags that are rendered with surrounding line breaks.

Constructor Detail

StripTagsProcessor

public StripTagsProcessor()
Constructs a new StripTagsProcessor that doesn't mark line breaks.


StripTagsProcessor

public StripTagsProcessor(boolean markLineBreaks)
Constructs a new StripTagProcessor that marks line breaks as specified.

Method Detail

getMarkLineBreaks

public boolean getMarkLineBreaks()
Retruns whether the output of the processor will contain newline words ("\n") at the end of block-level tags.


setMarkLineBreaks

public void setMarkLineBreaks(boolean markLineBreaks)
Sets whether the output of the processor will contain newline words ("\n") at the end of block-level tags.


process

public List process(List in)
Returns a new Document with the same meta-data as in, and the same words except tags are stripped.


main

public static void main(String[] args)
For internal debugging purposes only.



Stanford NLP Group