TregexPattern (Stanford JavaNLP API)

Overview

Package

Class

Tree

Deprecated

Index

Help

PREV CLASS NEXT CLASS

FRAMES NO FRAMES

SUMMARY: NESTED | FIELD | CONSTR | METHOD

DETAIL: FIELD | CONSTR | METHOD

edu.stanford.nlp.trees.tregex
Class TregexPattern

java.lang.Object
  edu.stanford.nlp.trees.tregex.TregexPattern

All Implemented Interfaces:: Serializable

public abstract class TregexPattern
extends Object
implements Serializable
extends Object
implements Serializable

A TregexPattern is a tgrep-type pattern.

TregexPattern instances can be matched against instances of the Tree class. The main(java.lang.String[]) method can be used to find matching nodes of a treebank from the command line.

Currently supported node/node relations and their symbols:

Symbol	Meaning
A << B	A dominates B
A >> B	A is dominated by B
A < B	A immediately dominates B
A > B	A is immediately dominated by B
A $ B	A is a sister of B (and not equal to B)
A .. B	A precedes B
A . B	A immediately precedes B
A ,, B	A follows B
A , B	A immediately follows B
A <<, B	B is a leftmost descendent of A
A <<- B	B is a rightmost descendent of A
A >>, B	A is a leftmost descendent of B
A >>- B	A is a rightmost descendent of B
A <, B	B is the first child of A
A >, B	A is the first child of B
A <- B	B is the last child of A
A >- B	A is the last child of B
A <` B	B is the last child of A
A >` B	A is the last child of B
A <i B	B is the ith child of A (i > 0)
A >i B	A is the ith child of B (i > 0)
A <-i B	B is the ith-to-last child of A (i > 0)
A >-i B	A is the ith-to-last child of B (i > 0)
A <: B	B is the only child of A
A >: B	A is the only child of B
A <<: B	A dominates B via an unbroken chain (length > 0) of unary local trees.
A >>: B	A is dominated by B via an unbroken chain (length > 0) of unary local trees.
A $++ B	A is a left sister of B (same as $.. for context-free trees)
A $-- B	A is a right sister of B (same as $,, for context-free trees)
A $+ B	A is the immediate left sister of B (same as $. for context-free trees)
A $- B	A is the immediate right sister of B (same as $, for context-free trees)
A $.. B	A is a sister of B and precedes B
A $,, B	A is a sister of B and follows B
A $. B	A is a sister of B and immediately precedes B
A $, B	A is a sister of B and immediately follows B
A <+(C) B	A dominates B via an unbroken chain of (zero or more) nodes matching description C
A >+(C) B	A is dominated by B via an unbroken chain of (zero or more) nodes matching description C
A <<# B	B is a head of phrase A
A >># B	A is a head of phrase B
A <# B	B is the immediate head of phrase A
A ># B	B is the immediate head of phrase A
A == B	A and B are the same node

Label descriptions can be literal strings, which much match labels exactly, or regular expressions in regular expression bars: /regex/. In order to prevent ambiguity with other Tregex symbols, only standard "identifiers" are allowed as literals, i.e. strings matching [a-zA-z]([a-zA-Z0-9_])* If you want to use other symbols, you can do so by using a regular expression instead of a literal string. A disjunctive list of literal strings can be given separated by '|'. The special string '__' (two underscores) can be used to match any node. (WARNING!! Use of the '__' node description may seriously slow down search.) If a label description is preceeded by '@', the label will match any node whose basicCategory matches the description. The basicCategory is defined according to a Function mapping Strings to Strings, as provided by AbstractTreebankLanguagePack.getBasicCategoryFunction().

In a chain of relations, all relations are relative to the first node in the chain. For example, (S < VP < NP) means "an S over a VP and also over an NP". If instead what you want is an S above a VP above an NP, you should write "S < (VP < NP)".

Nodes can be grouped using parens '(' and ')' as in S < (NP $++ VP) to match an S over an NP, where the NP has a VP as a right sister.

Relations can be combined using the '&' and '|' operators. Thus (NP < NN | < NNS) will match an NP node dominating either an NN or an NNS. (NP > S & $++ VP) matches an NP that is both under an S and has a VP as a right sister.

Relations can be grouped using brackets '[' and ']'. So the expression NP [< NN | < NNS] & > S matches an NP that dominates either an NN or an NNS and is under an S. Without brackets, & takes precidence over |, and equivalent operators are left-associative. Also note that & is the default combining operator if the operator is omitted in a chain of relations, so that (S < VP < NP) is equivalent to (S < VP & < NP) . As another example, (VP < VV | < NP % NP) can be written explicitly as (VP [< VV | [< NP & % NP] ] )

Relations can be negated with the '!' operator, in which case the expression will match only if there is no node satisfying the relation. For example (NP !< NNP) matches only NPs not dominating an NNP. Label descriptions can also be negated with '!': (NP < !NNP|NNS) matches NPs dominating some node that is not an NNP or an NNS.

In order to consider only the "basic category" of a tree label, i.e. to ignore functional tags or other annotations on the label, prefix that node's description with the @ symbol. For example (@NP < @/NN.?/) This can only be used for individual nodes; if you want all nodes to use the basic category, it would be more efficient to use a TreeNormalizer to remove functional tags before passing the tree to the TregexPattern.

`Naming nodes`



  Nodes can be given names using '='.  A named node will be stored in a
 map that maps names to nodes so that if a match is found, the node
 corresponding to the named node can be extracted from the map.  For
 example  (NP < NNP=name)  will match an NP dominating an NNP
 and after a match is found, the map can be queried with the
 name to retreived the matched node using TregexMatcher.getNode(Object o)
 with (String) argument "name" (not "=name").
 Note that a ParseException will be thrown if a named node is used in the
 scope of a negated relation, because the semantics would be unclear.

 
 Named nodes that refer back to previous named nodes need not have a node
 description -- this is known as "backreferencing".  In this case, the expression
 will match only if the subsequently named node is equal to the previously named
 node (in the == sense).
 For example:  (@NP <, (@NP $+ (/,/ $+ (@NP $+ /,/=comma))) <- =comma) 
 matches an NP dominating exactly the sequence NP comma NP comma.  Multiple
 backreferences are allowed.  If the node w/ no node description does not refer
 to a previously named node, there will be no error, the expression simply will
 not match anything.

 
 Another way to refer to previously named nodes is with the "link" symbol: '~'.
 A link is like a backreference, except that instead of having to be *equal to* the
 referred node, the current node only has to match the label of the referred to node.
 A link cannot have a node description, i.e. the '~' symbol must immediately follow a
 relation symbol.

 
 Relations can be made optional with the '?' operator.  This way the
 expression will match even if the optional relation is not satisfied, but
 if it is satisfied named nodes under it will still be put into the map.

 
 The HeadFinder used to determine heads for the head relations, and also
 the Function mapping from labels to Basic Category tags can be
 chosen by using a TregexPatternCompiler.

 
Variable Groups

  If you write a node description using a regular expression, you can assign its matching groups to variable names.
 If more than one node has a group assigned to the same variable name, then matching will only occur when all such groups
 capture the same string.  This is useful for enforcing coindexation constraints.  The syntax is

 

  / < regex-stuff > /#< group-number > % < variable-name > 
 

 For example, the pattern (designed for Penn Treebank trees)

 
  @SBAR < /^WH.*-([0-9]+)$/#1%index << (__=empty < (/^-NONE-/ < /^\*T\*-([0-9]+)$/#1%index)) 
 

 will match only such that the WH- node under the SBAR is coindexed with the trace node that gets the name empty.


 
Current known bugs/shortcomings:

 

  Node search currently takes no advantage of limitations
 imposed by the queried relations.  This reduces the efficiency of
 the search, quite a bit in some cases.
 
 Due to the lack of parent pointers in Trees, parents are found via
 exhaustive depth-first search from the root.  This is a serious efficiency bottleneck.

 




Author:
  Galen Andrew, Roger Levy (rog@csli.stanford.edu)
See Also:
Serialized Form









Field Summary



protected static Function<String,String>
currentBasicCatFunction



           


 






Method Summary



static TregexPattern
compile(String tregex)



          Creates a pattern from the given string using the default Headfinder and
 BasicCategoryFunction.



static void
main(String[] args)



          Prints out all matches of a tree pattern on each tree in the path.



 TregexMatcher
matcher(Tree t)



          Get a TregexMatcher for this pattern on this tree.



 String
pattern()



           



 void
prettyPrint()



          Print a multi-line respresentation of the pattern illustrating
 it's syntax to System.out.



 void
prettyPrint(PrintStream ps)



          Print a multi-line respresentation
 of the pattern illustrating it's syntax.



 void
prettyPrint(PrintWriter pw)



          Print a multi-line respresentation
 of the pattern illustrating it's syntax.



 void
setPatternString(String patternString)



           



abstract  String
toString()



           


 


Methods inherited from class java.lang.Object


clone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait


 








Field Detail




currentBasicCatFunction
protected static Function<String,String> currentBasicCatFunction











Method Detail




matcher
public TregexMatcher matcher(Tree t)

Get a TregexMatcher for this pattern on this tree.





Parameters:
t - a tree to match on
Returns:
a TregexMatcher





compile
public static TregexPattern compile(String tregex)
                             throws ParseException

Creates a pattern from the given string using the default Headfinder and
 BasicCategoryFunction.  If you want to use a different HeadFinder or
 BasicCategoryFunction, use a TregexPatternCompiler object.





Parameters:
tregex - the pattern string
Returns:
a TregexPattern for the string.
Throws:
ParseException - if the string does not parse





pattern
public String pattern()











setPatternString
public void setPatternString(String patternString)











toString
public abstract String toString()


Overrides:
toString in class Object



Returns:
A single-line string representation of the pattern





prettyPrint
public void prettyPrint(PrintWriter pw)

Print a multi-line respresentation
 of the pattern illustrating it's syntax.











prettyPrint
public void prettyPrint(PrintStream ps)

Print a multi-line respresentation
 of the pattern illustrating it's syntax.











prettyPrint
public void prettyPrint()

Print a multi-line respresentation of the pattern illustrating
 it's syntax to System.out.











main
public static void main(String[] args)

Prints out all matches of a tree pattern on each tree in the path.
 Usage: 


 java edu.stanford.nlp.trees.tregex.TregexPattern [-TCwfosnu] pattern
 [handle] filepath   

 
 Arguments:

 
pattern: the tree
 pattern which optionally names some node =name (for some arbitrary
 string "name")
 
 handle (optional): the named node in pattern
 
 filepath: the path to files with trees
 

 Options:

 
 -T causes all trees to be printed as processed.  Otherwise only matching nodes
 are printed.
 
 -C suppresses printing of matches, so only the
 number of matches is printed.
 
 -w causes the whole of a tree that matches to be printed.
 
 -f causes the filename to be printed.
 
 -o allows each tree node to be reported only once as the root of a match (by default a node will
 be printed once for every way the pattern matches).
 
 -s causes trees to be printed all on one line (by default they are pretty printed).
 
 -n causes the number of the tree in which the match was found to be
 printed before every match.
 
 -u causes only the label of each matching node to be printed.
 
 -t causes only the yield of the selected node (or the yield of the whole tree, if the -w option is used) to be printed.
 
 -encoding  option allows specification of character set.
 




















  
      Overview 
      Package 
    Class 
      Tree 
      Deprecated 
      Index 
      Help 
  









 PREV CLASS 
 NEXT CLASS

  FRAMES   
 NO FRAMES   
 







  SUMMARY: NESTED | FIELD | CONSTR | METHOD

DETAIL: FIELD | CONSTR | METHOD






Stanford NLP Group

Method Summary
`static TregexPattern`	`compile(String tregex)` Creates a pattern from the given string using the default Headfinder and BasicCategoryFunction.
`static void`	`main(String[] args)` Prints out all matches of a tree pattern on each tree in the path.
`TregexMatcher`	`matcher(Tree t)` Get a `TregexMatcher` for this pattern on this tree.
`String`	`pattern()`
`void`	`prettyPrint()` Print a multi-line respresentation of the pattern illustrating it's syntax to System.out.
`void`	`prettyPrint(PrintStream ps)` Print a multi-line respresentation of the pattern illustrating it's syntax.
`void`	`prettyPrint(PrintWriter pw)` Print a multi-line respresentation of the pattern illustrating it's syntax.
`void`	`setPatternString(String patternString)`
`abstract String`	`toString()`

edu.stanford.nlp.trees.tregex Class TregexPattern

Naming nodes

Variable Groups

Current known bugs/shortcomings:

currentBasicCatFunction

matcher

compile

pattern

setPatternString

toString

prettyPrint

prettyPrint

prettyPrint

main

edu.stanford.nlp.trees.tregex
Class TregexPattern

`Naming nodes`