INESS-logo
INESS Search

Treebanks

Tools



INESS Search is a querying system for treebanks in a variety of formats. It handles search in constituency, dependency and HPSG treebanks as well as in LFG treebanks.

The core of INESS Search is a reimplementation of TIGERSearch in Common Lisp and contains numerous extensions and improvements. The most important improvements are:

  • Support for LFG, HPSG, dependency and constituency treebanks
  • First-order predicate logic (existential and universal quantification; negation; quantifier scope can be specified) (with the exception of universal quantification over disjunctions)
  • Search in directed graphs (i.e., LFG F-structures), not only trees (with secondary edges)
  • Search in metadata (for some treebanks; author, document, title)
  • Abbreviated syntax, leading to less verbose query expressions
  • New operators: regular expressions over edge labels, rule operator, c-command and extended head
  • Significantly improved query execution speed
  • Parallelized query execution when searching in more than one treebank
  • Lazy evaluation: search for the first match in every sentence; other matches are calculated when the sentence is displayed
  • Dynamic index updating: the index always reflects the current state of the treebank
  • Aggregation of results by node feature values for selected node variables and features in tabular form
  • Possibility to refine queries (that is, to query a previous query result set). This feature is partially implemented.

Planned improvements and extensions include:

  • Search in parallel treebanks

Tree search: Search in constituency and dependency treebanks

Nodes and node variables

An INESS Search query is a logical form that consists of a set of (existentially or universally) quantified node variables and a specification of predicates (or constraints) on and relations between those node variables.

A treebank graph (or tree) matches a query if the logical form evaluates to true on that graph. In the easiest case of a query containing existentially quantified nodes only, a graph matches if there exists an instantiation of the node variables with nodes from the graph such that the query form is true on those nodes.

Every valid instantiation of the existential node variables that are not in the scope of universal variables is a submatch and is displayed separately in the matching graph.

  • Example: A tree matches the query #x > #y if it contains a node X directly dominating a node Y, since the instantiations #x ← X and #y ← Y make the query form evaluate to true: “X > Y” is true. The pair (X, Y) is one submatch of the query.

Node variables are strings of ASCII characters preceded by one of the variable markers # or %:

#n, #x_1, %vp

  • If the variable marker is #, the node variable is taken to be existentially quantified (there exists an x such that ...).
  • If the variable marker is %, the node variable is universally quantified (for all x: ...). More on quantification below.
  • If the variable name contains an _, its values are not displayed in the frequency tables.

In queries using the abbreviated syntax, node variables are often introduced implicitly. Those implicit variables are always existentially quantified.

  • Example: The abbreviated query NP, which matches nodes labeled ‘NP’, introduces an implicit node variable for the node that should be matched, as can be seen from the equivalent unabbreviated form of the query: #n_1:[label="NP"].

Node constraints and features

Treebank graph nodes come with feature values for a given set of features (or attributes).
The number of available features depends on the treebank and the coded features of a given treebank.

Commonly used features are:

  • word: the surface word form, on leaf nodes in constituency trees, and on all nodes in dependency graphs
  • lemma: the lemma form
  • morph: morphology, used e.g., in the UD treebanks
  • value: surface word form in c-structures (will be changed to word some time)
  • atom: atomic value in an LFG f-structure
  • cat: node label (not used in f-structures)
  • edge: label of an edge to the/a parent node (multi-valued feature in f-structures)
  • child-edge: label of an edge to a child node (multi-valued feature)
  • sec-edge: label of a secondary edge (Tiger-style constituency and dependency treebanks)
  • sec-child-edge: label of a secondary edge to a child node (Tiger-style constituency and dependency treebanks)
  • case, number, gender, person, degree, tense, mood, cat: used in the TIGER treebank
  • type, entity: used in HPSG treebanks

The features available in a given treebank for querying are listed on the Treebank details page.

A node variable can be constrained to a combination (by &) of feature-value pairs using a node constraint expression in square brackets.

Examples:

  • #x:[word="gut"]
  • %y:[lemma="gehen" & pos="V.*"]

For most features, the value is given as a regular expression (see below).
Some features can be set-valued, and they can be queried in a special syntax, where the right side of the feature-value pair is a boolean combination of values in parenthesis.

As example we can take the morph feature of the UD treebanks. Here, the annoated values are sets of feature-value pairs. For instance, the morh feature of the word ‘mora’ in the Slovenian treebank has the value

  • Aspect=Imp Mood=Ind Number=Sing Person=3 Tense=Pres VerbForm=Fin

and a query maching this node could be

  • [morph=("Mood=Ind" "Tense=Pres" !"Number=Plur")]

Inside the parenthesis, the boolean operators and (no overt operator), or (‘|’), and negation (‘!’) can be used.
Values are written in quotes, and here, no regular expressions are allowed.

Global features

For every treebank, the global features lang and treebank are indexed. Their values are the language of the treebank, and the treebank name, respectively.

In some of the NorGramBank LFG treebanks, some additional global features are indexed. These are the following:

  • doc : the document id
  • author : the text author, or translator if the text is translated
  • orig-author : the author of the original (empty if the original is Norwegian)
  • title : the text title
  • language : the language of the sentence of a matching node. This attribute has sentence granularity, as the language of a particular sentence can be different from the treebank language. This feature is particularly important to discriminate between Bokmål and Nynorsk in the nor-stortinget treebanks, which contain sentences in both variants.

Constraints on global features follow the main query expression, separated by two colons. Constraints are not enclosed in square brackets (since they do not relate to specific nodes), and variables can not be used on them.

Example:

NP >* "bok" :: author="Gaarder.*" & title="Sofie.*"

The features lang and treebank cannot be constrained, they are only useful in the tabular display (see below).

Regular expressions on strings

The value of a feature-value pair (in the form attribute="value") is always a regular expression that matches a string. Regular expressions are enclosed in double quotes ("...") or single quotes ('...').

NB 1: This is different from TigerSearch, where a regular expression is enclosed in slashes /.../, whereas string literals are put in single or double quotes. Treating string literals as a special case of a regular expression has proven to be more convenient.

NB 2: Note the different use of double vs. single quotes below.

The following special operators are implemented:

  • A dot (.) represents any character.

Thus, "bu." would match bus and bug, among others.

  • An asterisk (*, «Kleene star») behind an expression means arbitrary many (including zero) repetitions of that expression.

E.g., "help!*" would match help, help!, help!!, help!!!, and so on.

  • In a similar way, a plus (+, Kleene plus) denotes arbitrary many, but at least one repetition of that expression.

"help!+" would match help!, help!!, help!!!, and so on, but not help.

  • A question mark (?, optionality operator) means that the expression it follows is optional, it does not need to be present in the matched string.

"runs?" matches both run and runs.

  • Parentheses (( ... )) can be used to group expressions. A Kleene star or plus or an optionality operator operate on the whole grouped expression:

"(oh)?(la)+" matches la, ohla, lala, ohlala, lalala and so on.

  • Character classes are enclosed in angular brackets ([...]). They can contain arbitrary characters, including . ? * + which do not need to be escaped. You have to escape [ and ].

A character class matches any of the characters enclosed in the brackets. You can specify a range of characters using a hyphen; the character class [a-zåøæA-ZÅØÆ] for instance matches all characters of the Norwegian alphabet (excluding diacritics).
A character class can be defined as a complement using the notation [^...], matching all characters except those following the caret ^.

To give characters that have a special meaning in regular expressions their literal meaning, they have to be escaped with a backslash. For instance, "\." matches a dot, whereas "." would raise an error. Here is the complete list of characters that have to be escaped:

. , ? + * ( [ | ^ $ &

[Some notes about constraint and value variables to come.]

Boolean expressions for set-valued attributes

In LFG-, constituency and dependency treebanks, attributes (notably attributes denoting grammatical features) can be set-valued.
Queries over such attributes can be given as boolean expressions over the values, which is much more convenient than writing a regular expression for a complicated query.

The query [morph=(("adj" | "subst" !"prop") "fem")] would match all word nodes having "adj" "fem" or "subst" "fem" but not "prop" in their feature set (e.g., features="subst appell fem sg"). The available operators are, as can be seen from the example, boolean AND (juxtaposition), OR (|), and NOT (!), where parentheses are used for grouping.

Value negation

In attribute-value expressions, you can use negated values (which should not be confused with existential negation, see below).

A negated attribute-value expression has the form attribute!="value", which matches all nodes whose attribute value does not match the regular expression "value". In the abbreviated syntax for terminals, this negation has the form !"value", which is equaivalent to [word!="value"]. Similarly, a negated tree node label can simply be written as, e.g., !NP, which would match all tree nodes whose label is different from NP.

Node relations

The following operators are mostly equivalent to the corresponding operators in TIGERSearch.

Operator Explanation Notes
> direct dominance
>HD labeled direct dominance A label may contain any of the following characters: a-z, A-Z, 0-9, -, $.
>'>P' If other characters are necessary (as in Constraint Grammar-derived dependency analyses, where labels tend to contain the characters ‘<’ and ‘>’), the label can be escaped with single or double quotes. If double quotes are used, the label may not contain double quotes, and accordingly for single quotes.
>* dominance (minimum distance 1)
>{n} dominance with distance n (n>0) The distance n has to be put into curly braces
>{m,n} dominance with distance between m and n (0<m<n) The distance n has to be put into curly braces
>@l leftmost terminal successor (‘left corner’)
>@r rightmost terminal successor (‘right corner’)
!>L negated labeled direct dominance
!> negated unlabeled direct dominance
!>* negated dominance
!>{n} negated dominance, distance n The distance n has to be put into curly braces
!>{m,n} negated dominance, distance m...n (0<m<n) The distance n has to be put into curly braces
!>@l negated left corner
!>@r negated right corner
. immediate precedence (distance 1) In dependency treebanks, the precedence operator relates to the surface token order.
.* precedence (minimum distance 1) When querying LFG treebanks, this operator can be used to ensure that f-structure nodes are not instantiated several times with permutations of a node set in an otherwise symmetric query. E.g., in #f >$ #g & #f >$ #h & #g .* #h, the last term makes sure that not both (f1, f2) and (f2, f1) can be matches for (#g, #h).
.{n} precedence with distance n (n > 0) The distance n has to be put into curly braces
.{m,n} precedence with distance between m and n (0 > m > n) The distance n has to be put into curly braces
!.* negated precedence
!.{n} negated precedence, distance n The distance n has to be put into curly braces
!.{m,n} negated precedence, distance m...n The distance n has to be put into curly braces
$ siblings
$.* siblings with precedence
!$ negated siblings
>~L labeled secondary edge In contrast to TIGERSearch, the left operand is the parent node of the edge. This is more intuitive and makes the secondary edge relation more similar to the dominance relation. Example: S >SB NP and S >~SB NP behave in the same way.
>~ secondary edge
!>~L negated labelled secondary edge
!>~ negated secondary edge
= equality Does not exist in TIGERSearch; useful in negation to search for the nonexistence of certain node types, i.e. !(#x:NP = #x), which returns sentences without NPs
!=  inequality Does not exist in TIGERSearch

Parts of this table are taken from the TIGERSearch documentation.

Predicates

Predicate names start with an underscore.

Predicate Explanation
_arity(#x,n) matches nodes with arity (number of children) = n
_discont(#x) matches discontinuous nodes
_cont(#x) matches continuous nodes
_term(#x) matches terminal nodes
_nonterm(#x) matches nonterminal nodes
_position(#x,n) the surface token position of a node (in dependency graphs).

Abbreviated syntax

Expression Abbreviation Notes
[cat="NP"] NP node label expression
[cat="NP.*"] NP* simple star wildcard at the end of node label expression; here: all labels starting with NP, e.g., NP, NP-SBJ, NP-POS etc.
[word="Sofie"] "Sofie" double-quoted: terminal nodes in LFG and phrase-structure treebanks; lexical nodes in dependency treebanks. Make sure you use neutral (vertical) double quotes (Unicode U+0022).
[atom="indicative"] 'indicative' single-quoted: abbreviated syntax for atomic f-structure values in LFG treebanks. Make sure you use neutral (vertical) single quotes (Unicode U+0027).
S > #x:NP & #x > NP S > NP > NP relations can be chained

Existential and universal quantification

Variables written with a ‘#’ and implicit node variables are existentially quantified, whereas variables written with a ‘%’ are universally quantified.
In a negated context (expressions enclosed in !( ... )), however, those node variables that are not bound (not mentioned) outside the negated context are interpreted as universally quantified:

#x:NP & !(#x > #s:S & #s > #n:NP)
means:
#x:NP & {forall s, n : #x !> #s:S | #s !> #n:NP}
or, in prose: for any pair s (with cat=S), n (with cat=NP), either x does not dominate s, or s does not dominate n.
This is equivalent to stating that the query matches all NP nodes x not dominating an S that dominates an NP.

Since the variable x is bound outside !(...) (namely in the conjunct #x:NP), it is existentially quantified.

It is important to note that such nodes s, n need not exist. The expression !(#x > #s) is not equivalent to #x !> #s.

By default, the universal quantifiers are in the scope of all existential quantifiers:

%n:NP > #p:PP
is interpreted as
exists p:PP: ( forall n:NP: n > p )

or, in prose, there is a PP that is directly dominated by all PPs.

The default scoping rule can be overridden by explicitly stating the intended scoping order in prenex form (all quantifiers at the beginning):

(%n #p): %n:NP > #p:PP
is interpreted as
forall n:NP: ( exists p:PP: n > p )
or, in prose, every NP directly dominates a PP.

Some example queries

These queries can be run in the Tiger treebank (deu-tiger-con).

Query Explanation
"sehen" Q1. Find sentences that include the word ‘sehen’
!(#w:"sehen" = #w) Q2. Find sentences that do not include the word ‘sehen’.
NP >@r [pos="NN"] Q3. Find noun phrases whose rightmost child is a noun.
#vp:VP >* #v:[pos="VVFIN"] & #vp >* #np:NP & #vp >* #pp:PP & #np >@l #npl & #v . #npl & #np >@r #npr & #pp >@l #ppl & #npr . #ppl Q4. Find verb phrases that contain a verb immediately followed by a noun phrase that is immediately followed by a prepositional phrase.
#c >* #n:NP & #c >* #v:VP & !(#f >* #n & #f >* #v & #c >* #f) & #n !>* #v & #v !>* #n & #n .* #v Q5. Find the first common ancestor of sequences of a noun phrase followed by a verb phrase.
>SB #p1:NP >RC #y >* #p2 >* [pos="CARD"] & #y >@r #v:[pos="VVFIN"] & !(#p2 .* PP .* #v) Q8. Find all NPs which are a subject, inside of which there is a relative clause whose parent is the NP. Inside the relative clause, there must be a phrase p2, inside of which there must be a word which is a cardinal. At the end of the relative clause must be a finite verb whose parent is the same as that of p2. No PP may intervene between p2 and the verb.

The queries Q1-Q5 and Q8 are taken from Ulrik Petersen: Evaluating corpus query systems on functionality and speed: TIGERSearch and Emdros (and slightly changed to fit the syntax of INESS Search). These queries are adapted from Catherine Lai, Steven Bird: Querying and updating treebanks: A critical survey and requirements analysis (2004) and serve as a set of standard queries that treebank query systems should be able to handle.

Example searches for dependency treebanks:

Query Explanation
>pred #x & !(#y . #x) Find a sentence-initial verb.
A matching node x is a node that is the left value in a dependency relation,
such that there is no node y preceding x.
>pred #x & _position(#x,1) Find a sentence-initial verb, alternative formulation:
x is in initial position

LFG Search: Search in LFG treebanks

C-structures

Non-terminal nodes are coded as cat attributes;

Terminal nodes are coded as value attributes.

In addition, lemma forms and morphological features are indexed. They can be queried using the attributes lemma and morph. The morph attribute is set-valued, values to be matched can be given as a boolean combination of atomic features, e.g.: [lemma="bok" & morph=("Noun" (!"Gen" | "Indef"))]. (See also above.)

There is an abbreviated syntax for c-structure nodes: non-terminal nodes can be written as the bare node label string (without quotes, e.g., NP), whereas terminal nodes can be given as a double-quoted string (e.g., "Buch"). (Single-quoted strings are reserved for atomic f-structure values.)

Example query finding all c-structures where an NP-node dominates the lexical node 'Buch': [cat='NP'] >* [word='Buch'].

Using abbreviated syntax, this can be formulated as: NP >* "Buch".

In addition to the operators from the original TIGERSearch query language, there is a “rule” operator -> which can be used to search for node – node-children – configurations, e.g.: NP -> ( N PP ) (all NPs that have exactly an N and a PP child node, in that order). The parentheses are optional. By using node variables, you can query for arbitrarily complex configurations, e.g.: NP -> ( N #x:PP ) & #x -> ( P NP ). Child nodes can be left unspecified by using a dot, e.g.: NP -> ( N . ). The same effect could be achieved by using an unspecified node variable: NP -> ( N #x ), but using the dot is much more efficient. You can use Kleene star and Kleene plus for an arbitrary number of unspecified child nodes: e.g., NP -> .* N .+. (Matching configurations would include NP -> N PP and NP -> AP AP N CPrel, but not NP -> AP N.)

F-structures

Atomic values are coded as atom attributes; (sub-)f-structures correspond to node variables. Features of f-structures are coded as labeled edges: the query #x >SUBJ #y matches all f-structure pairs #x, #y such that #y is the value of the SUBJ feature of #x. Set membership is denoted by >$.

There is also an abbreviated syntax for F-structure atomic values, they can be given as a single-quoted string (e.g., 'past')

Cyclic f-structures: #x >* #x.

Structure sharing: #x >TOPIC #y & #x >OBJ #y; #x >TOPIC #y & #x !>SUBJ #y.

It is advisable not to use unlabeled precedence operators (>, >*) in f-structure queries unless their left and right operands are operands of labeled precedences (i.e. >SUBJ) in addition, since they will be very inefficient. For example, don't try #x > #y:"dat", but rather be more specific, like #x >CASE #y:"dat".

Projections

Operator Explanation Notes
#n >> #f Projection c-structure node n projects to f-structure node f
#n >><< #m Projective equivalence c-structure nodes n and m are in the same domain (project to the same f-structure(s))

Projections from c-structure nodes to sub-f-structures can be queried with the projection relation operator >>: You can for example find all PP nodes that project to an f-structure with an OBJ attribute by PP >> #x & #x >OBJ #y.

Regular expressions over f-structure attributes

There is limited support for regular expressions over f-structure features and the projection operator, which can be used to conveniently express path relations between f-structures.

Path syntax Operator Example Explanation
| (or) #x >( SUBJ | OBJ ) #y #y is SUBJ or OBJ feature value of #x
& (and) #x >( TOPIC & OBJ ) #y #y is TOPIC and OBJ value of #x, structure sharing
(space) #x >( COMP TNS-ASP TENSE ) 'past' There is a path from #x along COMP, TNS-ASP, TENSE to the atomic value ''past''
>> (projection) IP >>( OBJ PRED ) 'Kari' A c-structure IP node projects to an f-structure whose OBJ’s PRED value is Kari.
$ (set element) IP >>( SUBJ $ PRED ) 'Kari' A c-structure IP node projects to an f-structure whose SUBJ contains a set element whose PRED value is Kari.
(combinations) #x >(TOPIC & (SUBJ | XCOMP SUBJ)) #y #y is TOPIC of #x and either SUBJ of #x or XCOMP SUBJ of #x

C-command and Extended Heads

For C-structure nodes, the relations C-command and Extended Head have been implemented. Those relations can be used to test for endocentricity in c-structures that adhere to Bresnan’s X-bar principles. (See Bresnan, Lexical Functional Syntax for an explanation of these concepts.)

Operator Explanation Notes
#n >c> #c n c-commands c
#n >h> #c n is extended head of c

Some example queries for LFG treebanks

Example Explanation
#x >h> #y:S* & !(#y > #z >><< #y) x is a proper extended head of y, that is, x is extended head of y according to Bresnan’s definition, and y dominates no node in its functional domain
I' >>(TOPIC & ADJUNCT $) an I' projecting to an f-structure whose TOPIC is an ADJUNCT member

HPSG Search: Search in HPSG treebanks

Querying Head daughters

Operator Explanation Notes
#x >hd> #y y is head daughter of x
#x >hd>* #y the transitive closure: there is a path from x to y along head daughters

Query result aggregation and tabular display

On the Query page, query results are aggregated according to selected (existentially quantified) node variables and features and displayed in tabular form.

The variables that are selected for display are exactly those variables that are explicitly mentioned in the query expression and do not contain an underscore in their name. This means that none of the variables that are implicitly generated when a query expression is expanded are displayed, since they all contain an underscore.

For a given selected variable, there is one table row for every feature that is selected for that variable. A feature is selected if it is explicitly mentioned in the query expression, and in the case no feature is mentioned a default feature is chosen, depending on the variable type. The default features are:

  • word for dependency and terminal constituency or hpsg nodes
  • value for terminal c-structure nodes
  • cat for non-terminal constituency, c-structure or hpsg nodes
  • atom for f-structure nodes (which results in empty values for non-atomic f-structure nodes)

Features that are necessary to formulate query restrictions, but whose values should not show up in the table must be written with an extra underscore (_lemma below). Vice versa, unrestricted features whose values should appear in the table can be included in the node expression as bare feature names (value and morph below).

Every distinct combination of feature values for the selected variable features makes up one table row; in addition, the number of sentences having those feature values is recorded in the row.

As an example, let us take the query

NP > #n:N* > #w:[value & _lemma='filosofi' & morph] 

In this query, two variables are selected for display (#n and #w), and their selected features are cat (default feature) for #n and value and morph for #w. The variable #n_1, implicitly generated for NP, is not displayed. The resulting table might look like this:

This table can be sorted according to sentence count and the selected features, by clicking on the corresponding column headings. It can be downloaded as a tab-separated values file.

You can inspect the matching sentences for a given feature value combination by clicking on the corresponding row; this brings you to a list of the first 100 matching sentences.


Design & implementation: Paul Meurer, Uni Research Computing, 2018