INESS-logo
INESS Search Walkthrough

Treebanks

Tools



This walkthrough provides a gentle introduction to using INESS Search, without presupposing knowledge of regular expressions or first-order predicate logic. A more technical overview of INESS Search is available here.

We recommend that you follow these instructions step by step.

Getting started: accessing treebanks

You can search in treebanks without signing in, but you will have access to many more treebanks if you do. Information about signing in may be found here. The following examples assume that you are signed in.

The first step in searching is to choose one or more treebanks that you want to search in. Here we will first search in English treebanks. Open Treebank Selection (in the left margin menu) in a new window or tab. Then you can switch between reading the instructions on this page and performing the searches in the other window/tab. It's important to perform all the searches and examine the results to become familiar with INESS Search.

Under Languages in the new window/tab, click on English. (If the language you want to choose is "grayed out", use the Reset button.) When you scroll down on the page you see that there are some English treebanks listed (24 as of October 9, 2018). If not all boxes to the left of the treebank names are ticked off, then under Selected, click on all.

View one of the treebanks in the list by clicking on its name. For this example, we will choose eng-deepbank (DeepBank, an HPSG treebank of 1989 Wall Street Journal text). Note: The first time you view a treebank you will be asked to accept a license, if the treebank has one. So far eng-deepbank does not have a license.

Clicking on eng-deepbank brings us to the Sentence Overview page for that treebank. This page shows by default the first 50 sentences in the treebank text. You can only view one treebank at a time, but you may be searching in several treebanks. In the box where it says Search in: at the upper right, you see the names of 13 treebanks. We will now be searching in 13 treebanks, even though all 24 English treebanks were ticked off on the Treebank Selection page. This is because the other treebanks are of different types that are not search compatible.

Searches can be made from three different pages (available in the left margin menu):

  • Sentence Overview
  • Sentence
  • Query

You can use exactly the same search expressions on all these pages, but the pages differ in the way they display the results.

Searching from the Sentence Overview page

You are now on the Sentence Overview page for eng-deepbank. Let’s say we want to search for the word presently. In the query field (the box under Query:, where it says Query expression in gray letters), write:

"presently"

Note that the word must be enclosed in double quotes. Then click on the button Run query.

Under the query field, you will see the total number of matching sentences in all the treebanks that were searched, for instance:

6 matching sentence(s), …

Underneath that, the names of the treebanks that have matching sentences are displayed:

Matches in: eng-deepbank (1), eng-redwoods (1), eng-ud-1.2-dep (2), eng-ud-dep (2)

The name of the treebank that the search was made from, eng-deepbank, is boldfaced, and the number 1 is in parentheses after the name, showing that there is one occurrence of presently in this treebank.

The list of displayed sentences in the currently viewed treeebank has now shrunk to the sentences that match the query. For this example, there is one hit in eng-deepbank, the treebank we are in, so only one sentence is displayed in the list.

Clicking on the sentence Lloyd’s presently sells … brings us to the Sentence page, where the analysis is shown. By scrolling down we can see that what was searched for is highlighted in red.

Here on the Sentence page, we see that the information about the matches is displayed slightly differently than it was on the Sentence Overview page. In addition to the number of matches, we also see, for the treebank we are in, the name of the text (wsj13a.1) and the IDs of matching sentences (#20824). In this case there is only one ID number since there is only one hit.

Matches in: eng-deepbank (1): wsj13a.1: #20824; eng-redwoods (1) ; eng-ud-1.2-dep (2) ; eng-ud-dep (2)

Directly under the list of matches we see a yellow box with the matching sentence displayed in boldface in its textual context.

To see matching sentences in a different English treebank, we can for instance click on eng-ud-1.2-dep (Universal Dependencies English Web Treebank, version 1.2). This brings up the Sentence page for the first of two matches in this treebank. Now after the boldfaced treebank name, we see two sentence IDs:

eng-ud-dep (2): train: #9807, #10672

We are viewing the first of these, sentence #9807. This is shown in two ways: the ID number is displayed in green, and this number is at the top of the yellow box where the sentence (in boldface) is displayed in its context.

Click on #10672, and that sentence will be displayed, again with the match highlighted in red.

Note: this is a dependency treebank, and there are two alternative visualizations. The default visualization is a dependency tree. This kind of tree does not preserve the linear order of the words in the sentence. Tick off the option Linear view beneath the yellow box to see the other visualization, where the word order is preserved.

Searching from the Sentence page

Searches can also be performed from the Sentence page. Delete the ending -ly in the query field so that it says "present" and click on Run query.

Under the query field we see that there are 128 matches in four treebanks. But we are still viewing the same sentence as before running the query. In order to view the results of the new query, click on a sentence in the new list of matches, for example on #17, which is the first match in eng-ud-1.2-dep. You can then examine the analyses of the sentences one by one either by clicking on their ID numbers or by using the Next match button.

Searching from the Query page

Until now we have searched for words by enclosing them in double quotes. Writing "present" as a query expression is actually shorthand for writing:

[word="present"]

We can search for the lemma present by using the search expression:

[lemma="present"]

Not all treebanks encode the lemma that each word belongs to, so to do this search we must choose a treebank that has lemma encoding. The treebank we are currently viewing, eng-ud-1.2-dep, has lemmas. In order to change the treebanks being searched to only this treebank, do the following:

(1) click on Search in in the field to the right of the query field,
(2) click on Select none,
(3) choose eng-ud-1.2-dep by ticking off the box next to the treebank name,
(4) click on Apply.

Now click on Query in the left margin menu, write [lemma="present"] in the query field, and click on Run query.

We see that there are 30 matches. We can examine them by clicking on the number underneath Count, which brings up an overview of the matching sentences.

Usually we need to use explicit variables in our search expressions, either to make certain things visible in the search results or to express relations between different parts of the search expression. We can add a variable to our search expression like this:

#x:[lemma="present"]

This expression means:

"There is a node x that has the lemma present."

The results are shown as a table with one row. Now add & pos to the search expression:

#x:[lemma="present" & pos]

This expression means:

"There is a node x that has the lemma present. Show also the value of the attribute pos (part of speech)."

The resulting table is sorted by frequency: of the 30 occurrences of the wordform present, 17 are verbs, 10 are adjectives, and three are nouns. Clicking on the noun row brings up an overview of only the noun occurrences.

We can add another attribute:

#x:[lemma="present" & pos & word]

This expression means:

"There is a node x that has the lemma present. Show also the value of the attribute pos (part of speech) and the attribute word."

The resulting table is now sorted by the frequency of the various combinations of the values for pos and word.

Since the value of lemma is always the same when we are searching for one lemma, we don't need that value displayed in the table. The value of the lemma can be suppressed in the search results by writing an underscore before the attribute:

#x:[_lemma="present" & pos & word]

We can further refine the search expression by specifying certain values for the other attributes. For example, we can search for only the wordform present:

#x:[_lemma="present" & pos & word="present"]

Or we can search for only the verb present:

#x:[_lemma="present" & pos="VERB" & word]

Formulating search expressions

Suppose you want to find out what occurs as the object of the verb think in eng-ud-1.2-dep.

In order to formulate search expressions, you need to know something about the annotations in the particular treebank you want to search in and you need to know the operators for linear precedence, dominance, etc.

You can find information about which attributes are indexed for search by consulting Treebank Details in the left margin menu. For eng-ud-1.2-dep, we find the following attributes:

lemma (17 783) · word (23 016) · morph (68) · child-edge (46) · edge (46) · pos (17) · features (50)

Clicking on an attribute will reveal its values. The UD (Universal Dependency) treebanks use edge for dependency relations. Under edge we find dobj, which is the label for the direct object (see the UD website for more details about the annotation).

The operator for the dependency relation (or dominance) is the > symbol, which must be placed immediately in front of the edge label, i.e. >dobj. The whole expression becomes:

"think" >dobj #x

This expression introduces a variable #x which matches anything that is in a relation dobj with think. The result is displayed as a table sorted by the frequency of the direct object.

More detailed information about how to formulate search expressions may be found in the INESS Search documentation.

An example-based introduction (in Norwegian) for searching in NorGramBank may be found here. This introduction follows the chapters and examples in the Norwegian reference grammar Norsk referansegrammatikk.

If you need help in formulating search expressions for any treebank or any other assistance in using INESS Search, please do not hesitate to contact our help desk at iness@uib.no. We will get back to you as soon as possible.


Design & implementation: Paul Meurer, CLARINO Bergen Centre, 2020