Overview of MWE annotation in treebanks
Working Group 4 (WG 4) in PARSEME is creating an overview of existing annotation of multiword expressions (MWEs) in treebanks. The table below lists the treebanks that are currently documented or in the process of being documented. Some cells in the table are clickable, leading to wiki pages that provide detailed information about the treebanks and the MWE annotations that they provide. The table has been updated in accordance with the decisions made at the WG 4 sessions at the Frankfurt meeting in September 2014. (The old table is still available here if you need to consult it.)
If you would like to contribute to this WG 4 project by providing MWE information about a treebank that is not already listed in the table, please contact the working group leader Victoria Rosén.
Treebank
|
Language |
Annotation type |
Nominal MWEs
|
Verbal MWEs
|
Prepositional MWEs
|
Adjectival MWEs
|
MWEs of other categories
|
Proverbs
|
Multiword named entities
|
NN compounds
|
Other nominal MWEs
|
Phrasal verbs
|
Light verb constructions
|
VP idioms
|
Other verbal MWEs
|
The Estonian Dependency Treebank
|
Estonian |
dep |
NO |
N/A |
NO |
YES
|
NO |
NO |
NO |
NO |
NO |
NO |
NO |
The Latvian Treebank
|
Latvian |
dep |
YES
|
YES
|
NO |
N/A |
NO |
NO |
NO |
NO |
YES
|
YES
|
YES
|
META-NORD Sofie Swedish Treebank
|
Swedish |
dep |
YES
|
N/A |
NO |
NO |
NO |
NO |
NO |
NO |
NO |
NO |
NO |
The Prague Dependency Treebank
|
Czech |
dep |
YES
|
YES
|
YES
|
N/A |
YES
|
YES
|
N/A |
COMP
|
YES
|
YES
|
YES
|
RoRefTrees
|
Romanian |
dep |
YES
|
NO |
NO |
NO |
YES
|
YES
|
YES
|
YES
|
YES
|
YES
|
NO |
The ssj500k Dependency Treebank
|
Slovene |
dep |
YES
|
NO |
NO |
NO |
NO |
NO |
NO |
NO |
NO |
NO |
NO |
The Szeged Dependency Treebank
|
Hungarian |
dep |
YES
|
NO |
NO |
YES
|
YES
|
NO |
NO |
N/A |
YES
|
YES
|
NO |
The Turin University Treebank
|
Italian |
dep |
YES
|
NO |
NO |
YES
|
NO |
NO |
NO |
YES
|
YES
|
YES
|
NO |
IMST & IWT Turkish Treebanks
|
Turkish |
dep |
YES
|
YES
|
YES
|
NO |
YES
|
YES
|
YES
|
NO |
YES
|
NO |
YES
|
Universal Dependencies Treebanks
|
many languages |
dep |
YES
|
YES
|
YES
|
YES
|
YES
|
NO |
NO |
YES
|
YES
|
YES
|
NO |
The PENN Treebank
|
English |
const |
YES
|
YES
|
NO |
YES
|
NO |
NO |
NO |
NO |
NO |
YES
|
NO |
The National Corpus of Polish
|
Polish |
const |
YES
|
NO |
NO |
NO |
NO |
NO |
NO |
YES
|
NO |
YES
|
NO |
SQUOIA Spanish
|
Spanish |
const |
YES
|
NO |
NO |
YES
|
YES
|
YES
|
NO |
YES
|
NO |
YES
|
NO |
The TIGER Treebank
|
German |
const |
YES
|
NO |
NO |
YES
|
YES
|
NO |
NO |
NO |
NO |
YES
|
NO |
UZH Alpine German
|
German |
const |
YES
|
NO |
TBC
|
YES
|
YES
|
YES
|
TBC
|
TBC
|
YES
|
TBC
|
NO |
The Lassy Small Treebank
|
Dutch |
dep/const |
YES
|
YES
|
YES
|
YES
|
COMP
|
COMP
|
NO |
YES
|
NO |
NO |
NO |
The Eukalyptus Treebank of Written Swedish
|
Swedish |
dep/const |
TBC
|
YES
|
TBC
|
YES
|
YES
|
YES
|
YES
|
YES
|
YES
|
YES
|
YES
|
BulTreeBank
|
Bulgarian |
dep, const |
YES
|
N/A |
YES
|
N/A |
COMP
|
COMP
|
NO |
YES
|
YES
|
YES
|
COMP
|
The French Treebank
|
French |
dep, const |
YES
|
YES
|
YES
|
N/A |
NO |
YES
|
NO |
YES
|
YES
|
YES
|
NO |
The Cintil Portuguese Treebanks
|
Portuguese |
dep, const (HPSG) |
YES
|
COMP
|
N/A |
N/A |
COMP
|
N/A |
N/A |
YES
|
N/A |
YES
|
COMP
|
DeepBank
|
English |
HPSG |
YES
|
YES
|
YES
|
YES
|
NO |
NO |
NO |
NO |
NO |
NO |
NO |
NorGramBank
|
Norwegian |
LFG |
YES
|
N/A |
YES
|
YES
|
NO |
YES
|
NO |
YES
|
YES
|
YES
|
NO |
1 The wiki system
These pages are written in a Wikimedia-like framework featuring easy hyperlinking and a simple markup language. This framework is similar to Redmine's
hyperlinking and text formatting facility; the markup syntax ls mostly modeled along
Textile. There are, however, some deviations from the
Textile syntax; not all of the syntax is supported, and there are a couple of extensions.
1.1 Editing rights
In order to edit the PARSEME MWE pages you first need to sign in (at the top of the page). If you have an identity provider (IdP) that is a member of
eduGAIN, you will most likely be able to log in via your IdP after clicking on
eduGAIN. You can also log in with a
CLARIN account. Otherwise, you will have to register an
OpenIdP account.
After signing in for the first time you should send an e-mail to Victoria Rosén so that your editing rights can be set. You will receive an e-mail when you can start editing.
Please use Chrome or Safari when editing.
After signing in, do the following in order:
- choose "Edit is on" in the upper right corner of the screen,
- click on "Links" in the left column of the page (underneath the INESS logo) and then click on "MWEs in PARSEME",
- click on "Edit" in the upper left corner of the page.
1.2 Saving
- When adding new information, remember to save regularly! Click on "Save" before navigating away from an edited page. After saving you can resume editing by clicking on "Edit" again. Don't use the back arrow.
- A backup is taken of the whole wiki every night. In addition, whenever you save a page, the previous version is also kept so that it is easy to return to it if necessary. If you need to go back to a previous version, use the drop-down menu at the top of the page.
2 Editing the table
Each treebank has one row in the table. Please make sure that you do not edit anything except for your own row.
2.1 The treebank column
The name of your treebank should appear in the first cell on the left of the row. Edit this cell in the table if you need to change the name. Do that by clicking on Edit at the top of the page.
For example, if you need to change the name of your treebank from Example Treebank to Example Dependency Treebank, you will change this:
[[
exmpl-descr|
Example Treebank]]
to this:
[[
exmpl-descr|
Example Dependency Treebank]]
The official name of the treebank should be used if there is one (example: The Danish Copenhagen Dependency Treebank). If the treebank does not have an official name, use a description that uniquely identifies it by language, formalism, author/institution, or the like. Since the language also appears in the next cell to the right in the table, it doesn't necessarily need to be mentioned in this description.
2.2 The MWE columns
By default, cells in the table are filled in with
TBC, for ‘to be completed’. The markup around the
TBC in editing mode (for example
[[
mytreebank-mwetype
|
TBC
]]
) provides a link to a separate MWE description page where the MWE may be illustrated (see below under 4). After you are finished filling in the information on the MWE description page for a certain MWE, you should replace
TBC in the corresponding cell in the main table with
YES.
Not all treebanks have annotations for all MWE types in the table. The reason could be that the MWE type does not occur in the language. If that is the case, this cell in the table should be filled in with N/A for "not applicable". You can do this by replacing TBC in the relevant cell, and all the material within the double square brackets surrounding TBC, including the brackets themselves, with N/A.
Change: [[
mytreebank-mwetype|
TBC]]
to: N/A
Sometimes it will be the case that the MWE type does exist in a language, but the treebank lacks a special annotation for it. Then this cell should be filled in with NO. Do this by replacing TBC in the relevant cell, and all the material within the double square brackets surrounding TBC, including the brackets themselves, with NO.
Change: [[
mytreebank-mwetype|
TBC]]
to: NO
In cases where the language has a certain type of MWE but the treebank does not include a special annotation for it, the TBC in the table may be changed to COMP (for compositional analysis). You can then illustrate how the construction is analyzed in your treebank and explain why the construction is not treated as a MWE under About the analysis.
The four cell labels are thus:
N/A: the MWE type does not occur in the language
NO: the MWE type occurs in the language but the treebank lacks annotation for it
YES: the MWE type is annotated in the treebank and the MWE page shows how it is analyzed as a MWE
COMP: the MWE type is not annotated, but the MWE page shows how it is analyzed compositionally
3. The treebank description page
Each treebank has a description page, which is reached by clicking on the first cell of your row. This page has boldfaced headings like ‘Name’, ‘Size’, ‘Construction method’, etc. When you go to edit mode, you will see that there are comments under each heading with suggestions for what information should be written there. Provide your information under the headings. You can write them either above or below the comments, which should not affect the formatting. It is a good idea to leave the comments there in case you want to edit again later.
4. The MWE pages
For each type of MWE that is annotated in your treebank, there should be a MWE page that illustrates the analysis. Each "YES" in the table is linked to a MWE page. The page is already filled out with a template with instructions on how to complete it.
The first line at the top of the page gives the MWE type and the name of the treebank. It is important that you copy the name of the MWE type exactly as it is stated at the top of the column in the main table. You may also add an extra title line underneath the main title to specify what type of MWE you are providing. For instance, you may have an extra line with "Complex numeral expressions" under the main title "Named entities".
If you provide examples of several subtypes on one MWE page, you should number the lines with the subtypes. For an example, see Phrasal verbs in NorGramBank.
You should illustrate how each MWE type is annotated by choosing an example from your treebank. Try to choose a short example. If that is not possible, consider showing only the part of the sentence that contains the MWE.
In order to keep the table a manageable size, we have limited the number of columns. The idea is that several different subtypes may be entered on the same MWE page. For instance, on the NorGramBank Phrasal verbs page, there are three subtypes: Particle verbs, Particle verbs with selected prepositions, and Verbs with selected prepositions. If you want to enter several examples, you can copy the entire template one or more times on the same page. Click on the YES in the NorGramBank Phrasal verbs cell to see an example of a finished page.
Each MWE (sub)type on the MWE page should contain four components: 1. Example, 2. Analysis, 3. About the analysis, and 4. Searching for the MWE type.
4.1 Example
- The sentence or phrase used in your example should be entered with correct glossing and an idiomatic translation. Glossing should follow the Leipzig glossing rules as far as possible.
- Put the MWE in the example in boldface by surrounding each word that is part of the MWE by asterisks. For example, writing
*
kick*
*
the*
*
bucket*
will result in kick the bucket.
- If the correspondence between the words in the sentence and the glosses is not one to one, graphical words may be grouped together by using curly brackets (e.g. {ad hoc}). Words may be bracketed in the example sentence or in the glosses as needed.
- Do not put spaces in between words and punctuation marks in the example. Punctuation marks should not be repeated in the glossing.
4.2 Analysis
- Here you should provide a graphics file to illustrate the analysis. First you must write the name of your file instead of "example.png" in "image:example.png". Then you can delete: "REPLACE "example.png" WITH THE NAME OF YOUR PICTURE FILE".
- To upload your picture you must click twice on "Upload file" (at the top of the page on which you want the picture to be displayed). This is not a double click as when opening a file, but two clicks about a second apart. Then you should choose the file you want to upload from the window that appears. The "Upload file" link is not available when you are in the editing version of the page, and it is not visible anymore when you have saved your changes. To get the "Upload file" link back again you can start over from "Links" on the left.
- There is at present no scaling mechanism for changing the size of the picture once it is on the wiki page. If the picture displays too small or too large, either edit your picture to an appropriate size or make your screenshot at an appropriate resolution. Alternatively you can show only the relevant part of the graphics.
- It is possible to upload more than one picture if necessary. Make sure to give the images distinctive names.
4.3 About the analysis
- Since the analysis in your treebank may not be easy to understand for others, you should include a prose description of it.
- Include information on whether the MWE is fixed (a word with spaces) or flexible.
- Be explicit and pedagogical. Remember that not everyone will be familiar with the specific formalism used in your treebank.
- In addition to the prose description you provide, you may want to refer to existing documentation.
4.4 Searching for the MWE type
- If it is possible to search for the MWE type in your treebank, you should include a search expression.
- Provide a prose description that explains what the search expression does.
- Provide information about the search facility you are using, and provide a link to documentation.