PARSEME guidelines for MWE annotation in treebanks
This document summarizes the recommendations of PARSEME's Working Group 4 for the annotation of MWEs in treebanks. The recommendations are based on a detailed study of actual MWE treebank annotations (Rosén et al. 2015, Rosén et al. 2016; see also the Overview of MWE annotation in treebanks).
It is not possible to provide detailed guidelines for the annotation of MWEs in all treebanks. This is because there are many different methods of treebank construction and many different styles of annotation. In addition treebank annotations may be based on different linguistic theories. Therefore each treebank will need to have its own specific annotation guidelines. Here we provide only general guidelines formulated as principles for MWE annotation that are applicable to all types of treebanks.
MWEs should be annotated as such, so that treebank queries can directly target them.
This is a general principle that aims at improving the ease with which MWEs can be identified in treebanks, without the need to be detected by heuristics. The recursive case of this principle is that MWEs which occur as part of other MWEs should also be annotated as such, so that embeddings of MWEs (for example in the complex name Johann Wolfgang Goethe-Universität Frankfurt am Main) can be discovered.
The annotation of noncompositional MWEs should distinguish them from homonymous strings with a compositional analysis.
Ease of identification implies that MWEs should be distinguished from homonymous constructions which are compositional. For example, under the knife is an English idiom meaning 'undergoing surgery'. This idiom, illustrated in example (1), should be annotated in a way which distinguishes it from the compositional meaning in (2).
(1) The patient is under the knife.
(2) The napkin is under the knife.
Many MWEs have meanings that cannot be derived compositionally. These should if possible be represented at two levels: one level that reflects the idiomatic meaning, and one level that represents the internal syntactic structure, to the extent that it is syntactically regular.
This can be achieved in different ways depending on the grammar formalism. Two examples are shown here.
In some dependency/constituency treebanks, secondary edges can be used. The example above is from the Eukalyptus Treebank of Written Swedish (Adesam et al. 2015).
- The sentence node (S) contains a subject, head and object.
- The head dominates a multiword verb node (VBM), and secondary edges (labeled 'ME') are used to connect the remaining MWE parts to this node.
- Associated with the multiword node is a semantic identifier for the idiomatic sense.
In the LFG formalism, there are two levels of syntactic structure: c(onstituent)-structure and f(unctional)-structure. The figure above shows an example from the Norwegian treebank NorGramBank (Dyvik et al. 2016).
- The verbal idiom finne sted 'take place / occur' is represented as a combined predicate in the value of the PRED attribute in the f-structure on the right.
- The new predicate name 'finne#sted' is built by concatenating the verb predicate and the object predicate.
- This predicate has only the subject as a semantic argument; the object argument is outside the angled brackets, indicating that it is only a syntactic and not a semantic argument of the predicate.
- The c-structure represents the internal constituent structure of the MWE as shown on the left of the figure; it reflects the flexibility of the expression by representing each component of the MWE as a separate node.
Individual MWEs should be searchable even if they are discontinuous or variable in form.
This principle will allow non-fixed MWEs to be identified irrespective of their surface forms and word orders. For instance, the morphological and word order variants of the particle verb shut down in examples (3) and (4) should be searchable with a single query.
(3) The company is shutting down the power plant.
(4) The company has shut the power plant down.
In order to fulfill principle C, some normalization is recommended, i.e. each MWE occurrence in a corpus should be associated with its canonical form so as to conflate different morphosyntactic variants of the same MWE. In the simplest case a canonical form is a MWE lemma, e.g. man servant for men servants. Linking to a lexicon or knowledge base of MWEs, e.g. DuELME (Gregoire 2010, Odijk 2013) or dictionary storage for pre-annotation (Bejček & Straňák 2010) should be considered. To the extent that a treebank is a parsed corpus, this should normally be achieved by having appropriate MWE entries in the lexicon used in parsing, as is the case in NorGramBank.
Automatic lemmatization of MWEs is non-trivial in the general case, since components of a MWE lemma may not be lemmas themselves, as in to spill the beans but not to spill the bean. In highly inflected languages, automatic lemmatization of some MWE categories, such as person names, may be challenging (Piskorski et al. 2007); therefore assigning manually validated lemmas to named entities in a treebank may be an option (Savary et al. 2010).
It should be possible to search for various types of MWEs based on their characteristics.
This principle implies that, to the extent possible and depending on the MWE ontology, all MWEs belonging to certain types will be retrievable as a set, for instance, all fixed expressions, all particle verb constructions or all VP idioms. The different types should not necessarily be annotated at the same level of linguistic analysis. Some may be annotated at word level, such as fixed expression (so-called words with spaces), while others are annotated at one or more levels of syntactic structure (such as c-structure and f-structure in LFG treebanks, or analytical and tectogrammatical structure in PDT-style treebanks).
The minimal idiomatic phrase should be annotated as a MWE.
The phrase come off with flying colors may appear to be an idiom . However, with flying colors can occur with many other verbs with the same idiomatic meaning, for example:
(5) He passed the exam with flying colors.
(6) The team won with flying colors.
(7) The bill passed the Senate with flying colors.
The suggested guideline to annotate the minimal phrase as a MWE is relevant for annotation of flat text as well as grammatical structures. This does not prevent the annotation of MWEs in which other MWEs are embedded.
Syntactic/semantic structure should show how additional elements (such as modifiers) interact with MWEs.
The prepositional phrase (to be) in doubt can be modified, e.g. by little as in (8):
(8) Delegates are in doubt...
Whereas the relation between the modifier and the basic MWE is not expressed in pos-tagged corpora, such structure is normally annotated in treebanks, with the possible exception of fixed expressions as defined in Sag et al. (2002). Treebanks thereby not only provide a richer annotation, but also a less ambiguous one.
A flat annotation would annotate in and doubt as making up the (minimal) MWE, whereas little may be left unannotated. This leaves it unclear whether little is syntactically unrelated, making the MWE truly discontinuous, or whether it is syntactically integrated and modifies the MWE. In a treebank, the modification relation of would be apparent from the place of little in the example’s grammatical structure.
Yvonne Adesam, Gerlof Bouma, Richard Johansson (2015). Multiwords, Word Senses and Multiword Senses in the Eukalyptus Treebank of Written Swedish. In Markus Dickinson, Erhard Hinrichs, Agnieszka Patejuk, and Adam Przepiórkowski, editors, Proceedings of the Fourteenth Workshop on Treebanks and Linguistic Theories (TLT14), pages 3–12, Warsaw, Poland. Institute of Computer Science, Polish Academy of Sciences.
Helge Dyvik, Paul Meurer, Victoria Rosén, Koenraad De Smedt, Petter Haugereid, Gyri Smørdal Losnegaard, Gunn Inger Lyse, and Martha Thunes (2016). NorGramBank: A ‘Deep’ Treebank for Norwegian. In Nicoletta Calzolari, Khalid Choukri, Thierry Declerck, Marko Grobelnik, Bente Maegaard, Joseph Mariani, Asunción Moreno, Jan Odijk, and Stelios Piperidis, editors, Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16), pages 3555–3562, Portorož, Slovenia, 2016. ELRA.
Koenraad De Smedt, Victoria Rosén, and Paul Meurer (2015). Studying consistency in UD treebanks with INESS-Search. In Markus Dickinson, Erhard Hinrichs, Agnieszka Patejuk, and Adam Przepiórkowski, editors, Proceedings of the Fourteenth Workshop on Treebanks and Linguistic Theories (TLT14), pages 258–267, Warsaw, Poland. Institute of Computer Science, Polish Academy of Sciences.
Jakub Piskorski, Marcin Sydow, and Anna Kupść (2007). Lemmatization of Polish Person Names. In ACL 2007. Proceedings of the Workshop on Balto-Slavonic NLP 2007, pages 27–34. Association for Computational Linguistics.
Victoria Rosén, Koenraad De Smedt, Gyri Smørdal Losnegaard, Eduard Bejček, Agata Savary, and Petya Osenova (2016). MWEs in treebanks: From survey to guidelines. In Nicoletta Calzolari, Khalid Choukri, Thierry Declerck, Marko Grobelnik, Bente Maegaard, Joseph Mariani, Asunción Moreno, Jan Odijk, and Stelios Piperidis, editors, Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16), pages 2323–2330, Portorož, Slovenia. ELRA.
Victoria Rosén, Gyri Smørdal Losnegaard, Koenraad De Smedt, Eduard Bejček, Agata Savary, Adam Przepiórkowski, Petya Osenova, and Verginica Barbu Mititelu (2015). A survey of multiword expressions in treebanks. In Markus Dickinson, Erhard Hinrichs, Agnieszka Patejuk, and Adam Przepiórkowski, editors, Proceedings of the Fourteenth Workshop on Treebanks and Linguistic Theories (TLT14), pages 179–193, Warsaw, Poland. Institute of Computer Science, Polish Academy of Sciences.
Agata Savary, Jakub Waszczuk, and Adam Przepiórkowski (2010). Towards the Annotation of Named Entities in the Polish National Corpus. In Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC). European Language Resources Association, 17-23 May 2010.