INESS-logo
Sofie Parallel Treebank

Treebanks

Tools


The META-NORD Sofie Parallel Treebank

The Sofie Parallel Treebank is a syntactically annotated parallel corpus based on the first chapters of the novel Sofies verden by Jostein Gaarder, published by Aschehoug forlag.

The treebank is a product of the META-NORD project and its goal to promote the accessibility of existing treebanks for the languages in the project. The Norwegian novel Sofies verden (Gaarder 1991) was chosen as a suitable basis for treebanking because it is linguistically rich and professionally translated in many languages, and because some treebanks already existed for text selections from this material in some languages in the META-NORD area.

Previous work was done by the Nordic Treebank Network, funded by the Nordic Language Technology Program (2001-2005) but had not been maintained and was no longer accessible. It was decided to gather those treebanks, document them, supplement them with additional treebanks for some languages where this effort was feasible, and make the resulting resources accessible. The resulting work has been a joint effort between META-NORD and INESS.

Under META-NORD, small pilot treebanks were also constructed for the JRC Acquis Multilingual Parallel Corpus of EU/EEA law texts, providing materials from a different genre. These treebanks are available as the Acquis Parallel Treebank under the Parallel treebanks section of this page. More information about the treebank development in META-NORD can be found in the META-NORD Deliverable 3.4
Parallel Treebanks.

Monolingual treebanks

Please contact Paul Meurer to get access to the treebanks. Clicking on a treebank in the list below will take you to the META-SHARE description of that treebank.

The following treebanks are available for download:

Sofie Danish Treebank
Sofie English Treebank
Sofie Estonian Treebank
Sofie German Treebank
Sofie Icelandic Treebank
Sofie Norwegian Treebank
Sofie Swedish Treebank

The following treebanks can be viewed and searched, but are currently not available for download:

Sofie Finnish Treebank
Sofie Georgian Treebank

The Danish, Estonian, German, Icelandic and Swedish treebanks are developed in the Nordic Treebank Network, while the Finnish and Norwegian treebanks were developed in META-NORD, in cooperation with the INESS project and FinnTreeBank, respectively. The licensing of Finnish Sofie is still under negotiation. The Georgian annotation is developed by Paul Meurer at Uni Computing and is available under the same license as the NTN treebanks. The English treebank was originally part of the SMULTRON treebank, which is distributed free of charge under the Creative Commons Attribution-Noncommercial 2.5 Switzerland.

All language pairs have been aligned with the INESS Alignment Tool.

Annotation

The annotations are created in the Nordic Treebank Network, META-NORD, ParGram and SMULTRON. The treebank includes both dependency, dependency-CG, constituency and LFG annotations, which have been automatically, semi-automatically and manually created and partly or fully validated. The number of sentences varies for each treebank. Sentences for which an appropriate analysis has not been achieved have also been included in order to preserve the alignment. See the description of each monolingual treebanks for annotation details and evaluation.

License

The monolingual treebanks in the META-NORD Sofie Parallel Treebank are available under this license, which permits INESS to distribute the "Sofie analyses" outside the project under the following terms of use:

a. The "Sofie analyses" can only be used for language technology research and development.
b. The users of the "Sofie analyses" are not allowed to redistribute or to publish the "Sofie analyses", only the knowledge and work that has been made on the basis of the "Sofie analyses".
c. The users of the "Sofie analyses" will ensure appropriate acknowledgement/references to the author of the original text, Jostein
Gaarder, to Aschehoug Publishing house, the translation publishing house and to the project INESS.

The alignments between the language pairs are available under a CC-BY license.

Please ensure appropriate acknowledgement and reference to the creators of the alignments and linguistic annotations, as specified in the descriptions of the individual treebanks.

If you intend to use the English treebank for commercial purposes, you must direct your request to the rightholder of that treebank. In case other special licensing is required, please contact INESS.

Download

Monolingual treebanks
Alignments

Acknowledgements

META-NORD and INESS would like to thank the author Jostein Gaarder, the translators and the publishers for permission to publish the material in annotated form.

Appreciation also goes to Tekstlaboratoriet at the University of Oslo and to Prof. Joakim Nivre for permission to use the material developed by the Nordic Treebank Network, and finally to FinnTreeBank and the SMULTRON treebank.

Page last updated by Gyri Smørdal Losnegaard 29.01.2013