NorGramBank consists of approximately 50 million words as of May 2016. Part of the treebank (approximately 350,000 words) has been semiautomatically manually disambiguated as a gold standard. The treebank is still being developed, and the goal is to reach 100 million words by June 2017.
NorGramBank is being built by parsing a corpus on the XLE platform with the hand-written grammar NorGram (see Dyvik 2000, Butt et al. 2002 and also the online documentation). The corpus is parsed automatically, but disambiguation is done by manual disambiguation with discriminants (for the gold standard), and by stochastic parse ranking for the rest of the parsebank.
There are two levels of syntactic analysis in the corpus: constituent structure (c-structure) and functional structure (f-structure). The c-structure shows how the words in the sentence are hierarchically grouped together in phrases. The f-structure shows what syntactic functions (subject, object, adjunct, etc.) the phrases in the sentence have, as well as grammatical features such as number and tense. These two layers are related to each other through the computational grammar used to construct them. Mousing over parts of one of the structures shows by highlighting the correspondence relation to the other structure.
The texts in the treebank are both fiction (mainly novels, both for children and adults) and nonfiction (many genres, including newspaper text, popular science writing, television subtitles, school textbooks, etc.).
XLE prolog, may be exported in TigerXML and PENN treebank format.
Lexical Functional Grammar (LFG)
Helge Dyvik, Paul Meurer, Victoria Rosén, Koenraad De Smedt, Petter Haugereid, Gyri Smørdal Losnegaard, Gunn Inger Lyse, and Martha Thunes. NorGramBank: A ‘Deep’ Treebank for Norwegian. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16), Portorož, Slovenia, 2016. ELRA. bibtex
For more publications, please see the INESS Publications page
Availability and search:
NorGramBank is made available through the INESS treebanking infrastructure. Both c-structures and f-structures may be searched with INESS Search.
MWE annotation in the treebank:
The treebank has both nested and discontinuous MWEs.
NorGramBank is a dynamic parsebank. It can be reparsed at any time with new versions of the grammar and lexicon, resulting in a new version of the parsebank. Its main construction phase will be completed by June 30, 2017. A system for versioning will be in place sometime in 2016.
Parseme MWE wiki contact: