INESS :: Treebank Selection

Treebank Selection

Select a set of treebanks to work with. ?

Click on a treebank name below to proceed. All selected treebanks will be available for viewing and searching. | Show treebank descriptions

Selected	Name	Collection	Type	Sentences	Words	Indexed	Description	License	Downloads
all \| none				1 840 491	31 504 351
	Dutch (nld)			65 200	990 087
	nld-lassy-con	nld-lassy-con	Alpino	constituency-alpino	65 200	990 087	yes	The Lassy Small Corpus 1.1 is a 1 million word corpus with manually verified syntactic annotations. The lemma and postag annotations have been automatically assigned using Tadpole. The syntactic dependency annotations have been assigned using the Alpino parser. The automatically assigned lemmas, postags and syntactic dependency annotations were checked and corrected. Organisations involved in the building of the Lassy Large Corpus: Alfa-informatica, University of Groningen; CCL, K.U. Leuven. ACCESS: The INESS copy can be used by all employees and students of University of Bergen, Dep. of Linguistic, Literary and Aesthetic studies. Others need to apply to the rights holders of the original first. The Lassy version at INESS may be used for academic purposes under the following conditions: attribution required, no derivatives, no redistribution, non-commercial. [less] The Lassy Small Corpus 1.1 is a 1 million word corpus with manually verified syntactic annotations. … [more]	unspecified	no
	English (eng)			101	658
	eng-pargram (aligned)	eng-pargram (aligned)	ParGram	lfg	101	658	yes	The ParGram collection is a collection of parallel treebanks covering a set of chosen syntactic constructions. The ParGram collection is a collaborative effort of the ParGram project, along with the ParSem project, by researcher groups in industrial and academic institutions around the world. The aim of ParGram is to produce wide coverage grammars for a variety of languages. These are written collaboratively within the linguistic framework of LFG (Lexical Functional Grammar) and with a commonly-agreed-upon set of grammatical features. The XLE (Xerox Linguistic Environment) is used as a development platform. ParSem develops semantic structures based on the ParGram syntactic structures. Most of the ParSem systems use the XLE’s XFR system. Regular semiannual meetings are being held to bring together the various research groups involved in ParGram and ParSem. [less] The ParGram collection is a collection of parallel treebanks covering a set of chosen syntactic cons… [more]	CC-BY	no
	Georgian (kat)			1 077	10 146
	kat-pargram (aligned)	kat-pargram (aligned)	GeoGram, ParGram	lfg	52	231	yes	The ParGram collection is a collection of parallel treebanks covering a set of chosen syntactic constructions. The ParGram collection is a collaborative effort of the ParGram project, along with the ParSem project, by researcher groups in industrial and academic institutions around the world. The aim of ParGram is to produce wide coverage grammars for a variety of languages. These are written collaboratively within the linguistic framework of LFG (Lexical Functional Grammar) and with a commonly-agreed-upon set of grammatical features. The XLE (Xerox Linguistic Environment) is used as a development platform. ParSem develops semantic structures based on the ParGram syntactic structures. Most of the ParSem systems use the XLE’s XFR system. Regular semiannual meetings are being held to bring together the various research groups involved in ParGram and ParSem. [less] The ParGram collection is a collection of parallel treebanks covering a set of chosen syntactic cons… [more]	CC-BY	no
	kat-sofie (aligned)	kat-sofie (aligned)	GeoGram, Sofie	lfg	1 025	9 915	yes	The Georgian part of the META-NORD Sofie Parallel Treebank. This is a syntactically annotated parallel corpus based on the first chapters of the novel “Sofies verden” (Sophie's World) by Jostein Gaarder, published by Aschehoug forlag. The treebank consists of grammatical annotations of extracts from the Georgian translation of the novel. The Georgian translation is published by Bakur Sulakauri Publishing. [less] The Georgian part of the META-NORD Sofie Parallel Treebank. This is a syntactically annotated parall… [more]	unspecified	no
	Indonesian (ind)			79	433
	ind-pargram (aligned)	ind-pargram (aligned)	ParGram	lfg	79	433	yes	The ParGram collection is a collection of parallel treebanks covering a set of chosen syntactic constructions. The ParGram collection is a collaborative effort of the ParGram project, along with the ParSem project, by researcher groups in industrial and academic institutions around the world. The aim of ParGram is to produce wide coverage grammars for a variety of languages. These are written collaboratively within the linguistic framework of LFG (Lexical Functional Grammar) and with a commonly-agreed-upon set of grammatical features. The XLE (Xerox Linguistic Environment) is used as a development platform. ParSem develops semantic structures based on the ParGram syntactic structures. Most of the ParSem systems use the XLE’s XFR system. Regular semiannual meetings are being held to bring together the various research groups involved in ParGram and ParSem. [less] The ParGram collection is a collection of parallel treebanks covering a set of chosen syntactic cons… [more]	CC-BY	no
	Norwegian (nor)			1 219 910	24 085 506
	nor-stortinget	nor-stortinget	NorGram, NorGramBank	lfg	227 699	4 477 383	yes	The treebank "Proceedings of Norwegian parliamentary debates (2008-2015)" is a collection of transcriptions of Norwegian parliamentary debates between 2008 and 2015, downloaded from https://data.stortinget.no/. Of the total set of documents, it is ongoing work to preprocess documents (e.g register previously unknown words in the document) and load the preprocessed documents into INESS for automatic parsing; hence, as of June 2016, the size of the treebank is still growing. To see the updated info on treebank size and which documents are included, please choose the relevant treebank, and then click "Treebank Details" (in the left-hand menu). Each sentence has the following metadata which is searchable in the INESS search system: (1) language variety - Norwegian bokmål (nob) or Norwegian nynorsk (nno), based on the automatic recognition of language variety, implemented by Paul Meurer at Uni Research Computing. There are also some transcriptions from speeches in English and Danish. (2) Speaker's name (3) Date and time (4) Political party to which the speaker belongs (5) Type of contribution (e.g. 'hovedinnlegg' [main contribution] or 'replikk' [reply]). AVAILABILITY: The material from 2008- 2015 is searchable via the corpus tool Corpuscle. Via the treebank portal INESS (clarino.uib.no/iness) you can search in sentence analyses from the material (for that set of documents that have currently been preprocessed and the automatically parsed). [less] The treebank "Proceedings of Norwegian parliamentary debates (2008-2015)" is a collection of transcr… [more]	NLOD	no
	nor-stortinget_1	nor-stortinget_1	NorGram, NorGramBank	lfg	257 130	5 090 162	yes	The treebank "Proceedings of Norwegian parliamentary debates (2008-2015)" is a collection of transcriptions of Norwegian parliamentary debates between 2008 and 2015, downloaded from https://data.stortinget.no/. Of the total set of documents, it is ongoing work to preprocess documents (e.g register previously unknown words in the document) and load the preprocessed documents into INESS for automatic parsing; hence, as of June 2016, the size of the treebank is still growing. To see the updated info on treebank size and which documents are included, please choose the relevant treebank, and then click "Treebank Details" (in the left-hand menu). Each sentence has the following metadata which is searchable in the INESS search system: (1) language variety - Norwegian bokmål (nob) or Norwegian nynorsk (nno), based on the automatic recognition of language variety, implemented by Paul Meurer at Uni Research Computing. There are also some transcriptions from speeches in English and Danish. (2) Speaker's name (3) Date and time (4) Political party to which the speaker belongs (5) Type of contribution (e.g. 'hovedinnlegg' [main contribution] or 'replikk' [reply]). AVAILABILITY: The material from 2008- 2015 is searchable via the corpus tool Corpuscle. Via the treebank portal INESS (clarino.uib.no/iness) you can search in sentence analyses from the material (for that set of documents that have currently been preprocessed and the automatically parsed). [less] The treebank "Proceedings of Norwegian parliamentary debates (2008-2015)" is a collection of transcr… [more]	NLOD	no
	nor-stortinget_2	nor-stortinget_2	NorGram, NorGramBank	lfg	236 006	4 648 667	yes	The treebank "Proceedings of Norwegian parliamentary debates (2008-2015)" is a collection of transcriptions of Norwegian parliamentary debates between 2008 and 2015, downloaded from https://data.stortinget.no/. Of the total set of documents, it is ongoing work to preprocess documents (e.g register previously unknown words in the document) and load the preprocessed documents into INESS for automatic parsing; hence, as of June 2016, the size of the treebank is still growing. To see the updated info on treebank size and which documents are included, please choose the relevant treebank, and then click "Treebank Details" (in the left-hand menu). Each sentence has the following metadata which is searchable in the INESS search system: (1) language variety - Norwegian bokmål (nob) or Norwegian nynorsk (nno), based on the automatic recognition of language variety, implemented by Paul Meurer at Uni Research Computing. There are also some transcriptions from speeches in English and Danish. (2) Speaker's name (3) Date and time (4) Political party to which the speaker belongs (5) Type of contribution (e.g. 'hovedinnlegg' [main contribution] or 'replikk' [reply]). AVAILABILITY: The material from 2008- 2015 is searchable via the corpus tool Corpuscle. Via the treebank portal INESS (clarino.uib.no/iness) you can search in sentence analyses from the material (for that set of documents that have currently been preprocessed and the automatically parsed). [less] The treebank "Proceedings of Norwegian parliamentary debates (2008-2015)" is a collection of transcr… [more]	NLOD	no
	nor-stortinget_3	nor-stortinget_3	NorGram, NorGramBank	lfg	252 845	4 924 530	yes	The treebank "Proceedings of Norwegian parliamentary debates (2008-2015)" is a collection of transcriptions of Norwegian parliamentary debates between 2008 and 2015, downloaded from https://data.stortinget.no/. Of the total set of documents, it is ongoing work to preprocess documents (e.g register previously unknown words in the document) and load the preprocessed documents into INESS for automatic parsing; hence, as of June 2016, the size of the treebank is still growing. To see the updated info on treebank size and which documents are included, please choose the relevant treebank, and then click "Treebank Details" (in the left-hand menu). Each sentence has the following metadata which is searchable in the INESS search system: (1) language variety - Norwegian bokmål (nob) or Norwegian nynorsk (nno), based on the automatic recognition of language variety, implemented by Paul Meurer at Uni Research Computing. There are also some transcriptions from speeches in English and Danish. (2) Speaker's name (3) Date and time (4) Political party to which the speaker belongs (5) Type of contribution (e.g. 'hovedinnlegg' [main contribution] or 'replikk' [reply]). AVAILABILITY: The material from 2008- 2015 is searchable via the corpus tool Corpuscle. Via the treebank portal INESS (clarino.uib.no/iness) you can search in sentence analyses from the material (for that set of documents that have currently been preprocessed and the automatically parsed). [less] The treebank "Proceedings of Norwegian parliamentary debates (2008-2015)" is a collection of transcr… [more]	NLOD	no
	nor-stortinget_4	nor-stortinget_4	NorGram, NorGramBank	lfg	246 230	4 944 764	yes	The treebank "Proceedings of Norwegian parliamentary debates (2008-2015)" is a collection of transcriptions of Norwegian parliamentary debates between 2008 and 2015, downloaded from https://data.stortinget.no/. Of the total set of documents, it is ongoing work to preprocess documents (e.g register previously unknown words in the document) and load the preprocessed documents into INESS for automatic parsing; hence, as of June 2016, the size of the treebank is still growing. To see the updated info on treebank size and which documents are included, please choose the relevant treebank, and then click "Treebank Details" (in the left-hand menu). Each sentence has the following metadata which is searchable in the INESS search system: (1) language variety - Norwegian bokmål (nob) or Norwegian nynorsk (nno), based on the automatic recognition of language variety, implemented by Paul Meurer at Uni Research Computing. There are also some transcriptions from speeches in English and Danish. (2) Speaker's name (3) Date and time (4) Political party to which the speaker belongs (5) Type of contribution (e.g. 'hovedinnlegg' [main contribution] or 'replikk' [reply]). AVAILABILITY: The material from 2008- 2015 is searchable via the corpus tool Corpuscle. Via the treebank portal INESS (clarino.uib.no/iness) you can search in sentence analyses from the material (for that set of documents that have currently been preprocessed and the automatically parsed). [less] The treebank "Proceedings of Norwegian parliamentary debates (2008-2015)" is a collection of transcr… [more]	NLOD	no
	Norwegian Nynorsk (nno)			554 042	6 417 062
	nno-child	nno-child	NorGram, NorGramBank	lfg	106 447	1 043 278	yes	The treebank "NorGrambank children's fiction in Norwegian Nynorsk" is a syntactically annotated corpus based on data taken from bokhylla.no at the National Library of Norway. This treebank is part of INESS NorGramBank collection (see URL in metadata). As of October 2015, the treebank comprises 106434 sentences, 1043260 words, 76 documents. The source text was OCR-read by the National Library of Norway; INESS has preprocessed the source text semi-automatically with regard to OCR errors (misinterpreted letters etc) before syntactic parsing. [less] The treebank "NorGrambank children's fiction in Norwegian Nynorsk" is a syntactically annotated corp… [more]	CLARIN_ACA	no
	nno-fn	nno-fn	NorGram	lfg	21 723	371 744	yes	The "NorGram Non-fiction text in Norwegian Nynorsk from Forskning.no" treebank is a syntactically annotated corpus based on data taken from the Norwegian popular science website Forskning.no. This treebank is part of INESS NorGramBank collection (see URL in metadata). As of October 2015, the treebank comprises 21723 sentences, 371744 words and 582 documents. [less] The "NorGram Non-fiction text in Norwegian Nynorsk from Forskning.no" treebank is a syntactically an… [more]	CLARIN_RES-DEP	no
	nno-ndt-lfg	nno-ndt-lfg	NDT, NorGram, NorGramBank	lfg	17 579	272 023	yes	The treebank "NorGram NDT in LFG in Norwegian Nynorsk (derivate from Norwegian Dependency Treebank)" is based on the text material in the Norwegian Dependency Treebank (NDT), available from Språkbanken at National Library of Norway. The sentences have been parsed and disambiguated in the Norwegian LFG treebank using the NorGram LFG grammar. [less] The treebank "NorGram NDT in LFG in Norwegian Nynorsk (derivate from Norwegian Dependency Treebank)"… [more]	CC-BY	no
	nno-nnk-av	nno-nnk-av	NorGram, NorGramBank	lfg	7 847	123 436	yes	The treebank "NorGramBank annotations of Newspaper text from 'Nynorskkorpuset ved Norsk Ordbok 2014'" is a syntactically annotated corpus which uses text extracts from Nynorskkorpuset ved Norsk Ordbok 2014 (no2014.uio.no). This treebank is part of INESS NorGramBank collection (see URL in metadata). [less] The treebank "NorGramBank annotations of Newspaper text from 'Nynorskkorpuset ved Norsk Ordbok 2014'… [more]	CLARIN_ACA-DEP	no
	nno-nnk-sa	nno-nnk-sa	NorGram, NorGramBank	lfg	38 332	623 281	yes	The treebank "Annotations of non-fiction text from 'Nynorskkorpuset ved Norsk Ordbok 2014'" is a syntactically annotated corpus which uses text extracts from Nynorskkorpuset ved Norsk Ordbok 2014 (no2014.uio.no). This treebank is part of INESS NorGramBank collection (see URL in metadata). [less] The treebank "Annotations of non-fiction text from 'Nynorskkorpuset ved Norsk Ordbok 2014'" is a syn… [more]	CLARIN_ACA-DEP	no
	nno-nnk-sk	nno-nnk-sk	NorGram, NorGramBank	lfg	94 409	969 308	yes	The treebank "Annotations of fiction text from 'Nynorskkorpuset ved Norsk Ordbok 2014' is a syntactically annotated corpus which uses text extracts from Nynorskkorpuset ved Norsk Ordbok 2014 (no2014.uio.no). This treebank is part of INESS NorGramBank collection (see URL in metadata). [less] The treebank "Annotations of fiction text from 'Nynorskkorpuset ved Norsk Ordbok 2014' is a syntacti… [more]	CLARIN_ACA-DEP	no
	nno-novel	nno-novel	NorGram, NorGramBank	lfg	267 705	3 013 992	yes	The "NorGramBank fiction in Norwegian Nynorsk" treebank is a syntactically annotated corpus based on data taken from bokhylla.no at the National Library of Norway. This treebank is part of INESS NorGramBank collection (see URL in metadata). As of October 2015, the treebank comprises 260285 sentences, 2884376 words and 91 documents. The source text was OCR-read by the National Library of Norway; INESS has preprocessed the source text semi-automatically with regard to OCR errors (misinterpreted letters etc) before syntactic parsing. [less] The "NorGramBank fiction in Norwegian Nynorsk" treebank is a syntactically annotated corpus based on… [more]	CLARIN_ACA	no
	Portuguese (por)			50	340
	por-pargram	por-pargram	ParGram	lfg	50	340	yes	The ParGram collection is a collection of parallel treebanks covering a set of chosen syntactic constructions. The ParGram collection is a collaborative effort of the ParGram project, along with the ParSem project, by researcher groups in industrial and academic institutions around the world. The aim of ParGram is to produce wide coverage grammars for a variety of languages. These are written collaboratively within the linguistic framework of LFG (Lexical Functional Grammar) and with a commonly-agreed-upon set of grammatical features. The XLE (Xerox Linguistic Environment) is used as a development platform. ParSem develops semantic structures based on the ParGram syntactic structures. Most of the ParSem systems use the XLE’s XFR system. Regular semiannual meetings are being held to bring together the various research groups involved in ParGram and ParSem. [less] The ParGram collection is a collection of parallel treebanks covering a set of chosen syntactic cons… [more]	CC-BY	no
	Turkish (tur)			32	119
	tur-pargram (aligned)	tur-pargram (aligned)	ParGram	lfg	32	119	yes	The ParGram collection is a collection of parallel treebanks covering a set of chosen syntactic constructions. The ParGram collection is a collaborative effort of the ParGram project, along with the ParSem project, by researcher groups in industrial and academic institutions around the world. The aim of ParGram is to produce wide coverage grammars for a variety of languages. These are written collaboratively within the linguistic framework of LFG (Lexical Functional Grammar) and with a commonly-agreed-upon set of grammatical features. The XLE (Xerox Linguistic Environment) is used as a development platform. ParSem develops semantic structures based on the ParGram syntactic structures. Most of the ParSem systems use the XLE’s XFR system. Regular semiannual meetings are being held to bring together the various research groups involved in ParGram and ParSem. [less] The ParGram collection is a collection of parallel treebanks covering a set of chosen syntactic cons… [more]	CC-BY	no

Design & implementation: Paul Meurer, CLARINO Bergen Centre, 2025 · Accessibility statement (in Norwegian only)