Clarino-logo
Corpuscle :: Corpus list
[Hide]
Select corpora by language or collection:
Languages: All · Abkhazian (1) · Bulgarian (1) · English (15) · Faroese (1) · Georgian (4) · German (1) · Mingrelian (1) · Norwegian (2) · Norwegian Bokmål (9) · Norwegian Nynorsk (3) · Old Georgian (1) · Old Norse (2) · Scots (1) · Slovenian (1) · Spanish (4) · Svan (1)
Collections: All · ASK (2) · AbNC (1) · Aviskorpus (3) · GNC (7) · ICAME (10) · Menota (2) · PubNoEnPC (4) · Talebanken (1)
Choose a corpus from the list below. Some corpora are only available when you have signed in.
Corpus Language(s) Size
(words &
punctuation)
Updated Description License
Menota Old Norse (non) 1 618 028 2017-06-02 Menota is an archive of Medieval Nordic Texts. CC-BY-SA
GNC Old Georgian Old Georgian (oge) 4 349 721 2017-06-02 The Georgian National Corpus – Old Georgian CC-BY-NC
ASK Hovedkorpus Norwegian Bokmål (nob) 768 043 2018-02-14 ASK is an electronic, searchable text corpus of Norwegian as a second language, with links between linguistic data and personal data. CLARIN_RES-PRIV
GNC Middle Georgian Georgian (kat) 1 242 634 2017-06-02 The Georgian National Corpus – Middle Georgian CC-BY-NC
Menota-test Old Norse (non) 119 239 2018-04-25 This is a test version of Menota, used to test new features and stylesheets. CC-BY-SA
ASK Korrektkorpus Norwegian Bokmål (nob) 785 451 2017-06-02 ASK is an electronic, searchable text corpus of Norwegian as a second language, with links between linguistic data and personal data, established by the Norwegian Second Language Corpus project. CLARIN_RES-PRIV
GNC Modern Georgian Georgian (kat) 451 584 2017-06-02 The Georgian National Corpus – Modern Georgian CC-BY-NC
Føroyskur talumálsbanki Faroese (fao) 45 135 2017-11-05 Transcribed and annotated dialect recordings
GRC Georgian (kat) 182 883 787 2018-02-24 Georgian Reference Corpus unspecified
SSGG Georgian (kat) 152 708 2017-06-02 SSGG – The sociolinguistic situation of present-day Georgia CLARIN_ACA-NC-LOC-PRIV-ND-*
GNC Megrelian Mingrelian (xmf) 89 404 2017-06-02 The Georgian National Corpus – Mingrelian CC-BY-NC
AbNC Abkhazian (abk) 10 085 059 2018-04-05 The Abkhaz National Corpus is a comprehensive and open, grammatically annotated text corpus. CLARIN_PUB-BY-NC-ND
GNC Svan Svan (sva) 473 180 2017-06-02 The Georgian National Corpus – Svan CC-BY-NC
Aviskorpus (Bokmål) Norwegian Bokmål (nob) 1 509 076 098 2017-06-02 The Norwegian Newspaper Corpus (NNC) Bokmål version is a large monitor corpus representing contemporary Norwegian language in the written variety Norwegian Bokmål. CC-BY
Aviskorpus (Nynorsk) Norwegian Nynorsk (nno) 16 070 002 2017-06-02 The Norwegian Newspaper Corpus (NNC) Nynorsk is a large monitor corpus representing contemporary Norwegian language in the written variety Norwegian Nynorsk. CC-BY
Aviskorpus ann. Norwegian Bokmål (nob) 28 969 124 2017-06-02 This is a subpart of Norsk aviskorpus, grammatically annotated and classified. It comprises 35 692 210 tokens and covers Norwegian bokmål in the time span 2001-2009. CC-BY
BulTreeBank Bulgarian (bul) 229 732 2017-06-02 This distribution represents only the morphological information encoded in BulTreeBank - HPSG-based Treebank of Bulgarian. It contains about 214000 tokens. It was used for the training of the TreeTagger for Bulgarian. MS-NC-NoReD
COLA Spanish (spa) 479 474 2017-06-02 COLA (Corpus Oral de Lenguaje Adolescente Resource) is a corpus of recorded, spontaneous speech among teenagers from different schools and youth clubs in Madrid, Buenos Aires and Santiago de Chile. It is created for the purpose of studying teenage language in Spanish. CLARIN_RES-PLAN-INF-PRIV-ND-DEP-*
Child Rights English (eng) 102 967 2017-04-24 This set of documents is a combination of reports to the UN Committee on the Rights of the Child from civil society organizations (CSOs), which include NGOs, NHRIs and Ombudspersons from both Finland, Norway and Spain. Furthermore, the documents also include the Committee on the Rights of the Child's Concluding Observations. All the documents are collected from the previously reporting rounds for Finland (2008), Norway (2009) and Spain (2010). CC-BY
Coryl English (eng) 129 421 2017-06-02 Coryl is a young learner corpus, which consists of English texts written by Norwegian pupils. CC-BY
FTA/Eng English (eng) 1 270 362 2017-06-02 Corpus of Free Trade Agreements (English/Spanish) CLARIN_ACA
FTA/Spa Spanish (spa) 1 342 828 2017-06-02 Corpus of Free Trade Agreements (English/Spanish) CLARIN_ACA
Forskning.no Norwegian Bokmål (nob) 10 084 683 2017-06-02 Data set containing texts from the popular science website forskning.no from the period 1998 - 2012. CLARIN_RES-DEP
Forskning.no (2017) Norwegian Bokmål (nob) 21 467 688 2017-11-20 Data set containing texts from the popular science website forskning.no from the period 1998 - 2017. CLARIN_RES-DEP
ICAME – ACE English (eng) 1 152 533 2017-06-02 ACE is the first systematically compiled heterogeneous corpus in Australia, designed to support a variety of linguistic research. CLARIN_ACA
ICAME – BROWN Family English (eng) 6 897 518 2017-06-02 This is a collection of Brown, LOB, Frown, FLOB, BLOB and BE06. The collection is made by UCREL, Lancaster. CLARIN_ACA
ICAME – CEECS English (eng) 514 224 2017-06-02 The Corpus of Early English Correspondence (CEEC) has been compiled for the study of social variables in the history of English. CLARIN_ACA
ICAME – COLT English (eng) 444 166 2017-06-02 COLT is a corpus of London Teenage Language with audio recordings. CLARIN_RES-PLAN-INF-PRIV-ND-DEP-*
ICAME – FLOB English (eng) 1 133 503 2017-06-02 The Freiburg - LOB Corpus of British English (FLOB) contains texts from 1991. CLARIN_ACA
ICAME – FROWN English (eng) 1 145 190 2017-06-02 The Freiburg - Brown Corpus of American English (Frown) contains texts from 1991. CLARIN_ACA
ICAME – HC English (eng) 1 851 007 2017-06-02 The Helsinki Corpus of English Texts is a structured multi-genre diachronic corpus. CLARIN_ACA
ICAME – HCOS Scots (sco) 959 977 2017-06-02 The Helsinki Corpus of Older Scots was compiled as a supplement to the diachronic part of the Helsinki Corpus of English Texts. CLARIN_ACA
ICAME – LLC English (eng) 574 340 2017-06-02 The London-Lund Corpus contains samples of educated spoken British English, in orthographic transcription with detailed prosodic marking. CLARIN_ACA
ICAME – LOB English (eng) 1 156 902 2017-06-02 The Lancaster - Oslo/Bergen (LOB) Corpus is a million-word collection of present-day (1961) British English texts. CLARIN_ACA
Jos1M Slovenian (slv) 1 182 946 2017-06-02 Project JOS: Linguistic Annotation of Slovene (http://nl.ijs.si/jos/index-sl.html) CC-BY-NC
KIAP English (eng) 3 900 925 2017-06-02 KIAP is a corpus of 450 research articles covering three disciplines (economics, linguistics and medicine) and three languages (English, French and Norwegian). CC-BY
NBs frie tekster (Bokmål) Norwegian Bokmål (nob) 516 392 689 2017-06-02 This corpus contains Norwegian (Bokmål) texts from NBDigital (the National Library’s collection of OCRed books) that are free of copyright and have an OCR confidence higher than 0.9. CC-ZERO
NBs frie tekster (Nynorsk) Norwegian Nynorsk (nno) 46 017 287 2017-06-02 This corpus contains Norwegian (Nynorsk) texts from NBDigital (the National Library’s collection of OCRed books) that are free of copyright and have an OCR confidence higher than 0.9. CC-ZERO
NSPC/Nor Norwegian (nor) 2 293 213 2017-06-02 The NSPC is a parallel, unidirectional translation corpus of contemporary Norwegian written texts translated into Spanish, published between 2000 and 2009. It contains fiction and non-fiction, and each text is classified according to genre, the author's gender and the gender and mother tongue of the translator. CLARIN_ACA
NSPC/Spa Spanish (spa) 2 465 968 2017-06-02 The NSPC is a parallel, unidirectional translation corpus of contemporary Norwegian written texts translated into Spanish, published between 2000 and 2009. It contains fiction and non-fiction, and each text is classified according to genre, the author's gender and the gender and mother tongue of the translator. CLARIN_ACA
PubBEPC (eng) English (eng) 448 717 2018-03-22 PubBEPC is a Bokmål-English sentence-aligned parallel corpus built from the public web sites www.nav.no, www.nyinorge.no and skatteetaten.no. CC-BY
PubBEPC (nob) Norwegian Bokmål (nob) 359 401 2018-03-22 PubBEPC is a Bokmål-English sentence-aligned parallel corpus built from the public web sites www.nav.no, www.nyinorge.no and skatteetaten.no. CC-BY
PubNEPC (eng) English (eng) 353 838 2018-04-02 PubNEPC is a Nynorsk-English sentence-aligned parallel corpus built from the public web sites www.nav.no and skatteetaten.no. CC-BY
PubNEPC (nno) Norwegian Nynorsk (nno) 289 723 2018-04-02 PubNEPC is a Nynorsk-English sentence-aligned parallel corpus built from the public web sites www.nav.no and skatteetaten.no. CC-BY
Storting debates Norwegian Bokmål (nob) 28 533 334 2017-06-02 The corpus "Proceedings of Norwegian parliamentary debates (2008-2015)" is a collection of transcriptions of Norwegian parliamentary debates between 2008 and 2015, downloaded from https://data.stortinget.no/. NLOD
TRIS/de-at German (deu) 841 861 2017-06-02 The corpus TRIS (Parallel Corpus of documents from the Technical Regulations Information System for German-Spanish) is a specialized parallel corpus with Spanish-German (ES-ES, DE-AT and DE-DE), texts from the European Commission between 1997-2010. CC-BY-NC-SA
TRIS/es-es Spanish (spa) 1 209 985 2017-06-02 The corpus TRIS (Parallel Corpus of documents from the Technical Regulations Information System for German-Spanish) is a specialized parallel corpus with Spanish-German (ES-ES, DE-AT and DE-DE), texts from the European Commission between 1997-2010. CC-BY-NC-SA
Talk Of Norway Norwegian (nor) 63 803 594 2017-06-02 Corpus based on the Talk of Norway (TON) dataset v. 1.0, a collection of Norwegian parliament speeches from the 1998-1999 to 2015-2016 sessions. NLOD

Design & implementation: Paul Meurer, Universitetet i Bergen, CLARINO Centre, 2018