Clarino-logo
Corpuscle :: Corpus list
[Hide]
Select corpora by language or collection:
Languages: All · Abkhazian (1) · Bulgarian (1) · English (16) · Faroese (1) · French (1) · Georgian (8) · German (1) · Mingrelian (1) · Norwegian (3) · Norwegian Bokmål (14) · Norwegian Nynorsk (7) · Old Georgian (1) · Old Norse (3) · Scots (1) · Slovenian (1) · Spanish (4) · Svan (1)
Collections: All · ASK (4) · AbNC (1) · Aviskorpus (3) · GNC (11) · ICAME (10) · Menota (4) · PubNoEnPC (4) · Talebanken (4)
Choose a corpus from the list below. Some corpora are only available when you have signed in.
Corpus Language(s) Size
(words &
punctuation)
Updated Description License
Menota (trans) Norwegian Bokmål (nob) 75 387 2020-09-24 Menota Archive
Dialektendring Norwegian Nynorsk (nno) 4 259 502 2020-04-25 Transcribed and annotated dialect recordings CLARIN_RES-PRIV
ASK Hovedkorpus Norwegian Bokmål (nob) 768 043 2020-04-25 ASK is an electronic, searchable text corpus of Norwegian as a second language, with links between linguistic data and personal data. CLARIN_RES-PRIV
Menota Old Norse (non) 1 896 522 2020-09-29 Menota is an archive of Medieval Nordic Texts. CC-BY-SA
GNC Old Georgian Old Georgian (oge) 6 062 122 2020-04-25 The Georgian National Corpus – Old Georgian CC-BY-NC
Industristad Norwegian Nynorsk (nno) 1 775 521 2020-11-01 Transcribed and annotated dialect recordings CLARIN_RES-PRIV
Menota-test Old Norse (non) 268 110 2020-10-13 This is a test version of Menota, used to test new features and stylesheets. CC-BY-SA
Menota-diploma Old Norse (non) 2 038 2020-09-22 Menota Archive, diploma
GNC Middle Georgian Georgian (kat) 1 432 262 2020-04-25 The Georgian National Corpus – Middle Georgian CC-BY-NC
ASK Korrektkorpus Norwegian Bokmål (nob) 785 451 2020-04-25 ASK is an electronic, searchable text corpus of Norwegian as a second language, with links between linguistic data and personal data, established by the Norwegian Second Language Corpus project. CLARIN_RES-PRIV
GNC Modern Georgian Georgian (kat) 2 108 370 2020-04-25 The Georgian National Corpus – Modern Georgian CC-BY-NC
ASK Tillegg Norwegian Bokmål (nob) 44 529 2020-04-25 Supplemental texts for ASK CLARIN_RES-PRIV
Talesøk Norwegian Nynorsk (nno) 1 560 968 2020-04-26 Transcribed and annotated dialect recordings CLARIN_RES-PRIV
GRC Georgian (kat) 202 338 624 2020-04-25 Georgian Reference Corpus unspecified
Føroyskur talumálsbanki Faroese (fao) 471 178 2020-04-25 Transcribed and annotated dialect recordings CLARIN_RES-PRIV
GDC Georgian (kat) 1 694 362 2020-04-25 Georgian dialect corpus
SSGG Georgian (kat) 152 708 2020-04-26 SSGG – The sociolinguistic situation of present-day Georgia CLARIN_ACA-NC-LOC-PRIV-ND-*
AbNC Abkhazian (abk) 10 372 902 2020-04-25 The Abkhaz National Corpus is a comprehensive and open, grammatically annotated text corpus. CLARIN_PUB-BY-NC-ND
GNC Political texts Georgian (kat) 1 436 075 2020-04-25 Georgian National Corpus, Political texts
GNC Law texts Georgian (kat) 1 495 989 2020-04-25 Georgian National Corpus, Old and Middle Georgian, Law texts
GNC Megrelian Mingrelian (xmf) 89 404 2020-04-25 The Georgian National Corpus – Mingrelian CC-BY-NC
GNC Svan Svan (sva) 473 180 2020-04-25 The Georgian National Corpus – Svan CC-BY-NC
GNC Wikipedia Georgian (kat) 20 974 721 2020-04-26 Georgian Wikipedia
ASK Hovedk./2015 Norwegian Bokmål (nob) 36 142 2020-04-25 Andrespråkskorpus: a Norwegian learners’ corpus. 2015 addition CLARIN_RES-PRIV
Avis/INESS Norwegian (nor) 202 360 007 2020-04-25 Years 2012 and 2013 of Aviskorpus
Aviskorpus (Bokmål) Norwegian Bokmål (nob) 1 952 671 824 2020-05-31 The Norwegian Newspaper Corpus (NNC) Bokmål version is a large monitor corpus representing contemporary Norwegian language in the written variety Norwegian Bokmål. CC-BY
Aviskorpus (Nynorsk) Norwegian Nynorsk (nno) 21 010 885 2020-06-01 The Norwegian Newspaper Corpus (NNC) Nynorsk is a large monitor corpus representing contemporary Norwegian language in the written variety Norwegian Nynorsk. CC-BY
Aviskorpus ann. Norwegian Bokmål (nob) 28 969 124 2020-04-26 This is a subpart of Norsk aviskorpus, grammatically annotated and classified. It comprises 35 692 210 tokens and covers Norwegian bokmål in the time span 2001-2009. CC-BY
BulTreeBank Bulgarian (bul) 229 732 2020-04-25 This distribution represents only the morphological information encoded in BulTreeBank - HPSG-based Treebank of Bulgarian. It contains about 214000 tokens. It was used for the training of the TreeTagger for Bulgarian. MS-NC-NoReD
COLA Spanish (spa) 479 474 2020-04-25 COLA (Corpus Oral de Lenguaje Adolescente Resource) is a corpus of recorded, spontaneous speech among teenagers from different schools and youth clubs in Madrid, Buenos Aires and Santiago de Chile. It is created for the purpose of studying teenage language in Spanish. CLARIN_RES-PLAN-INF-PRIV-ND-DEP-*
Child Rights English (eng) 102 967 2020-04-25 This set of documents is a combination of reports to the UN Committee on the Rights of the Child from civil society organizations (CSOs), which include NGOs, NHRIs and Ombudspersons from both Finland, Norway and Spain. Furthermore, the documents also include the Committee on the Rights of the Child's Concluding Observations. All the documents are collected from the previously reporting rounds for Finland (2008), Norway (2009) and Spain (2010). CC-BY
Coryl English (eng) 129 421 2020-04-25 Coryl is a young learner corpus, which consists of English texts written by Norwegian pupils. CC-BY
FTA/Eng English (eng) 1 270 362 2020-04-25 Corpus of Free Trade Agreements (English/Spanish) CLARIN_ACA
FTA/Spa Spanish (spa) 1 342 828 2020-04-25 Corpus of Free Trade Agreements (English/Spanish) CLARIN_ACA
Forskning.no Norwegian Bokmål (nob) 10 084 683 2020-04-25 Data set containing texts from the popular science website forskning.no from the period 1998 - 2012. CLARIN_RES-DEP
Forskning.no (2017) Norwegian Bokmål (nob) 21 467 688 2020-04-25 Data set containing texts from the popular science website forskning.no from the period 1998 - 2017. CLARIN_RES-DEP
ICAME – ACE English (eng) 1 152 533 2020-04-25 ACE is the first systematically compiled heterogeneous corpus in Australia, designed to support a variety of linguistic research. CLARIN_ACA
ICAME – BROWN Family English (eng) 6 897 518 2020-04-25 This is a collection of Brown, LOB, Frown, FLOB, BLOB and BE06. The collection is made by UCREL, Lancaster. CLARIN_ACA
ICAME – CEECS English (eng) 514 224 2020-04-25 The Corpus of Early English Correspondence (CEEC) has been compiled for the study of social variables in the history of English. CLARIN_ACA
ICAME – COLT English (eng) 444 166 2020-04-25 COLT is a corpus of London Teenage Language with audio recordings. CLARIN_RES-PLAN-INF-PRIV-ND-DEP-*
ICAME – FLOB English (eng) 1 133 503 2020-04-25 The Freiburg - LOB Corpus of British English (FLOB) contains texts from 1991. CLARIN_ACA
ICAME – FROWN English (eng) 1 145 190 2020-04-25 The Freiburg - Brown Corpus of American English (Frown) contains texts from 1991. CLARIN_ACA
ICAME – HC English (eng) 1 851 007 2020-04-25 The Helsinki Corpus of English Texts is a structured multi-genre diachronic corpus. CLARIN_ACA
ICAME – HCOS Scots (sco) 959 977 2020-04-26 The Helsinki Corpus of Older Scots was compiled as a supplement to the diachronic part of the Helsinki Corpus of English Texts. CLARIN_ACA
ICAME – LLC English (eng) 574 340 2020-04-25 The London-Lund Corpus contains samples of educated spoken British English, in orthographic transcription with detailed prosodic marking. CLARIN_ACA
ICAME – LOB English (eng) 1 156 902 2020-04-25 The Lancaster - Oslo/Bergen (LOB) Corpus is a million-word collection of present-day (1961) British English texts. CLARIN_ACA
Jos1M Slovenian (slv) 1 182 946 2020-04-25 Project JOS: Linguistic Annotation of Slovene (http://nl.ijs.si/jos/index-sl.html) CC-BY-NC
KIAP English (eng) 3 900 925 2020-04-25 KIAP is a corpus of 450 research articles covering three disciplines (economics, linguistics and medicine) and three languages (English, French and Norwegian). CC-BY
Leksikografisk bokmålskorpus Norwegian Bokmål (nob) 102 286 906 2020-04-25 Bokmålstekster fra 1985 til i dag. CLARIN_RES-PLAN-BY-NC-LOC-PRIV-NORED-ND-*
NAOB-tekster Norwegian Bokmål (nob) 229 151 252 2020-04-25 This corpus contains Norwegian (Bokmål) texts used for building the NAOB treebank. CLARIN_RES-INF-LOC-PRIV-DEP
NBs frie tekster (Bokmål) Norwegian Bokmål (nob) 516 392 689 2020-04-25 This corpus contains Norwegian (Bokmål) texts from NBDigital (the National Library’s collection of OCRed books) that are free of copyright and have an OCR confidence higher than 0.9. CC-ZERO
NBs frie tekster (Nynorsk) Norwegian Nynorsk (nno) 46 017 287 2020-04-25 This corpus contains Norwegian (Nynorsk) texts from NBDigital (the National Library’s collection of OCRed books) that are free of copyright and have an OCR confidence higher than 0.9. CC-ZERO
NSPC/Nor Norwegian (nor) 2 293 213 2020-04-25 The NSPC is a parallel, unidirectional translation corpus of contemporary Norwegian written texts translated into Spanish, published between 2000 and 2009. It contains fiction and non-fiction, and each text is classified according to genre, the author's gender and the gender and mother tongue of the translator. CLARIN_ACA
NSPC/Spa Spanish (spa) 2 465 968 2020-04-25 The NSPC is a parallel, unidirectional translation corpus of contemporary Norwegian written texts translated into Spanish, published between 2000 and 2009. It contains fiction and non-fiction, and each text is classified according to genre, the author's gender and the gender and mother tongue of the translator. CLARIN_ACA
NTAP English (eng) 660 798 199 2020-04-25 The English NTAP blog corpus comprises 1,5 milion English-language blog posts from around 3000 blogs related to climate change issues across science, politics and environment.
NTAP (French) French (fra) 1 506 074 082 2020-04-26 The French NTAP blog corpus comprises 2.3 milion French-language blog posts from around 2000 blogs related to climate change issues across science, politics and environment.
Nynorsk-korpus Norwegian Nynorsk (nno) 107 803 034 2020-04-26 Norsk Ordboks Nynorskkorpus er ei elektronisk samling av nynorske tekstar og omfattar per i dag omkring 102 millionar ord. CLARIN_RES-NC-DEP
PubBEPC (eng) English (eng) 448 717 2020-04-26 PubBEPC is a Bokmål-English sentence-aligned parallel corpus built from the public web sites www.nav.no, www.nyinorge.no and skatteetaten.no. CC-BY
PubBEPC (nob) Norwegian Bokmål (nob) 359 401 2020-04-26 PubBEPC is a Bokmål-English sentence-aligned parallel corpus built from the public web sites www.nav.no, www.nyinorge.no and skatteetaten.no. CC-BY
PubNEPC (eng) English (eng) 353 838 2020-04-26 PubNEPC is a Nynorsk-English sentence-aligned parallel corpus built from the public web sites www.nav.no and skatteetaten.no. CC-BY
PubNEPC (nno) Norwegian Nynorsk (nno) 289 723 2020-04-26 PubNEPC is a Nynorsk-English sentence-aligned parallel corpus built from the public web sites www.nav.no and skatteetaten.no. CC-BY
Storting debates Norwegian Bokmål (nob) 28 533 334 2020-04-26 The corpus "Proceedings of Norwegian parliamentary debates (2008-2015)" is a collection of transcriptions of Norwegian parliamentary debates between 2008 and 2015, downloaded from https://data.stortinget.no/. NLOD
TRIS/de-at German (deu) 841 861 2020-04-26 The corpus TRIS (Parallel Corpus of documents from the Technical Regulations Information System for German-Spanish) is a specialized parallel corpus with Spanish-German (ES-ES, DE-AT and DE-DE), texts from the European Commission between 1997-2010. CC-BY-NC-SA
TRIS/es-es Spanish (spa) 1 209 985 2020-04-26 The corpus TRIS (Parallel Corpus of documents from the Technical Regulations Information System for German-Spanish) is a specialized parallel corpus with Spanish-German (ES-ES, DE-AT and DE-DE), texts from the European Commission between 1997-2010. CC-BY-NC-SA
Talk Of Norway Norwegian (nor) 63 803 594 2020-04-26 Corpus based on the Talk of Norway (TON) dataset v. 1.0, a collection of Norwegian parliament speeches from the 1998-1999 to 2015-2016 sessions. NLOD

Design & implementation: Paul Meurer, Universitetet i Bergen, CLARINO Centre, 2020