Clarino-logo
Corpuscle :: Corpus list
[Hide]
Select corpora by language or collection:
Languages: All · Abkhazian (1) · Bulgarian (1) · English (16) · Faroese (1) · French (1) · Georgian (7) · German (1) · Mingrelian (1) · Norwegian (3) · Norwegian Bokmål (12) · Norwegian Nynorsk (7) · Old Georgian (1) · Old Norse (2) · Scots (1) · Slovenian (1) · Spanish (4) · Svan (1)
Collections: All · ASK (4) · AbNC (1) · Aviskorpus (3) · GNC (10) · ICAME (10) · Menota (2) · PubNoEnPC (4) · Talebanken (4)
Choose a corpus from the list below. Some corpora are only available when you have signed in.
Corpus Language(s) Size
(words &
punctuation)
Updated Description License
Menota Old Norse (non) 1 618 028 2017-06-02 Menota is an archive of Medieval Nordic Texts. CC-BY-SA
GNC Old Georgian Old Georgian (oge) 6 062 122 2019-01-28 The Georgian National Corpus – Old Georgian CC-BY-NC
Dialektendring Norwegian Nynorsk (nno) 3 916 253 2019-04-01 Transcribed and annotated dialect recordings CLARIN_RES-PRIV
ASK Hovedkorpus Norwegian Bokmål (nob) 768 043 2019-01-24 ASK is an electronic, searchable text corpus of Norwegian as a second language, with links between linguistic data and personal data. CLARIN_RES-PRIV
GNC Middle Georgian Georgian (kat) 1 432 262 2019-01-28 The Georgian National Corpus – Middle Georgian CC-BY-NC
ASK Korrektkorpus Norwegian Bokmål (nob) 785 451 2017-06-02 ASK is an electronic, searchable text corpus of Norwegian as a second language, with links between linguistic data and personal data, established by the Norwegian Second Language Corpus project. CLARIN_RES-PRIV
Industristad Norwegian Nynorsk (nno) 1 775 521 2017-06-02 Transcribed and annotated dialect recordings CLARIN_RES-PRIV
Menota-test Old Norse (non) 402 262 2019-05-29 This is a test version of Menota, used to test new features and stylesheets. CC-BY-SA
Talesøk Norwegian Nynorsk (nno) 1 560 968 2017-06-02 Transcribed and annotated dialect recordings CLARIN_RES-PRIV
ASK Tillegg Norwegian Bokmål (nob) 44 529 2017-06-02 Supplemental texts for ASK CLARIN_RES-PRIV
GNC Modern Georgian Georgian (kat) 451 584 2017-06-02 The Georgian National Corpus – Modern Georgian CC-BY-NC
GRC Georgian (kat) 182 883 787 2018-02-24 Georgian Reference Corpus unspecified
Føroyskur talumálsbanki Faroese (fao) 471 178 2019-01-28 Transcribed and annotated dialect recordings CLARIN_RES-PRIV
GDC Georgian (kat) 1 694 362 2017-06-02 Georgian dialect corpus
SSGG Georgian (kat) 152 708 2017-06-02 SSGG – The sociolinguistic situation of present-day Georgia CLARIN_ACA-NC-LOC-PRIV-ND-*
GNC Political texts Georgian (kat) 743 183 2019-01-28 Georgian National Corpus, Modern Georgian
AbNC Abkhazian (abk) 10 372 902 2019-03-21 The Abkhaz National Corpus is a comprehensive and open, grammatically annotated text corpus. CLARIN_PUB-BY-NC-ND
GNC Law texts Georgian (kat) 1 495 989 2019-04-15 Georgian National Corpus, Old and Middle Georgian, Law texts
GNC Megrelian Mingrelian (xmf) 89 404 2017-06-02 The Georgian National Corpus – Mingrelian CC-BY-NC
GNC Svan Svan (sva) 473 180 2017-06-02 The Georgian National Corpus – Svan CC-BY-NC
ASK Hovedk./2015 Norwegian Bokmål (nob) 36 142 2017-06-02 Andrespråkskorpus: a Norwegian learners’ corpus. 2015 addition CLARIN_RES-PRIV
Avis/INESS Norwegian (nor) 202 360 007 2017-06-02 Years 2012 and 2013 of Aviskorpus
Aviskorpus (Bokmål) Norwegian Bokmål (nob) 1 880 624 638 2019-06-13 The Norwegian Newspaper Corpus (NNC) Bokmål version is a large monitor corpus representing contemporary Norwegian language in the written variety Norwegian Bokmål. CC-BY
Aviskorpus (Nynorsk) Norwegian Nynorsk (nno) 19 601 405 2019-06-13 The Norwegian Newspaper Corpus (NNC) Nynorsk is a large monitor corpus representing contemporary Norwegian language in the written variety Norwegian Nynorsk. CC-BY
Aviskorpus ann. Norwegian Bokmål (nob) 28 969 124 2017-06-02 This is a subpart of Norsk aviskorpus, grammatically annotated and classified. It comprises 35 692 210 tokens and covers Norwegian bokmål in the time span 2001-2009. CC-BY
BulTreeBank Bulgarian (bul) 229 732 2017-06-02 This distribution represents only the morphological information encoded in BulTreeBank - HPSG-based Treebank of Bulgarian. It contains about 214000 tokens. It was used for the training of the TreeTagger for Bulgarian. MS-NC-NoReD
COLA Spanish (spa) 479 474 2017-06-02 COLA (Corpus Oral de Lenguaje Adolescente Resource) is a corpus of recorded, spontaneous speech among teenagers from different schools and youth clubs in Madrid, Buenos Aires and Santiago de Chile. It is created for the purpose of studying teenage language in Spanish. CLARIN_RES-PLAN-INF-PRIV-ND-DEP-*
Child Rights English (eng) 102 967 2017-04-24 This set of documents is a combination of reports to the UN Committee on the Rights of the Child from civil society organizations (CSOs), which include NGOs, NHRIs and Ombudspersons from both Finland, Norway and Spain. Furthermore, the documents also include the Committee on the Rights of the Child's Concluding Observations. All the documents are collected from the previously reporting rounds for Finland (2008), Norway (2009) and Spain (2010). CC-BY
Coryl English (eng) 129 421 2017-06-02 Coryl is a young learner corpus, which consists of English texts written by Norwegian pupils. CC-BY
FTA/Eng English (eng) 1 270 362 2017-06-02 Corpus of Free Trade Agreements (English/Spanish) CLARIN_ACA
FTA/Spa Spanish (spa) 1 342 828 2017-06-02 Corpus of Free Trade Agreements (English/Spanish) CLARIN_ACA
Forskning.no Norwegian Bokmål (nob) 10 084 683 2017-06-02 Data set containing texts from the popular science website forskning.no from the period 1998 - 2012. CLARIN_RES-DEP
Forskning.no (2017) Norwegian Bokmål (nob) 21 467 688 2019-01-26 Data set containing texts from the popular science website forskning.no from the period 1998 - 2017. CLARIN_RES-DEP
ICAME – ACE English (eng) 1 152 533 2017-06-02 ACE is the first systematically compiled heterogeneous corpus in Australia, designed to support a variety of linguistic research. CLARIN_ACA
ICAME – BROWN Family English (eng) 6 897 518 2017-06-02 This is a collection of Brown, LOB, Frown, FLOB, BLOB and BE06. The collection is made by UCREL, Lancaster. CLARIN_ACA
ICAME – CEECS English (eng) 514 224 2017-06-02 The Corpus of Early English Correspondence (CEEC) has been compiled for the study of social variables in the history of English. CLARIN_ACA
ICAME – COLT English (eng) 444 166 2017-06-02 COLT is a corpus of London Teenage Language with audio recordings. CLARIN_RES-PLAN-INF-PRIV-ND-DEP-*
ICAME – FLOB English (eng) 1 133 503 2017-06-02 The Freiburg - LOB Corpus of British English (FLOB) contains texts from 1991. CLARIN_ACA
ICAME – FROWN English (eng) 1 145 190 2017-06-02 The Freiburg - Brown Corpus of American English (Frown) contains texts from 1991. CLARIN_ACA
ICAME – HC English (eng) 1 851 007 2017-06-02 The Helsinki Corpus of English Texts is a structured multi-genre diachronic corpus. CLARIN_ACA
ICAME – HCOS Scots (sco) 959 977 2017-06-02 The Helsinki Corpus of Older Scots was compiled as a supplement to the diachronic part of the Helsinki Corpus of English Texts. CLARIN_ACA
ICAME – LLC English (eng) 574 340 2017-06-02 The London-Lund Corpus contains samples of educated spoken British English, in orthographic transcription with detailed prosodic marking. CLARIN_ACA
ICAME – LOB English (eng) 1 156 902 2017-06-02 The Lancaster - Oslo/Bergen (LOB) Corpus is a million-word collection of present-day (1961) British English texts. CLARIN_ACA
Jos1M Slovenian (slv) 1 182 946 2017-06-02 Project JOS: Linguistic Annotation of Slovene (http://nl.ijs.si/jos/index-sl.html) CC-BY-NC
KIAP English (eng) 3 900 925 2017-06-02 KIAP is a corpus of 450 research articles covering three disciplines (economics, linguistics and medicine) and three languages (English, French and Norwegian). CC-BY
Leksikalsk bokmålskorpus Norwegian Bokmål (nob) 102 286 906 2019-01-28 Leksikalsk bokmålskorpus CLARIN_RES
NBs frie tekster (Bokmål) Norwegian Bokmål (nob) 516 392 689 2017-06-02 This corpus contains Norwegian (Bokmål) texts from NBDigital (the National Library’s collection of OCRed books) that are free of copyright and have an OCR confidence higher than 0.9. CC-ZERO
NBs frie tekster (Nynorsk) Norwegian Nynorsk (nno) 46 017 287 2017-06-02 This corpus contains Norwegian (Nynorsk) texts from NBDigital (the National Library’s collection of OCRed books) that are free of copyright and have an OCR confidence higher than 0.9. CC-ZERO
NSPC/Nor Norwegian (nor) 2 293 213 2017-06-02 The NSPC is a parallel, unidirectional translation corpus of contemporary Norwegian written texts translated into Spanish, published between 2000 and 2009. It contains fiction and non-fiction, and each text is classified according to genre, the author's gender and the gender and mother tongue of the translator. CLARIN_ACA
NSPC/Spa Spanish (spa) 2 465 968 2017-06-02 The NSPC is a parallel, unidirectional translation corpus of contemporary Norwegian written texts translated into Spanish, published between 2000 and 2009. It contains fiction and non-fiction, and each text is classified according to genre, the author's gender and the gender and mother tongue of the translator. CLARIN_ACA
NTAP English (eng) 660 798 199 2017-06-02 The English NTAP blog corpus comprises 1,5 milion English-language blog posts from around 3000 blogs related to climate change issues across science, politics and environment.
NTAP (French) French (fra) 1 506 074 082 2017-06-02 The French NTAP blog corpus comprises 2.3 milion French-language blog posts from around 2000 blogs related to climate change issues across science, politics and environment.
Nynorsk-korpus Norwegian Nynorsk (nno) 107 803 034 2018-10-17 Nynorsk-korpuset CLARIN_RES
PubBEPC (eng) English (eng) 448 717 2018-03-22 PubBEPC is a Bokmål-English sentence-aligned parallel corpus built from the public web sites www.nav.no, www.nyinorge.no and skatteetaten.no. CC-BY
PubBEPC (nob) Norwegian Bokmål (nob) 359 401 2018-03-22 PubBEPC is a Bokmål-English sentence-aligned parallel corpus built from the public web sites www.nav.no, www.nyinorge.no and skatteetaten.no. CC-BY
PubNEPC (eng) English (eng) 353 838 2018-04-02 PubNEPC is a Nynorsk-English sentence-aligned parallel corpus built from the public web sites www.nav.no and skatteetaten.no. CC-BY
PubNEPC (nno) Norwegian Nynorsk (nno) 289 723 2018-04-02 PubNEPC is a Nynorsk-English sentence-aligned parallel corpus built from the public web sites www.nav.no and skatteetaten.no. CC-BY
Storting debates Norwegian Bokmål (nob) 28 533 334 2017-06-02 The corpus "Proceedings of Norwegian parliamentary debates (2008-2015)" is a collection of transcriptions of Norwegian parliamentary debates between 2008 and 2015, downloaded from https://data.stortinget.no/. NLOD
TRIS/de-at German (deu) 841 861 2017-06-02 The corpus TRIS (Parallel Corpus of documents from the Technical Regulations Information System for German-Spanish) is a specialized parallel corpus with Spanish-German (ES-ES, DE-AT and DE-DE), texts from the European Commission between 1997-2010. CC-BY-NC-SA
TRIS/es-es Spanish (spa) 1 209 985 2017-06-02 The corpus TRIS (Parallel Corpus of documents from the Technical Regulations Information System for German-Spanish) is a specialized parallel corpus with Spanish-German (ES-ES, DE-AT and DE-DE), texts from the European Commission between 1997-2010. CC-BY-NC-SA
Talk Of Norway Norwegian (nor) 63 803 594 2019-01-26 Corpus based on the Talk of Norway (TON) dataset v. 1.0, a collection of Norwegian parliament speeches from the 1998-1999 to 2015-2016 sessions. NLOD

Design & implementation: Paul Meurer, Universitetet i Bergen, CLARINO Centre, 2019