Clarino-logo
Corpuscle :: Corpus list
[Hide]
Select corpora by language or collection:
Languages: All · Abkhazian (3) · Ancient Greek (to 1453) (1) · Arabic (1) · Armenian (1) · Belarusian (1) · Bulgarian (1) · English (19) · Faroese (1) · French (3) · Georgian (8) · German (4) · Italian (1) · Kirghiz (1) · Lithuanian (1) · Middle Georgian (1) · Mingrelian (1) · Norwegian (3) · Norwegian Bokmål (19) · Norwegian Nynorsk (11) · Old Georgian (1) · Old Norse (4) · Scots (1) · Slovenian (1) · Spanish (4) · Svan (2) · Tigrinya (1)
Collections: All · ASK (5) · AbNC (1) · Aviskorpus (4) · GNC (11) · ICAME (10) · MediaKorpus (4) · Menota (5) · PubNoEnPC (4) · Talebanken (4)
Choose a corpus from the list below. Some corpora are only available when you have signed in.
Corpus Language(s) Size
(words &
punctuation)
Updated Description License
Menota (trans) Norwegian Bokmål (nob) 132 427 2024-10-27 Menota Archive
Dialektendring Norwegian Nynorsk (nno) 4 690 076 2023-12-18 Transcribed and annotated dialect recordings CLARIN_RES-PRIV
GNC Old Georgian Old Georgian (oge) 7 101 021 2022-02-14 The Georgian National Corpus – Old Georgian CC-BY-NC
ASK Hovedkorpus Norwegian Bokmål (nob) 768 043 2020-04-25 ASK is an electronic, searchable text corpus of Norwegian as a second language, with links between linguistic data and personal data. CLARIN_RES-PRIV
Menota Old Norse (non) 2 252 129 2024-11-08 Menota is an archive of Medieval Nordic Texts. CC-BY-SA
Industristad Norwegian Nynorsk (nno) 1 775 521 2020-11-01 Transcribed and annotated dialect recordings CLARIN_RES-PRIV
GNC Middle Georgian Georgian (kat) 1 432 305 2022-02-14 The Georgian National Corpus – Middle Georgian CC-BY-NC
Menota-rune Old Norse (non) 1 894 2023-11-15 Menota Archive, runic inscriptions
Menota-test Old Norse (non) 239 379 2024-11-06 This is a test version of Menota, used to test new features and stylesheets. CC-BY-SA
ASK Korrektkorpus Norwegian Bokmål (nob) 785 451 2020-04-25 ASK is an electronic, searchable text corpus of Norwegian as a second language, with links between linguistic data and personal data, established by the Norwegian Second Language Corpus project. CLARIN_RES-PRIV
Menota-diploma Old Norse (non) 59 741 2024-07-08 Menota Archive, diploma
Talesøk Norwegian Nynorsk (nno) 1 560 968 2020-11-01 Transcribed and annotated dialect recordings CLARIN_RES-PRIV
GNC Modern Georgian Georgian (kat) 1 993 022 2022-02-20 The Georgian National Corpus – Modern Georgian CC-BY-NC
ASK Tillegg Norwegian Bokmål (nob) 44 529 2020-04-25 Supplemental texts for ASK CLARIN_RES-PRIV
Føroyskur talumálsbanki Faroese (fao) 471 178 2020-04-25 Transcribed and annotated dialect recordings CLARIN_RES-PRIV
ASK Tillegg AG Norwegian Bokmål (nob) 94 729 2020-04-25 Supplemental texts for ASK
GRC Georgian (kat) 202 338 621 2021-07-06 Georgian Reference Corpus unspecified
GDC Georgian (kat) 1 694 362 2020-04-25 Georgian dialect corpus
SSGG Georgian (kat) 152 708 2020-04-26 SSGG – The sociolinguistic situation of present-day Georgia CLARIN_ACA-NC-LOC-PRIV-ND-*
AbNC Abkhazian (abk) 10 633 322 2023-09-30 The Abkhaz National Corpus is a comprehensive and open, grammatically annotated text corpus. CLARIN_PUB
GNC Political texts Georgian (kat) 1 436 075 2020-04-25 Georgian National Corpus, Political texts
GNC Law texts Georgian (kat) 1 541 946 2024-11-11 Georgian National Corpus, Old and Middle Georgian, Law texts
GNC Megrelian Mingrelian (xmf) 89 404 2020-04-25 The Georgian National Corpus – Mingrelian CC-BY-NC
GNC Svan Svan (sva) 473 180 2020-04-25 The Georgian National Corpus – Svan CC-BY-NC
GNC Wikipedia Georgian (kat) 20 974 721 2020-04-26 Georgian Wikipedia
ASK Hovedk./2015 Norwegian Bokmål (nob) 36 142 2020-04-25 Andrespråkskorpus: a Norwegian learners’ corpus. 2015 addition CLARIN_RES-PRIV
Amedia (Bokmål) Part 1 Norwegian Bokmål (nob) 822 769 489 2023-12-23 CLARIN_RES
Amedia (Bokmål) Part 2 Norwegian Bokmål (nob) 1 211 144 789 2023-12-22 CLARIN_RES
Amedia (Nynorsk) Norwegian Nynorsk (nno) 134 983 629 2023-12-23 CLARIN_RES
Avis/INESS Norwegian (nor) 202 360 007 2020-04-25 Years 2012 and 2013 of Aviskorpus
Aviskorpus (Bokmål) Norwegian Bokmål (nob) 2 117 226 336 2022-12-02 The Norwegian Newspaper Corpus (NNC) Bokmål version is a large monitor corpus representing contemporary Norwegian language in the written variety Norwegian Bokmål. CC-BY-NC
Aviskorpus (Nynorsk) Norwegian Nynorsk (nno) 21 010 885 2020-06-01 The Norwegian Newspaper Corpus (NNC) Nynorsk is a large monitor corpus representing contemporary Norwegian language in the written variety Norwegian Nynorsk. CC-BY
Aviskorpus ann. Norwegian Bokmål (nob) 28 969 124 2020-04-26 This is a subpart of Norsk aviskorpus, grammatically annotated and classified. It comprises 35 692 210 tokens and covers Norwegian bokmål in the time span 2001-2009. CC-BY-NC
BulTreeBank Bulgarian (bul) 229 732 2020-04-25 This distribution represents only the morphological information encoded in BulTreeBank - HPSG-based Treebank of Bulgarian. It contains about 214000 tokens. It was used for the training of the TreeTagger for Bulgarian. MS-NC-NoReD
COLA Spanish (spa) 479 474 2020-04-25 COLA (Corpus Oral de Lenguaje Adolescente Resource) is a corpus of recorded, spontaneous speech among teenagers from different schools and youth clubs in Madrid, Buenos Aires and Santiago de Chile. It is created for the purpose of studying teenage language in Spanish. CLARIN_ACA-NC-LOC-PRIV-ND-*
Child Rights English (eng) 102 967 2020-04-25 This set of documents is a combination of reports to the UN Committee on the Rights of the Child from civil society organizations (CSOs), which include NGOs, NHRIs and Ombudspersons from both Finland, Norway and Spain. Furthermore, the documents also include the Committee on the Rights of the Child's Concluding Observations. All the documents are collected from the previously reporting rounds for Finland (2008), Norway (2009) and Spain (2010). CC-BY
Coryl English (eng) 336 656 2021-03-09 Coryl is a young learner corpus, which consists of English texts written by Norwegian pupils. CC-BY
FTA/Eng English (eng) 1 270 362 2020-04-25 Corpus of Free Trade Agreements (English/Spanish) CLARIN_ACA
FTA/Spa Spanish (spa) 1 342 828 2020-04-25 Corpus of Free Trade Agreements (English/Spanish) CLARIN_ACA
Forskning.no Norwegian Bokmål (nob) 10 084 683 2020-04-25 Data set containing texts from the popular science website forskning.no from the period 1998 - 2012. CLARIN_RES-DEP
Forskning.no (2017) Norwegian Bokmål (nob) 21 467 688 2020-04-25 Data set containing texts from the popular science website forskning.no from the period 1998 - 2017. CLARIN_RES-DEP
Homer Ancient Greek (to 1453) (grc) 229 551 2023-11-12 Odyssey and Iliad
ICAME – ACE English (eng) 1 152 533 2020-04-25 ACE is the first systematically compiled heterogeneous corpus in Australia, designed to support a variety of linguistic research. CLARIN_ACA
ICAME – BROWN Family English (eng) 6 897 518 2020-04-25 This is a collection of Brown, LOB, Frown, FLOB, BLOB and BE06. The collection is made by UCREL, Lancaster. CLARIN_ACA
ICAME – CEECS English (eng) 514 224 2020-04-25 The Corpus of Early English Correspondence (CEEC) has been compiled for the study of social variables in the history of English. CLARIN_ACA
ICAME – COLT English (eng) 444 166 2020-04-25 COLT is a corpus of London Teenage Language with audio recordings. CLARIN_ACA-NC-LOC-PRIV-ND-*
ICAME – FLOB English (eng) 1 133 503 2020-04-25 The Freiburg - LOB Corpus of British English (FLOB) contains texts from 1991. CLARIN_ACA
ICAME – FROWN English (eng) 1 145 190 2020-04-25 The Freiburg - Brown Corpus of American English (Frown) contains texts from 1991. CLARIN_ACA
ICAME – HC English (eng) 1 851 007 2020-04-25 The Helsinki Corpus of English Texts is a structured multi-genre diachronic corpus. CLARIN_ACA
ICAME – HCOS Scots (sco) 959 977 2020-04-26 The Helsinki Corpus of Older Scots was compiled as a supplement to the diachronic part of the Helsinki Corpus of English Texts. CLARIN_ACA
ICAME – LLC English (eng) 574 340 2020-04-25 The London-Lund Corpus contains samples of educated spoken British English, in orthographic transcription with detailed prosodic marking. CLARIN_ACA
ICAME – LOB English (eng) 1 156 902 2020-04-25 The Lancaster - Oslo/Bergen (LOB) Corpus is a million-word collection of present-day (1961) British English texts. CLARIN_ACA
ICNALE English (eng) 1 361 193 2022-03-04 ICNALE (International Corpus Network of Asian Learners of English), Written Essays section, tagged CLARIN_RES
Jos1M Slovenian (slv) 1 182 946 2020-04-25 Project JOS: Linguistic Annotation of Slovene (http://nl.ijs.si/jos/index-sl.html) CC-BY-NC
KIAP English (eng) 3 900 925 2020-04-25 KIAP is a corpus of 450 research articles covering three disciplines (economics, linguistics and medicine) and three languages (English, French and Norwegian). CC-BY
KRLE Norwegian Bokmål (nob) 74 398 2021-06-08 The KRLE corpus belongs to the research group Reading and Writing Didactics at Western Norway University of Applied Sciences and was collected during a research project named “Learning by Writing in Religious Education (RE)”. CLARIN_RES-PRIV
NAOB-tekster Norwegian Bokmål (nob) 229 151 252 2020-04-25 This corpus contains Norwegian (Bokmål) texts used for building the NAOB treebank. CLARIN_RES-INF-LOC-PRIV-DEP
NBs frie tekster (Bokmål) Norwegian Bokmål (nob) 516 392 689 2020-04-25 This corpus contains Norwegian (Bokmål) texts from NBDigital (the National Library’s collection of OCRed books) that are free of copyright and have an OCR confidence higher than 0.9. CC-ZERO
NBs frie tekster (Nynorsk) Norwegian Nynorsk (nno) 46 017 287 2020-04-25 This corpus contains Norwegian (Nynorsk) texts from NBDigital (the National Library’s collection of OCRed books) that are free of copyright and have an OCR confidence higher than 0.9. CC-ZERO
NSPC/Nor Norwegian (nor) 2 293 213 2020-04-25 The NSPC is a parallel, unidirectional translation corpus of contemporary Norwegian written texts translated into Spanish, published between 2000 and 2009. It contains fiction and non-fiction, and each text is classified according to genre, the author's gender and the gender and mother tongue of the translator. CLARIN_ACA
NSPC/Spa Spanish (spa) 2 465 968 2020-04-25 The NSPC is a parallel, unidirectional translation corpus of contemporary Norwegian written texts translated into Spanish, published between 2000 and 2009. It contains fiction and non-fiction, and each text is classified according to genre, the author's gender and the gender and mother tongue of the translator. CLARIN_ACA
NTAP English (eng) 660 798 199 2020-04-25 The English NTAP blog corpus comprises 1,5 milion English-language blog posts from around 3000 blogs related to climate change issues across science, politics and environment.
NTAP (French) French (fra) 1 506 074 082 2020-04-26 The French NTAP blog corpus comprises 2.3 milion French-language blog posts from around 2000 blogs related to climate change issues across science, politics and environment.
Nagaoka Tigrinya Corpus Tigrinya (tir) 72 469 2024-02-29 The Nagaoka Tigrinya corpus is the first publicly available part-of-speech (PoS) tagged corpus of Tigrinya language.
Naturen Norwegian Bokmål (nob) 2 489 279 2023-12-07 CLARIN_ACA
Nynorsk-korpus Norwegian Nynorsk (nno) 107 273 268 2021-02-24 Nynorsk-korpuset CLARIN_RES
Nynorsk-korpus (2017) Norwegian Nynorsk (nno) 3 447 527 2021-02-12 Nynorsk-korpuset CLARIN_RES
Nynorsk-korpus (2023) Norwegian Nynorsk (nno) 9 927 460 2023-11-15 Nynorsk-korpuset CLARIN_RES
Nynorsk-korpus (Allkunne) Norwegian Nynorsk (nno) 1 605 648 2022-12-06 Nynorsk-korpuset, Allkunne-del CLARIN_RES
PubBEPC (eng) English (eng) 448 717 2020-04-26 PubBEPC is a Bokmål-English sentence-aligned parallel corpus built from the public web sites www.nav.no, www.nyinorge.no and skatteetaten.no. CC-BY
PubBEPC (nob) Norwegian Bokmål (nob) 359 401 2020-04-26 PubBEPC is a Bokmål-English sentence-aligned parallel corpus built from the public web sites www.nav.no, www.nyinorge.no and skatteetaten.no. CC-BY
PubNEPC (eng) English (eng) 353 838 2020-04-26 PubNEPC is a Nynorsk-English sentence-aligned parallel corpus built from the public web sites www.nav.no and skatteetaten.no. CC-BY
PubNEPC (nno) Norwegian Nynorsk (nno) 289 723 2020-04-26 PubNEPC is a Nynorsk-English sentence-aligned parallel corpus built from the public web sites www.nav.no and skatteetaten.no. CC-BY
Rustaveli/abk-1 Abkhazian (abk) 60 502 2023-04-06 GNC Rustaveli Parallel Corpus, Abkhaz translation 1
Rustaveli/abk-2 Abkhazian (abk) 66 362 2023-04-06 GNC Rustaveli Parallel Corpus, Abkhaz translation 2
Rustaveli/ara Arabic (ara) 63 744 2022-07-20 GNC Rustaveli Parallel Corpus, Arabic translation
Rustaveli/bel Belarusian (bel) 60 920 2022-07-20 GNC Rustaveli Parallel Corpus, Belarusian part
Rustaveli/deu-1 German (deu) 84 063 2022-07-20 GNC Rustaveli Parallel Corpus, German translation 1
Rustaveli/deu-2 German (deu) 82 905 2022-07-20 GNC Rustaveli Parallel Corpus, German translation 2
Rustaveli/deu-3 German (deu) 79 044 2022-07-20 GNC Rustaveli Parallel Corpus, German translation 3
Rustaveli/eng-1 English (eng) 92 100 2022-07-20 GNC Rustaveli Parallel Corpus, English translation 1
Rustaveli/eng-2 English (eng) 97 542 2022-07-20 GNC Rustaveli Parallel Corpus, English translation 2
Rustaveli/fra-1 French (fra) 78 384 2022-07-20 GNC Rustaveli Parallel Corpus, French translation 1
Rustaveli/fra-2 French (fra) 83 863 2022-07-20 GNC Rustaveli Parallel Corpus, French translation 2
Rustaveli/hye Armenian (hye) 68 205 2022-07-20 GNC Rustaveli Parallel Corpus, Armenian part
Rustaveli/ita Italian (ita) 1 188 2022-07-20 GNC Rustaveli Parallel Corpus, Italian translation
Rustaveli/kir Kirghiz (kir) 61 820 2022-07-20 GNC Rustaveli Parallel Corpus, Kyrgyz translation
Rustaveli/lit Lithuanian (lit) 46 633 2022-07-20 GNC Rustaveli Parallel Corpus, Lithuanian part
Rustaveli/mge Middle Georgian (mge) 64 305 2022-07-20 GNC Rustaveli Parallel Corpus, Georgian part
Rustaveli/sva Svan (sva) 49 212 2022-07-20 GNC Rustaveli Parallel Corpus, Svan translation
Storting debates Norwegian Bokmål (nob) 28 533 334 2020-04-26 The corpus "Proceedings of Norwegian parliamentary debates (2008-2015)" is a collection of transcriptions of Norwegian parliamentary debates between 2008 and 2015, downloaded from https://data.stortinget.no/. NLOD
TRIS/de-at German (deu) 841 861 2020-04-26 The corpus TRIS (Parallel Corpus of documents from the Technical Regulations Information System for German-Spanish) is a specialized parallel corpus with Spanish-German (ES-ES, DE-AT and DE-DE), texts from the European Commission between 1997-2010. CC-BY-NC-SA
TRIS/es-es Spanish (spa) 1 209 985 2020-04-26 The corpus TRIS (Parallel Corpus of documents from the Technical Regulations Information System for German-Spanish) is a specialized parallel corpus with Spanish-German (ES-ES, DE-AT and DE-DE), texts from the European Commission between 1997-2010. CC-BY-NC-SA
TV 2 Norwegian Bokmål (nob) 196 930 955 2024-02-23 CLARIN_RES
Talk Of Norway Norwegian (nor) 63 803 594 2020-04-26 Corpus based on the Talk of Norway (TON) dataset v. 1.0, a collection of Norwegian parliament speeches from the 1998-1999 to 2015-2016 sessions. NLOD