Corpus |
Language(s) |
Size (words & punctuation)
|
Updated |
Description |
License |
Menota (trans) |
Norwegian Bokmål (nob) |
132 427 |
2024-10-27 |
Menota Archive |
|
|
|
Dialektendring |
Norwegian Nynorsk (nno) |
4 690 076 |
2023-12-18 |
Transcribed and annotated dialect recordings |
|
|
CLARIN_RES-PRIV |
GNC Old Georgian |
Old Georgian (oge) |
7 101 021 |
2022-02-14 |
The Georgian National Corpus – Old Georgian |
|
|
CC-BY-NC |
ASK Hovedkorpus |
Norwegian Bokmål (nob) |
768 043 |
2020-04-25 |
ASK is an electronic, searchable text corpus of Norwegian as a second language, with links between linguistic data and personal data. |
|
|
CLARIN_RES-PRIV |
Menota |
Old Norse (non) |
2 252 129 |
2024-11-08 |
Menota is an archive of Medieval Nordic Texts. |
|
|
CC-BY-SA |
Industristad |
Norwegian Nynorsk (nno) |
1 775 521 |
2020-11-01 |
Transcribed and annotated dialect recordings |
|
|
CLARIN_RES-PRIV |
GNC Middle Georgian |
Georgian (kat) |
1 432 305 |
2022-02-14 |
The Georgian National Corpus – Middle Georgian |
|
|
CC-BY-NC |
Menota-rune |
Old Norse (non) |
1 894 |
2023-11-15 |
Menota Archive, runic inscriptions |
|
|
|
Menota-test |
Old Norse (non) |
239 379 |
2024-11-06 |
This is a test version of Menota, used to test new features and stylesheets. |
|
|
CC-BY-SA |
ASK Korrektkorpus |
Norwegian Bokmål (nob) |
785 451 |
2020-04-25 |
ASK is an electronic, searchable text corpus of Norwegian as a second language, with links between linguistic data and personal data, established by the Norwegian Second Language Corpus project. |
|
|
CLARIN_RES-PRIV |
Menota-diploma |
Old Norse (non) |
59 741 |
2024-07-08 |
Menota Archive, diploma |
|
|
|
Talesøk |
Norwegian Nynorsk (nno) |
1 560 968 |
2020-11-01 |
Transcribed and annotated dialect recordings |
|
|
CLARIN_RES-PRIV |
GNC Modern Georgian |
Georgian (kat) |
1 993 022 |
2022-02-20 |
The Georgian National Corpus – Modern Georgian |
|
|
CC-BY-NC |
ASK Tillegg |
Norwegian Bokmål (nob) |
44 529 |
2020-04-25 |
Supplemental texts for ASK |
|
|
CLARIN_RES-PRIV |
Føroyskur talumálsbanki |
Faroese (fao) |
471 178 |
2020-04-25 |
Transcribed and annotated dialect recordings |
|
|
CLARIN_RES-PRIV |
ASK Tillegg AG |
Norwegian Bokmål (nob) |
94 729 |
2020-04-25 |
Supplemental texts for ASK |
|
|
|
GRC |
Georgian (kat) |
202 338 621 |
2021-07-06 |
Georgian Reference Corpus |
|
|
unspecified |
GDC |
Georgian (kat) |
1 694 362 |
2020-04-25 |
Georgian dialect corpus |
|
|
|
SSGG |
Georgian (kat) |
152 708 |
2020-04-26 |
SSGG – The sociolinguistic situation of present-day Georgia |
|
|
CLARIN_ACA-NC-LOC-PRIV-ND-* |
AbNC |
Abkhazian (abk) |
10 633 322 |
2023-09-30 |
The Abkhaz National Corpus is a comprehensive and open, grammatically annotated text corpus. |
|
|
CLARIN_PUB |
GNC Political texts |
Georgian (kat) |
1 436 075 |
2020-04-25 |
Georgian National Corpus, Political texts |
|
|
|
GNC Law texts |
Georgian (kat) |
1 541 946 |
2024-11-11 |
Georgian National Corpus, Old and Middle Georgian, Law texts |
|
|
|
GNC Megrelian |
Mingrelian (xmf) |
89 404 |
2020-04-25 |
The Georgian National Corpus – Mingrelian |
|
|
CC-BY-NC |
GNC Svan |
Svan (sva) |
473 180 |
2020-04-25 |
The Georgian National Corpus – Svan |
|
|
CC-BY-NC |
GNC Wikipedia |
Georgian (kat) |
20 974 721 |
2020-04-26 |
Georgian Wikipedia |
|
|
|
ASK Hovedk./2015 |
Norwegian Bokmål (nob) |
36 142 |
2020-04-25 |
Andrespråkskorpus: a Norwegian learners’ corpus. 2015 addition |
|
|
CLARIN_RES-PRIV |
Amedia (Bokmål) Part 1 |
Norwegian Bokmål (nob) |
822 769 489 |
2023-12-23 |
|
|
|
CLARIN_RES |
Amedia (Bokmål) Part 2 |
Norwegian Bokmål (nob) |
1 211 144 789 |
2023-12-22 |
|
|
|
CLARIN_RES |
Amedia (Nynorsk) |
Norwegian Nynorsk (nno) |
134 983 629 |
2023-12-23 |
|
|
|
CLARIN_RES |
Avis/INESS |
Norwegian (nor) |
202 360 007 |
2020-04-25 |
Years 2012 and 2013 of Aviskorpus |
|
|
|
Aviskorpus (Bokmål) |
Norwegian Bokmål (nob) |
2 117 226 336 |
2022-12-02 |
The Norwegian Newspaper Corpus (NNC) Bokmål version is a large monitor corpus representing contemporary Norwegian language in the written variety Norwegian Bokmål. |
|
|
CC-BY-NC |
Aviskorpus (Nynorsk) |
Norwegian Nynorsk (nno) |
21 010 885 |
2020-06-01 |
The Norwegian Newspaper Corpus (NNC) Nynorsk is a large monitor corpus representing contemporary Norwegian language in the written variety Norwegian Nynorsk. |
|
|
CC-BY |
Aviskorpus ann. |
Norwegian Bokmål (nob) |
28 969 124 |
2020-04-26 |
This is a subpart of Norsk aviskorpus, grammatically annotated and classified. It comprises 35 692 210 tokens and covers Norwegian bokmål in the time span 2001-2009. |
|
|
CC-BY-NC |
BulTreeBank |
Bulgarian (bul) |
229 732 |
2020-04-25 |
This distribution represents only the morphological information encoded in BulTreeBank - HPSG-based Treebank of Bulgarian. It contains about 214000 tokens. It was used for the training of the TreeTagger for Bulgarian. |
|
|
MS-NC-NoReD |
COLA |
Spanish (spa) |
479 474 |
2020-04-25 |
COLA (Corpus Oral de Lenguaje Adolescente Resource) is a corpus of recorded, spontaneous speech among teenagers from different schools and youth clubs in Madrid, Buenos Aires and Santiago de Chile. It is created for the purpose of studying teenage language in Spanish. |
|
|
CLARIN_ACA-NC-LOC-PRIV-ND-* |
Child Rights |
English (eng) |
102 967 |
2020-04-25 |
This set of documents is a combination of reports to the UN Committee on the Rights of the Child from civil society organizations (CSOs), which include NGOs, NHRIs and Ombudspersons from both Finland, Norway and Spain. Furthermore, the documents also include the Committee on the Rights of the Child's Concluding Observations. All the documents are collected from the previously reporting rounds for Finland (2008), Norway (2009) and Spain (2010). |
|
|
CC-BY |
Coryl |
English (eng) |
336 656 |
2021-03-09 |
Coryl is a young learner corpus, which consists of English texts written by Norwegian pupils. |
|
|
CC-BY |
FTA/Eng |
English (eng) |
1 270 362 |
2020-04-25 |
Corpus of Free Trade Agreements (English/Spanish) |
|
|
CLARIN_ACA |
FTA/Spa |
Spanish (spa) |
1 342 828 |
2020-04-25 |
Corpus of Free Trade Agreements (English/Spanish) |
|
|
CLARIN_ACA |
Forskning.no |
Norwegian Bokmål (nob) |
10 084 683 |
2020-04-25 |
Data set containing texts from the popular science website forskning.no from the period 1998 - 2012. |
|
|
CLARIN_RES-DEP |
Forskning.no (2017) |
Norwegian Bokmål (nob) |
21 467 688 |
2020-04-25 |
Data set containing texts from the popular science website forskning.no from the period 1998 - 2017. |
|
|
CLARIN_RES-DEP |
Homer |
Ancient Greek (to 1453) (grc) |
229 551 |
2023-11-12 |
Odyssey and Iliad |
|
|
|
ICAME – ACE |
English (eng) |
1 152 533 |
2020-04-25 |
ACE is the first systematically compiled heterogeneous corpus in Australia, designed to support a variety of linguistic research. |
|
|
CLARIN_ACA |
ICAME – BROWN Family |
English (eng) |
6 897 518 |
2020-04-25 |
This is a collection of Brown, LOB, Frown, FLOB, BLOB and BE06. The collection is made by UCREL, Lancaster. |
|
|
CLARIN_ACA |
ICAME – CEECS |
English (eng) |
514 224 |
2020-04-25 |
The Corpus of Early English Correspondence (CEEC) has been compiled for the study of social variables in the history of English. |
|
|
CLARIN_ACA |
ICAME – COLT |
English (eng) |
444 166 |
2020-04-25 |
COLT is a corpus of London Teenage Language with audio recordings. |
|
|
CLARIN_ACA-NC-LOC-PRIV-ND-* |
ICAME – FLOB |
English (eng) |
1 133 503 |
2020-04-25 |
The Freiburg - LOB Corpus of British English (FLOB) contains texts from 1991. |
|
|
CLARIN_ACA |
ICAME – FROWN |
English (eng) |
1 145 190 |
2020-04-25 |
The Freiburg - Brown Corpus of American English (Frown) contains texts from 1991. |
|
|
CLARIN_ACA |
ICAME – HC |
English (eng) |
1 851 007 |
2020-04-25 |
The Helsinki Corpus of English Texts is a structured multi-genre diachronic corpus. |
|
|
CLARIN_ACA |
ICAME – HCOS |
Scots (sco) |
959 977 |
2020-04-26 |
The Helsinki Corpus of Older Scots was compiled as a supplement to the diachronic part of the Helsinki Corpus of English Texts. |
|
|
CLARIN_ACA |
ICAME – LLC |
English (eng) |
574 340 |
2020-04-25 |
The London-Lund Corpus contains samples of educated spoken British English, in orthographic transcription with detailed prosodic marking. |
|
|
CLARIN_ACA |
ICAME – LOB |
English (eng) |
1 156 902 |
2020-04-25 |
The Lancaster - Oslo/Bergen (LOB) Corpus is a million-word collection of present-day (1961) British English texts. |
|
|
CLARIN_ACA |
ICNALE |
English (eng) |
1 361 193 |
2022-03-04 |
ICNALE (International Corpus Network of Asian Learners of English), Written Essays section, tagged |
|
|
CLARIN_RES |
Jos1M |
Slovenian (slv) |
1 182 946 |
2020-04-25 |
Project JOS: Linguistic Annotation of Slovene (http://nl.ijs.si/jos/index-sl.html) |
|
|
CC-BY-NC |
KIAP |
English (eng) |
3 900 925 |
2020-04-25 |
KIAP is a corpus of 450 research articles covering three disciplines (economics, linguistics and medicine) and three languages (English, French and Norwegian). |
|
|
CC-BY |
KRLE |
Norwegian Bokmål (nob) |
74 398 |
2021-06-08 |
The KRLE corpus belongs to the research group Reading and Writing Didactics at Western Norway University of Applied Sciences and was collected during a research project named “Learning by Writing in Religious Education (RE)”. |
|
|
CLARIN_RES-PRIV |
NAOB-tekster |
Norwegian Bokmål (nob) |
229 151 252 |
2020-04-25 |
This corpus contains Norwegian (Bokmål) texts used for building the NAOB treebank. |
|
|
CLARIN_RES-INF-LOC-PRIV-DEP |
NBs frie tekster (Bokmål) |
Norwegian Bokmål (nob) |
516 392 689 |
2020-04-25 |
This corpus contains Norwegian (Bokmål) texts from NBDigital (the National Library’s collection of OCRed books) that are free of copyright and have an OCR confidence higher than 0.9. |
|
|
CC-ZERO |
NBs frie tekster (Nynorsk) |
Norwegian Nynorsk (nno) |
46 017 287 |
2020-04-25 |
This corpus contains Norwegian (Nynorsk) texts from NBDigital (the National Library’s collection of OCRed books) that are free of copyright and have an OCR confidence higher than 0.9. |
|
|
CC-ZERO |
NSPC/Nor |
Norwegian (nor) |
2 293 213 |
2020-04-25 |
The NSPC is a parallel, unidirectional translation corpus of contemporary Norwegian written texts translated into Spanish, published between 2000 and 2009. It contains fiction and non-fiction, and each text is classified according to genre, the author's gender and the gender and mother tongue of the translator. |
|
|
CLARIN_ACA |
NSPC/Spa |
Spanish (spa) |
2 465 968 |
2020-04-25 |
The NSPC is a parallel, unidirectional translation corpus of contemporary Norwegian written texts translated into Spanish, published between 2000 and 2009. It contains fiction and non-fiction, and each text is classified according to genre, the author's gender and the gender and mother tongue of the translator. |
|
|
CLARIN_ACA |
NTAP |
English (eng) |
660 798 199 |
2020-04-25 |
The English NTAP blog corpus comprises 1,5 milion English-language blog posts from around 3000 blogs related to climate change issues across science, politics and environment. |
|
|
|
NTAP (French) |
French (fra) |
1 506 074 082 |
2020-04-26 |
The French NTAP blog corpus comprises 2.3 milion French-language blog posts from around 2000 blogs related to climate change issues across science, politics and environment. |
|
|
|
Nagaoka Tigrinya Corpus |
Tigrinya (tir) |
72 469 |
2024-02-29 |
The Nagaoka Tigrinya corpus is the first publicly available part-of-speech (PoS) tagged corpus of Tigrinya language. |
|
|
|
Naturen |
Norwegian Bokmål (nob) |
2 489 279 |
2023-12-07 |
|
|
|
CLARIN_ACA |
Nynorsk-korpus |
Norwegian Nynorsk (nno) |
107 273 268 |
2021-02-24 |
Nynorsk-korpuset |
|
|
CLARIN_RES |
Nynorsk-korpus (2017) |
Norwegian Nynorsk (nno) |
3 447 527 |
2021-02-12 |
Nynorsk-korpuset |
|
|
CLARIN_RES |
Nynorsk-korpus (2023) |
Norwegian Nynorsk (nno) |
9 927 460 |
2023-11-15 |
Nynorsk-korpuset |
|
|
CLARIN_RES |
Nynorsk-korpus (Allkunne) |
Norwegian Nynorsk (nno) |
1 605 648 |
2022-12-06 |
Nynorsk-korpuset, Allkunne-del |
|
|
CLARIN_RES |
PubBEPC (eng) |
English (eng) |
448 717 |
2020-04-26 |
PubBEPC is a Bokmål-English sentence-aligned parallel corpus built from the public web sites www.nav.no, www.nyinorge.no and skatteetaten.no. |
|
|
CC-BY |
PubBEPC (nob) |
Norwegian Bokmål (nob) |
359 401 |
2020-04-26 |
PubBEPC is a Bokmål-English sentence-aligned parallel corpus built from the public web sites www.nav.no, www.nyinorge.no and skatteetaten.no. |
|
|
CC-BY |
PubNEPC (eng) |
English (eng) |
353 838 |
2020-04-26 |
PubNEPC is a Nynorsk-English sentence-aligned parallel corpus built from the public web sites www.nav.no and skatteetaten.no. |
|
|
CC-BY |
PubNEPC (nno) |
Norwegian Nynorsk (nno) |
289 723 |
2020-04-26 |
PubNEPC is a Nynorsk-English sentence-aligned parallel corpus built from the public web sites www.nav.no and skatteetaten.no. |
|
|
CC-BY |
Rustaveli/abk-1 |
Abkhazian (abk) |
60 502 |
2023-04-06 |
GNC Rustaveli Parallel Corpus, Abkhaz translation 1 |
|
|
|
Rustaveli/abk-2 |
Abkhazian (abk) |
66 362 |
2023-04-06 |
GNC Rustaveli Parallel Corpus, Abkhaz translation 2 |
|
|
|
Rustaveli/ara |
Arabic (ara) |
63 744 |
2022-07-20 |
GNC Rustaveli Parallel Corpus, Arabic translation |
|
|
|
Rustaveli/bel |
Belarusian (bel) |
60 920 |
2022-07-20 |
GNC Rustaveli Parallel Corpus, Belarusian part |
|
|
|
Rustaveli/deu-1 |
German (deu) |
84 063 |
2022-07-20 |
GNC Rustaveli Parallel Corpus, German translation 1 |
|
|
|
Rustaveli/deu-2 |
German (deu) |
82 905 |
2022-07-20 |
GNC Rustaveli Parallel Corpus, German translation 2 |
|
|
|
Rustaveli/deu-3 |
German (deu) |
79 044 |
2022-07-20 |
GNC Rustaveli Parallel Corpus, German translation 3 |
|
|
|
Rustaveli/eng-1 |
English (eng) |
92 100 |
2022-07-20 |
GNC Rustaveli Parallel Corpus, English translation 1 |
|
|
|
Rustaveli/eng-2 |
English (eng) |
97 542 |
2022-07-20 |
GNC Rustaveli Parallel Corpus, English translation 2 |
|
|
|
Rustaveli/fra-1 |
French (fra) |
78 384 |
2022-07-20 |
GNC Rustaveli Parallel Corpus, French translation 1 |
|
|
|
Rustaveli/fra-2 |
French (fra) |
83 863 |
2022-07-20 |
GNC Rustaveli Parallel Corpus, French translation 2 |
|
|
|
Rustaveli/hye |
Armenian (hye) |
68 205 |
2022-07-20 |
GNC Rustaveli Parallel Corpus, Armenian part |
|
|
|
Rustaveli/ita |
Italian (ita) |
1 188 |
2022-07-20 |
GNC Rustaveli Parallel Corpus, Italian translation |
|
|
|
Rustaveli/kir |
Kirghiz (kir) |
61 820 |
2022-07-20 |
GNC Rustaveli Parallel Corpus, Kyrgyz translation |
|
|
|
Rustaveli/lit |
Lithuanian (lit) |
46 633 |
2022-07-20 |
GNC Rustaveli Parallel Corpus, Lithuanian part |
|
|
|
Rustaveli/mge |
Middle Georgian (mge) |
64 305 |
2022-07-20 |
GNC Rustaveli Parallel Corpus, Georgian part |
|
|
|
Rustaveli/sva |
Svan (sva) |
49 212 |
2022-07-20 |
GNC Rustaveli Parallel Corpus, Svan translation |
|
|
|
Storting debates |
Norwegian Bokmål (nob) |
28 533 334 |
2020-04-26 |
The corpus "Proceedings of Norwegian parliamentary debates (2008-2015)" is a collection of transcriptions of Norwegian parliamentary debates between 2008 and 2015, downloaded from https://data.stortinget.no/. |
|
|
NLOD |
TRIS/de-at |
German (deu) |
841 861 |
2020-04-26 |
The corpus TRIS (Parallel Corpus of documents from the Technical Regulations Information System for German-Spanish) is a specialized parallel corpus with Spanish-German (ES-ES, DE-AT and DE-DE), texts from the European Commission between 1997-2010. |
|
|
CC-BY-NC-SA |
TRIS/es-es |
Spanish (spa) |
1 209 985 |
2020-04-26 |
The corpus TRIS (Parallel Corpus of documents from the Technical Regulations Information System for German-Spanish) is a specialized parallel corpus with Spanish-German (ES-ES, DE-AT and DE-DE), texts from the European Commission between 1997-2010. |
|
|
CC-BY-NC-SA |
TV 2 |
Norwegian Bokmål (nob) |
196 930 955 |
2024-02-23 |
|
|
|
CLARIN_RES |
Talk Of Norway |
Norwegian (nor) |
63 803 594 |
2020-04-26 |
Corpus based on the Talk of Norway (TON) dataset v. 1.0, a collection of Norwegian parliament speeches from the 1998-1999 to 2015-2016 sessions. |
|
|
NLOD |