Korpuskel-logo
Corpuscle :: Corpus list
Clarino-logo
Choose a corpus from the list below. Some corpora are only available when you have signed in.
Corpus Language(s) Size (tokens) (words &
punctuation)
Created Description License
ASK Hovedkorpus Norwegian Bokmål (nob) 1 129 569 769 802 2017-06-02 The Norwegian Second Language Corpus project (ASK) has established an electronic, searchable text corpus of Norwegian as a second language, with links between linguistic data and personal data. CLARIN_RES-PRIV
ASK Korrektkorpus Norwegian Bokmål (nob) 1 132 810 785 451 2016-09-21 The Norwegian Second Language Corpus project (ASK) has established an electronic, searchable text corpus of Norwegian as a second language, with links between linguistic data and personal data. CLARIN_RES-PRIV
Aviskorpus (Bokmål) Norwegian Bokmål (nob) 1 516 439 773 1 509 076 098 2017-06-02 The Norwegian Newspaper Corpus (NNC) Bokmål version is a large monitor corpus representing contemporary Norwegian language in the written variety Norwegian Bokmål. A corresponding corpus is available for Norwegian nynorsk, see URL in metadata. CC-BY
Aviskorpus (Nynorsk) Norwegian Nynorsk (nno) 16 070 002 2017-06-02 The Norwegian Newspaper Corpus (NNC) Nynorsk is a large monitor corpus representing contemporary Norwegian language in the written variety Norwegian Nynorsk. A corresponding corpus is available for Norwegian bokmål, see URL in metadata. CC-BY
Aviskorpus ann. Norwegian Bokmål (nob) 35 692 210 28 969 124 2017-06-02 This is a subpart of Norsk aviskorpus, grammatically annotated and classified. It comprises 35 692 210 tokens and covers Norwegian bokmål in the time span 2001-2009. CC-BY
BulTreeBank Bulgarian (bul) 259 327 229 732 2017-06-02 This distribution represents only the morphological information encoded in BulTreeBank - HPSG-based Treebank of Bulgarian. It contains about 214000 tokens. It was used for the training of the TreeTagger for Bulgarian. MS-NC-NoReD
COLA Spanish (spa) 751 168 479 474 2017-06-02 COLA (Corpus Oral de Lenguaje Adolescente Resource) is a corpus of recorded, spontaneous speech among teenagers from different schools and youth clubs in Madrid, Buenos Aires and Santiago de Chile. It is created for the purpose of studying teenage language in Spanish. The sound files are coupled with orthographic transcriptions (text files) that are anonymized, making the corpus searchable as text through a web search interface where you can read the text and listen to the corresponding recording. CLARIN_RES-PLAN-INF-PRIV-ND-DEP-*
Child Rights English (eng) 107 550 102 967 2017-06-02 This set of documents is a combination of reports to the UN Committee on the Rights of the Child from civil society organizations (CSOs), which include NGOs, NHRIs and Ombudspersons from both Finland, Norway and Spain. Furthermore, the documents also include the Committee on the Rights of the Child's Concluding Observations. All the documents are collected from the previously reporting rounds for Finland (2008), Norway (2009) and Spain (2010). CC-BY
Coryl English (eng) 191 564 129 421 2017-06-02 Coryl is a young learner corpus, which consists of English texts written by Norwegian pupils. The texts which make up the corpus, were collected in the course of the National Testing of English (writing), 2004-5, at the University of Bergen, and are taken randomly from pupils in 7th, 10th and 11th grade. The texts are anonymized and have been assigned to levels and half levels on the CEFR by multiple raters. CC-BY
FTA/Eng English (eng) 1 509 567 1 270 362 2017-06-02 FTA parallellkorpus, English CLARIN_ACA
FTA/Spa Spanish (spa) 1 581 865 1 342 828 2017-06-02 FTA parallellkorpus, Spanish CLARIN_ACA
Forskning.no Norwegian Bokmål (nob) 11 897 288 10 084 683 2017-06-02 Data set containing texts from the popular science website forskning.no. The text material is constituted by articles published by Forskning.no belonging to the following three categories: CLARIN_RES-DEP
Færøsk talemålsbank Faroese (fao) 56 694 45 135 2017-08-30 Transcribed and annotated dialect recordings
ICAME – ACE English (eng) 1 164 145 1 152 533 2017-06-02 ACE was the first systematically compiled heterogeneous corpus in Australia, designed to support a variety of linguistic research. Interest in the differentiation between Australian, British and American English meant that a corpus modeled on the Brown and LOB corpora would provide ready comparisons. It would also serve as a strategic sample of current Australian English, and as a reference corpus for comparisons with more specialised, homogeneous corpora in Australia. CLARIN_ACA
ICAME – BROWN Family English (eng) 7 006 533 6 897 518 2017-06-02 This is a collection of Brown, LOB, Frown, FLOB, BLOB and BE06. The collection is made by UCREL, Lancaster. CLARIN_ACA
ICAME – CEECS English (eng) 566 196 514 224 2017-06-02 The Corpus of Early English Correspondence (CEEC) has been compiled for the study of social variables in the history of English. To enable this, great attention has been paid to the authenticity of letters on the one hand and to the social representativeness of the writers on the other. The timespan covered is from 1417 to 1681, and the size of the whole corpus is 2.7 million words. CLARIN_ACA
ICAME – COLT English (eng) 689 885 444 166 2017-06-02 COLT is a corpus of London Teenage Language with sound, and is now distributed via the search engine Corpuscle. Corpuscle allows you to pass queries to the corpus, and you may ask for concordances, collocations and distribution. CLARIN_RES-PLAN-INF-PRIV-ND-DEP-*
ICAME – FLOB English (eng) 1 267 306 1 133 503 2017-06-02 The Freiburg - LOB Corpus of British English (FLOB) contains texts from 1991. Like the original Brown and LOB corpora, FLOB contains 500 texts of around 2000 words each, distributed across 15 text categories. CLARIN_ACA
ICAME – FROWN English (eng) 1 283 096 1 145 190 2017-06-02 The Freiburg - Brown Corpus of American English (Frown) contains texts from 1991. Like the original Brown and LOB corpora, Frown contains 500 texts of around 2000 words each, distributed across 15 text categories, 9 informative and 6 imaginative. CLARIN_ACA
ICAME – HC English (eng) 2 096 307 1 851 007 2017-06-02 The Helsinki Corpus of English Texts is a structured multi-genre diachronic corpus, which includes periodically organized text samples from Old, Middle and Early Modern English, from the eighth to the beginning of the eighteenth century. Each sample is preceded by a list of parameter codes giving information on the text and its author. CLARIN_ACA
ICAME – HCOS Scots (sco) 1 060 747 959 977 2017-06-02 The Helsinki Corpus of Older Scots was compiled as a supplement to the diachronic part of the Helsinki Corpus of English Texts. The Scottish texts were selected according to the same principles of sociohistorical variation analysis as the main corpus, and the computer format, parameter coding and editorial and typographical conventions are also the same. CLARIN_ACA
ICAME – LLC English (eng) 1 167 375 574 340 2017-06-02 The London-Lund Corpus contains samples of educated spoken British English, in orthographic transcription with detailed prosodic marking. It consists of 100 'texts', each of some 5,000 running words. The text categories represented are spontaneous conversation, spontaneous commentary, spontaneous and prepared oration, etc. CLARIN_ACA
ICAME – LOB English (eng) 1 247 034 1 156 902 2017-06-02 The Lancaster - Oslo/Bergen (LOB) Corpus is a million-word collection of present-day (1961) British English texts. CLARIN_ACA
Jos1M Slovenian (slv) 1 340 171 1 182 946 2017-06-02 Project JOS: Linguistic Annotation of Slovene (http://nl.ijs.si/jos/index-sl.html) CC-BY-NC
KIAP English (eng) 3 945 288 3 900 925 2017-06-02 KIAP is a corpus of 450 research articles covering three disciplines (economics, linguistics and medicine) and three languages (English, French and Norwegian). It is available in Copuscle at the University of Bergen/Uni Research. Corpuscle allows you to pass queries to the corpus, and you may ask for concordances, collocations and distribution. Due to IPR regulations, the search queries may only return the results in limited context windows. CC-BY
NSPC/Nor Norwegian (nor) 2 946 352 2 293 213 2017-06-02 Norsk-spansk parallellkorpus, Norsk CLARIN_ACA
NSPC/Spa Spanish (spa) 3 102 373 2 465 968 2017-06-02 Norsk-spansk parallellkorpus, Spanish CLARIN_ACA
Storting debates Norwegian Bokmål (nob) 29 482 445 28 533 334 2017-06-02 The corpus "Proceedings of Norwegian parliamentary debates (2008-2015)" is a collection of transcriptions of Norwegian parliamentary debates between 2008 and 2015, downloaded from https://data.stortinget.no/. NLOD
TRIS/de-at German (deu) 1 129 372 841 861 2017-06-02 TRIS CC-BY-NC-SA
TRIS/es-es Spanish (spa) 1 497 496 1 209 985 2017-06-02 TRIS CC-BY-NC-SA
Talk Of Norway Norwegian (nor) 64 304 339 63 803 594 2017-06-02 Corpus based on the Talk of Norway (TON) dataset v. 1.0, a collection of Norwegian parliament speeches from the 1998-1999 to 2015-2016 sessions. NLOD
Menota Old Norse (non) 1 991 981 1 618 664 Medieval Nordic Text Archive CC-BY-SA
ASK Norwegian Bokmål (nob) 1 129 799 769 892 Andrespråkskorpus: a Norwegian learners' corpus CLARIN_RES-PRIV
Talebanken Norwegian Nynorsk (nno) 7 080 765 5 675 549 Transcribed and annotated dialect recordings

Design & implementation: Paul Meurer, Uni Research Computing, 2017