INESS-logo
Metadata

Treebanks

Tools


Tranining corpus ssj500kv1.2
Persistent identifier for the resource:
Contact Person: Krek, Simon
This resource is licensed under the following terms:
Creative_Commons-BY-NC-SA (CC-BY-NC-SA)
BY NC SA
BY NC SA
Please click on the link to read the license terms.
By accepting the terms of the license you will be granted access to the resource.
Attribution:
Krek, Simon and Erjavec, Tomaž (2014). Training corpus ssj500kv1.2. Jožef Stefan Institute, Slovenia. http://hdl.handle.net/11495/DB26-0437-026E-4
Language(s): Slovenian (sl)
Description:
The ssj500k training corpus is based on two training corpora, built within the JOS project. It contains the entire jos100k corpus and additional 400.000 words from a million-word jos1M corpus. When making the training corpus, the text, consisting of a sequence of characters (letters, numbers, spaces, symbols etc.), has to be first divided into meaningful units such as paragraphs, sentences, words and punctuation. This procedure is called segmentation (sentence identification) and tokenization (identification of tokens, i.e. words and punctuation). Two other types of information are attributed to each word: a basic form or a lemma (jagodam, jagodami -> jagoda) and a morphosyntactic tag. The latter is formed as an acronym, containing the information on word class and related morphosyntactic features, for example Somei = samostalnik (noun), občno ime (common noun), moški spol (masculine gender), ednina (singular), imenovalnik (nominative). The ssj500k corpus uses the JOS tagset that contains exactly 1,902 tags with combinations of categories and features according to the specifications of the JOS project.