Resources

Data Sets

EmpiriST Corpus

See the data set (and the description) on GitHub.

Corpora

AlCo Corpus

The Albanian Corpus (AlCo) contains a hundred million word tokens (text words), the first Albanian corpus of this size. The corpus covers different domains of language and contains different text types – it is a reference corpus. At this moment the work is still in progress, some texts still need to be replaced or recategorized. The corpus is annotated with a morpho-syntactic tagset of 77 tags, since 2015. We use CQPweb, a web-based corpus analysis system, to explore the corpus data.

AlCoPress (2017-2019) Corpus

The Albanian Corpus of Press Texts (AlCo) contains around 32 million word tokens (text words). The corpus is annotated like AlCo. We use CQPweb, a web-based corpus analysis system, to explore the corpus data.

Buzuku (1555) Corpus

The Buzuku Corpus contains the text of "Missale" (1555) from Gjon Buzuku. The corpus is not annotated.