1 Overview

Context

The Saarbrücken Treebank of Albanian Fiction (STAF) is part of a larger effort to build a parallel corpus for typological investigations, the Corpus of Indo-European Prose and more (CIEP+: Talamo & Verkerk, 2022; Verkerk, A. and Talamo, L., 2024), which aims to include the translations of 18 literary texts in 50 languages from 12 families. The parallel corpus is automatically annotated using Stanford Stanza (Qi et al., 2020) with models trained on Universal Dependencies (UD), performing the following Natural Language Processing (NLP) tasks: sentence splitting, tokenization, lemmatization, universal parts-of-speech (UPOS) and universal morphological (Universal Features) tagging, dependency parsing. At the time of writing, twelve languages sampled in CIEP+ do not have pre-trained models and/or available treebanks in the UD collection.

Before the release of STAF in UD v.2.15 (November 2024), Albanian had only one treebank in the UD collection, UD-Albanian_TSA: Toska et al. 2020, which is too small (922 tokens, 60 sentences) to train a reliable model. Since the release of UD v.2.11, a treebank for Gheg Albanian, UD-Gheg_GPS, has been available. Despite the close similarities between this dialect and standard Albanian, which is based on the Tosk dialect but with a complex interaction with Gheg (Camaj, 1984, xv–xvi), the usage of the GPS for annotating written standard Albanian is problematic for a number of reasons. First and foremost, the GPS treebank is based on oral data collected in Kosovo and in Switzerland, containing features of “(semi-)spontaneous speech, like disfluencies and corrections”.1 Furthermore, as is common in oral corpora, the ortography of the GPS treebank is actually the original transliteration of the collected data, reflecting some features of the oral speech and thus differing from the ortography of standard Albanian e.g., rru:gën, standard Albanian rrugën ‘the road.ACC’. Finally, due to the multilingual environment in which the data was collected, GPS contains several examples of code-switching with German, specifically Swiss German.

As for tools performing specific NLP tasks, Kastrati & Biba, 2022 report that almost all of the parts-of-speech and morphological taggers for Albanian are “not available online for NLP purposes”. A notable exception is an Albanian model for the Turku Neural Parser Pipeline (Kanerva et al., 2018), which is trained on a large treebank (185K tokens) annotated in the UD framework (Kote et al., 2019); unfortunately, the treebank misses the annotation for dependency parsing and, consequently, the model does not perform this task, which is crucial for both applied and theoretical research.

Finally, a new treebank for standard Albanian has recently been presented in Kote et al., 2024, the Standard Albanian Language Treebank (SALT). SALT is a large treebank (24,537 tokens, 1.4K sentences) annotated in the UD framework with some divergences from the official guidelines (see below) and is used as the ‘seed treebank’ for bootstrapping STAF. SALT is currently unreleased and its use as the seed treebank for STAF is gratefully acknowledged.

2 Method

Steps

Since CIEP+ is a parallel corpus of literary texts, I have decided to focus on the fiction genre for STAF. Developing resources featuring the fiction genre is particularly useful for cross-linguistic comparison because the vast majority of parallel resources i.e., parallel corpora, only cover the legal and religious genres. Due to its narrative and descriptive nature, the language of the fiction genre is particularly rich both in lexicon and in morpho-syntactic structure; moreover, the fiction genre is often also characterized by a certain amount of dialogue, which to some extent mimics the spoken language.

The following steps were undertaken in the building of STAF: (i) data collection, (ii) automatic processing and (iii) manual correction.

As for the first step, I have legally acquired Albanian books available in digital format. This included full books in various formats (PDF, epub), which were converted to the TXT format using Calibre,2 as well as free book excerpts offered by on-line vendors. As shown in Table 1, I have sampled 200 sentences from nine fictional books written in standard Albanian by contemporary authors in the 1963–2016 period. The sentence sampling was mostly randomic, but I tried to keep a balance between dialogue and narrative parts. For instance, sentences from Dibra’s Gjumi mbi borë are quite short and mostly contain dialogue; by contrast, sentences from Qosia’s Një dashuri dhe shtatë faje and Kongoli’s Ëndrra e Damokleut contain long narrative and descriptive passages. Finally, some sentences, such as the four sentences from Açka’s Kryqi i harresës, were chosen in order to cover all the possible merged particles e.g., t’i = .PART + i.DAT.3SG (see the Reuse potential section).

Table 1

Overview of the texts sampled in STAF.


AUTHOR – TITLEYEARSENTENCESTOKENS

Ismali Kadare – Gjenerali ï Ushtrisë së Vdekur196342593

Dritëro Agolli – Njerëz të krisur199511144

Fatos Kongoli – Lëkura e qenit200312317

Rexhep Qosja – Një dashuri dhe shtatë faje200336529

Flutura Açka – Kryqi i harresës2004464

Fatos Kongoli – Ëndrra e Damokleut2004501207

Enkelejd Lamaj – Libri i bardhë20111690

Enkelejd Lamaj – Vendi diku midis201410229

Ridvan Dibra – Gjumi mbi borë201619152

Total2003325

The second step involved the training of an Albanian model for Stanza using an early (November 2023) version of the SALT treebank (Kote et al., 2024), combined with pre-trained word vectors from the FastText collection.3 The resulting model was used to bootstrap the annotation of STAF, automatically processing the 200 sentences for tokenization, lemmatization, parts-of-speech and morphological tagging as well as dependency parsing.

In the third step, three Albanian native speakers manually corrected the annotated sentences; two of them, AÇ and RR, are professors for the Albanian language at the University of Tirana (Albania), and the third one, EL, was a student assistant at Saarland University (Germany), who previously received a three-month training in Linguistics and in the annotation of UD treebanks. During a visiting period at Saarland University, AÇ and RR corrected 50 sentences and supervised EL in the annotation of 25 sentences; the remaining sentences were corrected by EL under my supervision. The correction of the annotation focused on the parts-of-speech and morphological tagset and on the dependency parsing, and aimed to correct processing errors as well as annotations that diverge from the UD guidelines. As discussed in Kote et al., 2024, these diverging annotations exist in both the parts-of-speech/morphological tagset and dependency parsing, making SALT non-interoperable with the existing UD treebank (TSA), and hindering its comparability with other treebanks in the UD collection.

Quality control

After manual correction, I performed a full review of all sentences with regard to parts-of-speech and morphological tagsets, as well as dependency parsing. Furthermore, in order to pass the UD validation test,4 I adapted the annotation of STAF to that of the existing UD treebank, TSA, with some exceptions and new annotation variables. As shown in Table 2, which summarizes the main differences in the annotations of the three treebanks, I have introduced several morphological features, analyzed words as multi-word tokens (MWTs), annotated pronominal clitics as object/indirect objects and added subtypes for dependency relations.5

Table 2

Differences in the annotation of TSA, STAF, and SALT treebanks. UPOS = Universal Parts of Speech; deprel = Dependency Relation; dephead = Dependency Head; features = Universal Features.


TSASTAFSALT

Multi-word tokensnoyesyes

UPOS tags141517

UPOS for kam ‘to have’ and jam ‘to be’ as copulaAUXAUXVERB

Deprels333732

Dephead for adjectival/nominal predicationsadj/nounadj/nouncopula

Deprel for për/ + verbmarkmarkfixed

Deprel for oblique temporal modifiersoblobl:tmodadvmod

Deprel for possessive pronounsdetdet:possamod:poss

Deprel for articles of prearticulated adjectivesdet:adjdet:adjdet

Deprel for pronominal clitics in clitic doublingexplobj/iobjobj/iobj

Features3641?

Features for adjectivesGender, NumberCase, Degree, Gender, NumberCase, Degree, Gender, Number

Features for adpositionsCase

Features for adverbsDegreeDegree, AdvTypeAdvType

Features for articlesGenderCase, Definite, Gender, Number, PronTypeCase, Gender, Number, PronType

Features for possessive markers (i/e// + possessor)GenderGender, NumberCase, Gender, Number, PronType

Features for personal pronounsGender, Number, PronTypeCase, Gender, Number, PronTypeCase, Gender, Number, PronType

3 Dataset Description

Repository name

Zenodo

Object name

STAF

Format names and versions

TXT (CoNLL-U format)

Creation dates

2023-07-10 to 2024-11-15

Dataset creators

Luigi Talamo (Saarland University), Edita Luftiu (Saarland University), Nelda Kote (Polytechnic of Tirana), Rozana Rushitu (University of Tirana), Anila Çepani (University of Tirana).

Language

Albanian

License

Creative Commons Attribution 4.0 International

Publication date

2024-11-15

4 Reuse Potential

The following are a number of examples of the reuse potential of STAF.

Quantitative empirical research

As a validated Universal Dependencies treebank,6 STAF allows for typological and contrastive studies with respect to over 200 languages. For instance, the annotation of merged particles, which result from the combination of personal pronominal clitics with other personal pronominal clitics and/or subordinator markers, as multi-word tokens i.e., ma = .DAT.1SG + e.ACC.3SG (see also Toska et al., 2020, 182–183; Kote et al., 2024, 87) has been recently exploited in a cross-linguistic study on pronouns (Talamo et al., submitted).

Training and developing set

STAF can serve as the dataset for the automatic training of language models used in several Natural Language Processing tasks, such as lemmatization, parts-of-speech tagging and dependency parsing. For instance, Stanford Stanza includes a model for Albanian trained on STAF.

Testing set

STAF is manually annotated and carefully checked in each of its annotation field, allowing for its use as a gold standard resource.