Workshop Proceedings

Invited talks

Serge Sharoff

University of Leeds, UK

Towards Pan-Slavonic NLP: some experiments with Language Adaptation


There is great variation in the amount of the NLP resources available for Slavonic languages. For example, the Universal Dependency treebank has about 2 million words of training resources for Czech and for Russian, only 950 words for Ukrainian and nothing for Belarussian, Bosnian or Macedonian. Similarly, the Autodesk Machine Translation dataset only covers three Slavonic languages (Czech, Polish and Russian).

In this talk I will discuss a general approach, which can be called Language Adaptation, similarly to Domain Adaptation. In this approach, a model for a particular language processing task is built by lexical transfer of cognate words and by learning a new feature representation for a lesser-resourced language starting from a better-resourced one. More specifically, I will demonstrate how language adaptation works in such training scenarios as Part-of- Speech tagging, syntactic parsing and translation quality estimation.

Workshop Schedule

09:00 - 09:10 Welcome remarks
09:10 - 10:00 Toward Pan-Slavic NLP: Some Experiments with Language Adaptation
Invited talk by Serge Sharoff
Session I: Lexical Semantics
chair: Tanja Samardžic
10:10 - 10:35 Clustering of Russian Adjective-Noun Constructions using Word Embeddings
Andrey Kutuzov, Elizaveta Kuzmenko and Lidia Pivovarova
10:35 - 11:00 A Preliminary Study of Croatian Lexical Substitution
Domagoj Alagic and Jan Šnajder
11:00 - 11:30 Coffee break
Session II: Development of Linguistic Resources
chair: Lidia Pivovarova
11:30 – 11:55Projecting Multiword Expression Resources on a Polish Treebank
Agata Savary and Jakub Waszczuk
11:55 – 12:20Lexicon Induction for Spoken Rusyn – Challenges and Results
Achim Rabus and Yves Scherrer
12:20 - 12:45 The Universal Dependencies Treebank for Slovenian
Kaja Dobrovoljc, Tomaž Erjavec and Simon Krek
12:45 - 13:10 Universal Dependencies for Serbian in Comparison with Croatian and Other Slavic Languages
Tanja Samardžic, Mirjana Starovic, Željko Agic and Nikola Ljubešic
13:10 - 14:30 Lunch
Session III: Processing Non-Standard Language and User-Generated Content
chair: Serge Sharoff
14:30 - 14:55 Spelling Correction for Morphologically Rich Language: a Case Study of Russian
Alexey Sorokin
14:55 - 15:20 Debunking Sentiment Lexicons: A Case of Domain-Specific Sentiment Classification for Croatian
Paula Gombar, Zoran Medic, Domagoj Alagic and Jan Šnajder
15:20 - 15:45 Adapting a State-of-the-Art Tagger for South Slavic Languages to Non-Standard Text
Nikola Ljubešic, Tomaž Erjavec and Darja Fišer
15:45 - 16:10 Comparison of Short-Text Sentiment Analysis Methods for Croatian
Leon Rotim and Jan Šnajder
16:10 - 16:30 Coffee break
Session IV: Shared Task on Multilingual Named Entity Recognition
chair: Jakub Piskorski, Josef Steinberger
16:30 - 16:45 The First Cross-Lingual Challenge on Recognition, Normalization, and Matching of Named Entities in Slavic Languages
Jakub Piskorski, Lidia Pivovarova, Jan Šnajder, Josef Steinberger and Roman Yangarber
16:45 - 16:55 Language-Independent Named Entity Analysis Using Parallel Projection and Rule-Based Disambiguation
James Mayfield, Paul McNamee and Cash Costello
16:55 - 17:05 Liner2 — a Generic Framework for Named Entity Recognition
Michał Marcińczuk, Jan Kocoń and Marcin Oleksy
17:05 - 17:15 Discussion
Session V: Information Filtering, Retrieval, and Extraction
chair: Jan Snajder
17:20 - 17:40 Comparison of String Similarity Measures for Obscenity Filtering
Ekaterina Chernyak
17:40 - 18:00 Stylometric Analysis of Parliamentary Speeches: Gender Dimension
Justina Mandravickaite and Tomas Krilavicius
18:00 - 18:20 Towards Never Ending Language Learning for Morphologically Rich Languages
Kseniya Buraya, Lidia Pivovarova, Sergey Budkov and Andrey Filchenkov
18:20 - 18:40 Gender Profiling for Slovene Twitter communication: the Influence of Gender Marking, Content and Style
Ben Verhoeven, Iza Škrjanec and Senja Pollak