Balto-Slavic Natural Language Processing 2019

Shared Task

Organizer's paper with an overview of the task results:

Jakub Piskorski, Laska Laskova, Michał Marcińczuk, Lidia Pivovarova, Pavel Přibáň, Josef Steinberger, Roman Yangarber. The Second Cross-Lingual Challenge on Recognition, Normalization, Classification, and Linking of Named Entities across Slavic Languages

BibTex:

   @inproceedings{piskorski-etal-2019-second,
   title = "The Second Cross-Lingual Challenge on Recognition, Normalization, Classification, and Linking of Named Entities across {S}lavic Languages",
   author = "Piskorski, Jakub  and Laskova, Laska  and Marci{\'n}czuk, Micha{\l}  and Pivovarova, Lidia  and P{\v{r}}ib{\'a}{\v{n}}, Pavel  
   and Steinberger, Josef  and Yangarber, Roman",
   booktitle = "Proceedings of the 7th Workshop on Balto-Slavic Natural Language Processing",
   month = aug,
   year = "2019",
   address = "Florence, Italy",
   publisher = "Association for Computational Linguistics",
   url = "https://www.aclweb.org/anthology/W19-3709",
   pages = "63--74"
   }

Complete final ranking

Description

The 2nd edition of the shared task on multilingual named entity recognition aims at recognizing mentions of named entities in web documents in Slavic languages, their lemmatization, and cross-language matching.

Due to rich inflection, free word order, derivation and other phenomena exhibited by Slavic languages the detection of names and their lemmatization poses a challenging task. Fostering research and development on this problem and the closely related problem of entity linking is of paramount importance for enabling multilingual and cross-lingual information access.

The 2019 edition of the shared task covers four languages:

Bulgarian,
Czech,
Polish,
Russian.

and focuses on recognition of five types of named entities including:

persons,
locations,
organizations,
events,
products.

The task focuses on cross-lingual document-level extraction of named entities, i.e., the systems should recognize, classify, and extract all named entity mentions in a document, but detecting the position of each named entity mention in text is not required. Furthermore, named-entity mentions should be lemmatized and the ones referring to the same real-world object should be linked across documents and languages.The input text collection consists of sets of news articles from online media, each collection revolving around a certain entity or an event. The corpus was obtained by crawling the web and parsing the HTML of relevant documents.

IMPORTANT: it is not mandatory to participate in the full task, e.g., monolingual system responses without lemmatization of the extracted named entities will be evaluated and scored also.

Participation

Teams that plan to participate should register via email to: bsnlp (ät) cs.helsinki.fi which includes the following information:

team name,
team composition,
contact person,
contact email.

Registered teams will receive access to data and additional information about the shared task.

Updates about the shared task will be announced on the BSNLP 2019 Web page , and on the SIGSLAV mailing list .

Detailed task description and system response guidelines

PDF: Detailed definition of the Shared Task, including system response guidelines and relevant input/output formats (UPDATED on 22 January 2019).

Data Sets

Sample data

Sample data consisting of raw documents and corresponding annotations are available HERE.

Training data

Training data are available HERE.

Training data consist of two moderately sized sets of annotated documents, each related to a specific topic (entity or event).

Participants are encouraged to exploit various external named-entity related resources for the languages of the shared task, which can be found at the SIGSLAV web page .

Test data

Test data are available HERE.

The test data set consist of two sets of documents, each related to a specific Topic (revolving around an entity or event), which are different from the topics in the training data set.

The format used is exactly the same as for training data.

Consistency Check

A Java script that checks consistency of the annotation files (including format, valid entity types, id assignment, etc.) can be found HERE.

Evaluation Metrics

A Java script that was used for evaluation can be found HERE. Please read 'readme.txt' inside the archive for more details. Do not hesitate to contact us in case you find any problems.

Evaluation is carried out on the system response returned by the participants for the test corpora.

Named entity recognition (exact case-insensitive matching) and lemmatization tasks are evaluated in terms of precision, recall, and F1 scores. In particular, as regards named entity recognition, two types of evaluations are carried out:

Relaxed evaluation: an entity mentioned in a given document is considered to be extracted correctly if the system response includes at least one annotation of a named mention of this entity (regardless whether the extracted mention is base form);

Strict evaluation: the system response should include exactly one annotation for each unique form of a named mention of an entity that is referred to in a given document, i.e., capturing and listing all variants of an entity is required.

The document-level and cross-language entity matching task are evaluated using the Link-Based Entity-Aware metric (LEA) introduced in (Moosavi and Strube, 2016).

Please note that the evaluation is case-insensitive. That is, all named mentions in system response and test corpora are lower-cased.

Publication and Workshop

Participants in the shared task are invited to submit a paper to the BSNLP 2019 workshop, although submitting a paper is not mandatory for participating in the shared task. Papers must follow the workshop submission instructions and they will undergo peer review. Their acceptance will not depend on the results obtained in the shared task (which will take place after the deadline of submitting the papers), but on the quality of the paper (clarity of presentation, etc.). Authors of accepted papers will be informed about the evaluation results of their systems in due prior to submission deadline. Accepted papers will appear in the Proceedings of BSNLP 2019. Authors of accepted papers will present their solutions and results in a dedicated session on the shared task.

Important Dates

08 January 2019: Shared task announcement and call for participation
15 January 2019: Release of sample data
2 March 2019: Registration deadline
3 March 2019: Release of full training data for registered participants
6 May 2019: Release of blind test data for registered participants
7 May 2019: Submission of system responses
9 May 2019: Sending results to participants
14 May 2019: Shared task paper submission due (non mandatory)
26 May 2019: Notification of acceptance
3 June 2019: Camera-ready shared task papers due

Organizers

Michał Marcińczuk
Petya Osenova
Jakub Piskorski
Lidia Pivovarova
Pavel Přibáň
Kiril Simov
Josef Steinberger
Roman Yangarber

Contributors

Tomek Bernaś
Anastasia Golovina
Laska Laskova
Natalia Novikova
Elena Shukshina
Yana Vorobieva
Alina Zaharova

Acknowledgements

The shared task was supported in part by the Europe Media Monitoring Project (EMM), carried out by the Text and Data Mining Unit of the Joint Research Centre of the European Commission.
Work was supported in part by investment in the CLARIN-PL research infrastructure funded by the Polish Ministry of Science and Higher Education.
Work was supported in part by ERDF “Research and Development of Intelligent Components of Advanced Technologies for the Pilsen Metropolitan Area (InteCom)” (no. CZ.02.1.01 / 0.0 / 0.0 / 17_048/0007267), and by Grant No. SGS-2019-018 “Processing of heterogeneous data and its specialized applications.”
The work on Bulgarian data for the shared task was partially supported by the Bulgarian National Interdisciplinary Research e-Infrastructure for Resources and Technologies in favor of the Bulgarian Language and Cultural Heritage, part of the EU infrastructures CLARIN and DARIAH – CLaDA-BG, Grant number DO01-164/28.08.2018.