The 2nd edition of the shared task on multilingual named entity recognition aims at recognizing mentions of named entities in web documents in Slavic languages, their lemmatization, and cross-language matching.
Due to rich inflection, free word order, derivation and other phenomena exhibited by Slavic languages the detection of names and their lemmatization poses a challenging task. Fostering research and development on this problem and the closely related problem of entity linking is of paramount importance for enabling multilingual and cross-lingual information access.
The 2019 edition of the shared task covers four languages:
and focuses on recognition of five types of named entities including:
The task focuses on cross-lingual document-level extraction of named entities, i.e., the systems should recognize, classify, and extract all named entity mentions in a document, but detecting the position of each named entity mention in text is not required. Furthermore, named-entity mentions should be lemmatized and the ones referring to the same real-world object should be linked across documents and languages.The input text collection consists of sets of news articles from online media, each collection revolving around a certain entity or an event. The corpus was obtained by crawling the web and parsing the HTML of relevant documents.
IMPORTANT: it is not mandatory to participate in the full task, e.g., monolingual system responses without lemmatization of the extracted named entities will be evaluated and scored also.
Teams that plan to participate should register via email to: bsnlp (ät) cs.helsinki.fi which includes the following information:
- team name,
- team composition,
- contact person,
- contact email.
Registered teams will receive access to data and additional information about the shared task.
Detailed task description and system response guidelines
PDF: Detailed definition of the Shared Task, including system response guidelines and relevant input/output formats (UPDATED on 22 January 2019).
Sample data consisting of raw documents and corresponding annotations are available HERE.
Training data are available HERE.
Training data consist of two moderately sized sets of annotated documents, each related to a specific topic (entity or event).
Participants are encouraged to exploit various external named-entity related resources for the languages of the shared task, which can be found at the SIGSLAV web page .
Test data are available HERE.
The test data set consist of two sets of documents, each related to a specific Topic (revolving around an entity or event), which are different from the topics in the training data set.
The format used is exactly the same as for training data.
A Java script that checks consistency of the annotation files (including format, valid entity types, id assignment, etc.) can be found HERE.
A Java script that was used for evaluation can be found HERE. Please read 'readme.txt' inside the archive for more details. Do not hesitate to contact us in case you find any problems.
Evaluation is carried out on the system response returned by the participants for the test corpora.
Named entity recognition (exact case-insensitive matching) and lemmatization tasks are evaluated in terms of precision, recall, and F1 scores. In particular, as regards named entity recognition, two types of evaluations are carried out:
Relaxed evaluation: an entity mentioned in a given document is considered to be extracted correctly if the system response includes at least one annotation of a named mention of this entity (regardless whether the extracted mention is base form);
Strict evaluation: the system response should include exactly one annotation for each unique form of a named mention of an entity that is referred to in a given document, i.e., capturing and listing all variants of an entity is required.
The document-level and cross-language entity matching task are evaluated using the Link-Based Entity-Aware metric (LEA) introduced in (Moosavi and Strube, 2016).
Please note that the evaluation is case-insensitive. That is, all named mentions in system response and test corpora are lower-cased.
Publication and Workshop
Participants in the shared task are invited to submit a paper to the BSNLP 2019 workshop ( http://bsnlp.cs.helsinki.fi/ ), although submitting a paper is not mandatory for participating in the shared task. Papers must follow the workshop submission instructions ( http://bsnlp.cs.helsinki.fi/cfp.html ) and they will undergo peer review. Their acceptance will not depend on the results obtained in the shared task (which will take place after the deadline of submitting the papers), but on the quality of the paper (clarity of presentation, etc.). Authors of accepted papers will be informed about the evaluation results of their systems in due prior to submission deadline. Accepted papers will appear in the Proceedings of BSNLP 2019. Authors of accepted papers will present their solutions and results in a dedicated session on the shared task.
- 08 January 2019: Shared task announcement and call for participation
- 15 January 2019: Release of sample data
- 2 March 2019: Registration deadline
- 3 March 2019: Release of full training data for registered participants
- 6 May 2019: Release of blind test data for registered participants
- 7 May 2019: Submission of system responses
- 9 May 2019: Sending results to participants
- 14 May 2019: Shared task paper submission due (non mandatory)
- 26 May 2019: Notification of acceptance
- 3 June 2019: Camera-ready shared task papers due
- Michał Marcińczuk
- Petya Osenova
- Jakub Piskorski
- Lidia Pivovarova
- Pavel Přibáň
- Kiril Simov
- Josef Steinberger
- Roman Yangarber
- Tomek Bernaś
- Anastasia Golovina
- Laska Laskova
- Natalia Novikova
- Elena Shukshina
- Yana Vorobieva
- Alina Zaharova