Important Dates
- Shared task announcement: 1 December 2020 ⇒ Training data available for most languages (below)
- Release of training data (for remaining languages): 25 January 2021
- Registration deadline: 1 March 2021
- Shared Task paper submission (DRAFT without evaluation test data): 1 March 2021
- Release of test data to registered participants: 3 March 2021
- Submission of system responses: 5 March 2021
- Results announced to participants: 6 March 2021
- Camera-ready Shared Task paper: 8 March 2021
Description
The 3rd edition of the SlavNER Shared Task focuses on the analysis of Named Entities in multilingual Web documents in Slavic languages.
Due to rich inflection, free word order, derivation, and other phenomena present in the the Slavic languages, work on Named Entities poses a challenging task. Fostering research & development on the problems of Named Entities — detecting mentions of names, lemmatization (normalization), classification, and cross-lingual matching — is crucial for cross-lingual information access and wider use of NLP in Slavic languages.
The 3rd edition of the shared task covers six languages:
- Bulgarian,
- Czech,
- Polish,
- Russian,
- Slovene,
- Ukrainian.
and five types of named entities:
- persons,
- locations,
- organizations,
- events,
- products.
The Shared Task focuses on cross-lingual, document-level extraction of named entities — the systems should recognize, classify, and extract all named entity mentions in a document; detecting the position of each named entity mention is not required. Named-entity mentions should be lemmatized, and mentions referring to the same real-world object should be linked across documents and languages. The input text collection consists of sets of documents retrieved from the Web, each set being about a certain entity or event. The corpus was obtained by crawling the Web and parsing the HTML of documents.
Important: it is NOT mandatory to participate in the full task, e.g., monolingual responses, without lemmatization of the extracted named entities, can be evaluated also.
See the details about the 1st edition (2017) and the 2nd edition (2019) of this shared task.
Participation
Teams that intend to participate should register by sending an email to: bsnlp@cs.helsinki.fi, which includes the following information:
- name of team,
- names of team members,
- contact person,
- contact email.
Detailed task description and system response guidelines
PDF: Detailed definition of the Shared Task, including system response guidelines and relevant input/output formats. (new 1.2 version, 18 January 2021)
Data
Training data
The training data for this edition consist of training and test data from the previous editions (2017 and 2019)
- 2017 Train data - fixed (CS, HR, PL, RU, SK, SL, UA) - Updated version with some error corrections - added 17.1.2021
- 2021 Train data - fixed + UA, SL (UA, SL, RU, PL, CS, BG) - Updated version of 2019 data + UA and SL data added for the 2019 topics - added 25.1.2021
For consistency reasons, we keep the original (non-fixed) versions of the data
- 2017 training data (CS, HR, PL, RU, SK, SL, UA)
- 2019 Sample data (CS, PL, RU, BG)
- 2019 Train data (CS, PL, RU, BG)
- 2019 Test data (CS, PL, RU, BG)
Please refer to task description papers from the previous editions, 2017 and 2019, for more details.
Participants are encouraged to exploit various external named-entity resources for the languages of the Shared Task, which can be found at the SIGSLAV Web Page.
Test Data
The test data set will consist of two sets of documents, each set related to a specific topic (about an entity or event), which are different from the topics in the existing training data sets from 2017 and 2019.
- 2021 Test data (UA, SL, RU, PL, CS, BG)
Tools
Consistency Check
A Java program that checks the consistency of the annotation files (including format, valid entity types, id assignment, etc.) can be found HERE.
Evaluation Metrics
A Java program that was used for evaluation can be found HERE. Please read the file readme.txt inside the archive for more details. Do not hesitate to contact us in case you encounter any problems.
Evaluation is carried out on the system response returned by the participants for the test corpora.
Named entity recognition (exact, case-insensitive matching) and lemmatization tasks are evaluated in terms of precision, recall, and F1 scores. In particular, as regards named entity recognition, two types of evaluations are carried out:
Relaxed evaluation: an entity mentioned in a given document is considered to be extracted correctly if the system response includes at least one annotation of a named mention of this entity (regardless of whether the extracted mention is base form);
Strict evaluation: the system response should include exactly one annotation for each unique form of a named mention of an entity that is referred to in a given document, i.e., capturing and listing all variants of an entity is required.
The document-level and cross-language entity matching task are evaluated using the Link-Based Entity-Aware metric (LEA) introduced in (Moosavi and Strube, 2016).
Please note that the evaluation is case-insensitive. That is, all named mentions in system response and test corpora are lower-cased.
Publication and Workshop
Participants in the shared task are invited to submit a paper to the BSNLP 2021 workshop. Submitting a paper is not mandatory for participating in the Shared Task. Papers must follow the workshop submission instructions and will undergo regular peer review. Their acceptance will not depend on the results obtained in the shared task, but on the quality of the paper: clarity of presentation, etc. Authors of accepted papers will be informed about the evaluation results of their systems prior to the submission deadline. Accepted papers will appear in the ACL anthology. Accepted papers will be present at a session of the BSNLP 2021 Workshop specially dedicated to the Shared Task.
The deadline for the shared task paper is the same as for the workshop papers, see the Important dates page.