Shared Task

Description

The 2nd edition of the shared task on multilingual named entity recognition aims at recognizing mentions of named entities in web documents in Slavic languages, their lemmatization, and cross-language matching.

Due to rich inflection, free word order, derivation and other phenomena exhibited by Slavic languages the detection of names and their lemmatization poses a challenging task. Fostering research and development on this problem and the closely related problem of entity linking is of paramount importance for enabling multilingual and cross-lingual information access.

The 2019 edition of the shared task covers four languages:

and focuses on recognition of five types of named entities including:

The task focuses on cross-lingual document-level extraction of named entities, i.e., the systems should recognize, classify, and extract all named entity mentions in a document, but detecting the position of each named entity mention in text is not required. Furthermore, named-entity mentions should be lemmatized and the ones referring to the same real-world object should be linked across documents and languages.The input text collection consists of sets of news articles from online media, each collection revolving around a certain entity or an event. The corpus was obtained by crawling the web and parsing the HTML of relevant documents.

IMPORTANT: it is not mandatory to participate in the full task, e.g., monolingual system responses without lemmatization of the extracted named entities will be evaluated and scored also.

Participation

Teams that plan to participate should register via email to:   bsnlp (ät) cs.helsinki.fi  which includes the following information:

Registered teams will receive access to data and additional information about the shared task.

Updates about the shared task will be announced on the BSNLP 2019 Web page , and on the SIGSLAV mailing list .

 

 

Detailed task description and system response guidelines

PDF: Detailed definition of the Shared Task, including system response guidelines and relevant input/output formats (UPDATED on 22 January 2019).

Data Sets

Sample data

Sample data consisting of raw documents and corresponding annotations is available HERE.

Training data

Training data will consist of two moderately sized sets of annotated documents, each related to a specific topic (entity or event).

Registered participants will receive the full annotated corpora and further information via email after registration.

Participants are encouraged to exploit various external named-entity related resources for the languages of the shared task, which can be found at the SIGSLAV web page .

Test data

The test data set will consist of two sets of raw documents, each related to a specific Topic (revolving around an entity or event), which will be different from the topics in the training data set. The test data set will be provided to registered participants on 25 April 2019. Please see below, Section on “Important Dates”, for further information.

The format used will be exactly the same as for training data.

Evaluation Metrics

Evaluation will be carried out on the system response returned by the participants for the test corpora.

Named entity recognition (exact case-insensitive matching) and lemmatization tasks will be evaluated in terms of precision, recall, and F1 scores. In particular, as regards named entity recognition, two types of evaluations will be carried out:

Relaxed evaluation:  an entity mentioned in a given document is considered to be extracted correctly if the system response includes at least one annotation of a named mention of this entity (regardless whether the extracted mention is base form);

Strict evaluation:  the system response should include exactly one annotation for each unique form of a named mention of an entity that is referred to in a given document, i.e., capturing and listing all variants of an entity is required.

The document-level and cross-language entity matching task will be evaluated using the Link-Based Entity-Aware metric (LEA)  introduced in (Moosavi and Strube, 2016).

Please note that the evaluation will be case-insensitive. That is, all named mentions in system response and test corpora will be lower-cased.

Publication and Workshop

Participants in the shared task are invited to submit a paper to the BSNLP 2019 workshop ( http://bsnlp.cs.helsinki.fi/ ), although submitting a paper is not mandatory for participating in the shared task.  Papers must follow the workshop submission instructions ( http://bsnlp.cs.helsinki.fi/cfp.html ) and they will undergo peer review. Their acceptance will not depend on the results obtained in the shared task (which will take place after the deadline of submitting the papers), but on the quality of the paper (clarity of presentation, etc.).  Authors of accepted papers will be informed about the evaluation results of their systems in due prior to submission deadline.  Accepted papers will appear in the Proceedings of BSNLP 2019. Authors of accepted papers will present their solutions and results in a dedicated session on the shared task.

Important Dates

Organizers

Contributors