4th Shared Task on SlavNER

Recognition, Normalization, Classification and Cross-lingual linking of Named Entities in Slavic Languages

Important Dates

  • See here

    Description

    The 4th edition of the SlavNER Shared Task focuses on the analysis of Named Entities in multilingual Web documents in Slavic languages.

    Due to rich inflection, free word order, derivation, and other phenomena present in the the Slavic languages, work on Named Entities poses a challenging task. Fostering research & development on the problems of Named Entities — detecting mentions of names, lemmatization (normalization), classification, and cross-lingual matching — is crucial for cross-lingual information access and wider use of NLP in Slavic languages.

    The 4rd edition of the shared task covers three languages:

    • Czech,
    • Polish,
    • Russian.

    and five types of named entities:

    • persons,
    • locations,
    • organizations,
    • events,
    • products.

    The Shared Task focuses on cross-lingual, document-level extraction of named entities — the systems should recognize, classify, and extract all named entity mentions in a document; detecting the position of each named entity mention is not required. Named-entity mentions should be lemmatized, and mentions referring to the same real-world object should be linked across documents and languages. The input text collection consists of sets of documents retrieved from the Web, each set being about a certain entity or event. The corpus was obtained by crawling the Web and parsing the HTML of documents.

    IMPORTANT: it is NOT mandatory to participate in the full task, e.g., monolingual responses, without lemmatization of the extracted named entities, can be evaluated also.

    See the details about the 1st edition (2017) , 2nd edition (2019) and the 3rd edition (2021) of this shared task.

    Participation

    Teams that intend to participate should register by sending an email to: bsnlp@cs.helsinki.fi, which includes the following information:

    • name of team,
    • names of team members,
    • contact person,
    • contact email.

    Detailed task description and system response guidelines

    PDF: Detailed definition of the Shared Task, including system response guidelines and relevant input/output formats.

    Data

    Training data

    The training data for this edition consist of training and test data from: the 2021 edition — the entire collection of links to the train/test data.

    Please see the previous editions as well: 2017, and 2019.
    Please refer to task description papers from the previous editions, 2017 and 2019, for more details.

    Participants are encouraged to exploit various external named-entity resources for the languages of the Shared Task, which can be found at the SIGSLAV Web Page.

    Test Data

    The test data set will consist of a set of documents related to a specific topic (about an entity or event), which is different from the topics in the existing training data sets from 2017, 2019 and 2021.

    Tools

    Consistency Check

    A Java program that checks the consistency of the annotation files (including format, valid entity types, id assignment, etc.) can be found HERE.

    Evaluation Metrics

    A Java program that was used for evaluation can be found HERE. Please read the file readme.txt inside the archive for more details. Do not hesitate to contact us in case you encounter any problems.

    Evaluation is carried out on the system response returned by the participants for the test corpora.

    Named entity recognition (exact, case-insensitive matching) and lemmatization tasks are evaluated in terms of precision, recall, and F1 scores. In particular, as regards named entity recognition, two types of evaluations are carried out:

    Relaxed evaluation: an entity mentioned in a given document is considered to be extracted correctly if the system response includes at least one annotation of a named mention of this entity (regardless of whether the extracted mention is base form);

    Strict evaluation: the system response should include exactly one annotation for each unique form of a named mention of an entity that is referred to in a given document, i.e., capturing and listing all variants of an entity is required.

    The document-level and cross-language entity matching task are evaluated using the Link-Based Entity-Aware metric (LEA) introduced in (Moosavi and Strube, 2016).

    Please note that the evaluation is case-insensitive. That is, all named mentions in system response and test corpora are lower-cased.

    Publication and Workshop

    Participants in the shared task are invited to submit a paper to the SlavNLP 2023 workshop. Submitting a paper is not mandatory for participating in the Shared Task. Papers must follow the workshop submission instructions and will undergo regular peer review. Their acceptance will not depend on the results obtained in the shared task, but on the quality of the paper: clarity of presentation, etc. Authors of accepted papers will be informed about the evaluation results of their systems prior to the submission deadline. Accepted papers will appear in the ACL anthology. Accepted papers will be present at a session of the Slavic NLP 2023 Workshop specially dedicated to the Shared Task.

    The deadline for the shared task paper is the same as for the workshop papers, see the Important dates page.