Shared Task Data and Results

Shared Task

Description

The BSNLP 2017 shared task on multilingual named entity recognition aims at recognizing mentions of named entities in web documents in Slavic languages, their normalization / lemmatization, and cross-language matching.

Due to rich inflection, free word order, derivation and other phenomena exhibited by Slavic languages the detection of names and their lemmatization poses a challenging task. Fostering research and development on this problem—and the closely related problem of entity linking—is of paramount importance for enabling multilingual and cross-lingual information access.

The shared task initially covers seven languages:

Croatian,
Czech,
Polish,
Russian,
Slovak,
Slovene,
Ukrainian

and focuses on recognition of four types of named entities including:

persons,
locations,
organizations, and
miscellaneous,

where the last category covers mentions of all other types of named entities, e.g., products, events, etc. The task focuses on cross-lingual document-level extraction of named entities, i.e., the systems should recognize, classify, and extract all named entity mentions in a document, but detecting the position of each named entity mention in text is not required. The input text collection consists of sets of documents from the Web, each collection revolving around a certain entity. The corpus is obtained by posing a query to a search engine and parsing the HTML of relevant documents. We envision an extension of the task to additional languages at a later stage, which will go beyond the scope of the 2017 BSNLP Task.

Participation: Instructions

Participation is open. Teams that plan to participate should register via sending an email to: bsnlp@cs.helsinki.fi which includes the following information:

team name
team composition
contact person
contact email

System response files should be encoded in UTF-8.

Upon registration, participants will immediately receive access to data and additional information.

Updates about the shared task will be announced on BSNLP 2017 web page at:

http://bsnlp-2017.cs.helsinki.fi/shared_task.html

and on the mailing list of SIGSLAV at:

https://groups.google.com/forum/?fromgroups#!forum/sigslav

Details about the named entities to be covered and system response format is found in the following section.

System response guidelines and format

The task

The Multilingual Named Entity Recognition task consists of three subtasks:

Named Entity Mention Detection and Classification: recognizing all unique named mentions of entities of four types:

persons (PER),
organizations (ORG),
locations (LOC),
miscellaneous (MISC).

Name Normalization: computing for each detected named mention of an entity its corresponding base form/lemma,

Entity Matching: assigning to each detected named mention of an entity an identifier in such a way that detected mentions of entities referring to the same real-world entity should be assigned the same identifier, which we will refer to as cross-lingual ID.

There is no need to return positional information of name entity mentions.

Systems may be tuned to solving all three subtasks or a subset of the subtasks.

The following rules/conventions should be followed when designing the system/solution:

1. General rules

a) The system should not return more than one annotation for all occurrences of the same form of a mention (e.g., inflected variant, acronym or abbreviation) of a named entity within the same document, unless the different occurrences thereof have different entity types (different readings) assigned to them.

b) Since the evaluation will be case-insensitive, it is not relevant whether the system response includes case information. In particular, if the text includes lowercase, uppercase and/or mixed-letter named mention variants of the same entity, the system response should include only one annotation for all of these mentions. For instance, for “ISIS”, “isis”, and “Isis” (provided that they refer to the same named-entity type), only one annotation should be returned (e.g., “Isis”).

c) Recognition of nominal or pronominal mentions of entities is not part of the task.

2. Person names (PER)

a) Person names should not include titles, honorifics, and functions/positions. For example, in the text fragment "CEO Dr. Jan Kowalski", only "Jan Kowalski" should be recognized as a person name. However, initials and pseudonyms are considered named mentions of person names and should be recognized. Similarly, named references to groups of people (that do not have a formal organization unifying them) should also be recognized, e.g., “Ukrainians”.

b) Personal possessives derived from a named mention of a person should be classified as a person, and the base form of the corresponding person name should be extracted. For instance, for "Piskorskijev mejl" (Croatian) it is expected to recognize "Piskorskijev", classified with PER and extract the base form: "Piskorski".

c) Fictive persons and characters are considered as persons.

3. Locations (LOC)

a) This category includes all toponyms and geopolitical entities (e.g., cities, counties, provinces, countries, regions, bodies of water, land formations, etc.) and named mentions of facilities (e.g., stadiums, parks, museums, theaters, hotels, hospitals, transportation hubs, churches, railroads, bridges, and other similar urban and non-urban facilities).

b) Even in case of named mentions of facilities that refer to an organization, the LOC tag should be used. For example, from the text phrase "The Schipol airport has acquired new electronic gates" the mention "The Schipol airport" should be extracted and classified as LOC.

4. Organizations (ORG)

a) This category covers all kind of organizations such as: political parties, public institutions, international organizations, companies, religious organizations, sport organizations, education and research institutions, etc.

b) Organization designators and potential mentions of the seat of the organization are considered to be part of the organization name. For instance, from the text fragment "Citi Handlowy w Poznaniu" (a bank in Poznań), the full phrase "Citi Handlowy w Poznaniu" should be extracted.

5. Miscellaneous (MISC)

a) This category covers all other named mentions of entities, e.g., product names (e.g., “Motorola Moto X”), events (conferences, concerts, natural disasters, holidays, e.g.,”Święta Bożego Narodzenia”, etc.).

b) This category does not include the recognition of temporal and numerical expressions, as well as recognition of identifiers such as email addresses, URLs, postal addresses, etc.

6. Other aspects

a) In case of complex named entities, consisting of nested named entities, only the top-most entity should be recognized. For example, from the text fragment "George Washington University" one should not extract "George Washington", but the entire name, namely, "George Washington University".

7. Input texts

The input texts are the result of downloading HTML pages (mainly news articles or fragments thereof) and converting them into pure text using a hybrid HTML parser, which might have resulted in extracting texts that not only include the core body text of a Web page, but also some additional pieces of text (e.g., a list of labels from a menu, user comments, etc.) that might not necessarily constitute well-formed utterances in a given language. This phenomenon applies to a small fraction of texts in training/test collection. Such texts were included in the training/test document collections and will be included in the test data in order to maintain the flavour of “real-data”. However, obvious HTML parser failure, e.g., extraction of javascript code or extraction of empty texts, were removed from the data sets. As regards the task at hand, it is important to emphasize that the entire text that is available (regardless of whether it contains more than the main text of the Web page) should be considered for matching names, including the title of the document.

8. Output format

The system response should contain for each file in the input test corpus a corresponding file in the following format:

The first line should contain only the ID of the file in the test corpus.

Each subsequent line should be of the format:

named entity mention <TAB> base form <TAB> category <TAB> cross-lingual ID

Examples of system response:

ISIS file_16.txt (Polish)

	      16
	      Podlascy Czeczeni    Podlascy Czeczeni    PER    6
	      ISIS    ISIS    ORG    2
	      Rosji    Rosja    LOC    3
	      T.G.    T.G.    PER    8
	      Halida    Halida    PER    9
	      Z.G.    Z.G.    PER    10
	      A.Y.    A.Y.    PER    11
	      S A.    S A.    PER    12
	      Polsce    Polska    LOC    13
	      Niemczech    Niemcy    LOC    14
	      Agencji Bezpieczeństwa Wewnętrznego    Agencja Bezpieczeństwa Wewnętrznego    ORG    15
	      Turcji    Turcja    LOC    16
	      Warszawie    Warszawa    LOC    17
	      Białymstoku    Białystok    LOC    18
	      Łomży    Łomża    LOC    19
	      Czeczeni    Czeczeni    PER    20
	      białostockim sądzie okręgowym    Białostocki Sąd Okręgowy    ORG    22
	      Magazynu Kuriera Porannego    Magazyn Kuriera Porannego    ORG    23

ISIS file_159.txt (Russian)

	      159
	      Варвара Караулова    Варвара Караулова    PER 1
	      ИГИЛ    ИГИЛ    ORG    2
	      России    Россия    LOC    3
	      МГУ    МГУ    ORG    4
	      Московском окружном военном суде    Московский окружной военный суд    ORG    5
	      Караулова    Караулова    PER    1
	      Карауловой    Караулова    PER    1
	      Александру Иванову    Александра Иванова    PER    21

The cross-lingual identifiers may consist of an arbitrary sequence of alphanumeric characters. The form of the identifiers is not relevant; what will be relevant for the evaluation is that mentions of the same entity across documents— in any language—are assigned the same cross-lingual identifier.

Data sets

Training data

The training data consists of two sets of about 200 documents each, related to:

Beata Szydło, the current prime minister of Poland, and
ISIS, the so-called “Islamic State of Iraq and Syria” terror group

The corresponding set of links from which the data sets were created can be downloaded from “Beata Szydło Links” and “ISIS Links.”

Each of the documents in the collections will be available in the following format:

The first five lines of the document contain the following metadata,

<ID>

<CREATION-DATE>

<URL>

<TITLE>

The core text to be processed is available from the 6th line till the end of the file.

Please note that both <CREATION-DATE> and <TITLE> information might be missing (since the HTML parsers might not had been able to extract it for various reasons). In such cases the corresponding lines are empty. An example of an input file can be downloaded here.

Registered participants will receive the full corpora and further information via email directly after registration.

Test data

The test data set will be provided to registered participants in February and will be in the same format, i.e., the content of each collection will be focused on one particular entity. Please see the Section on Important Dates for further information. The format used will be exactly the same as for training data.

UTF-8 Encoding is to be used for all system output.

Test data are now publicly available

Evaluation Metrics

Evaluation will be carried out on the system response returned by the participants for the test corpora. The test corpora will be distributed to the participants in February 2017 (see Important Dates Section for details).

Named entity recognition (exact case-insensitive matching) and Name Normalization (sometimes called “lemmatization”) tasks will be evaluated in terms of precision, recall, and F1 scores. In particular, as regards named entity recognition, two types of evaluations will be carried out:

Relaxed evaluation: an entity mentioned in a given document is considered to be extracted correctly if the system response includes at least one annotation of a named mention of this entity (regardless whether the extracted mention is base form);
Strict evaluation: the system response should include exactly one annotation for each unique form of a named mention of an entity that is referred to in a given document, i.e., capturing and listing all variants of an entity is required.

Analogously, the document-level and cross-language entity matching task will also be evaluated in terms of precision, recall and F1 scores.

Please note that the evaluation will be case-insensitive. That is, all named mentions in system response and test corpora will be lowercased.

Publication and Workshop

Participants in the shared task are invited to submit a short paper to the BSNLP 2017 workshop ( http://bsnlp-2017.cs.helsinki.fi/ ), although submitting a paper is not mandatory to participate in the shared task. Papers must follow the workshop submission instructions ( http://bsnlp-2017.cs.helsinki.fi/cfp.html ) and they will undergo peer review. Their acceptance will not depend on the results obtained in the shared task (which will take place after the deadline of submitting the papers), but on the quality of the paper (clarity of presentation, etc.). Authors of accepted papers will be informed about the evaluation results of their systems in due time and asked to include these results in the final version of their papers. Accepted papers will appear in the Proceedings of BSNLP 2017. Authors of accepted papers will present their solutions and results in a dedicated session on the shared task.

Important Dates: Workflow

12 December 2016	Shared task announcement and release of training/trial data
12 December 2016	First Call for Participation
21 December 2016	Second Call for Participation
10 January 2017	Final Call for Participation
~~16 January 2017~~ 9 February 2017 (anywhere in the world)	Deadline for submission of system papers (not mandatory)
16 February 2017	Release of blind test data for registered participants
19 February 2017	Submission of system responses
11 February 2017	Notification of acceptance of system papers
21 February 2017	Camera-ready system papers due (including the received results of the evaluation)
22 February 2017	Dissemination of the results to participants
24 March 2017	Phase II test data release
4 April 2017	BSNLP 2017 workshop

Please note that the official evaluation results will be announced after the submission deadline for system description papers. Papers that are accepted should subsequently incorporate their evaluation results into the camera-ready version.

Please also note that for those participants who are not able to meet the deadlines related to the BSNLP 2017 workshop, a post-workshop evaluation will be possible as we intend to provide more test corpora for Slavic languages for this particular task in future, to foster research in this area.