Shared Task Data and Results
The BSNLP 2017 shared task on multilingual named entity recognition aims at recognizing mentions of named entities in web documents in Slavic languages, their normalization / lemmatization, and cross-language matching.
Due to rich inflection, free word order, derivation and other phenomena exhibited by Slavic languages the detection of names and their lemmatization poses a challenging task. Fostering research and development on this problem—and the closely related problem of entity linking—is of paramount importance for enabling multilingual and cross-lingual information access.
The shared task initially covers
and focuses on recognition of four types of named entities including:
- organizations, and
where the last category covers mentions of all other types of named entities, e.g., products, events, etc. The task focuses on cross-lingual document-level extraction of named entities, i.e., the systems should recognize, classify, and extract all named entity mentions in a document, but detecting the position of each named entity mention in text is not required. The input text collection consists of sets of documents from the Web, each collection revolving around a certain entity. The corpus is obtained by posing a query to a search engine and parsing the HTML of relevant documents. We envision an extension of the task to additional languages at a later stage, which will go beyond the scope of the 2017 BSNLP Task.
Participation is open. Teams that plan to participate should register via sending an email to: firstname.lastname@example.org which includes the following information:
- team name
- team composition
- contact person
- contact email
System response files should be encoded in UTF-8.
Upon registration, participants will immediately receive access to data and additional information.
Updates about the shared task will be announced on BSNLP 2017 web page at:
and on the mailing list of SIGSLAV at:
Details about the named entities to be covered and system response format is found in the following section.
System response guidelines and format
The Multilingual Named Entity Recognition task consists of three subtasks:
- Named Entity Mention Detection and Classification: recognizing all unique named mentions of entities of four types:
- persons (PER),
- organizations (ORG),
- locations (LOC),
- miscellaneous (MISC).
- Entity Matching: assigning to each detected named mention of an entity an identifier in such a way that detected mentions of entities referring to the same real-world entity should be assigned the same identifier, which we will refer to as cross-lingual ID.
There is no need to return positional information of name entity mentions.
Systems may be tuned to solving all three subtasks or a subset of the subtasks.
The following rules/conventions should be followed when designing the system/solution:
1. General rules
a) The system should not return more than one annotation for all occurrences of the same form of a mention (e.g., inflected variant, acronym or abbreviation) of a named entity within the same document, unless the different occurrences thereof have different entity types (different readings) assigned to them.
b) Since the evaluation will be case-insensitive, it is not relevant whether the system response includes case information. In particular, if the text includes lowercase, uppercase and/or mixed-letter named mention variants of the same entity, the system response should include only one annotation for all of these mentions. For instance, for “ISIS”, “isis”, and “Isis” (provided that they refer to the same named-entity type), only one annotation should be returned (e.g., “Isis”).
c) Recognition of nominal or pronominal mentions of entities is not part of the task.
2. Person names (PER)
a) Person names should not include titles, honorifics, and functions/positions. For example, in the text fragment "CEO Dr. Jan Kowalski", only "Jan Kowalski" should be recognized as a person name. However, initials and pseudonyms are considered named mentions of person names and should be recognized. Similarly, named references to groups of people (that do not have a formal organization unifying them) should also be recognized, e.g., “Ukrainians”.
b) Personal possessives derived from a named mention of a person should be classified as a person, and the base form of the corresponding person name should be extracted. For instance, for "Piskorskijev mejl" (Croatian) it is expected to recognize "Piskorskijev", classified with PER and extract the base form: "Piskorski".
c) Fictive persons and characters are considered as persons.
3. Locations (LOC)
a) This category includes all toponyms and geopolitical entities (e.g., cities, counties, provinces, countries, regions, bodies of water, land formations, etc.) and named mentions of facilities (e.g., stadiums, parks, museums, theaters, hotels, hospitals, transportation hubs, churches, railroads, bridges, and other similar urban and non-urban facilities).
b) Even in case of named mentions of facilities that refer to an organization, the LOC tag should be used. For example, from the text phrase "The Schipol airport has acquired new electronic gates" the mention "The Schipol airport" should be extracted and classified as LOC.
4. Organizations (ORG)
a) This category covers all kind of organizations such as: political parties, public institutions, international organizations, companies, religious organizations, sport organizations, education and research institutions, etc.
b) Organization designators and potential mentions of the seat of the organization are considered to be part of the organization name. For instance, from the text fragment "Citi Handlowy w Poznaniu" (a bank in Poznań), the full phrase "Citi Handlowy w Poznaniu" should be extracted.
5. Miscellaneous (MISC)
a) This category covers all other named mentions of entities, e.g., product names (e.g., “Motorola Moto X”), events (conferences, concerts, natural disasters, holidays, e.g.,”Święta Bożego Narodzenia”, etc.).
b) This category does not include the recognition of temporal and numerical expressions, as well as recognition of identifiers such as email addresses, URLs, postal addresses, etc.
6. Other aspects
a) In case of complex named entities, consisting of nested named entities, only the top-most entity should be recognized. For example, from the text fragment "George Washington University" one should not extract "George Washington", but the entire name, namely, "George Washington University".
7. Input texts
8. Output format
The system response should contain for each file in the input test corpus a corresponding file in the following format:
The first line should contain only the ID of the file in the test corpus.
Each subsequent line should be of the format:
named entity mention <TAB> base form <TAB> category <TAB> cross-lingual ID
Examples of system response:
ISIS file_16.txt (Polish)
16 Podlascy Czeczeni Podlascy Czeczeni PER 6 ISIS ISIS ORG 2 Rosji Rosja LOC 3 T.G. T.G. PER 8 Halida Halida PER 9 Z.G. Z.G. PER 10 A.Y. A.Y. PER 11 S A. S A. PER 12 Polsce Polska LOC 13 Niemczech Niemcy LOC 14 Agencji Bezpieczeństwa Wewnętrznego Agencja Bezpieczeństwa Wewnętrznego ORG 15 Turcji Turcja LOC 16 Warszawie Warszawa LOC 17 Białymstoku Białystok LOC 18 Łomży Łomża LOC 19 Czeczeni Czeczeni PER 20 białostockim sądzie okręgowym Białostocki Sąd Okręgowy ORG 22 Magazynu Kuriera Porannego Magazyn Kuriera Porannego ORG 23
ISIS file_159.txt (Russian)
159 Варвара Караулова Варвара Караулова PER 1 ИГИЛ ИГИЛ ORG 2 России Россия LOC 3 МГУ МГУ ORG 4 Московском окружном военном суде Московский окружной военный суд ORG 5 Караулова Караулова PER 1 Карауловой Караулова PER 1 Александру Иванову Александра Иванова PER 21
The cross-lingual identifiers may consist of an arbitrary sequence of alphanumeric characters. The form of the identifiers is not relevant; what will be relevant for the evaluation is that mentions of the same entity across documents— in any language—are assigned the same cross-lingual identifier.
The training data consists of two sets of about 200 documents each, related to:
- Beata Szydło, the current prime minister of Poland, and
- ISIS, the so-called “Islamic State of Iraq and Syria” terror group
The corresponding set of links from which the data sets were created can be downloaded from “Beata Szydło Links” and “ISIS Links.”
Each of the documents in the collections will be available in the following format:
The first five lines of the document contain the following metadata,
The core text to be processed is available from the 6th line till the end of the file.
Please note that both <CREATION-DATE> and <TITLE> information might be missing (since the HTML parsers might not had been able to extract it for various reasons). In such cases the corresponding lines are empty. An example of an input file can be downloaded here.
Registered participants will receive the full corpora and further information via email directly after registration.
The test data set will be provided to registered participants in February and will be in the same format, i.e., the content of each collection will be focused on one particular entity. Please see the Section on Important Dates for further information. The format used will be exactly the same as for training data.
UTF-8 Encoding is to be used for all system output.
Test data are now publicly available
Evaluation will be carried out on the system response returned by the participants for the test corpora. The test corpora will be distributed to the participants in February 2017 (see Important Dates Section for details).
Named entity recognition (exact case-insensitive matching) and Name Normalization (sometimes called “lemmatization”) tasks will be evaluated in terms of precision, recall, and F1 scores. In particular, as regards named entity recognition, two types of evaluations will be carried out:
- Relaxed evaluation: an entity mentioned in a given document is considered to be extracted correctly if the system response includes at least one annotation of a named mention of this entity (regardless whether the extracted mention is base form);
- Strict evaluation: the system response should include exactly one annotation for each unique form of a named mention of an entity that is referred to in a given document, i.e., capturing and listing all variants of an entity is required.
Analogously, the document-level and cross-language entity matching task will also be evaluated in terms of precision, recall and F1 scores.
Please note that the evaluation will be case-insensitive. That is, all named mentions in system response and test corpora will be lowercased.
Publication and Workshop
Participants in the shared task are invited to submit a short paper to the BSNLP 2017 workshop ( http://bsnlp-2017.cs.helsinki.fi/ ), although submitting a paper is not mandatory to participate in the shared task. Papers must follow the workshop submission instructions ( http://bsnlp-2017.cs.helsinki.fi/cfp.html ) and they will undergo peer review. Their acceptance will not depend on the results obtained in the shared task (which will take place after the deadline of submitting the papers), but on the quality of the paper (clarity of presentation, etc.). Authors of accepted papers will be informed about the evaluation results of their systems in due time and asked to include these results in the final version of their papers. Accepted papers will appear in the Proceedings of BSNLP 2017. Authors of accepted papers will present their solutions and results in a dedicated session on the shared task.
Important Dates: Workflow
12 December 2016
Shared task announcement and release of training/trial data
12 December 2016
First Call for Participation
21 December 2016
Second Call for Participation
10 January 2017
Final Call for Participation
Deadline for submission of system papers (not mandatory)
16 February 2017
Release of blind test data for registered participants
19 February 2017
Submission of system responses
11 February 2017
Notification of acceptance of system papers
21 February 2017
Camera-ready system papers due
(including the received results of the evaluation)
22 February 2017
Dissemination of the results to participants
24 March 2017
Phase II test data release
4 April 2017
BSNLP 2017 workshop
Please note that the official evaluation results will be announced after the submission deadline for system description papers. Papers that are accepted should subsequently incorporate their evaluation results into the camera-ready version.
Please also note that for those participants who are not able to meet the deadlines related to the BSNLP 2017 workshop, a post-workshop evaluation will be possible as we intend to provide more test corpora for Slavic languages for this particular task in future, to foster research in this area.