The BSNLP 2017 the shared task on multilingual named entity recognition, their normalization and cross-language matching in web documents in Slavic languages has been jointly co-organized by the Competence Centre on Text Mining and Analysis of the Joint Research Centre of the European Commission, University of West Bohemia, University of Helsinki and the University of Zagreb.
Data and code
Please cite the shared task paper if you use these data or code.
Two datasets were prepared for evaluation, each consisting of documents extracted from the web and related to a given entity. One dataset contains documents related to Donald Trump, the recently elected President of United States and the second dataset contains documents related to the European Commission
The test datasets were created as follows. For each “focus” entity, we posed a separate search query to Google, in each of the seven target languages. The query returned links to documents only in the language of interest. We extracted the first 100 links 2 returned by the search engine, removed duplicate links, downloaded the corresponding HTML pages—mainly news articles or fragments thereof—and converted them into plain text, using a hybrid HTML parser.
The resulting set of partially “cleaned” documents were used to select circa 20–25 documents for each language and topic, for the preparation of the final test datasets. Annotations for Croatian, Czech, Polish, Russian, and Slovene were made by native speakers; annotations for Slovak were made by native speakers of Czech, capable of understanding Slovak. Annotations for Ukrainian were made partly by native speakers and partly by near-native speakers of Ukrainian. Cross-lingual alignment of the entity identifiers was performed by two annotators.
For more details please consult the shared task paper: Jakub Piskorski, Lidia Pivovarova, Jan Šnajder, Josef Steinberger and Roman Yangarber The First Cross-Lingual Challenge on Recognition, Normalization, and Matching of Named Entities in Slavic Languages. BSNLP, 2017 (bib)