The LOD Wikilinks corpus

The Wikilinks corpus[1] is a coreference resolution corpus of very large scale. It contains over 40 million mentions of over 3 million entities. Mentions are manually labeled links to respective Wikipedia pages in natural language text. They are obtained via a web crawl and aggregated together with the pages' source in the extended corpus of over 180GB in size.

We took the corpus and converted it into the NLP Interchange Format (NIF)[2], publishing it here in Linked Open Data, RDF dumps and an accompanying CSV. This document contains some documentation on the format and how to use the corpus.


Every webpage in the corpus was parsed. The text of the html element surrounding the individual Wikipedia links was extracted and concatenated together, if there was more than one link on the page. The position of the links in these texts was located and annotated via string offsets. The position of the html elements containing the links was annotated with Xpath expressions. For every link to Wikipedia, the respective DBpedia page was included as a link. The DBpedia ontology classes of the linked resource were added as well. If a mapping exists, NERD core classes were added, too.


The data is available in the Apache file system under However, for ease of use, it is also available in a number of gzipped Dumpfiles. Additionally, there is a gzipped CSV file containing the core of the data here.

Linked Data Format

We are using the NIF 2.0. Every webpages information is available as LOD resource. Take, for example, this one:,3680.

First, there is a nif:Context resource. It contains the text strings in which the links are originally found in nif:isString. It also contains the length of this string in nif:endIndex, as well as the URL of the page in nif:sourceUrl. In this case, the links are found in more than one element, so the text string is devided into nif:Snippets by "[...]", which we call nif:SnippetSeparators. A Snippet is declared as a new resource, denoting its position in the nif:Context string with beginIndex and endIndex. It also links the position on the original webpage via an Xpath expression in the nif:wasConvertedFrom property.

Links themselves are resources of the type nif:Phrase or nif:Word (depending on the number of words they contain). They again link the context they reference and their position in its string via offsets. They also link the respective DBpedia page via itsrdf:taIdentRef. DBpedia ontology classes are linked with itsrdf:taClassRef. If a core class (a class that is not a subclass of another class, like "Person" or "Location", may also be called Named entity types) can be found, it is specifically linked via nif:taNerdCoreClassRef.

URIs follow the NIF API specification. The webservice to deliver them can be found at The only parameter combination it supports at the moment is "?t=url&f=html&i=", followed by the websites URL. Only websites found in the Wikilinks corpus are available. Note that many of the websites themselves are not online anymore.


The smallest data compilation we can supply is a gzipped CSV file. Its format is:

"Link text","DBpedia URL","Linked Data URL","NE Type"

Where Linked Data URL contains a link to the respective resource on this server and NE Type means the type of the resource (like Person, Location, etc). Note that the type is missing in many of the cases, due to the sheer size of the dataset and heterogenity of the data itself.

Size, License, Publications

We provide 533.016.300 triples for 10.626.762 pages with 31.542.468 links in LOD with a total dataset size of 79GB (12GB gzipped dumps). The gzipped CSV is 1GB in size. Use of this dataset is free, as in free speech as well as in free beer. However, you may want to cite the original authors of the dataset:

[1] S. Singh, A. Subramanya, F. Pereira, A. McCallum
Wikilinks: A Large-scale Cross-Document Coreference Corpus Labeled via Links to Wikipedia
University of Massachusetts Amherst, CMPSCI Technical Report, UM-CS-2012-015, 2012

as well the authors of this dataset:

[2] S. Hellmann, J. Lehmann, Sören Auer, M. Brümmer
Integrating NLP using Linked Data
Proceedings of the 12th International Semantic Web Conference, Sydney, Australia, October 2013