The DBpedia abstract corpus

Wikipedia is the most important and comprehensive source of open, encyclopedic knowledge. The English Wikipedia alone features over 4.280.000 entities described by basic data points, so called info boxes, as well as natural language texts. The DBpedia project has been extracting, mapping, converting and publishing Wikipedia data since 2007, establishing the LOD cloud and becoming its center in the process. Article texts and the data they may contain are not especially focused, although they are the largest part of most articles in terms of time spent on writing, informational content and size. Only the text of the first introductory section of the articles is extracted and contained in the DBpedia, called abstract.

Links inside the articles are only extracted as an unordered bag, showing only an unspecified relation between the linking and the linked articles, but not where in the text the linked article was mentioned or which relation applies between the articles. As the links are set by the contributors to Wikipedia themselves, they represent entities intellectually disambiguated by URL. This property makes extracting the abstracts including the links and their exact position in the text an interesting opportunity to create a corpus usable for, among other cases, NER and NEL algorithm evaluation.

Corpus content

This corpus contains a conversion of Wikipedia abstracts in seven languages (dutch, english, french, german, italian, japanese and spanish) into the NLP Interchange Format (NIF)[1]. The corpus contains the abstract texts, as well as the position, surface form and linked article of all links in the text. As such, it contains entity mentions manually disambiguated to Wikipedia/DBpedia resources by native speakers, which predestines it for NER training and evaluation.

Furthermore, the abstracts represent a special form of text that lends itself to be used for more sophisticated tasks, like open relation extraction. Their encyclopedic style, following Wikipedia guidelines on opening paragraphs adds further interesting properties. The first sentence puts the article in broader context. Most anaphers will refer to the original topic of the text, making them easier to resolve. Finally, should the same string occur in different meanings, Wikipedia guidelines suggest that the new meaning should again be linked for disambiguation. In short: The type of text is highly interesting.

Data

All corpora come in a number of gzipped Turtle files. Linked Data is not available at the moment.

Dutch

Dutch corpus, available at http://wiki-link.nlp2rdf.org/abstracts/nl.

Size: 82,062,834 triples
Abstracts: 1,740,494
Average abstract length: 317.82
Links: 8,644,715
Example

English

English corpus, available at http://wiki-link.nlp2rdf.org/abstracts/en.

Size: 279,718,143 triples
Abstracts: 4,415,993
Average abstract length: 523.86
Links: 31,101,184
Example

French

French corpus, available at http://wiki-link.nlp2rdf.org/abstracts/fr.

Size: 83,359,452 triples
Abstracts: 1,476,876
Average abstract length: 349.73
Links: 9,127,782
Example

German

German corpus, available at http://wiki-link.nlp2rdf.org/abstracts/de.

Size: 107,764,449 triples
Abstracts: 1,556,343
Average abstract length: 471.88
Links: 12,108,812
Example

Italian

Italian corpus, available at http://wiki-link.nlp2rdf.org/abstracts/it.

Size: 54,557,207 triples
Abstracts: 907,329
Average abstract length: 398.31
Links: 6,031,300
Example

Japanese beta!

Japanese corpus, available at http://wiki-link.nlp2rdf.org/abstracts/ja.

Size: 59,647,215 triples
Abstracts: 909,387
Average abstract length: 154.94
Links: 6,660,236
Example

Spanish

Spanish corpus, available at http://wiki-link.nlp2rdf.org/abstracts/es.

Size: 76,422,857 triples
Abstracts: 1,038,639
Average abstract length: 517.66
Links: 8,644,715
Example

Data Format

We are using NIF 2.0.

First, there is a nif:Context resource. It contains the abstracts in nif:isString. It also contains the length of this string in nif:endIndex, as well as the URL of the page in nif:sourceUrl.

Links themselves are resources of the type nif:Phrase or nif:Word (depending on the number of words they contain). They link the context they reference and their position in its string via offsets. They also link the respective DBpedia page via itsrdf:taIdentRef.

Further languages, domain specific corpora

Compiling a language is quite a process, but you can send requests to me. For existing languages, I can trivially generate domain specific corpora using lists of resource URIs, too!

License, Publications, Contact

License, as per Wikipedia and DBpedia is CC-BY. For NIF, please cite [1]. There is no publication for this corpus yet, so if you have any cool use-cases, feel free to contact me at bruemmer@informatik.uni-leipzig.de

[1] S. Hellmann, J. Lehmann, Sören Auer, M. Brümmer
Integrating NLP using Linked Data
Proceedings of the 12th International Semantic Web Conference, Sydney, Australia, October 2013
PDF

Martin Brümmer, AKSW, University of Leipzig, 2015