Datasets

From Meta, a Wikimedia project coordination wiki

Various places that have Wikimedia datasets, and tools for working with them.

Also, you can now store table and maps data using Commons Datasets, and use them from all wikis from Lua and Graphs.

List[edit]

Dataset Description URL Last Updated
Official Wikipedia database dumps [1] Present
Parsoid exposes semantics of content in fully rendered HTML+RDFa, and is available for various languages and projects: enwiki, frwiki, ..., frwiktionary, dewikibooks, ... The prefix pattern is the wikimedia database name. Users include VE, Flow, Kiwix and Google. Parsoid also supports the conversion of (possibly modified) HTML back to wikitext without introducing dirty diffs. [2] Dead
Taxobox - Wikipedia Infoboxes with Taxonomic information on Animal Species [3] Dead
Wikipedia³ is a conversion of the English Wikipedia into RDF. It's a monthly updated dataset containing around 47 million triples [4] Dead
DBpedia Facts extracted from Wikipedia info boxes and link structure in RDF format(Auer et al.,2007) [5] 2019
Multiple data sets (English Wikipedia articles that have been transformed into XML) [6] Dead
This is an alphabetical list of film articles (or sections within articles about films). It includes made for television films [7] Dead
Using the Wikipedia page-to-page link database [8] Dead
Wikipedia: Lists of common misspellings/For machines [9] Dead
Apache Hadoop is a powerful open source software package designed for sophisticated analysis and transformation of both structured and unstructured complex data. [10] Dead
Wikipedia XML Data [11] 2015
Wikipedia Page Traffic Statistics (up to November 2015) [12] 2015
Complete Wikipedia edit history (up to January 2008) [13] 2008
Wikitech-l page counters [14] 2016
MusicBrainz Database [15] Dead
Datasets of network extracted from User Talk pages [16] 2011
Wikipedia Statistics [17] Present
List of articles created last month/week/day with most users contributing to article within the same period [18] Dead
Wikipedia Taxonomy automatically generated from the network of categories in Wikipedia(RDF Schema format)(Ponzetto and Strube, 2007 a–c; Zirn et al., 2008) [19] Dead
Semantic Wikipedia: A snapshot of Wikipedia automatically annotated with named entity tags(Zaragoza etal.,2007) [20] Dead
Cyc to Wikipedia mappings: 50,000 automatically created mappings from Cyc terms to Wikipedia articles (Medelyan and Legg, 2008) [21] Dead
Topic indexed documents: A set of 20 Computer Science technical reports indexed with Wikipedia articles as topics. 15 teams of 2 senior CS undergraduates have independently assigned topics from Wikipedia to each article (Medelyan et al., 2008) [22] Dead
Wikipedia Page Traffic API [23] Present
Articles published using the Content Translation tool. Both detailed lists and summary statistics are available. [24]

[25]

2022

Tools to extract data from Wikipedia:[edit]

This table might be migrated to the Knowledge Extraction Wikipedia Article

Tool Description URL Last Updated
Wikilytics Extracting the dumps into a NoSQL database [26] 2017
Wikipedia2text Extracting Text from Wikipedia [27] 2008
Traffic Statistics Wikipedia article traffic statistics [28] Dead
Wikipedia to Plain text Generating a Plain Text Corpus from Wikipedia [29] 2009
DBpedia Extraction Framework The DBpedia software that produces RDF data from over 90 language editions of Wikipedia and Wiktionary (highly configurable for other MediaWikis also). [30] [31]

github

2019
Wikiteam Tools for archiving wikis including Wikipedia github 2019
History Flow History flow is a tool for visualizing dynamic, evolving documents and the interactions of multiple collaborating authors [32] Dead
WikiXRay This tool includes a set of Python and GNU R scripts to obtain statistics, graphics and quantitative results for any Wikipedia language version [33] 2012
StatMediaWiki StatMediaWiki is a project that aims to create a tool to collect and aggregate information available in a MediaWiki installation.Results are static HTML pages including tables and graphics that can help to analyze the wiki status and development, or a CSV file for custom processing. [34] Dead
Java Wikipedia Library (JWPL) This is a open-source, Java-based application programming interface that allows to access all information contained in Wikipedia [35] 2016
Wikokit Wiktionary parser and visual interface github 2019
wiki-network Python scripts for parsing Wikipedia dumps with different goals github 2012
Pywikipediabot Python Wikipedia robot framework [36] 2019
WikiRelate API for computing semantic relatedness using Wikipedia (Strube and Ponzetto,2006) [37] 2006
WikiPrep A Perl tool for preprocessing Wikipedia XML dumps(Gabrilovich andMarkovitch,2007) [38] 2014
W.H.A.T. Wikipedia Hybrid Analysis Tool An analytic tool for Wikipedia with two main functionalities: an article network and extensive statistics.It contains a visualization of the article networks and a powerful interface to analyze the behavior of authors [39] 2013
QuALiM A Question Answering system. Given a question in a natural language returns relevant passages from Wikipedia (Kaisser, 2008) [40] 2008
Koru A demo of a search interface that maps topics involved in both queries and documents to Wikipedia articles. Supports automatic and interactive query expansion(Milne et al.,2007) [41] 2007
Wikipedia Thesaurus A large scale association thesaurus containing 78M associations(Nakayama et al.,2007a,2008) [42] Dead
Wikipedia English–Japanese dictionary A dictionary returning translations from English into Japanese and vise versa, enriched with probabilities of these translations(Erdmann et al.,2008) [43] Dead
Wikify Automatically annotates any text with links to Wikipedia articles(Mihalcea and Csomai,2007) [44] Dead
Wikifier Automatically annotates any text with links to Wikipedia articles describing named entities [45] Dead
Wikipedia Cultural Diversity Observatory Creates a dataset named Cultural Context Content (CCC) for each language edition with the articles that relate to its cultural context (geography, people, traditions, history, companies, etc.). [46] github 2019
Time-series graph of Wikipedia Wikipedia web network stored in Neo4J database. Pagecounts data stored in Apache Cassandra database. Deployment scripts and instructions use corresponding Wikimedia dumps. github [47] 2020
Basic python parsing of dumps A guide for how to parse Wikipedia dumps in python blog script 2017
Wiki Dump Reader A python package to extract text from Wikipedia dumps [48] 2019
MediaWiki Parser from Hell A python library to parse MediaWiki wikicode. docs github 2020
Mediawiki Utilities A collection of utilities for interfacing with MediaWiki:
  • mwapi - utilities for interacting with MediaWiki’s “action” API – usually available at /w/api.php. The most salient feature of this library is the mwapi.Session class that provides a connection session that sustains a logged-in user status and provides convenience functions for calling the MediaWiki API
  • mwdb - utilities for connecting to and querying a MediaWiki database.
  • mwxml - utilities for efficiently processing MediaWiki’s XML database dumps
  • mwreverts - utilities for detecting reverts and identifying the reverted status of edits to a MediaWiki wiki
  • mwsessions - utilities for grouping MediaWiki user actions into sessions. Such methods have been used to measure editor labor hours
  • mwdiffs - utilities for generating information about the difference between revisions.
  • mwoauth - simple means to performing an OAuth handshake with a MediaWiki installation with the OAuth Extension installed.
  • mwtypes - set of standardized types to be used when processing MediaWiki data
  • mwpersistence - utilities for measuring content persistence and tracking authorship in MediaWiki revisions.
mediawiki github 2020
qwikidata A python utility for interacting with WikiData github 2020
Namespace Database A python utility which:
  • downloads Wikipedia dumps from the fastest mirror
  • partitions dumps so they are more manageable
  • extracts features on a namespace to a MySQL database
github 2020

See also[edit]