Datasets

From Meta, a Wikimedia project coordination wiki
Jump to: navigation, search

Various places that have Wikimedia datasets, and tools for working with them.

List [edit]

Dataset Description URL
Official Wikipedia database dumps [1]
Taxobox - Wikipedia Infoboxes with Taxonomic information on Animal Species [2]
Wikipedia³ is a conversion of the English Wikipedia into RDF. It's a monthly updated dataset containing around 47 million triples [3]
DBpedia Facts extracted from Wikipedia info boxes and link structure in RDF format(Auer et al.,2007) [4]
Multiple data sets (English Wikipedia articles that have been transformed into XML) [5]
This is an alphabetical list of film articles (or sections within articles about films). It includes made for television films [6]
Using the Wikipedia page-to-page link database [7]
Wikipedia: Lists of common misspellings/For machines [8]
Apache Hadoop is a powerful open source software package designed for sophisticated analysis and transformation of both structured and unstructured complex data. [9]


Wikipedia Hits [10]
Top 1000 Accessed Wikipedia Articles [11]
Wikipedia XML Data [12]
Wikipedia Page Traffic Statistics [13]
Complete Wikipedia edit history (up to January 2008) [14]
Wikitech-l page counters [15]
MusicBrainz Database [16]
Datasets of network extracted from User Talk pages [17]
Wikipedia Statistics [18]
List of articles created last month/week/day with most users contributing to article within the same period [19]
Wikipedia Taxonomy automatically generated from the network of categories in Wikipedia(RDF Schema format)(Ponzetto and Strube, 2007 a–c; Zirn et al., 2008) [20]
Semantic Wikipedia: A snapshot of Wikipedia automatically annotated with named entity tags(Zaragoza etal.,2007) [21]
Cyc to Wikipedia mappings: 50,000 automatically created mappings from Cyc terms to Wikipedia articles (Medelyan and Legg, 2008) [22]
Topic indexed documents: A set of 20 Computer Science technical reports indexed with Wikipedia articles as topics. 15 teams of 2 senior CS undergraduates have independently assigned topics from Wikipedia to each article (Medelyan et al., 2008) [23]

Tools to extract data from Wikipedia: [edit]

This table might be migrated to the Knowledge Extraction Wikipedia Article

Tool Description URL
Wikilytics Extracting the dumps into a NoSQL database [24]
Wikipedia2text Extracting Text from Wikipedia [25]
Traffic Statistics Wikipedia article traffic statistics [26]
Wikipedia to Plain text Generating a Plain Text Corpus from Wikipedia [27]
Autocomplete Wikipedia Titles Autocomplete Wikipedia Article Titles API that returns an array of up to 100 completions for a given prefix. [28]
DBpedia Extraction Framework The DBpedia software that produces RDF data from over 90 language editions of Wikipedia and Wiktionary (highly configurable for other MediaWikis also). [29] [30]


Wikiteam Tools for archiving wikis including Wikipedia [31]
History Flow History flow is a tool for visualizing dynamic, evolving documents and the interactions of multiple collaborating authors [32]
WikiXRay This tool includes a set of Python and GNU R scripts to obtain statistics, graphics and quantitative results for any Wikipedia language version [33]
StatMediaWiki StatMediaWiki is a project that aims to create a tool to collect and aggregate information available in a MediaWiki installation.Results are static HTML pages including tables and graphics that can help to analyze the wiki status and development, or a CSV file for custom processing. [34]
Java Wikipedia Library (JWPL) This is a open-source, Java-based application programming interface that allows to access all information contained in Wikipedia [35]
Wikokit Wiktionary parser and visual interface [36]
wiki-network Python scripts for parsing Wikipedia dumps with different goals [37]
Pywikipediabot Python Wikipedia robot framework [38]
WikiRelate API for computing semantic relatedness using Wikipedia (Strube and Ponzetto,2006) [39]
WikiPrep A Perl tool for preprocessing Wikipedia XML dumps(Gabrilovich andMarkovitch,2007) [40]
W.H.A.T. Wikipedia Hybrid Analysis Tool An analytic tool for Wikipedia with two main functionalities: an article network and extensive statistics.It contains a visualization of the article networks and a powerful interface to analyze the behavior of authors [41]
QuALiM A Question Answering system. Given a question in a natural language returns relevant passages from Wikipedia (Kaisser, 2008) [42]
Koru A demo of a search interface that maps topics involved in both queries and documents to Wikipedia articles. Supports automatic and interactive query expansion(Milne et al.,2007) [43]
Wikipedia Thesaurus A large scale association thesaurus containing 78M associations(Nakayama et al.,2007a,2008) [44]
Wikipedia English–Japanese dictionary A dictionary returning translations from English into Japanese and vise versa, enriched with probabilities of these translations(Erdmann et al.,2008) [45]
Wikify Automatically annotates any text with links to Wikipedia articles(Mihalcea and Csomai,2007) [46]
Wikifier Automatically annotates any text with links to Wikipedia articles describing named entities [47]

See also [edit]