Datasets

From Meta, a Wikimedia project coordination wiki
Jump to: navigation, search

Various places that have Wikimedia datasets, and tools for working with them.

List[edit]

Dataset Description URL
Official Wikipedia database dumps [1]
Parsoid exposes semantics of content in fully rendered HTML+RDFa, and is available for various languages and projects: enwiki, frwiki, ..., frwiktionary, dewikibooks, ... The prefix pattern is the wikimedia database name. Users include VE, Flow, Kiwix and Google. Parsoid also supports the conversion of (possibly modified) HTML back to wikitext without introducing dirty diffs. [2]
Taxobox - Wikipedia Infoboxes with Taxonomic information on Animal Species [3]
Wikipedia³ is a conversion of the English Wikipedia into RDF. It's a monthly updated dataset containing around 47 million triples [4]
DBpedia Facts extracted from Wikipedia info boxes and link structure in RDF format(Auer et al.,2007) [5]
Multiple data sets (English Wikipedia articles that have been transformed into XML) [6]
This is an alphabetical list of film articles (or sections within articles about films). It includes made for television films [7]
Using the Wikipedia page-to-page link database [8]
Wikipedia: Lists of common misspellings/For machines [9]
Apache Hadoop is a powerful open source software package designed for sophisticated analysis and transformation of both structured and unstructured complex data. [10]


Wikipedia Hits [11]
Top 1000 Accessed Wikipedia Articles [12]
Wikipedia XML Data [13]
Wikipedia Page Traffic Statistics [14]
Complete Wikipedia edit history (up to January 2008) [15]
Wikitech-l page counters [16]
MusicBrainz Database [17]
Datasets of network extracted from User Talk pages [18]
Wikipedia Statistics [19]
List of articles created last month/week/day with most users contributing to article within the same period [20]
Wikipedia Taxonomy automatically generated from the network of categories in Wikipedia(RDF Schema format)(Ponzetto and Strube, 2007 a–c; Zirn et al., 2008) [21]
Semantic Wikipedia: A snapshot of Wikipedia automatically annotated with named entity tags(Zaragoza etal.,2007) [22]
Cyc to Wikipedia mappings: 50,000 automatically created mappings from Cyc terms to Wikipedia articles (Medelyan and Legg, 2008) [23]
Topic indexed documents: A set of 20 Computer Science technical reports indexed with Wikipedia articles as topics. 15 teams of 2 senior CS undergraduates have independently assigned topics from Wikipedia to each article (Medelyan et al., 2008) [24]

Tools to extract data from Wikipedia:[edit]

This table might be migrated to the Knowledge Extraction Wikipedia Article

Tool Description URL
Wikilytics Extracting the dumps into a NoSQL database [25]
Wikipedia2text Extracting Text from Wikipedia [26]
Traffic Statistics Wikipedia article traffic statistics [27]
Wikipedia to Plain text Generating a Plain Text Corpus from Wikipedia [28]
Autocomplete Wikipedia Titles Autocomplete Wikipedia Article Titles API that returns an array of up to 100 completions for a given prefix. [29]
DBpedia Extraction Framework The DBpedia software that produces RDF data from over 90 language editions of Wikipedia and Wiktionary (highly configurable for other MediaWikis also). [30] [31]


Wikiteam Tools for archiving wikis including Wikipedia [32]
History Flow History flow is a tool for visualizing dynamic, evolving documents and the interactions of multiple collaborating authors [33]
WikiXRay This tool includes a set of Python and GNU R scripts to obtain statistics, graphics and quantitative results for any Wikipedia language version [34]
StatMediaWiki StatMediaWiki is a project that aims to create a tool to collect and aggregate information available in a MediaWiki installation.Results are static HTML pages including tables and graphics that can help to analyze the wiki status and development, or a CSV file for custom processing. [35]
Java Wikipedia Library (JWPL) This is a open-source, Java-based application programming interface that allows to access all information contained in Wikipedia [36]
Wikokit Wiktionary parser and visual interface [37]
wiki-network Python scripts for parsing Wikipedia dumps with different goals [38]
Pywikipediabot Python Wikipedia robot framework [39]
WikiRelate API for computing semantic relatedness using Wikipedia (Strube and Ponzetto,2006) [40]
WikiPrep A Perl tool for preprocessing Wikipedia XML dumps(Gabrilovich andMarkovitch,2007) [41]
W.H.A.T. Wikipedia Hybrid Analysis Tool An analytic tool for Wikipedia with two main functionalities: an article network and extensive statistics.It contains a visualization of the article networks and a powerful interface to analyze the behavior of authors [42]
QuALiM A Question Answering system. Given a question in a natural language returns relevant passages from Wikipedia (Kaisser, 2008) [43]
Koru A demo of a search interface that maps topics involved in both queries and documents to Wikipedia articles. Supports automatic and interactive query expansion(Milne et al.,2007) [44]
Wikipedia Thesaurus A large scale association thesaurus containing 78M associations(Nakayama et al.,2007a,2008) [45]
Wikipedia English–Japanese dictionary A dictionary returning translations from English into Japanese and vise versa, enriched with probabilities of these translations(Erdmann et al.,2008) [46]
Wikify Automatically annotates any text with links to Wikipedia articles(Mihalcea and Csomai,2007) [47]
Wikifier Automatically annotates any text with links to Wikipedia articles describing named entities [48]

See also[edit]