Datasets
From Meta, a Wikimedia project coordination wiki
Various places that have Wikimedia datasets, and tools for working with them.
List [edit]
| Dataset Description | URL |
|---|---|
| Official Wikipedia database dumps | [1] |
| Taxobox - Wikipedia Infoboxes with Taxonomic information on Animal Species | [2] |
| Wikipedia³ is a conversion of the English Wikipedia into RDF. It's a monthly updated dataset containing around 47 million triples | [3] |
| DBpedia Facts extracted from Wikipedia info boxes and link structure in RDF format(Auer et al.,2007) | [4] |
| Multiple data sets (English Wikipedia articles that have been transformed into XML) | [5] |
| This is an alphabetical list of film articles (or sections within articles about films). It includes made for television films | [6] |
| Using the Wikipedia page-to-page link database | [7] |
| Wikipedia: Lists of common misspellings/For machines | [8] |
| Apache Hadoop is a powerful open source software package designed for sophisticated analysis and transformation of both structured and unstructured complex data. | [9]
|
| Wikipedia Hits | [10] |
| Top 1000 Accessed Wikipedia Articles | [11] |
| Wikipedia XML Data | [12] |
| Wikipedia Page Traffic Statistics | [13] |
| Complete Wikipedia edit history (up to January 2008) | [14] |
| Wikitech-l page counters | [15] |
| MusicBrainz Database | [16] |
| Datasets of network extracted from User Talk pages | [17] |
| Wikipedia Statistics | [18] |
| List of articles created last month/week/day with most users contributing to article within the same period | [19] |
| Wikipedia Taxonomy automatically generated from the network of categories in Wikipedia(RDF Schema format)(Ponzetto and Strube, 2007 a–c; Zirn et al., 2008) | [20] |
| Semantic Wikipedia: A snapshot of Wikipedia automatically annotated with named entity tags(Zaragoza etal.,2007) | [21] |
| Cyc to Wikipedia mappings: 50,000 automatically created mappings from Cyc terms to Wikipedia articles (Medelyan and Legg, 2008) | [22] |
| Topic indexed documents: A set of 20 Computer Science technical reports indexed with Wikipedia articles as topics. 15 teams of 2 senior CS undergraduates have independently assigned topics from Wikipedia to each article (Medelyan et al., 2008) | [23] |
Tools to extract data from Wikipedia: [edit]
This table might be migrated to the Knowledge Extraction Wikipedia Article
| Tool | Description | URL |
|---|---|---|
| Wikilytics | Extracting the dumps into a NoSQL database | [24] |
| Wikipedia2text | Extracting Text from Wikipedia | [25] |
| Traffic Statistics | Wikipedia article traffic statistics | [26] |
| Wikipedia to Plain text | Generating a Plain Text Corpus from Wikipedia | [27] |
| Autocomplete Wikipedia Titles | Autocomplete Wikipedia Article Titles API that returns an array of up to 100 completions for a given prefix. | [28] |
| DBpedia Extraction Framework | The DBpedia software that produces RDF data from over 90 language editions of Wikipedia and Wiktionary (highly configurable for other MediaWikis also). | [29] [30]
|
| Wikiteam | Tools for archiving wikis including Wikipedia | [31] |
| History Flow | History flow is a tool for visualizing dynamic, evolving documents and the interactions of multiple collaborating authors | [32] |
| WikiXRay | This tool includes a set of Python and GNU R scripts to obtain statistics, graphics and quantitative results for any Wikipedia language version | [33] |
| StatMediaWiki | StatMediaWiki is a project that aims to create a tool to collect and aggregate information available in a MediaWiki installation.Results are static HTML pages including tables and graphics that can help to analyze the wiki status and development, or a CSV file for custom processing. | [34] |
| Java Wikipedia Library (JWPL) | This is a open-source, Java-based application programming interface that allows to access all information contained in Wikipedia | [35] |
| Wikokit | Wiktionary parser and visual interface | [36] |
| wiki-network | Python scripts for parsing Wikipedia dumps with different goals | [37] |
| Pywikipediabot | Python Wikipedia robot framework | [38] |
| WikiRelate | API for computing semantic relatedness using Wikipedia (Strube and Ponzetto,2006) | [39] |
| WikiPrep | A Perl tool for preprocessing Wikipedia XML dumps(Gabrilovich andMarkovitch,2007) | [40] |
| W.H.A.T. Wikipedia Hybrid Analysis Tool | An analytic tool for Wikipedia with two main functionalities: an article network and extensive statistics.It contains a visualization of the article networks and a powerful interface to analyze the behavior of authors | [41] |
| QuALiM | A Question Answering system. Given a question in a natural language returns relevant passages from Wikipedia (Kaisser, 2008) | [42] |
| Koru | A demo of a search interface that maps topics involved in both queries and documents to Wikipedia articles. Supports automatic and interactive query expansion(Milne et al.,2007) | [43] |
| Wikipedia Thesaurus | A large scale association thesaurus containing 78M associations(Nakayama et al.,2007a,2008) | [44] |
| Wikipedia English–Japanese dictionary | A dictionary returning translations from English into Japanese and vise versa, enriched with probabilities of these translations(Erdmann et al.,2008) | [45] |
| Wikify | Automatically annotates any text with links to Wikipedia articles(Mihalcea and Csomai,2007) | [46] |
| Wikifier | Automatically annotates any text with links to Wikipedia articles describing named entities | [47] |