User:Halfak (WMF)/Wikimedia data

From Meta, a Wikimedia project coordination wiki

Datasources[edit]

Primary[edit]

These datasources are official, well defined, maintained and kept up to date.

Content & contributors Reading behavior
Database replicas (more info)
  • Fully SQL query-able, live copy of the databases behind the wikis. Good for large/complex queries.
  • No text content. If you need to process text, look to one of the other datasources.
  • Query from the web with Quarry or directly through Tool labs/LabsDB
MediaWiki API (more info)
  • Limited query-able RESTful API. Also capable of performing actions (edit, watchlist, etc.)
  • Slower than the database replicas, but text content is available.
  • Lots of client libraries available for Python, PHP, Java and more.
XML dumps (more info)
  • Large, full copies of metadata and text content.
  • Highly compressed datasets that are extremely large (terabytes when decompressed)
  • Recommended for large scale analysis of text & editor behavior.
  • Libraries available that simplify streaming decompression and analysis.
RCStream (more info)
  • A socket.io-based live stream of public events in Wikimedia wikis
  • No text data -- just metadata.
Wikidata query service (more info)
Page views (homepage)
  • Counts of page views by page title and whole wikis in hourly files.
  • Only historic datasource for reading behavior.
  • Query from the web with stats.grok.se

Secondary[edit]

WikiStats (more info)
  • A collection of reports generated about Wikimedia Projects (active editors, monthly pageviews, etc.)
DBPedia (homepage)
  • A database of structured data extracted from Wikipedias
  • RDF,N-triplets, SPARQL endpoint, Linked Data
Wikimedia @ DataHub.io (homepage)
ORES (more info)
  • Machine learning as a RESTful service
  • Scores revisions by the probability that they are damaging and predicts article quality.
WikiBrainAPI (homepage)
  • Powerful algorithmic processing for Wikipedia
  • Semantic relatedness, page rank calculations, etc.

Data processing libraries[edit]

pywikibot (Monolith)
mediawiki-utilities (Unix-style)
  • Primary datasources: mwapi, mwdb, mwxml, mwtytpes
  • Auth: mwoauth
  • Data processing: mwreverts, mwsessions, mwpersistence, mwparserfromhell, mwmetrics, mwevents, etc.