User:TBurmeister (WMF)/Sandbox/Research:Data

This page is currently a draft. More information pertaining to this may be available on the talk page.

Translation admins: Normally, drafts should not be marked for translation.

Tracked in Phabricator:
Task T193296

This is a draft of a new page structure for Research:Data or related portal pages that seek to help people get started working with Wikimedia data. Its audience is primarily researchers and data scientists, but also includes developers who are building software or scripts using Wikimedia data.

Types of data[edit]

This section describes the major types of publicly-available, open-licensed data about Wikimedia projects. Understanding the different types of data is the first step towards working with specific datasets and analysis tools.

Traffic and readership

Placeholder text.

This
Will
Be
great

The analysis process[edit]

This section provides an outline for how to approach working with Wikimedia data as part of your research or analysis project.

Frame your questions[edit]

Learning goal: Teach participants how to frame research questions that are answerable and relevant to Wikimedia

Before you can identify which datasets are relevant for you, you must define your analysis task or frame your research questions. If you don't yet have a clear analysis task, you may want to:

explore / browse data generated by WMF.
understand the general types of data available. For example: I need to understand what terms like "analytics data" mean.
browse and learn from research and analyses created by other data consumers using Wikimedia data. Exploring what others have done may help you get a sense of what's possible.

Identify relevant Wikimedia data[edit]

Learning goal: Introduce participants to the landscape of open data available on the Wikimedia projects

After you have defined your analysis task or research question, you can start mapping your questions to the available data. Tasks:

Find and use data generated by WMF.
Be able to identify the available datasets that may be relevant for my task/research question, and decide between them.
Align your data needs with the data model and descriptive language used by WMF/the data publishers.

TODO: insert mapping of research interests to data types

After you identify some available datasets that may be relevant to you, you'll have to decide which to investigate further. To help you decide between datasets:

Understand the constraints and nuances of the data, like its availability, gaps, outages, biases, reliability, existing tickets (feature requests/known issues).
Understand the available access methods.
Understand which datasets may or may not be compatible with each other, and why.
Consider browsing examples of what others have done with the data, to confirm that it aligns with your goals.

Access and process data[edit]

TODO: maybe this is a sub-step of the "prototyping" and "scaling" tasks? Learning goal: Teach participants how to best process the necessary data to answer their research questions use mediawiki utils Python libraries to access and process the different Wikimedia data sources – e.g., mwxml, mwedittypes

Throughout your subsequent work with the data:

You may need to file a bug or feature request. You need to know where to do that so that the right people see it.
You may need to get help. You need to know where to do that and how to formulate my help request effectively.
You need to know where to look for documentation about the data and tools you're using.

Prototype with small datasets in web interfaces[edit]

UIs, PAWS, small language editions Learning goal: Teach participants how to best gather the necessary data to answer their research questions (part 1: techniques for prototyping) Learning goal: Introduce participants to the landscape of open infrastructure and tools available for Wikimedia researchers (part 1: infra and tools for prototyping)

As a data consumer who has decided which dataset(s) to investigate or work with:

understand the process, datasets, and tools for prototyping.
set up and start using an appropriate data access method. (TODO: figure out how to present access and processing steps within or before prototyping and scaling steps?)
see example queries to help you get started quickly.
reference database schema and field definitions.
compose correct and efficient queries in an appropriate query language.
analyze the output of your queries and iterate on them.
debug and troubleshoot queries that are incorrect or too slow.

Scale your analysis[edit]

Participants will know how to scale their analyses – e.g., from APIs -> dumps; from PAWS -> Toolforge; from simplewiki -> enwiki

Techniques for scaling your analysis

Learning goal: Teach participants how to best gather the necessary data to answer their research questions (part 2: techniques for large-scale analysis))

Tools and infrastructure for scaling your analysis

Learning goal: Introduce participants to the landscape of open infrastructure and tools available for Wikimedia researchers (part 2: infra and tools for large-scale analysis)

After you've successfully prototyped your analysis, you'll need to scale it and (where relevant) ensure its continuous reliability:

how to transition from using APIs to dumps; from PAWS to Toolforge; from simplewiki -> enwiki.
understand the issues you may encounter due to increasing the scale of data you're processing, and how to address those issues (i.e. public compute resources, compute limitations and ways to handle them)
how to get alerted if there are outages or known issues with the upstream sources of your data, or with the dataset your using.

Working with public compute resources[edit]

What is available and how to use it?
Limitations of public compute resources and how to handle them

Publish and share your analyses[edit]

As a data consumer who has completed some analysis or generated a new dataset using Wikimedia data:

where can you publish your datasets and/or analyses?
how should you license your work that is based on WMF-generated data?
where or on what platforms can/should your publish and spread the word about your work?
what policies, rules, responsibilities apply?
what information should you publish about your dataset, and how/where/in what format to publish it?
- how can you help potential consumers of your data to find it? what terminology might others use to search for the type of research you're publishing?
- are there tools to help auto-generate dataset documentation and/or metadata?
how and when can you associate your data or analysis with related datasets?