There is a vibrant, distributed community of researchers from academia and industry, students, and volunteers producing research about, and informing, Wikimedia projects.
Over time, the Wikimedia Foundation has dedicated significant resources to support these initiatives, in the form of open data and API releases; formal collaborations; grants and outreach initiatives; and ad hoc technical support. We also built internal capacity and infrastructure to provide teams at the Wikimedia Foundation with A/B testing tooling, analytics support and a research computing infrastructure to test and evaluate the impact of new products and programs. Finally, we designed an open access policy to ensure that the entire output of research collaborations the Foundation supports financially or with significant resources be made available under open licenses.
Currently, there is no single team at the Wikimedia Foundation responsible for all data- and research-related initiatives. While the Wikimedia Research department has the primary responsibility for research at the Foundation, many more teams are involved in research-related initiatives across the organization, including:
- Product: Community Tech, Discovery, Editing, Fundraising Tech, Reading
- Technology: Analytics, Design Research, Performance, Research and Data, Security
- Advancement: Online Fundraising, Partnerships and Global Reach
- Community Engagement: Program Capacity & Learning, Resources, Technical Collaboration
We’re creating this guide to help you understand who owns research-related processes and resources. Many processes and resources described in this guide are primarily targeted at internal teams at WMF but we are publishing this guide in the hope that it will provide additional guidance and transparency to researchers in the volunteer and academic communities on the internal functioning of the organization.
Still unsure? If your question is not answered in the FAQ below, feel free to hop on the #wikimedia-research IRC channel or drop a line to research-internallists.wikimedia.org, our cross-departmental research list, where someone will be able to help.
Frequently Asked Questions
- 1 About
- 2 Frequently Asked Questions
- 2.1 Where do I find general statistics about Wikimedia projects?
- 2.2 Where do I find data or statistics about a specific Product Audience?
- 2.3 Where do I find official numbers to prepare a talk or a press release?
- 2.4 I have an urgent press query and I need some data, who do I talk to?
- 2.5 Where can I find the exact definition of an official metric used by WMF?
- 2.6 What does the Research and Data team do? Can it support my team’s data analysis needs?
- 2.7 I want to pitch a new project to Research and Data, what should I do?
- 2.8 How can I propose a formal collaboration between my team and the Wikimedia Foundation?
- 2.9 Can the Wikimedia Foundation financially support my research?
- 2.10 Can the Wikimedia Foundation write a letter of support for my grant proposal?
- 2.11 Are there any research and analytics jobs at the Wikimedia Foundation?
- 2.12 Does the Wikimedia Foundation share private data under NDAs for research purposes?
- 2.13 Where do I find a list of open datasets released by WMF for research purposes?
- 2.14 Can the Wikimedia Foundation help me collect data for my study?
- 2.15 How do I get special API privileges for my research?
- 2.16 Where can I learn about current research projects at WMF?
- 2.17 Where can I learn about existing research on a specific topic?
- 2.18 What conferences should I attend to learn about academic research on Wikimedia projects?
- 2.19 How do I release a dataset?
- 2.20 What kind of traffic data is collected by the Wikimedia Foundation?
- 2.21 What instrumentation data is collected for product analytics?
- 2.22 How do I access instrumentation data (EventLogging)?
- 2.23 I wrote a schema and I need to verify the quality of the data collected, who can help me?
- 2.24 Can I get access to Wikimedia's production databases to run some analysis?
- 2.25 I want to run a survey, how do I get started?
- 2.26 How do newly designed features and products go through usability testing?
Where do I find general statistics about Wikimedia projects?
Most reports and dashboards used by WMF are maintained by the Analytics team in coordination with individual audience teams. The most frequently used reports are:
Wikistats – a comprehensive overview of statistics across all Wikimedia projects – was created and is currently maintained by Erik Zachte. A number of reports from Wikistats are in the process of being migrated to a new infrastructure designed by the Analytics Engineering team. Analytics Engineering is responsible for the Wikimedia Foundation's analytics and data computing infrastructure but as a general rule the team is not responsible for defining or providing ad hoc analyses of specific metrics, which as a general rule are owned by the respective Product Audience teams. You can contact the Analytics Engineering team via: analyticslists.wikimedia.org.
Where do I find data or statistics about a specific Product Audience?
Looking for data, metrics and statistics about editors, readers, donors etc? You can find an answer by checking this list of reports owned by individual departments or by reaching out to the appropriate audience team.
The Product group provides a hub with the metrics for each audience area reported on a monthly basis. The following is a list of audience teams and examples of the key metrics they report on:
- User-perceived load time, zero results rate, API usage, search user engagement, search engine ranking, referred-traffic, maps and WDQS usage
- Active editors, new active editors, Wikipedia article edits, Wikipedia article edits via mobile
- Fundraising Tech
- Donation count, average donation rate, peak donation rate, country count, distinct campaign sources, donation source breakdown
- Pageviews by desktop and mobile web, usage breakdown by global region, Android and iOS app uniques and installs
As a general rule, the best way to issue a request is not to directly contact a data analyst but to direct your query to the product manager.
Teams in the Product group use Phabricator tags to track analysis-related requests:
The best way to formulate a request in Phabricator (borrowing from Neil's great documentation) is to specify:
- What's requested. If you know what you want, be specific! For example, don't just ask for "data about multilingual Wiktionary editors", ask for "the number of contributors who edited more than one Wiktionary in the past month". If you have a question but don't know how it can be answered, say what you've already tried.
- Why it's requested You don't have to write an essay, but give me enough context that I can interpret, adapt, and prioritize your request. For example, "the number of multilingual Wiktionary users will help us decide whether to give a developer a $10,000 grant to write a tool for them."
- When it's requested. If you have a deadline, explain what it is and what it's tied to. For example, "the Wikitionary tool developer needs to make summer plans, so we need this information by 15 May." Don't just say "as soon as possible." If we drop everything we're doing and work all night, tomorrow is probably possible; is your request so urgent that we need to do that? :)
- Any other helpful information, like relevant documentation.
In most cases, a search on Wikistats will give you the answer you need.
Where do I find official numbers to prepare a talk or a press release?
The Communications team maintains a press kit with official metrics, vetted by the respective audience teams and by Analytics. You can reach the Communications team via communicationswikimedia.org. For other statistics, please look up the list of reports mentioned above or get in touch with the appropriate audience team.
I have an urgent press query and I need some data, who do I talk to?
As a general rule, press queries about specific audiences, related metrics and products should be directed to the Wikimedia Foundation's Communications team, they will in turn connect with the appropriate team. Please contact the Communications team if you receive a query from a reporter and they will help you handle the response: communicationswikimedia.org.
Where can I find the exact definition of an official metric used by WMF?
Metrics used in WMF's official reports maintained by the Analytics team are defined here. If you have questions about the definition of metrics relevant to a specific audience, please contact the corresponding team.
What does the Research and Data team do? Can it support my team’s data analysis needs?
The Research and Data team (R&D) is part of the Wikimedia Research department. Its mandate is to help design and test technology informed by qualitative and quantitative research methods and produce scientifically rigorous knowledge about Wikimedia's users and projects. Examples of projects led by the Research and Data team include:
- AI services to predict the quality of article and edits (blog)
- recommender systems to increase content coverage or identify missing links (blog)
- models to predict talk page abuse and harassment
- methods to quantify edit productivity and value added by wiki labor
The team can provide guidance on metric definitions, experimental design, statistical and methodological support on an ad hoc basis. Individual Product teams are responsible for data analysis and metric definition for their corresponding audiences. You can contact the R&D team via our (internal) department mailing list research-wmflists.wikimedia.org.
I want to pitch a new project to Research and Data, what should I do?
The Research and Data team partners with other teams in the organization, community members and academic researchers to design and run projects that typically span multiple months of work. In order to engage with the team, your project will likely be:
- a minimum of one or two quarters in projected time frame
- ahead of specific products or interventions being designed or tested
If you think your project meets these requirements, you can contact the team via this mailing list: research-wmflists.wikimedia.org or by creating a Phabricator task in the backlog of the Research and Data board. If you are looking for audience-specific metrics and statistics, please get in touch with the respective team's product owner.
How can I propose a formal collaboration between my team and the Wikimedia Foundation?
As of March 2015, all formal collaborations with external researchers – both in academia and in industry – that receive support by Wikimedia Foundation staff are subject to our open access policy, which secures the openness and immediate reusability of the output of these projects: data, code and scholarly publications. In most cases, a formal collaboration will require the partner team to sign a Memorandum of Understanding (MOU) acknowledging the terms of the policy. Additional agreements may be required, particularly when the collaboration includes access to private data. For frequently asked questions about this policy, you can read this page. A list of current formal collaborations with the Research team – as well as the process followed by the team to set up new ones – can be found on this page.
Can the Wikimedia Foundation financially support my research?
The Wikimedia Foundation sponsors research projects of strategic importance in the form of grants. Grants can be issued to individuals and organizations alike and can be awarded via calls for participation or directly allocated in the case of research commissioned by the Foundation. More information on different types of grant, and the corresponding requirements, can be found on this page. Research sponsored by a grant from the Wikimedia Foundation is subject to the terms of our open access policy.
Can the Wikimedia Foundation write a letter of support for my grant proposal?
The Wikimedia Foundation does not directly participate, unless in exceptional circumstances, in grant applications or research consortia as a partner institution, due to legal and financial constraints that come with restricted funding. However, we are happy to support individual research projects of particular strategic importance by providing formal endorsements. Letters of endorsement are signed by a C-level or by their delegate, they form part of a formal collaboration and are subject to the terms of the Wikimedia Foundation's open access policy.
Are there any research and analytics jobs at the Wikimedia Foundation?
Current openings for part-time and full-time positions in Research, Analytics Engineering and Product are listed on the Wikimedia Foundation's jobs website.
The Wikimedia Foundation can issue non-disclosure agreements (NDAs) to allow researchers to access private date under the terms of the Open access policy and subject to the organization's priorities and policies. The process to request access to private data requires writing a research proposal, describing the type of data requested, why it's needed, and the expected outcomes of the proposal. More details can be found on the Wikimedia Research team's formal collaboration instructions.
Where do I find a list of open datasets released by WMF for research purposes?
A comprehensive list of open datasets published by WMF for research purposes can be found at m:Research:Data. You can also search the Wikimedia Foundation's entry on the DataHub or Wikimedia-related datasets available on Figshare. For access to private data, see this question.
Can the Wikimedia Foundation help me collect data for my study?
As a general rule, researchers at the Wikimedia Foundation have little bandwidth to provide data collection / data analysis as a service, outside of the scope of formal collaborations. We are always happy to provide guidance and recommend the appropriate tools, data sources and libraries for a given study on an informal basis. The best way to get support is to post a request to wiki-research-l (for anything related to research design, methods, state of the art on a specific research topic) or to analytics-l (for data sources and APIs maintained by the Wikimedia Analytics Engineering team). You can also get support via the corresponding IRC channels, irc:wikimedia-research and irc:wikimedia-analytics. If your request is about recruiting participants for a survey or study, see the corresponding question.
How do I get special API privileges for my research?
You can access the MediaWiki API to retrieve data from Wikimedia projects with the standard permissions that are granted to your registered username. For most types of data you will not need any kind of special privilege. In some cases the Wikimedia Foundation can grant special permissions (such as high API request limits) on a temporary basis to individual users for research purposes. When these privileges are granted by WMF staff, they form part of a formal collaboration and are subject to the terms of the Wikimedia Foundation's open access policy.
Where can I learn about current research projects at WMF?
We run a weekly, cross-departmental research group every Thursday at 9:30am PT to discuss research in progress, present early results or get feedback on the design of new projects. The meeting is regularly attended by members of the Research and Data and Design Research teams, analysts with various Product teams and from Learning and Evaluation but it's open to anyone in the organization interested in participating. We also host more formal, public presentations on a monthly basis via our Research Showcase and at Monthly Metrics meetings, which you can attend in person if you're in the SF office or watch online via YouTube.
Where can I learn about existing research on a specific topic?
There are several places where you can learn about previous and current research. The most comprehensive resource covering research on Wikimedia projects is the Research Newsletter. The newsletter is a collaboratively maintained monthly overview of new research, edited by Tilman Bayer and Dario Taraborelli with contributions by several volunteer reviewers. It has been published monthly since 2011 and has a fully searchable archive. You can also follow the latest research updates hot off the press via the @WikiResearch handle on Twitter, by subscribing to wiki-research-l or by attending the Wikimedia Research monthly newscase (also available on YouTube). The Wikimedia Research Codex is a complementary effort to summarize past research by organizing it by topic instead of by date; it in currently in progress, and topics are prioritized depending on team needs.
What conferences should I attend to learn about academic research on Wikimedia projects?
There are several scholarly conferences with dedicated tracks on Wikimedia research and/or a long record of publications in the field. The best research on Wikipedia and other Wikimedia projects today happens at conferences such as CSCW, ICWSM, OpenSym, WWW. Wikimania also has regular tracks dedicated to research on our projects.
How do I release a dataset?
Releasing open data about Wikimedia projects for research purposes, while respecting our privacy and data retention policies, is in line with Wikimedia's values and mission to disseminate open knowledge. The Wikimedia Research team maintains an open data repository via the DataHub that anyone can contribute to. We also register and host open datasets for research purposes on Figshare, for citability and discoverability. If you are in a team at WMF dealing with sensitive data, before releasing a new dataset, particularly data obtained from private sources and/or containing personally identifiable information, it is mandatory to consult with the Legal and Security teams. The Research and Data team can provide best practices on how to publish and document the dataset, once its publication has been cleared by these two teams. The release of data from Fundraising is subject to additional restrictions due to our donor policies: before publishing any reports including anonymized or aggregate data from Online Fundraising, please review these guidelines and obtain explicit approval from the team.
What kind of traffic data is collected by the Wikimedia Foundation?
The Wikimedia Analytics Engineering team maintains several large-scale datasets on Wikimedia traffic, stored and processed via a Hadoop cluster, from unsampled HTTP request data to curated pageview data. Detailed documentation on these datasets can be found at: wikitech:Category:Data_stream. As a general rule, traffic data hosted on Hadoop is considered private and accessing it is restricted to WMF staffers and people covered by an NDA. If you are part of a team at the Wikimedia Foundation interested in analyzing traffic data, you can get in touch with Analytics Engineering to request access to the corresponding data stores. Article-level pageview data can be publicly accessed by anyone via the Pageview API.
What instrumentation data is collected for product analytics?
A comprehensive list of instrumentation data collected via EventLogging (and their respective owner) can be found at m:Research:Schemas. To inquire about a specific schema, please contact the owner on the schema's talk page.
How do I access instrumentation data (EventLogging)?
Instrumentation data used for testing new products and features and measuring how users interact with our sites is provided by a platform maintained by Analytics team called EventLogging. As a general rule, instrumentation data is private and accessing it is restricted to WMF staffers and people covered by an non-disclosure agreement. If you are part of a team at the Wikimedia Foundation interested in producing or analyzing instrumentation data, you can get in touch with Analytics Engineering to request access to the corresponding data stores.
I wrote a schema and I need to verify the quality of the data collected, who can help me?
Product and engineering teams at the Wikimedia Foundation use a platform called EventLogging to collect data and measure user interactions with Wikimedia sites. If you are a user of this platform, you'll find yourself asking not only where the data lives but also if the data collected matches the specification and if the instrumentation captures the data as intended. Extensive documentation on EventLogging, its architecture and the data stores it uses is available on this page. Schemas defining data that is collected via EventLogging can be found on this page. The responsibility to audit the quality of data collected lies with the engineers who wrote the instrumentation, in coordination with analysts within their team and product managers who are familiar about feature design and workflows. The Analytics team can provide guidance about the collection of high-throughput data and help identify appropriate sampling rates, when applicable, as well as providing information on the retention window for this data, subject to the Wikimedia Foundation's privacy and retention policies.
Can I get access to Wikimedia's production databases to run some analysis?
The Analytics team maintains real-time replicas of Wikimedia's production databases for analysis purposes via an internal SQL cluster. Production databases contain private data and their access is restricted to WMF staffers and people covered by a non-disclosure agreement. If you work for a team at the Wikimedia Foundation interested in analyzing this data, you can get in touch with the Analytics team to request access to the cluster. Alternatively, if your request does not involve private data, you can use Quarry to perform and save queries against a censored version of Wikimedia's entire production databases.
I want to run a survey, how do I get started?
The Program Capacity & Learning team maintains the Survey Support Desk - a one-stop shop for anything related to surveys in the Wikimedia context for Wikimedia Foundation staff, Wikimedia affiliates, and volunteers. The team also maintains and provides access to survey platforms used by WMF. The Design Research team can provide overall guidance and support to other teams at WMF on survey design. The Research and Data team can provide guidance on best practices on strategies for participant recruitment on-wiki. All WMF-run surveys must be reviewed by the Legal team -- see this page for more information.
Surveys run by academic researchers need to meet community expectations before participant recruitment can begin. Creating a research project and discussing the proposed recruitment strategy on wiki-research-l are good, preliminary steps towards successful recruitment of participants for a study. There aren't any global policies regulating third-party research or mechanisms for large-scale subject recruitment, but best practices have been discussed in a number of contexts. en:WP:Research and en:WP:SRAG are the product of the joint efforts of the research community and the English Wikipedia community to try and satisfy two goals:
- Create a mechanism for mass subject recruitment
- Protect the community (and individuals) from the disruption that mass recruitment could cause
Along with these two documents, a few essays are available as tools for educating Wikipedians about research:
How do newly designed features and products go through usability testing?
The Design Research team (DR) supports iteration of concepts and functionality toward a usable and intuitive experience for users. It also provides guidance to other WMF teams via a range of qualitative methods including, but not limited to, usability testing. Requests for the team can be submitted via phabricator. The team also conducts generative research and collaborates with Research and Data and other teams in order to help define what products and user experiences at a high level should be built (and why) for specific types of users, based on their needs. You can contact the DR team via our (internal) department mailing list research-wmflists.wikimedia.org.